Spark 平台的安裝
基於 hadoop 2.7 和 Spark 2.1.2 的安裝流程
主要的安裝過程參考這篇: 大數據運算系列:SPARK FOR UBUNTU LTS 16.04 安裝指引
前置設定
作業系統環境採用Ubuntu LTS 16.04 Desktop 版本
安裝 Java
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
修改環境變數
vim ~/.bashrc
在文件的最後加入這一行
export JAVA_HOME="/usr/lib/jvm/java-8-oracle"
安裝完成後執行 java -version
確認是否顯示版本訊息表示安裝成功
$ java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
安裝 python
sudo apt-get install python2.7
測試 python 安裝成果
$ python
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>
安裝 python3
sudo apt-get install python3
測試安裝結果:
ubuntu@testspark:~$ python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>
安裝 Scala
sudo apt-get install scala
測試安裝結果:
$ scala
Welcome to Scala version 2.11.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191).
Type in expressions to have them evaluated.
Type :help for more information.
scala> println("hello world")
hello world
scala>
修改環境變數
vim ~/.bashrc
在文件的最後加入這一行
export SCALA_HOME=/usr/share/scala
export PATH=$PATH:$SCALA_HOME/bin
安裝 SBT
用來建立基於 Scala 的編譯環境, 以下為安裝的指令:
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
一起貼上輸入:
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
sudo apt-get update
sudo apt-get install sbt
關於 SBT 的使用,我們會在"Spark 的編譯環境建立"中進一步說明。
安裝 SPARK (單機)
下載 SPARK 在下列官網有提供預先製作好的安裝套件。套件包含以下元件:
Core and Spark SQL
Structured Streaming
MLlib
SparkR
GraphX
http://spark.apache.org/downloads.html
在下載畫面中,我們可以看到,Spark 是基於 hadoop 2.0 的平台所建立的。 hadoop 2.0 雖然繼承了 hadoop 的名字,但是他的架構已經和 hadoop 完全不同,舉例來說,hadoop 2.0 捨棄了原有的 MapReduce 架構,而改用 Directed Acyclic Graph (DAG) 的方式來進行平行計算,在 DAG 的架構下,MapReduce 只是一個特例,當然,Spark 的平行化架構也是。
下載 spark 2 套件後,我們執行解壓縮,並移到 /usr/lib/spark
底下
wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
tar zxvf spark-2.1.0-bin-hadoop2.7.tgz
mv spark-2.1.0-bin-hadoop2.7/ spark
sudo mv spark/ /usr/lib/
接著我們要設定 spark,和 hadoop 一樣,都需要編輯一個開啟的 script 檔案,Spark 中自帶範例為: spark-env.sh.template
cd /usr/lib/spark/conf/
cp spark-env.sh.template spark-env.sh
我們複製該檔案,並加入一些設定
nano spark-env.sh
加入以下設定資料,主要是記憶體的大小,以及 Java 的執行位置。
JAVA_HOME=/usr/lib/jvm/java-8-oracle
SPARK_WORKER_MEMORY=4g
在這裡我也加入一次環境設定 (雖然應該是在.bashrc中加入就可以了)
SCALA_HOME=/usr/share/scala
SBT_HOME=/usr/share/sbt-launcher-packaging/bin/sbt-launch.jar
SPARK_HOME=/usr/lib/spark
PATH=$PATH:$JAVA_HOME/bin
PATH=$PATH:$SCALA_HOME/bin
PATH=$PATH:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin
接著編輯 ~/.bashrc
,在底下加入上述的環境設定,因此,最終的結果應該長得像是這樣:
$ tail -8 .bashrc
export JAVA_HOME="/usr/lib/jvm/java-8-oracle"
export SCALA_HOME=/usr/share/scala
export SBT_HOME=/usr/share/sbt-launcher-packaging/bin/sbt-launch.jar
export SPARK_HOME=/usr/lib/spark
export PATH=$PATH:$JAVA_HOME/bin
export PATH=$PATH:$SCALA_HOME/bin
export PATH=$PATH:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin
立即生效環境變數 (只有這次需要執行)
source ~/.bashrc
測試SPARK 環境
$ pyspark
Python 2.7.12 (default, Dec 4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/10/29 14:32:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/29 14:33:04 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.1.0
/_/
Using Python version 2.7.12 (default, Dec 4 2017 14:50:18)
SparkSession available as 'spark'.
>>>
和python一樣,可以使用 CTRL+D 或是 exit() 離開。
執行範例程式 (SparkPi)
run-example SparkPi 10
結果如下 (已經刪去不重要的 log 資訊):
$ run-example SparkPi 10
...
18/10/29 14:35:46 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 10 output partitions
18/10/29 14:35:46 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
18/10/29 14:35:46 INFO DAGScheduler: Parents of final stage: List()
18/10/29 14:35:46 INFO DAGScheduler: Missing parents: List()
18/10/29 14:35:46 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
18/10/29 14:35:47 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1832.0 B, free 366.3 MB)
18/10/29 14:35:47 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1172.0 B, free 366.3 MB)
18/10/29 14:35:47 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.0.222:42676 (size: 1172.0 B, free: 366.3 MB)
18/10/29 14:35:47 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
18/10/29 14:35:47 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34)
18/10/29 14:35:47 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
18/10/29 14:35:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/10/29 14:35:47 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
18/10/29 14:35:47 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
18/10/29 14:35:47 INFO Executor: Fetching spark://172.16.0.222:46320/jars/spark-examples_2.11-2.1.0.jar with timestamp 1540823746156
18/10/29 14:35:47 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
18/10/29 14:35:47 INFO TransportClientFactory: Successfully created connection to /172.16.0.222:46320 after 23 ms (0 ms spent in bootstraps)
18/10/29 14:35:47 INFO Utils: Fetching spark://172.16.0.222:46320/jars/spark-examples_2.11-2.1.0.jar to /tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/fetchFileTemp8594285529631276460.tmp
18/10/29 14:35:47 INFO Executor: Adding file:/tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/spark-examples_2.11-2.1.0.jar to class loader
18/10/29 14:35:47 INFO Executor: Fetching spark://172.16.0.222:46320/jars/scopt_2.11-3.3.0.jar with timestamp 1540823746156
18/10/29 14:35:47 INFO Utils: Fetching spark://172.16.0.222:46320/jars/scopt_2.11-3.3.0.jar to /tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/fetchFileTemp7738169543322034741.tmp
18/10/29 14:35:47 INFO Executor: Adding file:/tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/scopt_2.11-3.3.0.jar to class loader
18/10/29 14:35:47 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1041 bytes result sent to driver
...
18/10/29 14:35:47 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/10/29 14:35:47 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.528 s
18/10/29 14:35:47 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.760375 s
Pi is roughly 3.1364911364911365
從 log 中,我們可以看到最中的計算結果,以及該程式被分成10份平行處理。關於該程式實際內容,我們會在"實作: SparkPi 解說"中介紹
Last updated
Was this helpful?