Spark 平台的安裝

基於 hadoop 2.7 和 Spark 2.1.2 的安裝流程

主要的安裝過程參考這篇: 大數據運算系列:SPARK FOR UBUNTU LTS 16.04 安裝指引

前置設定

作業系統環境採用Ubuntu LTS 16.04 Desktop 版本

  • 安裝 Java

 sudo add-apt-repository ppa:webupd8team/java
 sudo apt-get update
 sudo apt-get install oracle-java8-installer

修改環境變數

vim ~/.bashrc

在文件的最後加入這一行

export JAVA_HOME="/usr/lib/jvm/java-8-oracle"

安裝完成後執行 java -version 確認是否顯示版本訊息表示安裝成功

$ java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
  • 安裝 python

sudo apt-get install python2.7

測試 python 安裝成果

$ python
Python 2.7.12 (default, Dec  4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
  • 安裝 python3

sudo apt-get install python3

測試安裝結果:

ubuntu@testspark:~$ python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 

安裝 Scala

sudo apt-get install scala

測試安裝結果:

$ scala
Welcome to Scala version 2.11.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191).
Type in expressions to have them evaluated.
Type :help for more information.

scala> println("hello world")
hello world

scala>

修改環境變數

vim ~/.bashrc

在文件的最後加入這一行

export SCALA_HOME=/usr/share/scala 
export PATH=$PATH:$SCALA_HOME/bin

安裝 SBT

用來建立基於 Scala 的編譯環境, 以下為安裝的指令:

echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list

一起貼上輸入:

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823

sudo apt-get update

sudo apt-get install sbt

關於 SBT 的使用,我們會在"Spark 的編譯環境建立"中進一步說明。

安裝 SPARK (單機)

下載 SPARK 在下列官網有提供預先製作好的安裝套件。套件包含以下元件:

  • Core and Spark SQL

  • Structured Streaming

  • MLlib

  • SparkR

  • GraphX

http://spark.apache.org/downloads.html

在下載畫面中,我們可以看到,Spark 是基於 hadoop 2.0 的平台所建立的。 hadoop 2.0 雖然繼承了 hadoop 的名字,但是他的架構已經和 hadoop 完全不同,舉例來說,hadoop 2.0 捨棄了原有的 MapReduce 架構,而改用 Directed Acyclic Graph (DAG) 的方式來進行平行計算,在 DAG 的架構下,MapReduce 只是一個特例,當然,Spark 的平行化架構也是。

下載 spark 2 套件後,我們執行解壓縮,並移到 /usr/lib/spark 底下

 wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
 tar zxvf spark-2.1.0-bin-hadoop2.7.tgz
 mv  spark-2.1.0-bin-hadoop2.7/ spark
 sudo mv spark/ /usr/lib/

接著我們要設定 spark,和 hadoop 一樣,都需要編輯一個開啟的 script 檔案,Spark 中自帶範例為: spark-env.sh.template

cd /usr/lib/spark/conf/ 
cp spark-env.sh.template spark-env.sh 

我們複製該檔案,並加入一些設定

nano spark-env.sh 

加入以下設定資料,主要是記憶體的大小,以及 Java 的執行位置。

JAVA_HOME=/usr/lib/jvm/java-8-oracle
SPARK_WORKER_MEMORY=4g

在這裡我也加入一次環境設定 (雖然應該是在.bashrc中加入就可以了)

SCALA_HOME=/usr/share/scala
SBT_HOME=/usr/share/sbt-launcher-packaging/bin/sbt-launch.jar
SPARK_HOME=/usr/lib/spark
PATH=$PATH:$JAVA_HOME/bin
PATH=$PATH:$SCALA_HOME/bin
PATH=$PATH:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin

接著編輯 ~/.bashrc ,在底下加入上述的環境設定,因此,最終的結果應該長得像是這樣:

$ tail -8 .bashrc

export JAVA_HOME="/usr/lib/jvm/java-8-oracle"
export SCALA_HOME=/usr/share/scala
export SBT_HOME=/usr/share/sbt-launcher-packaging/bin/sbt-launch.jar
export SPARK_HOME=/usr/lib/spark
export PATH=$PATH:$JAVA_HOME/bin
export PATH=$PATH:$SCALA_HOME/bin
export PATH=$PATH:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin

立即生效環境變數 (只有這次需要執行)

source ~/.bashrc

測試SPARK 環境

$ pyspark
Python 2.7.12 (default, Dec  4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/10/29 14:32:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/29 14:33:04 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Python version 2.7.12 (default, Dec  4 2017 14:50:18)
SparkSession available as 'spark'.
>>>

和python一樣,可以使用 CTRL+D 或是 exit() 離開。

執行範例程式 (SparkPi)

run-example SparkPi 10

結果如下 (已經刪去不重要的 log 資訊):

$ run-example SparkPi 10
...
18/10/29 14:35:46 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 10 output partitions
18/10/29 14:35:46 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
18/10/29 14:35:46 INFO DAGScheduler: Parents of final stage: List()
18/10/29 14:35:46 INFO DAGScheduler: Missing parents: List()
18/10/29 14:35:46 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
18/10/29 14:35:47 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1832.0 B, free 366.3 MB)
18/10/29 14:35:47 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1172.0 B, free 366.3 MB)
18/10/29 14:35:47 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.0.222:42676 (size: 1172.0 B, free: 366.3 MB)
18/10/29 14:35:47 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
18/10/29 14:35:47 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34)
18/10/29 14:35:47 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
18/10/29 14:35:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/10/29 14:35:47 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
18/10/29 14:35:47 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
18/10/29 14:35:47 INFO Executor: Fetching spark://172.16.0.222:46320/jars/spark-examples_2.11-2.1.0.jar with timestamp 1540823746156
18/10/29 14:35:47 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
18/10/29 14:35:47 INFO TransportClientFactory: Successfully created connection to /172.16.0.222:46320 after 23 ms (0 ms spent in bootstraps)
18/10/29 14:35:47 INFO Utils: Fetching spark://172.16.0.222:46320/jars/spark-examples_2.11-2.1.0.jar to /tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/fetchFileTemp8594285529631276460.tmp
18/10/29 14:35:47 INFO Executor: Adding file:/tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/spark-examples_2.11-2.1.0.jar to class loader
18/10/29 14:35:47 INFO Executor: Fetching spark://172.16.0.222:46320/jars/scopt_2.11-3.3.0.jar with timestamp 1540823746156
18/10/29 14:35:47 INFO Utils: Fetching spark://172.16.0.222:46320/jars/scopt_2.11-3.3.0.jar to /tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/fetchFileTemp7738169543322034741.tmp
18/10/29 14:35:47 INFO Executor: Adding file:/tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/scopt_2.11-3.3.0.jar to class loader
18/10/29 14:35:47 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1041 bytes result sent to driver
...
18/10/29 14:35:47 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/10/29 14:35:47 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.528 s
18/10/29 14:35:47 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.760375 s
Pi is roughly 3.1364911364911365

從 log 中,我們可以看到最中的計算結果,以及該程式被分成10份平行處理。關於該程式實際內容,我們會在"實作: SparkPi 解說"中介紹

Last updated