# Spark 平台的安裝

主要的安裝過程參考這篇:\
[大數據運算系列：SPARK FOR UBUNTU LTS 16.04 安裝指引](https://wenhsiaoyi.wordpress.com/2017/04/12/%E5%A4%A7%E6%95%B8%E6%93%9A%E9%81%8B%E7%AE%97%E7%B3%BB%E5%88%97%EF%BC%9Aspark-for-ubuntu-lts-16-04-%E5%AE%89%E8%A3%9D%E6%8C%87%E5%BC%95/)

## 前置設定

作業系統環境採用Ubuntu LTS 16.04 Desktop 版本

* 安裝 Java

```
 sudo add-apt-repository ppa:webupd8team/java
 sudo apt-get update
 sudo apt-get install oracle-java8-installer
```

修改環境變數

```
vim ~/.bashrc
```

在文件的最後加入這一行

```
export JAVA_HOME="/usr/lib/jvm/java-8-oracle"
```

安裝完成後執行 `java -version` 確認是否顯示版本訊息表示安裝成功

```
$ java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
```

* 安裝 python

```
sudo apt-get install python2.7
```

測試 python 安裝成果

```
$ python
Python 2.7.12 (default, Dec  4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
```

* 安裝 python3

```
sudo apt-get install python3
```

測試安裝結果:

```
ubuntu@testspark:~$ python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
```

## 安裝 Scala

```
sudo apt-get install scala
```

測試安裝結果:

```
$ scala
Welcome to Scala version 2.11.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191).
Type in expressions to have them evaluated.
Type :help for more information.

scala> println("hello world")
hello world

scala>
```

修改環境變數

```
vim ~/.bashrc
```

在文件的最後加入這一行

```
export SCALA_HOME=/usr/share/scala 
export PATH=$PATH:$SCALA_HOME/bin
```

## 安裝 SBT

用來建立基於 Scala 的編譯環境, 以下為安裝的指令:

```
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
```

一起貼上輸入:

```
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823

sudo apt-get update

sudo apt-get install sbt
```

關於 SBT 的使用，我們會在["Spark 的編譯環境建立"](https://spark-nctu.gitbook.io/spark/~/edit/drafts/-LQ-3faoK9z0QTrRTBZ2/spark-de-jing-jian-li)中進一步說明。

## **安裝 SPARK (單機)**

下載 SPARK  在下列官網有提供預先製作好的安裝套件。套件包含以下元件:

* Core and Spark SQL
* Structured Streaming
* MLlib
* SparkR
* GraphX

![Spark 下載畫面](https://13218333-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LPzeYWaV4cPH8nI_rTk%2F-LQ-FlCt971dEAek0ASH%2F-LQ-BI6FVteRQGRpH84J%2Fspark-download.PNG?alt=media\&token=0a299fac-38ac-4388-988c-538451865da3)

<http://spark.apache.org/downloads.html>

在下載畫面中，我們可以看到，Spark 是基於 hadoop 2.0 的平台所建立的。 hadoop 2.0 雖然繼承了 hadoop 的名字，但是他的架構已經和 hadoop 完全不同，舉例來說，hadoop 2.0 捨棄了原有的 MapReduce 架構，而改用 Directed Acyclic Graph (DAG) 的方式來進行平行計算，在 DAG 的架構下，MapReduce 只是一個特例，當然，Spark 的平行化架構也是。

下載 spark 2 套件後，我們執行解壓縮，並移到 `/usr/lib/spark` 底下

```
 wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
 tar zxvf spark-2.1.0-bin-hadoop2.7.tgz
 mv  spark-2.1.0-bin-hadoop2.7/ spark
 sudo mv spark/ /usr/lib/
```

接著我們要設定 spark，和 hadoop 一樣，都需要編輯一個開啟的 script 檔案，Spark 中自帶範例為: `spark-env.sh.template`

```
cd /usr/lib/spark/conf/ 
cp spark-env.sh.template spark-env.sh 
```

我們複製該檔案，並加入一些設定

```
nano spark-env.sh 
```

加入以下設定資料，主要是記憶體的大小，以及 Java 的執行位置。

```
JAVA_HOME=/usr/lib/jvm/java-8-oracle
SPARK_WORKER_MEMORY=4g
```

在這裡我也加入一次環境設定 (雖然應該是在.bashrc中加入就可以了)

```
SCALA_HOME=/usr/share/scala
SBT_HOME=/usr/share/sbt-launcher-packaging/bin/sbt-launch.jar
SPARK_HOME=/usr/lib/spark
PATH=$PATH:$JAVA_HOME/bin
PATH=$PATH:$SCALA_HOME/bin
PATH=$PATH:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin
```

接著編輯 `~/.bashrc` ，在底下加入上述的環境設定，因此，最終的結果應該長得像是這樣:

```
$ tail -8 .bashrc

export JAVA_HOME="/usr/lib/jvm/java-8-oracle"
export SCALA_HOME=/usr/share/scala
export SBT_HOME=/usr/share/sbt-launcher-packaging/bin/sbt-launch.jar
export SPARK_HOME=/usr/lib/spark
export PATH=$PATH:$JAVA_HOME/bin
export PATH=$PATH:$SCALA_HOME/bin
export PATH=$PATH:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin
```

立即生效環境變數 (只有這次需要執行)

```
source ~/.bashrc
```

測試SPARK 環境

```
$ pyspark
Python 2.7.12 (default, Dec  4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/10/29 14:32:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/29 14:33:04 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Python version 2.7.12 (default, Dec  4 2017 14:50:18)
SparkSession available as 'spark'.
>>>

```

和python一樣，可以使用 CTRL+D 或是 exit() 離開。

執行範例程式 (SparkPi)

```
run-example SparkPi 10
```

結果如下 (已經刪去不重要的 log 資訊):

```
$ run-example SparkPi 10
...
18/10/29 14:35:46 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 10 output partitions
18/10/29 14:35:46 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
18/10/29 14:35:46 INFO DAGScheduler: Parents of final stage: List()
18/10/29 14:35:46 INFO DAGScheduler: Missing parents: List()
18/10/29 14:35:46 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
18/10/29 14:35:47 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1832.0 B, free 366.3 MB)
18/10/29 14:35:47 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1172.0 B, free 366.3 MB)
18/10/29 14:35:47 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.0.222:42676 (size: 1172.0 B, free: 366.3 MB)
18/10/29 14:35:47 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
18/10/29 14:35:47 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34)
18/10/29 14:35:47 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
18/10/29 14:35:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/10/29 14:35:47 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
18/10/29 14:35:47 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
18/10/29 14:35:47 INFO Executor: Fetching spark://172.16.0.222:46320/jars/spark-examples_2.11-2.1.0.jar with timestamp 1540823746156
18/10/29 14:35:47 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
18/10/29 14:35:47 INFO TransportClientFactory: Successfully created connection to /172.16.0.222:46320 after 23 ms (0 ms spent in bootstraps)
18/10/29 14:35:47 INFO Utils: Fetching spark://172.16.0.222:46320/jars/spark-examples_2.11-2.1.0.jar to /tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/fetchFileTemp8594285529631276460.tmp
18/10/29 14:35:47 INFO Executor: Adding file:/tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/spark-examples_2.11-2.1.0.jar to class loader
18/10/29 14:35:47 INFO Executor: Fetching spark://172.16.0.222:46320/jars/scopt_2.11-3.3.0.jar with timestamp 1540823746156
18/10/29 14:35:47 INFO Utils: Fetching spark://172.16.0.222:46320/jars/scopt_2.11-3.3.0.jar to /tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/fetchFileTemp7738169543322034741.tmp
18/10/29 14:35:47 INFO Executor: Adding file:/tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/scopt_2.11-3.3.0.jar to class loader
18/10/29 14:35:47 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1041 bytes result sent to driver
...
18/10/29 14:35:47 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/10/29 14:35:47 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.528 s
18/10/29 14:35:47 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.760375 s
Pi is roughly 3.1364911364911365
```

從 log 中，我們可以看到最中的計算結果，以及該程式被分成10份平行處理。關於該程式實際內容，我們會在["實作: SparkPi 解說"](https://spark-nctu.gitbook.io/spark/~/edit/drafts/-LQ-3faoK9z0QTrRTBZ2/zuo-sparkpi-jie)中介紹
