# Spark 平台的安裝

主要的安裝過程參考這篇:\
[大數據運算系列：SPARK FOR UBUNTU LTS 16.04 安裝指引](https://wenhsiaoyi.wordpress.com/2017/04/12/%E5%A4%A7%E6%95%B8%E6%93%9A%E9%81%8B%E7%AE%97%E7%B3%BB%E5%88%97%EF%BC%9Aspark-for-ubuntu-lts-16-04-%E5%AE%89%E8%A3%9D%E6%8C%87%E5%BC%95/)

## 前置設定

作業系統環境採用Ubuntu LTS 16.04 Desktop 版本

* 安裝 Java

```
 sudo add-apt-repository ppa:webupd8team/java
 sudo apt-get update
 sudo apt-get install oracle-java8-installer
```

修改環境變數

```
vim ~/.bashrc
```

在文件的最後加入這一行

```
export JAVA_HOME="/usr/lib/jvm/java-8-oracle"
```

安裝完成後執行 `java -version` 確認是否顯示版本訊息表示安裝成功

```
$ java -version
java version "1.8.0_191"
Java(TM) SE Runtime Environment (build 1.8.0_191-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.191-b12, mixed mode)
```

* 安裝 python

```
sudo apt-get install python2.7
```

測試 python 安裝成果

```
$ python
Python 2.7.12 (default, Dec  4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> 
```

* 安裝 python3

```
sudo apt-get install python3
```

測試安裝結果:

```
ubuntu@testspark:~$ python3
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 
```

## 安裝 Scala

```
sudo apt-get install scala
```

測試安裝結果:

```
$ scala
Welcome to Scala version 2.11.6 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_191).
Type in expressions to have them evaluated.
Type :help for more information.

scala> println("hello world")
hello world

scala>
```

修改環境變數

```
vim ~/.bashrc
```

在文件的最後加入這一行

```
export SCALA_HOME=/usr/share/scala 
export PATH=$PATH:$SCALA_HOME/bin
```

## 安裝 SBT

用來建立基於 Scala 的編譯環境, 以下為安裝的指令:

```
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
```

一起貼上輸入:

```
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823

sudo apt-get update

sudo apt-get install sbt
```

關於 SBT 的使用，我們會在["Spark 的編譯環境建立"](https://spark-nctu.gitbook.io/spark/~/edit/drafts/-LQ-3faoK9z0QTrRTBZ2/spark-de-jing-jian-li)中進一步說明。

## **安裝 SPARK (單機)**

下載 SPARK  在下列官網有提供預先製作好的安裝套件。套件包含以下元件:

* Core and Spark SQL
* Structured Streaming
* MLlib
* SparkR
* GraphX

![Spark 下載畫面](/files/-LQ-BI6FVteRQGRpH84J)

<http://spark.apache.org/downloads.html>

在下載畫面中，我們可以看到，Spark 是基於 hadoop 2.0 的平台所建立的。 hadoop 2.0 雖然繼承了 hadoop 的名字，但是他的架構已經和 hadoop 完全不同，舉例來說，hadoop 2.0 捨棄了原有的 MapReduce 架構，而改用 Directed Acyclic Graph (DAG) 的方式來進行平行計算，在 DAG 的架構下，MapReduce 只是一個特例，當然，Spark 的平行化架構也是。

下載 spark 2 套件後，我們執行解壓縮，並移到 `/usr/lib/spark` 底下

```
 wget http://d3kbcqa49mib13.cloudfront.net/spark-2.1.0-bin-hadoop2.7.tgz
 tar zxvf spark-2.1.0-bin-hadoop2.7.tgz
 mv  spark-2.1.0-bin-hadoop2.7/ spark
 sudo mv spark/ /usr/lib/
```

接著我們要設定 spark，和 hadoop 一樣，都需要編輯一個開啟的 script 檔案，Spark 中自帶範例為: `spark-env.sh.template`

```
cd /usr/lib/spark/conf/ 
cp spark-env.sh.template spark-env.sh 
```

我們複製該檔案，並加入一些設定

```
nano spark-env.sh 
```

加入以下設定資料，主要是記憶體的大小，以及 Java 的執行位置。

```
JAVA_HOME=/usr/lib/jvm/java-8-oracle
SPARK_WORKER_MEMORY=4g
```

在這裡我也加入一次環境設定 (雖然應該是在.bashrc中加入就可以了)

```
SCALA_HOME=/usr/share/scala
SBT_HOME=/usr/share/sbt-launcher-packaging/bin/sbt-launch.jar
SPARK_HOME=/usr/lib/spark
PATH=$PATH:$JAVA_HOME/bin
PATH=$PATH:$SCALA_HOME/bin
PATH=$PATH:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin
```

接著編輯 `~/.bashrc` ，在底下加入上述的環境設定，因此，最終的結果應該長得像是這樣:

```
$ tail -8 .bashrc

export JAVA_HOME="/usr/lib/jvm/java-8-oracle"
export SCALA_HOME=/usr/share/scala
export SBT_HOME=/usr/share/sbt-launcher-packaging/bin/sbt-launch.jar
export SPARK_HOME=/usr/lib/spark
export PATH=$PATH:$JAVA_HOME/bin
export PATH=$PATH:$SCALA_HOME/bin
export PATH=$PATH:$SBT_HOME/bin:$SPARK_HOME/bin:$SPARK_HOME/sbin
```

立即生效環境變數 (只有這次需要執行)

```
source ~/.bashrc
```

測試SPARK 環境

```
$ pyspark
Python 2.7.12 (default, Dec  4 2017, 14:50:18)
[GCC 5.4.0 20160609] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/10/29 14:32:59 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/29 14:33:04 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.1.0
      /_/

Using Python version 2.7.12 (default, Dec  4 2017 14:50:18)
SparkSession available as 'spark'.
>>>

```

和python一樣，可以使用 CTRL+D 或是 exit() 離開。

執行範例程式 (SparkPi)

```
run-example SparkPi 10
```

結果如下 (已經刪去不重要的 log 資訊):

```
$ run-example SparkPi 10
...
18/10/29 14:35:46 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 10 output partitions
18/10/29 14:35:46 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
18/10/29 14:35:46 INFO DAGScheduler: Parents of final stage: List()
18/10/29 14:35:46 INFO DAGScheduler: Missing parents: List()
18/10/29 14:35:46 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
18/10/29 14:35:47 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1832.0 B, free 366.3 MB)
18/10/29 14:35:47 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1172.0 B, free 366.3 MB)
18/10/29 14:35:47 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.0.222:42676 (size: 1172.0 B, free: 366.3 MB)
18/10/29 14:35:47 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
18/10/29 14:35:47 INFO DAGScheduler: Submitting 10 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34)
18/10/29 14:35:47 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
18/10/29 14:35:47 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, localhost, executor driver, partition 2, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, localhost, executor driver, partition 3, PROCESS_LOCAL, 6086 bytes)
18/10/29 14:35:47 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/10/29 14:35:47 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
18/10/29 14:35:47 INFO Executor: Running task 3.0 in stage 0.0 (TID 3)
18/10/29 14:35:47 INFO Executor: Fetching spark://172.16.0.222:46320/jars/spark-examples_2.11-2.1.0.jar with timestamp 1540823746156
18/10/29 14:35:47 INFO Executor: Running task 2.0 in stage 0.0 (TID 2)
18/10/29 14:35:47 INFO TransportClientFactory: Successfully created connection to /172.16.0.222:46320 after 23 ms (0 ms spent in bootstraps)
18/10/29 14:35:47 INFO Utils: Fetching spark://172.16.0.222:46320/jars/spark-examples_2.11-2.1.0.jar to /tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/fetchFileTemp8594285529631276460.tmp
18/10/29 14:35:47 INFO Executor: Adding file:/tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/spark-examples_2.11-2.1.0.jar to class loader
18/10/29 14:35:47 INFO Executor: Fetching spark://172.16.0.222:46320/jars/scopt_2.11-3.3.0.jar with timestamp 1540823746156
18/10/29 14:35:47 INFO Utils: Fetching spark://172.16.0.222:46320/jars/scopt_2.11-3.3.0.jar to /tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/fetchFileTemp7738169543322034741.tmp
18/10/29 14:35:47 INFO Executor: Adding file:/tmp/spark-4cf3c215-48c4-4fe4-8c00-fe1cad6637e2/userFiles-81e56a98-0010-4aa5-af77-aff04ebff077/scopt_2.11-3.3.0.jar to class loader
18/10/29 14:35:47 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1041 bytes result sent to driver
...
18/10/29 14:35:47 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
18/10/29 14:35:47 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:38) finished in 0.528 s
18/10/29 14:35:47 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:38, took 0.760375 s
Pi is roughly 3.1364911364911365
```

從 log 中，我們可以看到最中的計算結果，以及該程式被分成10份平行處理。關於該程式實際內容，我們會在["實作: SparkPi 解說"](https://spark-nctu.gitbook.io/spark/~/edit/drafts/-LQ-3faoK9z0QTrRTBZ2/zuo-sparkpi-jie)中介紹


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://spark-nctu.gitbook.io/spark/spark-jing-an/spark-ping-tai-de-an.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
