# Spark 的編譯環境建立

為了要撰寫 Spark 程式，我們要先建立 Spark 的編譯環境。此時，就要用到 SBT這個套件，SBT (Simple Build Tool) 是 Scala 的編譯器，有興趣可以參考: <https://www.scala-sbt.org/>

在進行編譯前，我們需要創一個專案 (project)，重新執行之前範例中的 SparkPi。以下編譯環境的步驟，主要參考以下連結:\
<https://console.bluemix.net/docs/services/AnalyticsforApacheSpark/spark_app_example.html>\
因為有一些修改，就把指令紀錄如下:

&#x20;一開始，先建立專案的資料夾，把 SparkPi 移入:

```
mkdir -p ~/spark-submit/project
mkdir -p ~/spark-submit/src/main/scala
cp /usr/lib/spark/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala ~/spark-submit/src/main/scala/SparkPi.scala
```

接著, 設定SBT的環境:

```
vim ~/spark-submit/build.sbt
```

貼上以下內容 (Scala的版本號記得修改)

```
name := "SparkPi"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.1.2"
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"
```

輸入 SBT 對應的版本號 (請自行對應):

```
vim ~/spark-submit/project/build.properties
```

輸入 `sbt.version=1.2.6`

完成後就可以以下三行指令進行編譯 (compile)、執行 (run) 以及封裝 (package)，如果是第一次編譯會下載很多關聯檔案 (dependency)，要花不少時間等待。

```
cd ~/spark-submit
sbt compile //compile the project
sbt run //run the project
sbt package //package the project
```

關於SBT編譯的說明可以參考這一篇文章:

<https://alvinalexander.com/scala/sbt-how-to-compile-run-package-scala-project>

該篇文章以 Scala 為例，比較容易了解 SBT 和 Scala 之間的關係，編譯好的 Jar檔案在 `~/spark-submit/target/scala-2.11/sparkpi_2.11-1.0.jar`

接著， 只要執行Jar檔就可以了，指令如下:

```
$ spark-submit ~/spark-submit/target/scala-2.11/sparkpi_2.11-1.0.jar
```

在執行 `spark-submit` 時，可以給予不同參數，請參考:\
<https://console.bluemix.net/docs/services/AnalyticsforApacheSpark/spark_submit_example.html#example-running-a-spark-application-with-optional-parameters>\
\
如果在執行時出現以下錯誤:&#x20;

`"ERROR SparkContext: Error initializing SparkContext.`\
`org.apache.spark.SparkException: A master URL must be set in your configuration"`

是因為在原本宣告中，不帶有本機執行的資訊。請修改原始程式的這個部分，並重新執行:

```
val spark = SparkSession
   .builder
   .appName("Spark Pi")
   .config("spark.master", "local")
   .getOrCreate()
```

執行結果顯示如下 (省略部分資訊):

```
~$ spark-submit ~/spark-submit/target/scala-2.11/sparkpi_2.11-1.0.jar
...
18/10/29 14:57:13 INFO SparkContext: Starting job: reduce at SparkPi.scala:39
18/10/29 14:57:13 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:39) with 2 output partitions
18/10/29 14:57:13 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:39)
18/10/29 14:57:13 INFO DAGScheduler: Parents of final stage: List()
18/10/29 14:57:13 INFO DAGScheduler: Missing parents: List()
18/10/29 14:57:13 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:35), which has no missing parents
18/10/29 14:57:13 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1832.0 B, free 366.3 MB)
18/10/29 14:57:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1172.0 B, free 366.3 MB)
18/10/29 14:57:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.0.222:38239 (size: 1172.0 B, free: 366.3 MB)
18/10/29 14:57:13 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
18/10/29 14:57:13 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:35)
18/10/29 14:57:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/10/29 14:57:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 6015 bytes)
18/10/29 14:57:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/10/29 14:57:13 INFO Executor: Fetching spark://172.16.0.222:37044/jars/sparkpi_2.11-1.0.jar with timestamp 1540825032709
18/10/29 14:57:13 INFO TransportClientFactory: Successfully created connection to /172.16.0.222:37044 after 30 ms (0 ms spent in bootstraps)
18/10/29 14:57:13 INFO Utils: Fetching spark://172.16.0.222:37044/jars/sparkpi_2.11-1.0.jar to /tmp/spark-fb7da981-0380-4559-8923-11fb45df3b3a/userFiles-b1fe8cef-a836-47cc-b45c-2e54801f851e/fetchFileTemp4137346296745695944.tmp
18/10/29 14:57:13 INFO Executor: Adding file:/tmp/spark-fb7da981-0380-4559-8923-11fb45df3b3a/userFiles-b1fe8cef-a836-47cc-b45c-2e54801f851e/sparkpi_2.11-1.0.jar to class loader
18/10/29 14:57:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1041 bytes result sent to driver
18/10/29 14:57:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 6015 bytes)
18/10/29 14:57:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
18/10/29 14:57:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 246 ms on localhost (executor driver) (1/2)
18/10/29 14:57:13 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1041 bytes result sent to driver
18/10/29 14:57:13 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:39) finished in 0.271 s
18/10/29 14:57:13 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 19 ms on localhost (executor driver) (2/2)
18/10/29 14:57:13 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:39, took 0.510812 s
18/10/29 14:57:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
Pi is roughly 3.1450957254786274
...
```


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://spark-nctu.gitbook.io/spark/spark-jing-an/spark-de-jing-jian-li.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
