Spark 的編譯環境建立

基於 SBT 的 Scala 編譯環境

為了要撰寫 Spark 程式,我們要先建立 Spark 的編譯環境。此時,就要用到 SBT這個套件,SBT (Simple Build Tool) 是 Scala 的編譯器,有興趣可以參考: https://www.scala-sbt.org/

在進行編譯前,我們需要創一個專案 (project),重新執行之前範例中的 SparkPi。以下編譯環境的步驟,主要參考以下連結: https://console.bluemix.net/docs/services/AnalyticsforApacheSpark/spark_app_example.html 因為有一些修改,就把指令紀錄如下:

一開始,先建立專案的資料夾,把 SparkPi 移入:

mkdir -p ~/spark-submit/project
mkdir -p ~/spark-submit/src/main/scala
cp /usr/lib/spark/examples/src/main/scala/org/apache/spark/examples/SparkPi.scala ~/spark-submit/src/main/scala/SparkPi.scala

接著, 設定SBT的環境:

vim ~/spark-submit/build.sbt

貼上以下內容 (Scala的版本號記得修改)

name := "SparkPi"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.2"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.1.2"
resolvers += "Akka Repository" at "http://repo.akka.io/releases/"

輸入 SBT 對應的版本號 (請自行對應):

vim ~/spark-submit/project/build.properties

輸入 sbt.version=1.2.6

完成後就可以以下三行指令進行編譯 (compile)、執行 (run) 以及封裝 (package),如果是第一次編譯會下載很多關聯檔案 (dependency),要花不少時間等待。

cd ~/spark-submit
sbt compile //compile the project
sbt run //run the project
sbt package //package the project

關於SBT編譯的說明可以參考這一篇文章:

https://alvinalexander.com/scala/sbt-how-to-compile-run-package-scala-project

該篇文章以 Scala 為例,比較容易了解 SBT 和 Scala 之間的關係,編譯好的 Jar檔案在 ~/spark-submit/target/scala-2.11/sparkpi_2.11-1.0.jar

接著, 只要執行Jar檔就可以了,指令如下:

$ spark-submit ~/spark-submit/target/scala-2.11/sparkpi_2.11-1.0.jar

在執行 spark-submit 時,可以給予不同參數,請參考: https://console.bluemix.net/docs/services/AnalyticsforApacheSpark/spark_submit_example.html#example-running-a-spark-application-with-optional-parameters 如果在執行時出現以下錯誤:

"ERROR SparkContext: Error initializing SparkContext. org.apache.spark.SparkException: A master URL must be set in your configuration"

是因為在原本宣告中,不帶有本機執行的資訊。請修改原始程式的這個部分,並重新執行:

val spark = SparkSession
   .builder
   .appName("Spark Pi")
   .config("spark.master", "local")
   .getOrCreate()

執行結果顯示如下 (省略部分資訊):

~$ spark-submit ~/spark-submit/target/scala-2.11/sparkpi_2.11-1.0.jar
...
18/10/29 14:57:13 INFO SparkContext: Starting job: reduce at SparkPi.scala:39
18/10/29 14:57:13 INFO DAGScheduler: Got job 0 (reduce at SparkPi.scala:39) with 2 output partitions
18/10/29 14:57:13 INFO DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:39)
18/10/29 14:57:13 INFO DAGScheduler: Parents of final stage: List()
18/10/29 14:57:13 INFO DAGScheduler: Missing parents: List()
18/10/29 14:57:13 INFO DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:35), which has no missing parents
18/10/29 14:57:13 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1832.0 B, free 366.3 MB)
18/10/29 14:57:13 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1172.0 B, free 366.3 MB)
18/10/29 14:57:13 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on 172.16.0.222:38239 (size: 1172.0 B, free: 366.3 MB)
18/10/29 14:57:13 INFO SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:996
18/10/29 14:57:13 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:35)
18/10/29 14:57:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
18/10/29 14:57:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 6015 bytes)
18/10/29 14:57:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/10/29 14:57:13 INFO Executor: Fetching spark://172.16.0.222:37044/jars/sparkpi_2.11-1.0.jar with timestamp 1540825032709
18/10/29 14:57:13 INFO TransportClientFactory: Successfully created connection to /172.16.0.222:37044 after 30 ms (0 ms spent in bootstraps)
18/10/29 14:57:13 INFO Utils: Fetching spark://172.16.0.222:37044/jars/sparkpi_2.11-1.0.jar to /tmp/spark-fb7da981-0380-4559-8923-11fb45df3b3a/userFiles-b1fe8cef-a836-47cc-b45c-2e54801f851e/fetchFileTemp4137346296745695944.tmp
18/10/29 14:57:13 INFO Executor: Adding file:/tmp/spark-fb7da981-0380-4559-8923-11fb45df3b3a/userFiles-b1fe8cef-a836-47cc-b45c-2e54801f851e/sparkpi_2.11-1.0.jar to class loader
18/10/29 14:57:13 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1041 bytes result sent to driver
18/10/29 14:57:13 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 6015 bytes)
18/10/29 14:57:13 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
18/10/29 14:57:13 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 246 ms on localhost (executor driver) (1/2)
18/10/29 14:57:13 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 1041 bytes result sent to driver
18/10/29 14:57:13 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:39) finished in 0.271 s
18/10/29 14:57:13 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 19 ms on localhost (executor driver) (2/2)
18/10/29 14:57:13 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:39, took 0.510812 s
18/10/29 14:57:13 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
Pi is roughly 3.1450957254786274
...

Last updated