Spark

From ACENET
Jump to: navigation, search
Description
Spark is a library and cluster management software allowing parallel programming of in memory tasks for data processing.
Modulefile
spark
Documentation
Spark homepage
Usage
To submit a spark job use the following submission script as an example:
#$ -cwd
#$ -j yes
#$ -l h_rt=1:00:0
#$ -pe spark 16
#$ -N spark-test
module list
${SPARK_HOME}/bin/spark-submit --executor-memory 1g --master `cat spark-logs-$JOB_ID/master_url` ./wordcount.py generated.txt wordcounts

Unlike with most other software at ACENET, you must load the appropriate environment modules before you submit the job!.

$ module load java/8u45 spark
$ qsub spark-test.sh

Note also the use of the "spark" parallel environment. This environment will grant only whole nodes since Spark, by default, uses all the resources on a given host. Thus you must specify 16, 32, 48 etc. cores at the Placentia, Fundy, and Glooscap clusters and 4,8,12 etc. cores at Mahone. The ./wordcount.py generated.txt wordcounts specifies the Python script and command line arguments to be run with Spark. An example wordcount.py file is given below to help get you started using Spark:

from pyspark import SparkContext,SparkConf
import sys
import os

if len(sys.argv) > 1:
  inputFileName = sys.argv[1]
else:
  inputFileName = "./generated.txt"
assert(os.path.exists(inputFileName))

if len(sys.argv) > 2:
  outputDirectoryName = sys.argv[2]
else:
  outputDirectoryName = "./wordcounts"
assert(not os.path.exists(outputDirectoryName))

conf=SparkConf().setAppName("wordCount")
sc = SparkContext(conf=conf)
file = sc.textFile(inputFileName)
counts = file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
counts.saveAsTextFile(outputDirectoryName)

The input file generated.txt to the wordcount.py Spark script is a plain ascii file of text consisting of words separated by spaces on spread over multiple lines.