Spark

Description: Spark is a library and cluster management software allowing parallel programming of in memory tasks for data processing.
Modulefile: spark
Documentation: Spark homepage
Usage: To submit a spark job use the following submission script as an example:

#$ -cwd
#$ -j yes
#$ -l h_rt=1:00:0
#$ -pe spark 16
#$ -N spark-test
module list
${SPARK_HOME}/bin/spark-submit --executor-memory 1g --master `cat spark-logs-$JOB_ID/master_url` ./wordcount.py generated.txt wordcounts

Unlike with most other software at ACENET, you must load the appropriate environment modules before you submit the job!.

$ module load java/8u45 spark
$ qsub spark-test.sh

Note also the use of the "spark" parallel environment. This environment will grant only whole nodes since Spark, by default, uses all the resources on a given host. Thus you must specify 16, 32, 48 etc. cores at the Placentia, Fundy, and Glooscap clusters and 4,8,12 etc. cores at Mahone. The ./wordcount.py generated.txt wordcounts specifies the Python script and command line arguments to be run with Spark. An example wordcount.py file is given below to help get you started using Spark:

from pyspark import SparkContext,SparkConf
import sys
import os

if len(sys.argv) > 1:
  inputFileName = sys.argv[1]
else:
  inputFileName = "./generated.txt"
assert(os.path.exists(inputFileName))

if len(sys.argv) > 2:
  outputDirectoryName = sys.argv[2]
else:
  outputDirectoryName = "./wordcounts"
assert(not os.path.exists(outputDirectoryName))

conf=SparkConf().setAppName("wordCount")
sc = SparkContext(conf=conf)
file = sc.textFile(inputFileName)
counts = file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a+b)
counts.saveAsTextFile(outputDirectoryName)

The input file generated.txt to the wordcount.py Spark script is a plain ascii file of text consisting of words separated by spaces on spread over multiple lines.

Spark

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Quick Links

User Support

Resources

Policies

Legacy Documentation

Tools