Apache Spark is an open-source software for large-scale data processing and analysis. Using Apache Spark and Python (PySpark), this workshop is aimed at analyzing data sets that are too large to be handled and processed by a single computer.
In a hands-on format, participants learn to import data, and use functions to transform, reduce and compile the data. They also learn how to produce parallel algorithms that can run on Alliance clusters.
The workshop covers the following topics:
- Introduction to Big data and Map-Reduce
- Overview of Apache Spark
- Importing Data with PySpark
- Sorting data by key/value
- Working with structured data (PySpark SQL)
- Developing parallel algorithms