ACENET: Big Data Analysis with Spark

Tue, Feb 20, 2024 1:00pm - Thu, Feb 22, 2024 4:00pm

Status: Completed

This workshop is aimed at providing learners with an introduction to data analysis with Apache Spark.

Apache Spark is a user-friendly open-source platform for large-scale data processing, analytics and for parallel-computing. Using Apache Spark and Python (PySpark), this workshop is aimed at analyzing data sets that are too large to be handled and processed by a single computer.

With hands-on guided examples, the workshop covers the basics of Spark and Resilient Distributed Datasets (RDD) high-level architecture. The examples are mainly written in Python, hence the APIs covered are the ones available in PySpark, including Spark Core API (RDD API), Spark SQL, and Pandas on Spark.

Participants will learn how to import data, use functions to transform, reduce and compile the data, and produce parallel algorithms that can run on Alliance clusters.

Prerequisites:

  • ACENET Basics (OR experience using command line to log into remote systems)
  • How to write functions in Python

This session will take place over two sessions:

Session 1: Tuesday, February 20, 2024 - 1:00 - 4:00pm (Atlantic time)

Session 2: Thursday, February 22, 2024 - 1:00 - 4:00pm (Atlantic time)

The session will be delivered online.

Participants must register using their institutional / organizational email address (not a personal email, ie. gmail)

This session will not be recorded.

For more ACENET events, click here.