Introduction to Big Data and Spark

Learn about Big Data Architectures, Hadoop and Spark

  • Level: beginner
  • Duration: 1-day course
  • Delivered: in-house

What you will learn

This one-day course will provide a hands-on introduction to the Big Data ecosystem, Hadoop and Apache Spark in practice.

  • Understand the challenges in the Big Data ecosystem
  • Describe the fundamentals of the Hadoop ecosystem
  • Use the core Spark APIs to express data processing queries

Languages and libraries

  • Python programming language
  • Hadoop
  • Spark


Session 1

Introduction to "Big Data"

  • Volume, Velocity, Variety
  • Scaling horizontally
  • Batch vs Streaming
  • NoSQL landscape
  • Lambda architecture

Session 2

Hadoop Ecosystem

  • Architecture overview
  • HDFS
  • The MapReduce pattern

Session 3


  • Architecture overview
  • Resilient Distributed Datasets (RDDs)
  • Transformation, Actions, and DAG
  • RDD programming API
  • Using Amazon EMR and Spark

Session 4


  • RDD caching
  • Broadcast variables
  • Accumulators
  • Pipeline tuning


Prerequisites: Good knowledge of python, some familiarity with matrices, basic understanding of machine learning practice (as taught in Introduction to Data Science)


Those who are curious about the Big Data space and who want to feel comfortable getting their hands dirty with high volume, high velocity, diverse real-world datasets.

Get in touch

Get in touch to discuss team size, pricing and your tech requirements. Send an email to or fill in our contact form. We’ll be sure to get back to you soon.

Contact our team