Introduction to Big Data and Spark

Learn about Big Data Architectures, Hadoop and Spark


  • Level: beginner
  • Duration: 1-day course
  • Delivered: in-house

What you will learn

You will learn advanced state-of-the art machine learning techniques that are in demand in industry and research.

  • Understand the challenges in the Big Data ecosystem
  • Describe the fundamentals of the Hadoop ecosystem
  • Use the core Spark APIs to express data processing queries

Languages and libraries

  • Python programming language
  • Hadoop
  • Spark

OUTLINE

Session 1

Introduction to "Big Data"

  • Volume, Velocity, Variety
  • Scaling horizontally
  • Batch vs Streaming
  • NoSQL landscape
  • Lambda architecture

Session 2

Hadoop Ecosystem

  • Architecture overview
  • HDFS
  • The MapReduce pattern

Session 3

Spark

  • Architecture overview
  • Resilient Distributed Datasets (RDDs)
  • Transformation, Actions, and DAG
  • RDD programming API
  • Using Amazon EMR and Spark

Session 4

Tuning

  • RDD caching
  • Broadcast variables
  • Accumulators
  • Pipeline tuning

Prerequisites

Prerequisites: Good knowledge of python, some familiarity with matrices, basic understanding of machine learning practice (as taught in Introduction to Data Science)

Audience

Those who are curious about the Big Data space and who want to feel comfortable getting their hands dirty with high volume, high velocity, diverse real-world datasets.


Get in touch

Get in touch to discuss team size, pricing and your tech requirements. Send an email to training@cambridgespark.com or fill in our contact form. We’ll be sure to get back to you soon.

Contact our team