Big Data Analytics in Practice

Learn about Big Data Architectures, Hadoop, Spark, Kakfa, Spark Streaming and Spark SQL

No Upcoming dates. Sign up to our newsletter to find out about the next date.


Level

intermediate

LONDON

CAMBRIDGE

OXFORD

Duration

two days Course



What you will learn

You will learn about Big data, the Hadoop ecosystem and Spark in practice. After taking this class you will be able to:

  • Understand the challenges in the Big Data ecosystem
  • Describe the fundamentals of the Hadoop ecosystem
  • Use the core Spark APIs to express data processing queries
  • Design Big Data Architecture
  • Use Kakfa for Data Ingestion
  • Implement Stream Processing using Spark Streaming
  • Understand Big Data file formats

Languages and libraries

  • Python programming language
  • Hadoop
  • Spark
  • Kafka

Progression paths

Learn state-of-the art machine learning techniques at our Machine Learning Techniques using Python bootcamp.

Acquire specialised Natural Language Processing skills at our Text Mining and Natural Language Processing with Python bootcamp.


Prerequisites

Audience: Those who are curious about the Big Data space and who want to feel comfortable getting their hands dirty with high volume, high velocity, diverse real-world datasets

Prerequisites: Good knowledge of python, some familiarity with matrices, basic understanding of machine learning practice (as taught in Introduction to Data Science)

Day 1

Introduction to Big Data and Batch Processing

Session 1

Introduction to "Big Data"

  • Volume, Velocity, Variety
  • Scaling horizontally
  • Batch vs Streaming
  • NoSQL landscape
  • Lambda architecture

Session 2

Hadoop Ecosystem

  • Architecture overview
  • HDFS
  • The MapReduce pattern

Session 3

Spark

  • Architecture overview
  • Resilient Distributed Datasets (RDDs)
  • Transformation, Actions, and DAG
  • RDD programming API
  • Using Amazon EMR and Spark

Session 4

Tuning

  • RDD caching
  • Broadcast variables
  • Accumulators
  • Pipeline tuning

Evening

Social

  • Drinks with fellow participants and lecturers

Day 2

Fast Data and Stream Processing

Session 1

Big Data Architectures

  • Types of systems: OLAP vs OLTP
  • Components of a big data architecture
  • Case studies

Session 2

File formats for big data

  • Plain text: CSV, TSV, JSON
  • Set schema: Thrift, Protobuf, Avro
  • Columnar: Parquet, RCFile, ORC

Session 3

Data Ingestion

  • Messaging patterns
  • Kafka architecture
  • Kafka streams

Session 4

Stream processing

  • Spark streaming
  • Using discretized streams
  • Structured streaming

Session 5

Serving with fast high level APIs

  • Spark Dataframes and Datasets
  • SparkSQL

Continuous learning project

Our continuous learning project comprises a real-world problem and data set to complete in your own time, and practice using the course material and techniques covered during the bootcamp. The package includes model notebook answers, with a detailed explanation of the solution and problem-solving process.

Price: £100 extra

Highlights

Check out video highlights, photos and interviews from our previous bootcamps.


In-house Training

Get in touch to discuss your requirements by emailing contact@cambridgespark.com or by completing our contact form.

We can deliver this course as a private training at your office during week days.

We can also design a bespoke curriculum matching your specific training objectives.

Coming soon

We currently don't have a scheduled date.

Contact us to register your interest and to get notified about the next course.