Apache Spark Course Curriculum
You will be exposed to the complete Apache Spark Trainingcourse details in the below sections.
Introduction To Big Data And Spark
Learn how to apply data science techniques using parallel programming during Spark training, to explore big (and small) data.
Introduction to Big Data
Challenges with Big Data
Batch Vs. Real Time Big Data Analytics
Batch Analytics – Hadoop Ecosystem Overview
Real Time Analytics Options
Streaming Data – Storm
In Memory Data – Spark
What is Spark?
Modes of Spark
Spark Installation Demo
Overview of Spark on a cluster
Spark Standalone Cluster
Spark Baby Steps
Learn how to invoke spark shell, build spark project with sbt, distributed persistence and much more…in this module.
Invoking Spark Shell
Creating the Spark Context
Loading a File in Shell
Performing Some Basic Operations on Files in Spark Shell
Building a Spark Project with sbt
Running Spark Project with sbt
Caching Overview
Distributed Persistence
Spark Streaming Overview
Example: Streaming Word Count
Playing With RDDs In Spark
The main abstraction Spark provides is a resilient distributed dataset (RDD), which is a collection of elements partitioned across the nodes of the cluster that can be operated on in parallel.
RDDs
Spark Transformations in RDD
Actions in RDD
Loading Data in RDD
Saving Data through RDD
Spark Key-Value Pair RDD
Map Reduce and Pair RDD Operations in Spark
Scala and Hadoop Integration Hands on
Shark – When Spark Meets Hive
Shark is a component of Spark, an open source, distributed and fault-tolerant, in-memory analytics system, that can be installed on the same cluster as Hadoop. This module of spark training, will give insights about Shark.
Why Shark?
Installing Shark
Running Shark
Loading of Data
Hive Queries through Spark
Testing Tips in Scala
Performance Tuning Tips in Spark
Shared Variables: Broadcast Variables
Shared Variables: Accumulators