Prerequisites
To apply for the Hadoop Training in India, you need to either:
- To learn big data Analytics tools you need to know at least one programming language like Java, Python or R.
- You must also have basic knowledge on databases like SQL to retrieve and manipulate data.
- You need to have knowledge on basic statistics like progression, distribution, etc. and mathematical skills like linear algebra and calculus.
Course Curriculum
Module 1: Introduction to Big Data and Hadoop
- 1.1 Introduction to Big Data and Hadoop
- 1.2 Introduction to Big Data
- 1.3 Big Data Analytics
- 1.4 What is Big Data
- 1.5 Four Vs Of Big Data
- 1.6 Case Study Royal Bank of Scotland
- 1.7 Challenges of Traditional System
- 1.8 Distributed Systems
- 1.9 Introduction to Hadoop
- 1.10 Components of Hadoop Ecosystem
- 1.11 Commercial Hadoop Distributions
Module 2: Hadoop Architecture, Distributed Storage (HDFS) and YARN
- 2.1 Introduction to Hadoop Architecture Distributed Storage (HDFS) and YARN
- 2.2 What Is HDFS
- 2.3 Need for HDFS
- 2.4 Regular File System vs HDFS
- 2.5 Characteristics of HDFS
- 2.6 HDFS Architecture and Components
- 2.7 High Availability Cluster Implementations
- 2.8 HDFS Component File System Namespace
- 2.9 Data Block Split
- 2.10 Data Replication Topology
- 2.11 HDFS Command Line
- 2.12 YARN Introduction
- 2.13 YARN Use Case
- 2.14 YARN and Its Architecture
- 2.15 Resource Manager
- 2.16 How Resource Manager Operates
- 2.17 Application Master
- 2.18 How YARN Runs an Application
- 2.19 Tools for YARN Developers
Module 3: Data Ingestion into Big Data Systems and ETL
- 3.1 Introduction to Data Ingestion into Big Data Systems and ETL
- 3.2 Overview of Data Ingestion
- 3.3 Apache Sqoop
- 3.4 Sqoop and Its Uses
- 3.5 Sqoop Processing
- 3.6 Sqoop Import Process
- 3.7 Sqoop Connectors
- 3.8 Apache Flume
- 3.9 Flume Model
- 3.10 Scalability in Flume
- 3.11 Components in Flume’s Architecture
- 3.12 Configuring Flume Components
- 3.13 Apache Kafka
- 3.14 Aggregating User Activity Using Kafka
- 3.15 Kafka Data Model
- 3.16 Partitions
- 3.17 Apache Kafka Architecture
- 3.18 Producer Side API Example
- 3.19 Consumer Side API
- 3.20 Consumer Side API Example
- 3.21 Kafka Connect
Module 4: Distributed Processing – MapReduce Framework and Pig
- 4.1 Introduction to Distributed Processing MapReduce Framework and Pig
- 4.2 Distributed Processing in MapReduce
- 4.3 Word Count Example
- 4.4 Map Execution Phases
- 4.5 Map Execution Distributed Two Node Environment
- 4.6 MapReduce Jobs
- 4.7 Hadoop MapReduce Job Work Interaction
- 4.8 Setting Up the Environment for MapReduce Development
- 4.9 Set of Classes
- 4.10 Creating a New Project
- 4.11 Advanced MapReduce
- 4.12 Data Types in Hadoop
- 4.13 OutputFormats in MapReduce
- 4.14 Using Distributed Cache
- 4.15 Joins in MapReduce
- 4.16 Replicated Join
- 4.17 Introduction to Pig
- 4.18 Components of Pig
- 4.19 Pig Data Model
- 4.20 Pig Interactive Modes
- 4.21 Pig Operations
- 4.22 Various Relations Performed by Developers
Module 5: Apache Hive
- 5.1 Introduction to Apache Hive
- 5.2 Hive SQL over Hadoop MapReduce
- 5.3 Hive Architecture
- 5.4 Interfaces to Run Hive Queries
- 5.5 Running Beeline from Command Line
- 5.6 Hive Metastore
- 5.7 Hive DDL and DML
- 5.8 Creating New Table
- 5.9 Data Types
- 5.10 Validation of Data
- 5.11 File Format Types
- 5.12 Data Serialization
- 5.13 Hive Table and Avro Schema
- 5.14 Hive Optimization Partitioning Bucketing and Sampling
- 5.15 Non-Partitioned Table
- 5.16 Data Insertion
- 5.17 Dynamic Partitioning in Hive
- 5.18 Bucketing
- 5.19 What Do Buckets Do
- 5.20 Hive Analytics UDF and UDAF
- 5.21 Other Functions of Hive
Module 6: NoSQL Databases – HBase
- 6.1 Introduction to NoSQL Databases HBase
- 6.2 NoSQL Introduction
- 6.3 HBase Overview
- 6.4 HBase Architecture
- 6.5 Data Model
- 6.6 Connecting to HBase
Module 7: Basics of Functional Programming and Scala
- 7.1 Introduction to the basics of Functional Programming and Scala
- 7.2 Introduction to Scala
- 7.3 Functional Programming
- 7.4 Programming with Scala
- 7.5 Type Inference Classes Objects and Functions in Scala
- 7.6 Collections
- 7.7 Types of Collections
- 7.8 Scala REPL
Module 8: Apache Spark Next-Generation Big Data Framework
- 8.1 Introduction to Apache Spark Next-Generation Big Data Framework
- 8.2 History of Spark
- 8.3 Limitations of MapReduce in Hadoop
- 8.4 Introduction to Apache Spark
- 8.5 Components of Spark
- 8.6 Application of In-Memory Processing
- 8.7 Hadoop Ecosystem vs Spark
- 8.8 Advantages of Spark
- 8.9 Spark Architecture
- 8.10 Spark Cluster in Real World
Module 9: Spark Core Processing RDD
- 9.1 Processing RDD
- 9.2 Introduction to Spark RDD
- 9.3 RDD in Spark
- 9.4 Creating Spark RDD
- 9.5 Pair RDD
- 9.6 RDD Operations
- 9.7 Demo: Spark Transformation Detailed Exploration Using Scala Examples
- 9.8 Demo: Spark Action Detailed Exploration Using Scala
- 9.9 Caching and Persistence
- 9.10 Storage Levels
- 9.11 Lineage and DAG
- 9.12 Need for DAG
- 9.13 Debugging in Spark
- 9.14 Partitioning in Spark
- 9.15 Scheduling in Spark
- 9.16 Shuffling in Spark
- 9.17 Sort Shuffle
- 9.18 Aggregating Data with Pair RDD
Module 10: Spark SQL – Processing DataFrames
- 10.1 Introduction to Spark SQL Processing DataFrames
- 10.2 Spark SQL Introduction
- 10.3 Spark SQL Architecture
- 10.4 DataFrames
- 10.5 Demo: Handling Various Data Formats
- 10.6 Demo: Implement Various DataFrame Operations
- 10.7 Demo: UDF and UDAF
- 10.8 Interoperating with RDDs
- 10.9 Demo: Process DataFrame Using SQL Query
- 10.10 RDD vs DataFrame vs Dataset
Module 11: Stream Processing Frameworks and Spark Streaming
- 11.1 Introduction to Stream Processing Frameworks and Spark Streaming
- 11.2 Overview of Streaming
- 11.3 Real-Time Processing of Big Data
- 11.4 Data Processing Architectures
- 11.5 Spark Streaming
- 11.6 Introduction to DStreams
- 11.7 Transformations on DStreams
- 11.8 Design Patterns for Using ForeachRDD
- 11.9 State Operations
- 11.10 Windowing Operations
- 11.11 Join Operations stream-dataset Join
- 11.12 Streaming Sources
- 11.13 Structured Spark Streaming
- 11.14 Use Case Banking Transactions
- 11.15 Structured Streaming Architecture Model and Its Components
- 11.16 Output Sinks
- 11.17 Structured Streaming APIs
- 11.18 Constructing Columns in Structured Streaming
- 11.19 Windowed Operations on Event-Time
- 11.20 Use Cases
Module 12: Spark MLLib – Modelling BigData with Spark
- 12.1 Introduction to Spark MLlib Modeling Big Data with Spark
- 12.2 Role of Data Scientist and Data Analyst in Big Data
- 12.3 Analytics in Spark
- 12.4 Machine Learning
- 12.5 Supervised Learning
- 12.6 Demo: Classification of Linear SVM
- 12.7 Demo: Linear Regression with Real-World Case Studies
- 12.8 Unsupervised Learning
- 12.9 Demo: Unsupervised Clustering K-Means
- 12.10 Reinforcement Learning
- 12.11 Semi-Supervised Learning
- 12.12 Overview of MLlib
- 12.13 MLlib Pipelines
Module 13: Spark GraphX
- 13.1 Introduction to Spark GraphX
- 13.2 Introduction to Graph
- 13.3 Graphx in Spark
- 13.4 Graph Operators
- 13.5 Join Operators
- 13.6 Graph Parallel System
- 13.7 Algorithms in Spark
- 13.8 Pregel API
- 13.9 Use Case of GraphX