Prerequisites
To apply for the Big Data Hadoop Training, you need to either:
- To learn big data Analytics tools you need to know at least one programming language like Java, Python or R.
- You must also have basic knowledge on databases like SQL to retrieve and manipulate data.
- You need to have knowledge on basic statistics like progression, distribution, etc. and mathematical skills like linear algebra and calculus.
Course Curriculum
Hadoop installation and setup
Topics:
- Introduction to Hadoop
- Hadoop Architecture overview
- Overview of high availability and federation
- Different shell commands available in Hadoop
- Procedure to set up a production cluster
- Overview of configuration files in Hadoop
- Single node cluster installation
- Understanding Spark, Flume, Pig, Scala and Sqoop.
Learning outcome: Upon the completion of this module, you will gain hands-on experience in Hadoop Installation, shell commands cluster installation, etc.
Overview of Big Data Hadoop and Introduction to MapReduce and HDFS
Topics:
- Overview of Big data Hadoop
- Big data and the role of Hadoop
- Components of Hadoop ecosystem
- Distributed File System Replications
- Secondary Name node, Block Size, and High Availability. Y
- ARN- Node and Resource manager
Learning Outcome: Upon the completion of this chapter you will gain knowledge of data replication process, HDFS working mechanism, deciding the size of a block, gain knowledge of data node and name node.
Detail explanation of MapReduce
Topics:
- Introduction to MapReduce
- Learning the working procedure of MapReduce
- Understanding Map and reduce concepts
- Stages in MapReduce
- The terminology used in MR such as Shuffle, Sort, Combiners, Partitions, Output Format, Input Format and Output Format.
Learning Outcome: Upon the completion of this chapter you learn the procedure to write a word count program, knowledge of MapReduce Combiner, writing a custom practitioner, deploying unit tests, how to use a local job runner, what is a tool runner, data set joining etc.
Introduction to Hive
Topics:
- Overview of Hadoop Hive
- Understanding the architecture of Hadoop
- Comparison between Hive, RDBMS, and Pig
- Creation of database
- working with Hive Query Language
- Different Hive Tables
- Group by and other clauses,
- Storing the Hive Results,
- HCatalog, and Hive tables,
- Hive partitioning, and Buckets
Learning outcome: By the completion of this module you will learn the process to create a database in Hive, Hive table creation, Database dropping and customization to a table, Writing Hive queries to pull data, Hive Table Partitioning and Group by clause.
Advanced Hive and Impala
Topics:
- The index in Hive
- Hive Map side join
- User-defined functions in Hive
- Working with complex data types
- overview of Impala
- Difference between Impala and Hive
- Architecture of Impala
Learning Outcome: This chapter will give you complete knowledge of Hive queries, joining table, sequence table deployment, writing indexes, data storage in a different table.
Introduction to Pig
Topics:
- Introduction to Apache Pig
- Pig features
- Schema and various data types in Hive
- Tuples and Fields
- Available functions in Pig, and Hive Bags
Learning outcome: By the completion of this chapter you will gain knowledge to work with Pig, loading of data, storing the data into files, restricting data to 4 rows, working with Filter By, Group By, Split, Distinct, Cross in Hive.
Sqoop and Flume
Topics:
- Introduction to Apache Sqoop
- Importing and exporting data
- Sqoop Limitations
- Performance improvement with Sqoop
- Flume overview
- Flume Architecture
- What is CAP theorem and Hbase
Learning Outcome: Upon the completion of this module you will be able to generate sequence numbers, Consume twitter data using Sqoop, Hive table creation with AVRO, Table creation in HBase, AVRO with Pig, Scan and enable table, Deploying disable.
Writing Spark applications using Scala
Topics:
- Introduction to Spark
- Procedure to write Spark applications with Scala
- Overview of object-oriented programming
- A detailed study of Scala
- Scala Uses
- Executing Scala code
- Multiple classes of Scala such as Getters, Extending Objects, Abstract, Constructors,
- Setters, Overriding Methods.
- Scala and Java interoperability
- Bobsrockets package
- Anonymous functions, and functional programming
- comparison between Mutable and immutable collections
- control Structures in Scala
- Scala REPL, Lazy Values
- Directed Acyclic Graph (DAG),
- Spark in Hadoop ecosystem and Spark UI
- Developing Spark application using SBT/Eclipse
Learning Outcome: Upon the completion of this module you will gain knowledge to write Spark applications using Scala, Scala ability for Spark real-time analytics operation.
Spark framework
Topics:
- Introduction to Apache Spark
- Features of Spark
- Spark components Comparison
- between Spark and Hadoop
- Introduction t Scala and RDD
- Integrating HDFS with Spark
Learning Outcome: Upon the completion of this chapter, you will learn the importance of RDD in Spark and how it makes big data processes faster.
Data Frames and Spark SQL
Topics:
- Introduction to Spark SQL
- Importance of SQL in Spark
- Spark SQL JSON support
- Structured data processing
- Working with parquet files and XML data
- Procedure to read JDBC file
- Writing Data frame to HIve
- Hive context creation
- Role of Spark Dataframe
- Overview of schema manual inferring,
- JDBC table reading
- working with CSV files
- Data transformation from DataFrame to JDBC
- Shared accumulators, and variables.
- User-defined functions in Spark SQL
- Query and Transform data in data frames
- Configuration of Hive on Spark as an execution engine
- Dataframe benefits
Learning Outcome: After finishing this chapter you will gain knowledge to use data frames to query and transform data and get an overview of advantages that arise out of using data frames.
Machine Learning Using Spark (MLib)
Topics:
- Overview of Spark MLlib
- Introduction to different algorithms
- Graph processing analysis in Spark
- Understanding Spark interactive algorithm
- ML algorithms supported by MLlib,
- Introduction to Machine learning
- Introduction to accumulators,
- Overview of Decision Tree, Logistic Regression,
- Linear Regression. Building a Recommendation Engine
- K-means clustering techniques
Learning Outcome: Upon the completion of this module you will gain hands-on experience in building a recommendation engine.
Integration of Apache Kafka and Apache Flume
Topics:
- Introduction to Kafka
- Use of Kafka
- Kafka workflow,
- Kafka architecture
- Basic operations,
- Configuring Kafka cluster
- Integration of Apache
- Kafka and Apache Flume Producing and consuming messages
- Kafka monitoring tools.
Learning Outcome: Upon the completion of this module, you will gain hands-on exposure in the configuration of Single Node Multi Broker Cluster, Single Node Single Broker Cluster, and integration of Apache Flume and Kafka.
Spark Streaming
Topics:
- Introduction to Spark Streaming
- Working with Spark streaming
- Spark Streaming Architecture
- Data processing using Spark streaming
- Requesting count and DStream
- Features of Spark Streaming
- Working with advanced data sources
- Sliding window and multi-batch operations Spark Streaming features Discretized Streams (DStreams),
- Spark Streaming workflow
- Output Operations on DStreams,
- important Windowed Operators
- Windowed Operators and their use
- Stateful Operators.important
Windowed Operators Learning Outcome: After finishing this module you will learn to execute Twitter sentiment analysis, Kafka-Spark Streaming, streaming using Netcat server, and Spark-Flume Streaming.
Hadoop Administration: Configuration of a cluster
Topics:
- Introduction to Hadoop configuration
- Various parameters to be followed in the configuration process
- Importance of Hadoop configuration file
- Hadoop environment setup
- MapReduce parameters
- HDFS parameters The process to include and exclude
- Data node directory structures
- Overview of the File system image Understanding Edit log
Learning Outcome: In this chapter, you will gain hands-on exposure in executing performance tuning in MapReduce.
Hadoop Administration: Using Amazon EC@ instance to setup Multi-node cluster
Topics:
- Setting up 4 node cluster
- Running MapReduce code
- Running MapReduce jobs
- Working with cloud manager setup
Learning Outcome: By the completion of this chapter you will gain hands-on expertise in building a multi-node Hadoop cluster and working knowledge of cloud managers.
Hadoop Administration: Management, Monitoring and Troubleshooting
Topics:
- Basics of checkpoint procedure
- Failure of Name node
- Procedure to recover failed node
- Metadata and Data backup,
- Safe Mode, Different problems and solutions Adding and removing nodes
Learning Outcome: Upon the completion of this chapter, you will learn the process to recover the MapReduce File system, Hadoop cluster monitoring, Usage of job scheduler to schedule jobs, Fair Scheduler and process to its configuration, FIFO schedule and MapReduce job submission flow.