Prerequisites
To apply for the PySpark Training Certification, you need to either:
- You must have basic knowledge of Big data.
- It will be beneficial if you have basic Python Programming skills
- Having basic skills in Data Analytics will be an added advantage.
Course Curriculum
Module 1: Python
- Environment Setup
- Decision Making
- Loops and Number
- Strings
- Lists
- Tuples
- Dictionary
- Date and Time
- Regex
- Functions
- OOPS
- Files I/O
- Exceptions
- SET
- Lambda
- Map and filter
Module 2: Hadoop distributed File System(HDFS)
- What is HDFS ?
- How the data stored in HDFS ?
- What is BLOCK ?
- Replication Factor in HDFS ?
- Command in HDFS ?
Module 3: Pyspark
- What is Hadoop platform Why Hadoop platform What is Spark
- Why spark Evolution of Spark
- Hadoop Vs Spark (Spark Benefits )
- Architecture of Spark Define Spark Components Lazy Evaluation
- Spark-shell spark submit
- Setting Up memory (Driver Memory
- Executor Memory)
- Setting Up Cores (Executors Core) Running Spark in Local
- Hadoop Map Reduce VS Spark RDD
- Benefits Of RDD Over Hadoop Map Reduce
- RDD overview Transformations and actions in the context of RDDs.
- Demonstrate Each Api’s of RDD
- With Real Time Example(Like:cache
- uncancahe
- count
- filter
- map etc)
- Magic With Data frames
- Overview Of data frames
- Read a CSV/Excel Files And create a data frame.
- Cache/Uncahe Operations On data frames.
- Persist/UnPersist Operations On data frames.
- Partition and repartition Concepts of data frames.
- For each Partitions On Data frames.
- Programming using data frame. How to use data frames Api’s effectually.
- A magic spark Job using data frame concept.(small project)
- Schema Defining on from data frame How to perform SQL operations On data frame.
- Check Point in data frame.
- StructType and arrayType in data frames
- Complex Data Structure on data frame
Module 4: Various data sources
- CSV files Excel Files JSON Files Parquet file
- Benefits of Parquet file Text Files
Module 5: Various levels of persistence
- MEMORY_ONLY
- MEMORY_ONLY_SER
- MEMORY_AND_DISK
- MEMORY_AND_DISK_SER
- DISK_ONLY
- OFF_HEAP
Module 6: User Define Functions
- Benefits of UDF’s over SQL Writing the UDF’s and applying on to the data frame
- Complex UDF’s
- Data cleaning Using UDF’s
Module 7: Connecting Spark With S3
- Connect spark with s3
- Read a file from s3 and perform Transformation
- Write a File to the s3 Preparation and close while
- Writing the file to the s3
Module 9: PostgreSQL
- Overview of PostgreSQL
- How to connect spark with PostgreSQL
- Collection concepts of PostgreSQL
- Doing operation in spark
- Writing various keys to the redis using PostgreSQL
Module 8: MySQL Database
- Overview of mysql database and benefits.
- Partition Key and collection concepts in mysql Connecting mysql with spark
- Read a table from mysql and perform transformations.
- Writing data to a mysql table with millions of data
Module 10: Spark SQL
- Overview of Spark SQL.
- How to write SQL in Spark.
- Various types of Clause in Spark SQL
- Using UDF’s inside Spark SQL SQL Fine Tuning using Spark
Module 11: Data cleaning
- What are the data column types?
- How many fields match thedata type?
- How many fields are mismatches?
- Which fields are matches?
- Which fields are mismatches?
Module 12: Pyspark HIVE connectivity
- Pyspark HIVE_READ_Table
- Pyspark HIVE Write Table
- Pyspark Hive Checkpoint
Module 14: Pyspark broadcast and accumulator
- Pyspark broadcast
- Pyspark accumulator
Module 13: Pyspark Array Type Column and operation in Pyspark
Module 15: Pyspark storage level
Module 16: Pyspark mlib library
Module 17: Pyspark structure streaming
Module 18: Conclusion
- Summarize all the points discussed.