digital@thrayait.com +60162650525, +919043703606

Training Information

Hadoop with Pyspark, Linux

We are pleased to offer a comprehensive suite of training solutions tailored to meet your needs. Our services encompass both online and offline corporate training options, ensuring flexibility and accessibility for your team's professional development.

Click Here for Enquiry Form

Course Content

Syllabus:

BIG DATA HADOOP

I: INTRODUCTION

What is Big Data?

What is Hadoop?

Need of Hadoop

Sources and Types of Data

Comparison with Other Technologies

Challenges with Big Data

i. Storage

ii. Processing

RDBMS vs Hadoop

Advantages of Hadoop

Hadoop Echo System components

II: HDFS (Hadoop Distributed File System)

Features of HDFS

Name node ,Data node ,Blocks

Configuring Block size,

HDFS Architecture ( 5 Daemons)

i. Name Node

ii. Data Node

iii. Secondary Name node

iv. Job Tracker

v. Task Tracker

Metadata management

Storage and processing

Replication in Hadoop

Configuring Custom Replication

Fault Tolerance in Hadoop

HDFS Commands

III: MAP REDUCE

Map Reduce Architecture

Processing Daemons of Hadoop

Job Tracker (Roles and Responsibilities)

Task Tracker(Roles and Responsibilities)

Phases of Map Reduce

i) Mapper phase

ii) Reducer phase

Input split

Input split vs Block size

Partitioner in Map Reduce

Groupings and Aggregations

Data Types in Map Reduce

Map Reduce Programming Model

Driver Code

Mapper Code

Reducer Code

Programming examples

File input formats

File output formats

Merging in Map Reduce

Speculative Execution Model

Speculative Job

IV: SQOOP (SQL + HADOOP)

Introduction to Sqoop

SQOOP Import

SQOOP Export

Importing Data From RDBMS to HDFS

Importing Data From RDBMS to HIVE

Importing Data From RDBMS to HBASE

Exporting From HASE to RDBMS

Exporting From HBASE to RDBMS

Exporting From HIVE to RDBMS

Exporting From HDFS to RDBMS

Transformations While Importing / Exporting

Filtering data while importing

Vertical and Horizontal merging while import

Working with delimiters while importing

Groupings and Aggregations while import

Incremental import

Examples and operations

Defining SQOOP Jobs

V: YARN

Introduction

Speculative Execution ,Speculative job and

Speculative Task.

Comparision of Hadoop1.xx with Hadoop2.xx

Comparision with previous versions

YARN Architecture Componets

i. Resource Manager

ii. Application Master

iii. Node Manager

iv. Application Manager

v. Resource Scheduler

vi. Job History Server

vii. Container

VI: NOSQL

What is “Not only SQL”

NOSQL Advantages

What is problem with RDBMS for Large

Data Scaling Systems

Types of NOSQL & Purposes

Key Value Store

Columer Store

Document Store

Graph Store

Introduction to cassandra – NOSQL Database

Introduction to MongoDB and CouchDB Database

Intergration of NOSQL Databases with Hadoop

VII: HBASE

Introduction to big table

What is NOSQL and colummer store Database

HBASE Introduction

Hbase use cases

Hbase basics

Column families

Scans

Hbase Architecture

Map Reduce Over Hbase

Hbase data Modeling

Hbase Schema design

Hbase CRUD operators

Hive & Hbaseinteragation

Hbase storage handlers

VIII: HIVE

Introduction

Hive Architecture

Hive Metastore

Hive Query Launguage

Difference between HQL and SQL

Hive Built in Functions

Loading Data From Local Files To Hive Tables

Loading Data From Hdfs Files To Hive Tables

Tables Types

Inner Tables

External Tables

Hive Working with unstructured data

Hive Working With Xml Data

Hive Working With Json Data

Hive Working With Urls And Weblog Data

Hive Unions

Hive Joins

Multi Table / File Inserts

Inserting Into Local Files

Inserting Into Hdfs Files

Hive UDF (user defined functions)

Hive UDAF (user defined Aggregated functions)

Hive UDTF (user defined table Generated functions

Partitioned Tables

Non – Partitioned Tables

Multi-column Partitioning

Dynamic Partitions In Hive

Performance Tuning mechanism

Bucketing in hive

Indexing in Hive

Hive Examples

Hive & Hbase Integration

PYSPARK

I ) PYSPARK INTRODUCTION

What is Apache Spark?

Why Pyspark?

Need for pyspark

spark Python Vs Scala

pyspark features

Real-life usage of PySpark

PySpark Web/Application

PySpark - SparkSession

PySpark – SparkContext

PySpark – RDD

PySpark – Parallelize

PySpark – repartition() vs coalesce()

PySpark – Broadcast Variables

PySpark – Accumulator

II) PYSPARK - RDD COMPUTATION

Operations on a RDD

Direct Acyclic Graph (DAG)

RDD Actions and Transformations

RDD computation

Steps in RDD computation

RDD persistence

Persistence features

II) PERSISTENCE Options:

1) MEMORY_ONLY

2) MEMORY_SER_ONLY

3) DISK_ONLY

4) DISK_SER_ONLY

5) MEMORY_AND_DISK_ONLY

III) PYSPARK - CORE COMPUTING

Fault Tolerence model in spark

Different ways of creating a RDD

Word Count Example

Creating spark objects(RDDs) from Scala Objects(lists).

Increasing the no of partitons

Aggregations Over Structured Data:

reduceByKey()

IV) GROUPINGS AND AGGREGATIONS

i) Single Grouping and Single Aggregation

ii) Single Grouping and multiple Aggregation

iii) multi Grouping and Single Aggregation

iv) Multi Grouping and Multi Aggregation

Differences b/w reduceByKey() and groupByKey()

Process of groupByKey

Process of reduceByKey

Reduce() function

Various Transformations

Various Built-in Functions

V) Various Actions and Transformations:

countByKey()

countByValue()

sortByKey()

zip()

Union()

Distinct()

Various count aggregation

Joins

-inner join

-outer join

Cartesian()

Cogroup()

Other actions and transformations

VI) PySpark SQL - DataFrame

Introduction

Making data Structured

Case Classes

ways to extract case class objects

1) using function

2) using map with multiple exressions

3) using map with single expression

Sql Context

Data Frames API

DataSet API

RDD vs DataFrame vs DataSet

PySpark – Create a DataFrame

PySpark – Create an empty DataFrame

PySpark – Convert RDD to DataFrame

PySpark – Convert DataFrame to Pandas

PySpark – show()

PySpark – StructType & StructField

PySpark – Row Class

PySpark – Column Class

PySpark – select()

PySpark – collect()

PySpark – withColumn()

PySpark – withColumnRenamed()

PySpark – where() & filter()

PySpark – drop() & dropDuplicates()

PySpark – orderBy() and sort()

PySpark – groupBy()

PySpark – join()

PySpark – union() & unionAll()

PySpark – unionByName()

PySpark – UDF (User Defined Function)

PySpark – map()

PySpark – flatMap()

pyspark – foreach()

PySpark – sample() vs sampleBy()

PySpark – fillna() & fill()

PySpark – pivot() (Row to Column)

PySpark – partitionBy()

PySpark – ArrayType Column (Array)

PySpark – MapType (Map/Dict)

VII) PySpark SQL Functions

PySpark – Aggregate Functions

PySpark – Window Functions

PySpark – Date and Timestamp Functions

PySpark – JSON Functions

PySpark – Read & Write JSON file

VIII) PySpark Built-In Functions

PySpark – when()

PySpark – expr()

PySpark – lit()

PySpark – split()

PySpark – concat_ws()

Pyspark – substring()

PySpark – translate()

PySpark – regexp_replace()

PySpark – overlay()

PySpark – to_timestamp()

PySpark – to_date()

PySpark – date_format()

PySpark – datediff()

PySpark – months_between()

PySpark – explode()

PySpark – array_contains()

PySpark – array()

PySpark – collect_list()

PySpark – collect_set()

PySpark – create_map()

PySpark – map_keys()

PySpark – map_values()

PySpark – struct()

PySpark – countDistinct()

PySpark – sum(), avg()

PySpark – row_number()

PySpark – rank()

PySpark – dense_rank()

PySpark – percent_rank()

PySpark – typedLit()

PySpark – from_json()

PySpark – to_json()

PySpark – json_tuple()

PySpark – get_json_object()

PySpark – schema_of_json()

Working Examples

IX) Pyspark External Sources

Working with sql statements

Spark and Hive Integration

Spark and mysql Integration

Working with CSV

Working with JSON

Transformations and actions on dataframes

Narrow, wide transformations

Addition of new columns, dropping of columns ,renaming columns

Addition of new rows, dropping rows

Handling nulls

Joins

Window function

Writing data back to External sources

Creation of tables fromDataframes (Internal tables, Temporary tables)

X) DEPLOYMENT MODES

Local Mode

Cluster Modes(Standalone , YARN

XI) PYSPARK APLLICATION

Stages and Tasks

Driver and Executor

Building spark applications/pipelines

Deploying spark apps to cluster and tuning

Performance tuning

PySpark Streaming Concepts

Integration with Kafka

PySpark-mllib

PYTHON

1. Python Basics

What is Python

Why Python?

History of python

Applications of Python

Features of Python

Advantages of Python

Versions of Python

Installation of Python

Flavors of Python

Comparision b/w various programming languages C, Java and Python

2. Python Operations

Python Modes of Execution

Interactive mode of Execution

Batch mode of Execution

Python Editors and IDEs

Python Data Types

Python Constants

Python Variables

Comments in python

Output Print(),function

Input() Function :Accepting input

Type Conversion

Type(),Id() Functions

Comments in Python

Escape Sequences in Python

Strings in Python

String indices and slicing

3. Operators in Python

Arithmetic Operators

Comparision Operators

Logical Operators

Assignment Operators

Short Hand Assignment Operators

Bitwise Operators

Membership Operators

Identity Operators

4. Python IDE’s

Pycharm IDE Installation

Working with Pycharm

Pycharm components

Installing Anaconda

What is Conda?

Anaconda Prompt

Anaconda Navigator

Jupyter Notebook

Jupyter Features

Spyder IDE

Spyder Featueres

Conda and PIP

5. Flow Control statements

Block/clause

Indentation in Python

Conditional Statements

if stmt

if…else statement

if…elif…statement

6. Looping Statements

while loop,

while … else,

for loop

Range() in for loop

Nested for loop

Break statememt

Continue statement

Pass statement

7. Strings in Python

Creating Strings

String indexing

String slicing

String Concatenation

String Comparision

String splitting and joining

Finding Sub Strings

String Case Change

Split strings

String methods

8. Collections in Python

Introduction

Lists

Tuples

Sets

Dictionaries

Operations on collections

Functions for collections

Methods of collection

Nested collections

Differences b/w list tuple and set and Dictionary

9. Python Lists

List properties

List Creation

List indexing and slicing

List Operations

List addresses

List functions

Different ways of creating lists

Nested Lists

List modification

List insertion and deletion

List Methods

10. Python Tuples

Tuple properties

Tuple Creation

Tuple indexing and slicing

Different ways of creating tuples

Tuple Operations

Tuple Addresses

Tuple Functions

Nested Tuples

Tuple Methods

Differences b/w List and Tuple

11. Python Sets

Set properties

Set Creation

Set Operations

Set Functions

Set Addresses

Set Mathematical Operations

Set Methods

Insertion and Deletion operation

12. Python Dictionary

Dictionary properties

Dictionary Creation

Dictionary Operations

Dictionary Addresses

Nested Dictionaries

Dictionary Methods

Insertion and Deletion of elements

Differences b/w list tuple and set and Dictionary

13. Functions in Python

Defining a function

Calling a function

Properties of Function

Examples of Functions

Categories of Functions

Argument types

default arguments

non-default arguments

keyword arguments

non keyword arguments

Variable Length Arguments

Variables scope

Call by value and Call by Reference

Passing collections to function

Local and Global variables

Recursive Function

Boolean Function

Passing functions to function

Anonymous or Lamda function

Filter() and map() functions

Reduce Function

14. Modules in Python

What is a module?

Different types of module

Creating user defined module

Setting path

The import statement

Normal Import

From … Import

Module Aliases

Reloading a module

Dir function

Working with Standard modules -Math, Random, Date time and os modules,

15. Packages

Introduction to packages

Defining packages

Importing from packages

--init--.py file

Defining sub packages

Importing from sub packages

16. Errors and Exception Handling

Types of errors

Compile-Time Errors

Run-Time Errors

What is Exception?

Need of Exception handling

Predefined Exceptions

Try,Except, finally blocks

Nested blocks

Handling Multiple Exceptions

User defined Exceptions

Raise statement

17. File Handling

Introduction

Types of Files in Python

Opening a file

Closing a file

Writing data to files

Tell( ) and seek( ) methods

Reading a data from files

Appending data to files

With open stmt

Various functions

18. OOPs Concepts

OOPS Features

Encapsulation

Abstraction

Class

Object

Static and non static variables

Defining methods

Diff b/w functions & methods

Constructors

Parameterized Constructors

Built –in attributes

Object Reference count

Destructor

Garbage Collection

Inheritance

Types of Inheritances

Object class

Polymorphism

Over riding

Super() statement

19. Regular Expressions

What is regular expression?

Special characters

Forming regular expression

Compiling regular expressions

Grouping

Findall() function

Finditer() function

Sub() function

Match() function

Search() function

Matching vs searching

Splitting a string

Replacing text

validations

20. Database Access

Introduction

Installing mysql database

Creating database users,

Installing Oracle Python modules

Establishing connection with mysql

Closing database connections

Connection object

Cursor object

Executing SQL queries

Retrieving data from Database.

Using bind variables executing

SQL queries

Transaction Management

Handling errors

21. Python Date and Time

How to Use Date &DateTime Class

Time and date Objects

Calendar in Python

The Time Module

Python Calendar Module

22. Operating System Module

Introduction

getcwd

listdir

chdir

mkdir

rename file/dir

remove file/dir

rmtree()

Os help

Os operations

23. Advanced concepts

Python Iterator

Python Generator

Python closure

Python Decorators

Web Scraping

PIP

Working with CSV files

Working with XML files

Working with JSON files

Debugging

24. GUI Programming (tkinter)

Introduction

Components and events

Root window

Labels

Fonts and colors

Buttons, checkbox

Label widget

Message widget

Text widget

Radio button

image

25. Excel Workbook

Installing and working with Xlsx writer

Creating Excel Work book

Inserting into excel sheet

Insetting data into multiple excel sheets

Creating headers

Installing and working with xlrd module

Reading a specific cell or row or column

Reading specific rows and columns

26. Data Analytics

Introduction

pandas module

Numpy module

Matplotlib module

Working Examples

27. Introduction to Datascience

Machine Learning Introduction

Datasets

Supervised /Unsupervised Learning

Statistical Analysis

Data Analysis

Uni-variate/multi-variate analysis

Corelation Analysis

Algorithm types

Applications

28. Python Pandas

Introduction to Pandas

Creating Pandas Series

Creating Data Frames

Pandas Data Frames from dictionaries

Pandas Data Frames from list

Pandas Data Frames from series

Pandas Data Frames from CSV, Excel

Pandas Data Frames from JSON

Pandas Data Frames from Databases

Pandas Data Functionality

Pandas Timedelta

Creating Data Frames from Timedelta

Pandas Groupings and Aggregations

Converting Data Frames from list

Creating Functions

Converting Different Formats

Pandas and Matplotlib

Pandas usecases

29. Python Numpy

Introduction to Numpy

Numpy Arrays

Numpy Array Indexing

2-D and 3Dimensional Arrays

Numpy Mathematical operations

Numpy Flattening and reshaping

Numpy Horizontal and Vertical Stack

Numpy linespace and arrange

Numpy asarray and Random numbers

Numpy iterations and Transpose

Numpy Array Manipulation

Numpy and matplotlib

Numpy Linear Algebra

Numpy String Functions

Numpy operations and usecases

Numpy Working Examples

30. Python Matplotlib

Introduction to matplotlib

Installing matplotlib

Generating graphs

Normal plottings

Generating Bargraphs

Histograms

Scatter plots

Stack plots

Pie plots

Matplotlib working examples