digital@thrayait.com +60162650525, +919043703606

Training Information

BIG DATA HADOOP

We are pleased to offer a comprehensive suite of training solutions tailored to meet your needs. Our services encompass both online and offline corporate training options, ensuring flexibility and accessibility for your team's professional development.

Click Here for Enquiry Form

Course Content

Syllabus:

BIG DATA HADOOP

I: INTRODUCTION

What is Big Data?

What is Hadoop?

Need of Hadoop

Sources and Types of Data

Comparison with Other Technologies

Challenges with Big Data

i. Storage

ii. Processing

RDBMS vs Hadoop

Advantages of Hadoop

Hadoop Echo System components

II: HDFS (Hadoop Distributed File System)

Features of HDFS

Name node ,Data node ,Blocks

Configuring Block size,

HDFS Architecture ( 5 Daemons)

i. Name Node

ii. Data Node

iii. Secondary Name node

iv. Job Tracker

v. Task Tracker

Metadata management

Storage and processing

Replication in Hadoop

Configuring Custom Replication

Fault Tolerance in Hadoop

HDFS Commands

III: MAP REDUCE

Map Reduce Architecture

Processing Daemons of Hadoop

Job Tracker (Roles and Responsibilities)

Task Tracker(Roles and Responsibilities)

Phases of Map Reduce

i) Mapper phase

ii) Reducer phase

Input split

Input split vs Block size

Partitioner in Map Reduce

Groupings and Aggregations

Data Types in Map Reduce

Map Reduce Programming Model

Driver Code

Mapper Code

Reducer Code

Programming examples

File input formats

File output formats

Merging in Map Reduce

Speculative Execution Model

Speculative Job

IV: SQOOP (SQL + HADOOP)

Introduction to Sqoop

SQOOP Import

SQOOP Export

Importing Data From RDBMS to HDFS

Importing Data From RDBMS to HIVE

Importing Data From RDBMS to HBASE

Exporting From HASE to RDBMS

Exporting From HBASE to RDBMS

Exporting From HIVE to RDBMS

Exporting From HDFS to RDBMS

Transformations While Importing / Exporting

Filtering data while importing

Vertical and Horizontal merging while import

Working with delimiters while importing

Groupings and Aggregations while import

Incremental import

Examples and operations

Defining SQOOP Jobs

V: YARN

Introduction

Speculative Execution ,Speculative job and

Speculative Task.

Comparision of Hadoop1.xx with Hadoop2.xx

Comparision with previous versions

YARN Architecture Componets

i. Resource Manager

ii. Application Master

iii. Node Manager

iv. Application Manager

v. Resource Scheduler

vi. Job History Server

vii. Container

VI: NOSQL

What is “Not only SQL”

NOSQL Advantages

What is problem with RDBMS for Large

Data Scaling Systems

Types of NOSQL & Purposes

Key Value Store

Columer Store

Document Store

Graph Store

Introduction to cassandra – NOSQL Database

Introduction to MongoDB and CouchDB Database

Intergration of NOSQL Databases with Hadoop

VII: HBASE

Introduction to big table

What is NOSQL and colummer store Database

HBASE Introduction

Hbase use cases

Hbase basics

Column families

Scans

Hbase Architecture

Map Reduce Over Hbase

Hbase data Modeling

Hbase Schema design

Hbase CRUD operators

Hive & Hbaseinteragation

Hbase storage handlers

VIII: HIVE

Introduction

Hive Architecture

Hive Metastore

Hive Query Launguage

Difference between HQL and SQL

Hive Built in Functions

Loading Data From Local Files To Hive Tables

Loading Data From Hdfs Files To Hive Tables

Tables Types

Inner Tables

External Tables

Hive Working with unstructured data

Hive Working With Xml Data

Hive Working With Json Data

Hive Working With Urls And Weblog Data

Hive Unions

Hive Joins

Multi Table / File Inserts

Inserting Into Local Files

Inserting Into Hdfs Files

Hive UDF (user defined functions)

Hive UDAF (user defined Aggregated functions)

Hive UDTF (user defined table Generated functions

Partitioned Tables

Non – Partitioned Tables

Multi-column Partitioning

Dynamic Partitions In Hive

Performance Tuning mechanism

Bucketing in hive

Indexing in Hive

Hive Examples

Hive & Hbase Integration

PYSPARK

I ) PYSPARK INTRODUCTION

What is Apache Spark?

Why Pyspark?

Need for pyspark

spark Python Vs Scala

pyspark features

Real-life usage of PySpark

PySpark Web/Application

PySpark - SparkSession

PySpark – SparkContext

PySpark – RDD

PySpark – Parallelize

PySpark – repartition() vs coalesce()

PySpark – Broadcast Variables

PySpark – Accumulator

II) PYSPARK - RDD COMPUTATION

Operations on a RDD

Direct Acyclic Graph (DAG)

RDD Actions and Transformations

RDD computation

Steps in RDD computation

RDD persistence

Persistence features

II) PERSISTENCE Options:

1) MEMORY_ONLY

2) MEMORY_SER_ONLY

3) DISK_ONLY

4) DISK_SER_ONLY

5) MEMORY_AND_DISK_ONLY

III) PYSPARK - CORE COMPUTING

Fault Tolerence model in spark

Different ways of creating a RDD

Word Count Example

Creating spark objects(RDDs) from Scala Objects(lists).

Increasing the no of partitons

Aggregations Over Structured Data:

reduceByKey()

IV) GROUPINGS AND AGGREGATIONS

i) Single Grouping and Single Aggregation

ii) Single Grouping and multiple Aggregation

iii) multi Grouping and Single Aggregation

iv) Multi Grouping and Multi Aggregation

Differences b/w reduceByKey() and groupByKey()

Process of groupByKey

Process of reduceByKey

Reduce() function

Various Transformations

Various Built-in Functions

V) Various Actions and Transformations:

countByKey()

countByValue()

sortByKey()

zip()

Union()

Distinct()

Various count aggregation

Joins

-inner join

-outer join

Cartesian()

Cogroup()

Other actions and transformations

VI) PySpark SQL - DataFrame

Introduction

Making data Structured

Case Classes

ways to extract case class objects

1) using function

2) using map with multiple exressions

3) using map with single expression

Sql Context

Data Frames API

DataSet API

RDD vs DataFrame vs DataSet

PySpark – Create a DataFrame

PySpark – Create an empty DataFrame

PySpark – Convert RDD to DataFrame

PySpark – Convert DataFrame to Pandas

PySpark – show()

PySpark – StructType & StructField

PySpark – Row Class

PySpark – Column Class

PySpark – select()

PySpark – collect()

PySpark – withColumn()

PySpark – withColumnRenamed()

PySpark – where() & filter()

PySpark – drop() & dropDuplicates()

PySpark – orderBy() and sort()

PySpark – groupBy()

PySpark – join()

PySpark – union() & unionAll()

PySpark – unionByName()

PySpark – UDF (User Defined Function)

PySpark – map()

PySpark – flatMap()

pyspark – foreach()

PySpark – sample() vs sampleBy()

PySpark – fillna() & fill()

PySpark – pivot() (Row to Column)

PySpark – partitionBy()

PySpark – ArrayType Column (Array)

PySpark – MapType (Map/Dict)

VII) PySpark SQL Functions

PySpark – Aggregate Functions

PySpark – Window Functions

PySpark – Date and Timestamp Functions

PySpark – JSON Functions

PySpark – Read & Write JSON file

VIII) PySpark Built-In Functions

PySpark – when()

PySpark – expr()

PySpark – lit()

PySpark – split()

PySpark – concat_ws()

Pyspark – substring()

PySpark – translate()

PySpark – regexp_replace()

PySpark – overlay()

PySpark – to_timestamp()

PySpark – to_date()

PySpark – date_format()

PySpark – datediff()

PySpark – months_between()

PySpark – explode()

PySpark – array_contains()

PySpark – array()

PySpark – collect_list()

PySpark – collect_set()

PySpark – create_map()

PySpark – map_keys()

PySpark – map_values()

PySpark – struct()

PySpark – countDistinct()

PySpark – sum(), avg()

PySpark – row_number()

PySpark – rank()

PySpark – dense_rank()

PySpark – percent_rank()

PySpark – typedLit()

PySpark – from_json()

PySpark – to_json()

PySpark – json_tuple()

PySpark – get_json_object()

PySpark – schema_of_json()

Working Examples

IX) Pyspark External Sources

Working with sql statements

Spark and Hive Integration

Spark and mysql Integration

Working with CSV

Working with JSON

Transformations and actions on dataframes

Narrow, wide transformations

Addition of new columns, dropping of columns ,renaming columns

Addition of new rows, dropping rows

Handling nulls

Joins

Window function

Writing data back to External sources

Creation of tables fromDataframes (Internal tables, Temporary tables)

X) DEPLOYMENT MODES

Local Mode

Cluster Modes(Standalone , YARN

XI) PYSPARK APLLICATION

Stages and Tasks

Driver and Executor

Building spark applications/pipelines

Deploying spark apps to cluster and tuning

Performance tuning

PySpark Streaming Concepts

Integration with Kafka

PySpark-mllib