Hadoop/Bigdata

Module -1: Duration 75 Minutes

Basic Concepts

  • Big-Data:
    1. What is Data?
    2. What is Big-Data?
    3. Sources of Big-Data.
    4. Structured Vs Unstructured
    5. Big-Data Characteristics: 3Vs
    6. Common Use Cases.
    7. Project Discussion:
    8. Lab-1: Connectivity to cluster/Executing Unix commands / Winscp or Filezilla/ putty/Edge Node
  • Apache Hadoop
    1. Limitations of the Existing Solution on Big Data
    2. Compare Teradata with Hadoop
    3. How Hadoop provides the solution for Big Data?
    4. Apache Hadoop competitors in the market, then why Hadoop?
    5. What is Apache Hadoop?
    6. History of Apache Hadoop
    7. Why name Hadoop?
    8. Doug Cutting.
    9. Forecasting the job market across the globe
    10. Discuss Generation 1 Hadoop and Generation 2 Hadoop
    11. Do you know Hadoop is a Desktop application?
    12. YARN = 10 needs for Hadoop.
    13. Lab-2: Hortonworks, Cloudera, Pivotal distributions

Module -2:  Duration: 120 Minutes

  • Distributed File System Advantages and Disadvantages.
  • Apache Hadoop Components
    1. Storage :
      1. Default File System
      2. HDFS
  • Simple Storage Service
  1. Processing
    1. MapReduce
    2. MPP
  • Graph Processing
  1. Lab-3 : Configuration of File System and Framework
  • Introduction to Hadoop Distribute File System (HDFS)
    1. HDFS Components
    2. Name Node
    3. Secondary Name Node
    4. Data Node
    5. Lab-4: Demons in the cluster, starting of the demons.
  • Basic of HDFS
    1. HDFS Architecture
    2. Why Block?
    3. Why default block size is 64MB?
    4. Why default Replication factor 3?
    5. Block Management Service?
    6. Lab-5: HDFS basic commands.
  • Replication and Rack Awareness, Lab-6: Show replicated blocks in the cluster
  • Anatomy of File Write on HDFS, Lab-7: Show writing of file on HDFS
  • Anatomy of File Read on HDFS, Lap-8: Show reading of file on HDFS

Module -3: Duration: 75 MIN

  • Introduction to typical Hadoop Cluster
  • Secondary NameNode
    1. What is FsImage?
    2. What is edits.log?
    3. Usage of Secondary NameNode
    4. Lab-9: Show FsImage and edits.log in the cluster
  • NameNode is Single Point of Failure
    1. Generation 1 Hadoop SPOF
    2. Generation 2 Hadoop SPOF handled using HA.
      1. Failover Fencing
      2. STONITH
  • Split Brain Disorder
  • NameNode Scalability:
    1. Generation -1: Is NameNode Scalable in Generation 1 Hadoop.
    2. Generation -2: HDFS Federation.

 

Module -4: Duration: 120 MIN

  • Hadoop Cluster Modes.
    1. Standalone
    2. Psuedo-Distributed
    3. Fully-Distributed Mode
    4. Lab: Show configuration changes to run different cluster modes.

 

  • Core Configuration files
    1. Core-site.xml
    2. Hdfs-site.xml
    3. Mapred-site.xml
    4. Yarn-site.xml
    5. Hadoop-env.sh
    6. Masters
  • Running teragen example.
  • Dump of MR Log
  • Hadoop copy commands.

Module-5 Duration: 120 MIN

  • Introduction to MapReduce Framework.
  • MR Framework beyond scenes
  • Traditional way of solving the problem, Lab: Show sample running of WordCount process.
  • MapReduce way of solving the problem, Lab: Show sample running of WordCount process.
  • Generation 1: Executing WordCount MR Job
    1. Job Tracker
    2. Task Tracker
  • Generation 2: Executing WordCount Application
  • How to debug the log files for MapReduce.
  • Differences between Gen1 and Gen2 Hadoop
  • Anatomy of MapReduce Job
    1. Output Collector
    2. Circular Memory
    3. Split files
  • Advantage of MapReduce
    1. Parallel Processing.
    2. Data Locality
  • What is speculative Execution?

Module-6: Duration: 120 MIN

  • Introduction to YARN
    1. Generation -2 Architecture
    2. Gen-2 Components
      1. Client
      2. Resource Manager
        1. Scheduler
        2. Application Manager
  • Node Manager
  1. Container
  2. Application Master

 

  1. MapReduce Application Phases in YARN
    1. Application Submission
    2. Job Initialization
  • Task Assignment
  1. Task Execution
  2. Status Update
  3. Failure Recovery
    1. Container Failure
    2. Node Manager Failure
    3. AM Failure
    4. Resource Manager Failure
  • Lab: MR Program Execution on YARN
  1. Moving beyond MapReduce in YARN
  2. Introduction to Job Queues, Lab: How to define Queues?
  3. Schedulers
    1. FIFO Scheduler
    2. Fair Scheduler
  • Capacity Scheduler

Module 7 Duration: 120 MIN

  • Difference in MRv1 and MRv2 Java API chart
  • Writing MapReduce applications using MRv1 API
    1. Job : Configured, Tools, ToolRunner, Run, Configuration etc… discussion
    2. Mapper
    3. Reducer
  • Writing MapReduce applications using MRv2 API
    1. Job
    2. Mapper
    3. Reducer
  • Writing Weather Temperature Use Case.
  • DE Identification of Person Information
  • Fixed Width File to CSV file
  • JSON to CSV file
  • Combiner
  • Partitioner
  • Assignment: Secondary Sorting Use Case, Matrix Calculation use Case.

Module 8 Duration: 120 MIN

  • Joins in MapReduce
    1. MapSide Join
    2. Reduce Side Join
    3. BroadCast Join
  • Chain Mappers, Reducers
  • Custom InputFormat
  • OutputFormat
  • Data Types
    1. Pre-Defined Data Types
    2. Custom Data Types
    3. Writable Comparable.
  • MR Unit
  • Distributed Cache
  • Sequential File
  • AVRO File.
  • Lab: Writing and executing each use cases.

Module 8: Apache PIG

  • Why PIG?
  • Compare PIG Vs MR
  • Where to Use Pig?
  • Pig Execution Modes:
    1. Local
    2. MapReduce
    3. Tez_on_local
  • Pig Latin
  • Pig Data Types
    1. Primitive Data Types
    2. Scalar Data Types
      1. Bag
      2. Tupple
  • Field
  1. Map
  1. NULL
  • Pig Data Flow Language
  • Pig Operators
  • Load and Store Operators
    1. Load
    2. Store
    3. DUMP
  • Transform Operators
    1. FILTER
    2. FOREACH
    3. GROUP
    4. PARALLEL
    5. COGROUP
    6. INNER JOIN
    7. OUTER JOIN
    8. UNION
    9. SPLIT
  • DIAGNOSIS Operators
    1. DESCRIBE
    2. EXPLAIN
    3. ILLUSTRATE
  • Built in Functions in PIG
  • Pig Properties
  • UDFs
  • Pig HBASE storage Handler
    1. Load from HBASE
    2. Store into HBASE
  • Pig Schema
  • Synopsis: Pig Read Hive Table

Module: Apache Hive

  • History of Hive
  • Hive Architecture
  • Hive Metastore
    1. Embedded Metastore
    2. Local Metastore
    3. Remote Metastore
  • Hive Components
    1. Driver
    2. Shell
    3. Compiler
    4. Execution Engine
  • HiveQL
  • HiveQL data types
  • ACID Hive
    1. Hive Transactions
    2. Hive Updates
  • Partitioning
  • Bucketing
  • Hive Loads
  • Hive Tables
    1. Managed Table
    2. External Table
    3. Native Table
    4. Non-Native Tables
    5. Temporary Tables
  • Hive Views
  • Hive Diagnosis operators

Module : Advance Hive

  • Hive on HBASE
  • Join in Hive
    1. Inner Joins
    2. Outer Joins
  • Dynamic Partition
  • Hive SerDe: JSON Serdes
  • Hive UDF
  • Hive Parameters
  • Lab: Creating Hive tables, ORC Tables, Avro Tables, Json tables

Module: HBASE

  • Introduction to Nosql Databases.
  • CAP Theorem
  • HBASE
  • History of HBASE
  • Three major HBASE Components
    1. HBASE Master
    2. Region Server
    3. Client Library
  • HBASE vs RDBMS
  • HBASE Versioning
  • HBASE shell
  • HBASE Column Family
  • Sparse Datastore
  • Horizontal Sharding
  • ROW KEY
  • HBASE Architecture
    1. Zookeeper
    2. WAL
    3. HFILE
    4. MEM store
  • REGION
  • HBASE Write
  • HBASE Read

 

Module: Advance HBASE

  • HBASE Loading Techniques
    1. HBASE SHELL
    2. HBASE Java Client
    3. PIG to HBASE
    4. SQOOP to HBASE
    5. Hive to HBASE
  • Coprocessors
  • Joins in HBASE
  • BLOOM FILTER

Module: Zookeeper

  • Introduction to Zookeeper
  • Introduction to ZNODE
  • Zookeeper ENSEMBLE
  • ZAP Protocol
  • Atomic Broad Cast
  • Zookeeper during failures

Module: OOZIE

  • Introduction to OOZIE
  • OOZIE Components
  • Coordinators
  • Workflow
  • Bundle
  • Creating and Running Workflow
  • Creating and Running Coordinator
  • OOZIE actions
  • OOZIE nodes
  • OOZIE WebUI
  • OOZIE Client

MODULE: FLUME

  • Introduction to FLUME
  • AGENTS
  • SOURCE,CHANNEL,SINK
  • DIFFERENT SOURCE
  • Working with Twitter Source
  • Dumping MR Logs into FLUME

MODULE: SQOOP

  • Introduction to SQOOP
  • SQOOP Working principle
  • Sqoop import
  • Sqoop Export
  • Sqoop import query
  • Sqoop Number of Mapper

MODULE: Project.

  • Discussion on Project discussion -1
    1. Implement the project
    2. OOZIE
    3. FLUME
    4. HADOOP COPY COMMANDS
    5. MAPREDUCE
    6. PIG
    7. HIVE
    8. HBASE
    9. MYSQL
    10. WebUI
  • Discussion on Project –II
  • Discussion on Project –III
  • Discussion on Project –IV
  • Discussion on Project -V

Course Video

The Course Videos Section.

Course Overview

The Course Overview Section.