February | 2017 | coding algorithms

Hadoop 2.x Architecture

February 13, 2017 by Niranjan Tallapalli 1 Comment

Addressing the limitations of the hadoop1.x, hadoop2.x is designed in a generic way so as to accommodate more than just Hadoop Mapreduce (MR V1) so that the hadoop cluster can also be used for spark, kafka, storm, and other distributed frameworks. Hadoop2.x is also referred to as as YARN – Yet Another Resource Negotiator OR MapReduce V2.

Drawing parallel between Hadoop1.x and Hadoop2.x

Job tracker is split into Resource Manager and Application Master
Task Tracker is split into Node Manager and Containers

Components of MR V2

Resource Manager (one per cluster): It does the job of pure scheduling, i.e, identifying the resources across the cluster and assigning them to the competing applications. It has two modules in it.
- Scheduler: This is pluggable and we can change it based on requirement like capacity scheduler or fairness scheduler etc
- Application Masters (plural): This is responsible for restarting the Application Master in case of its failure.

Note: Resource Manager failure itself can be handled with ZooKeeper.

Application Master (one per application/job): This is responsible to negotiating resources with the Resource Manager and once it gets the resources (containers inside slave nodes), it talks to the Node Managers and get the tasks executed by tracking their status and progress. The major benefits this component brings to the architecture
- Scalability: By splitting the resource management/scheduling from application life cycle management, the cluster can be scaled much more than MR V1.
- Generic Architecture: By moving all the application specific runtime complexities into the Application Master, we will be able to launch any type of frameworks like Kafka, Storm, Spark, etc
  Application Master also supports very generic resource model, like it can request very specific resources like amount of RAM requried, number of cores, etc
Node Manager (one per Slave Node): This is responsible for launching containers inside the node with the amount of resources as specified by Application Master.YARN allows applications (AM) to launch any process (unlike only java in MR V1) with
- command to launch as process within container
- environment variables required for the process
- local resources required prior to launch on that node (like any 3rd party jars etc)
  security tokens (if any)
Container (one/many per node): This is the lowest level slave node component used to process the data.

Hadoop 2.x/YARN/MR V2 Architecture

Application Flow

#1: Client submits job to the cluster which is received by the Resource Manager
#2: Resource Manager (RM) launches the Application Master in one of the available containers (here container #C5)
#3: On Application Master (AM) bootup, it registers itself with RM and then requests for the resources to run the job on cluster
#4: RM responds to AM with 2 containers #C1 and #C8
#5: AM then asks Node Managers (NM) of #C1 and #C8 to launch the containers
#6: Containers while running the job report back to the AM
#7: While all this process is going on Client has direct communication with AM to get the status of the job
##: Once the job is complete, AM unregisters itself from the RM and thus making the container available for other jobs

Filed under big data, Hadoop

Hadoop 1.x Architecture and Drawbacks

February 13, 2017 by Niranjan Tallapalli Leave a comment

Hadoop is built on two whitepapers published by Google, i.e,

HDFS
Map Reduce

HDFS: Hadoop Distributed File System

It is different from the normal file system in a way that the data copied on to HDFS is split into ‘n’ blocks and each block is copied on to a different node in the cluster. To achieve this we use master-slave architecture

HDFS Master => Name Node: Takes the client request and responsible for orchestrating the data copy across the cluster
HDFS Slave => Data Node: Actually saves the block of data and coordinates with its master

MapReduce: This is the processing engine and is also implemented in master-slave architecture.

MR Master => Job Tracker: Takes the incoming jobs, identifies the available resources across the cluster, divides the job into tasks and submits it to the cluster
MR Slave => Task Tracker: Actually runs the task and coordinates with its master.

Architecture

Drawbacks

Design of JobTracker is done in such a way that its tightly coupled with two important responsibilities “Resource Management” and “MapReduce Task Execution”. Because of this reason the cluster cannot be used for other distributed computing technologies like Spark/Kafka/Storm/… other than Hadoop MapReduce
Name Node can maintain metadata of upto 4000-5000 data nodes at maximum. This will limit the cluster scalability to 4k-5k nodes
Hard partition of slot into Mapper and Reducer slots
JobTracker was a Single Point Of Failure SPOF
Iterative applications (Machine Learning) are very slow (10x times slower than YARN)
Lack of wire compatible protocols between client and sever in MapReduce applications (like hive and pig where they can support multiple versions on the same cluster)

Addressing these drawbacks hadoop 2.x is released.

Filed under big data, Hadoop

Newer posts →

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28

coding algorithms

Hadoop 2.x Architecture

Hadoop 1.x Architecture and Drawbacks

Coding Algorithms is referred

Categories

Subscribe via email

Mostly Viewed

Recent Comments

Recent Posts

Archives

Blogs I Follow

coding algorithms

Hadoop 2.x Architecture

Share this:

Hadoop 1.x Architecture and Drawbacks

Share this:

Coding Algorithms is referred

Categories

Subscribe via email

Trending Categories

Mostly Viewed

Recent Comments

Recent Posts

Archives

Blogs I Follow