Hadoop 1.x Architecture and Drawbacks
February 13, 2017 Leave a comment
Hadoop is built on two whitepapers published by Google, i.e,
- HDFS
- Map Reduce
HDFS: Hadoop Distributed File System
It is different from the normal file system in a way that the data copied on to HDFS is split into ‘n’ blocks and each block is copied on to a different node in the cluster. To achieve this we use master-slave architecture
- HDFS Master => Name Node: Takes the client request and responsible for orchestrating the data copy across the cluster
- HDFS Slave => Data Node: Actually saves the block of data and coordinates with its master
MapReduce: This is the processing engine and is also implemented in master-slave architecture.
- MR Master => Job Tracker: Takes the incoming jobs, identifies the available resources across the cluster, divides the job into tasks and submits it to the cluster
- MR Slave => Task Tracker: Actually runs the task and coordinates with its master.
Architecture
Drawbacks
- Design of JobTracker is done in such a way that its tightly coupled with two important responsibilities “Resource Management” and “MapReduce Task Execution”. Because of this reason the cluster cannot be used for other distributed computing technologies like Spark/Kafka/Storm/… other than Hadoop MapReduce
- Name Node can maintain metadata of upto 4000-5000 data nodes at maximum. This will limit the cluster scalability to 4k-5k nodes
- Hard partition of slot into Mapper and Reducer slots
- JobTracker was a Single Point Of Failure SPOF
- Iterative applications (Machine Learning) are very slow (10x times slower than YARN)
- Lack of wire compatible protocols between client and sever in MapReduce applications (like hive and pig where they can support multiple versions on the same cluster)
Addressing these drawbacks hadoop 2.x is released.
Recent Comments