Java8 Streams

Streams are functional programming design pattern for processing of elements of a data structure sequentially or parallely.


List<Order> orders = getOrders();
int qtySum =
                .filter(o -> o.getType().equals("ONLINE"))
                .mapToInt(o -> o.getQuantity())

Above code can be explained as below:

  • Create a stream from source java collection
  • Add a filter operation to the stream “intermediate operations pipeline”
  • Add a map operation to the stream “intermediate operations pipeline”
  • Add a terminal operation that kicks off the stream processing

A Stream has

  • Source: that stream can pull objects from
  • Pipeline: of operations that can execute on the elements of the stream
  • Terminal: operation that pull values down the stream

Streams are lazily evaluated and the stream lifecycle is

  • Creation: from source
  • Configuration: from collection of pipeline operations
  • Execution: (terminal operation is invoked)
  • Cleanup

Few java stream sources

// Number Stream
LongStream.range(0, 5).forEach(System.out::println)

// Collection Streams
List<String> cities = Arrays.asList("nyc", "edison", "livingston");

// Character Srream
int cnt = "ABC".chars().count()

// File Streams
Path filePath = new Path("/user/ntallapa/test");

Stream Terminal Operations

  • Reduction Terminal Operations: results in single result
    • reduce(f(x)), sum(f(x)), min(f(x)), max(f(x)), etc
  • Mutable Reduction Terminal Operations: returns multiple results in a container data structure
    • List integers = Arrays.asList(1,2,3,3);
    • Set integersSet =; // returns 1,2,3
  • Search Terminal Operations: returns result as soon as match is found
    • findFirst(), findAny(), anyMatch(f(x))
  • Generic Terminal Operations: do any kind of processing on every element

Stream Pipeline Rule: Since streams are meant to process elements sequentially/parallely, stream source is not allowed to modify

Intermediate Pipeline operations can be of two types

  • Stateless: filter(f(x)), summaryStatistics()
  • Statefull: distinct(), limit(n), skip(n)


Java8 Lambdas

Lambdas enable functional programming in Java. Concept of Lambda is available in many languages but the beauty of it in Java is its BACKWARD COMPATIBILITY. In this post we will discuss below areas

  • Concept
  • Syntax
  • Functional Interfaces
  • Variable Capture
  • Method References
  • Default Methods


Lambda can be defined as

  • a way of defining anonymous functions
  • It can be passed to variables
  • It can be passed to functions
  • Can be returned from functions

What are lambda good for?

  • These are the base for functional programming model
  • Makes parallel programming easier: If we want to make 100s of cores busy then its easier to do with functional programming than that of the object oriented programming
  • Write compact code (Hadoop1 in Java: 200k lines of code VS Spark1 in Scala: 25k Lines of Code)
  • Richer Data Structure Collections
  • Develop Cleaner APIs


List<Integer> integers = Arrays.asList(1,2,3);
integers.forEach(x -> System.out.println(x));
integers.forEach((x) -> {
    x = x+10;
integers.forEach((Integer x) -> {
    x = x+10;

We can explicitly mention types, but Java8 compiler is able to do Type Inference

Lambda Expression in Java (from video: A Peek Under the Hood by Brian) is converted into a Function and then we call the generated function

Functional Interfaces (FI)

An interface with just one method (one NON-DEFAULT Method) is called a Functional Interface.

Prior to Java8 we use to write the function with signature, open a curly brace and then write body of the function and close it but In Java8:

// define FI
    // enforces that the interface is FI (it fails compilation if below interface has more than one method)
    // Its optional and can be applied only to interfaces
public interface Consumer<T> {
    void accept(T t);

// give the definition
Consumer<Integer> consumer = x -> System.out.println(x);

// use it
List<Integer> integers = Arrays.asList(1,2,3);

Few things to notice here:

  • Here we are separating the body of the function (Line #10) from its signature(Line #6).
  • The method generated from lambda expression must have same signature as that of the FI(see Line#67: lambda takes one arg ‘x’, throws no Exception and returns nothing)
  • In Java8, the type of the lambda expression is same as that of the FI that lambda is assigned to. (see Line#67)

Variable Capture (VC)

Lambdas can interact with variables (local, instance and static) defined outside the body of lambda (aka VC).

List<Integer> integers = Arrays.asList(1,2,3);
int vc=10;
integers.forEach(x -> System.out.println(x+vc));

Note: Local variables accessed and used inside the Lambda are final and cannot be modified.

Lambda vs Anonymous Inner Classes

  • Inner classes can have state in the form of class level instance variables whereas lambdas cannot.
  • Inner Classes can have multiple methods whereas Lambda’s cannot
  • ‘this’ points to the object instance of anonymous inner class whereas it points to the enclosing object for lambda

java.util.function.* contains 43 most commonly used functional interfaces

  • Consumer: functions which takes argument of type T and returns void
  • Supplier: functions that takes no argument and returns a result of type T
  • Predicate: functions which takes argument of type T and returns boolean
  • Function<T, R>: function that takes an argument of type T and returns a result of type R

Method References (MRs)

As lambda being a way to define anonymous function, there is a good chance that the function we want to use exists already. In these cases, MRs can be used to pass an existing function in place where lambda is expected

public interface Consumer<T> {
    void accept(T t);

public void doSomething(Integer x) {

Consumer<Integer> cons1 = x -> doSomething(x);
// Reuse with MR
Consumer<Integer> cons2 = Example::doSomething;

Note: The signature of the referenced method must match the signature of FI method.

By looking at above definition it is obvious that MR works on method with only one argument and no return type

Referencing a Constructor: Constructor method references are quite handy while working with Streams

// Create a new function which has a method that takes 'String' as parameter (LHS to arrow), returns 'Integer' (RHS to arrow as body of method)
Function<String, Integer> mapper1 = x -> new Integer(x);

// Refer a Cons
Function<String, Integer> mapper2 = x -> Integer::new;

References to a specific object instance method:

Consumer<Integer> cons1 = x -> doSomething(x);
// can also be written as: this invokes the println() method on System.out object by passing param '2'
Consumer<Integer> cons2 = System.out::println;

Default Methods ***

This is very important feature because it addresses Interface Evolution Problem: How a published interface (like List, Iterable, etc) can be evolved without breaking existing implementations (backward compatible)

Default Method: A default method on a java interface has an implementation provided in the interface and is inherited by the classes that implements it.

public Iterable<T> { 
    Iterator<T> iterator;
    default void forEach(Consumer<? super T> action) {
        for(T t: this) {


Spark Architecture

Spark can be launched in different modes and each of this mode has different architecture.

  1. Local: Single JVM
  2. Standalone: from datastax (Static Allocation)
  3. YARN: from Hadoop (Dynamic Allocation)
  4. Mesos: Spark’s own Arch (Dynamic Allocation)

In these, 2,3,4 are distributed architectures. Standalone and Mesos architectures are similar to that of YARN.

In YARN, there are two different modes

  • Spark YARN Client Mode Architecture: This is for Spark in Scala/Python shell (aka Interactive Mode). Here Spark Driver will be run in the Edge Node and if the Driver Program is killed or edge node crashes, the application gets killed.
  • Spark YARN Cluster Mode Architecture: This is when user submits spark application using spark-submit. Here the Spark Driver is initiated inside the Application Master.

Unlike Hadoop Driver Program, Spark Driver is also responsible for

  • DAG Scheduler and Task Scheduler: Once Executors are launched inside Containers, they will have direct communication with this Scheduler. This play far more important role than that of the YARN Scheduler in Spark Applications.
  • Spark UI: UI with application DAG, Jobs and Stages are all served by the Spark Driver.

Spark Terminology – Nodes, Containers, Executors, Cores/Slots, Tasks, Partitions, Jobs, Stages

  • Spark cluster can be formed with ‘n’ Nodes.
  • Each Node can have 1+ containers. Number of containers are decided based on the min and max container memory limits in yarn-site.xml.
  • Each Container must have exactly 1 Executor JVM.
  • Each Executor can have 1+ Slots (aka Cores). The minimum slots required for Spark application are 2. Recommended range is between 8-32. We can choose a maximum of 2-3x times actual physical cores on a node.
  • Tasks are run inside the Slots.Task is a unit of work assigned to Executor core/slot by the Task Scheduler.
  • Partition is a block of data(like blocks in HDFS file). Spark RDD is split into 1+ partitions. Each Partition requires one thread of computation (aka Task) and hence an RDD with ‘n’ partitions requires ‘n’ Tasks to perform any Transformation.
  • Jobs: A Spark Application is split into ‘n’ Jobs based on number of Actions inside it. Basically for every Action a Job will be launched.
  • Stages: A Job is divided into ‘m’ Stages. A Stage is a group that can be put together based on operations, for example: map() and filter() can put together into a stage. And this Stage is finally split into ‘n’ Tasks.

Dynamic Allocation in Spark

Dynamic Allocation is a spark feature that allows addition or removal of executors launched by the application dynamically to match the workload.

Unlike static allocation of resources (prior to 1.6.0) where spark used to reserve fixed amount of CPU and Memory resources, in Dynamic Allocation its purely based on the workload.

Note: This is the main difference between Spark Standalone Architecture (static allocation) and Spark YARN/Mesos Architecture.

// flag to enable/diable DA feature
spark.dynamicAllocation.enabled: true/false

// Application starts with this many executors
spark.dynamicAllocation.minExecutors: m

// Application can increase to this many executors at max
spark.dynamicAllocation.maxExecutors: n

// For the FIRST Time when this time is hit, number of executors 
// will be increased
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout: x secs

// Next Time onwards, whenever this time is hit, it increases 
// number of executors till maxExecutors is hit
spark.dynamicAllocation.schedulerBacklogTimeout: y secs

// It releases the executor when it sees no Task is scheduled 
// on Executor for this time
spark.dynamicAllocation.executorIdleTimeout: z secs

There were few issues with dynamic allocation in Streaming Applications because

  • Executors may never be idle as they run for every N secs
  • Receiver will be running on a Slot/Core inside Executor which is never finished and hence idleTimeout will never be hit

// For streaming applications, disable above switch and enable below one

Tuning Spark Applications

Tuning performance of  Spark Applications can be done at various stages

  • OS Level
  • JVM Level
  • YARN Level
  • Spark Level

OS Level
In yarn-site.xml, we can allocate physical and virtual memory for a container initialized the node.

JVM Level
We can look at the performance of the JVM Garbage Collection and then fine tune GC parameters
./bin/spark-submit –name “My app” –master yarn –conf spark.eventLog.enabled=false –conf “spark.executor.extraJavaOptions=-XX:OldSize=100M -XX:MaxNewSize=100M -XX:+PrintGCDetails -XX:+PrintGCTimeStamps” myApp.jar

YARN Level
While submitting the job, we can control
Number of executors (Executor is run inside container and 1 Executor per Container)
Memory for each executor
Number of cores for each executor (This value can be raised to a maximum of 2x times the actual cores, but beaware that it can also raise the bar for memory)
Memory Overhead
./bin/spark-submit –name “My app” –master yarn –num-executors 8 –executor-memory 4G –executor-cores 16 –conf “spark.yarn.executor.memoryOverhead=1024M” myApp.jar

Spark Level
Prior to Spark 1.6.0, executor memory (spark.executor.memory) was split into two different pools
Storage Memory: Where it caches RDDs
Execution Memory: Where it holds execution objects
From 1.6.0 onwards, they are combined into a unified pool and there is no hard line split between the two. It is dynamically decided at run time on ratio of memory allocation for these two pools.


Based on all the above factors, we should target tuning the memory settings based on

  • Objectives (EFFICIENCY vs RELIABILITY) and
  •  Workloads (whether its a BATCH/STREAMING)


Some TIPS:

    • Cost of garbage collection is directly proportional to the number of objects hence try to reduce number of objects (for example use Array(int) instead of List)
    • For Batch Applications use default GC (ParallelGC) and for Streaming Applications use ConcMarkSweepGC
// BATCH Apps: default GC
-XX:+UseParallelGC -XX:ParallelGCThreads=<#>

// Streaming Apps
-XX:+UseConcMarkSweepGC -XX:ParallelCMSThreads=<#>
// G1 GC Available from Java7, which is considered as good replacement to CMS
    • KRYO Serialization: This is 10x times faster than Java Serialization. In general, for 1G disk file, it takes 2-3G to store it into memory which is is the cost of Java Serialization.
conf.set("spark.serializer", "org.apache.spark.serializer.KyroSer");
// We need to register our custom classes with KYRO Serializer
  • TACHYON: Use tachyon for off-heap storage. The advantage is that even if the Executor JVM crashes, it stays in the OFF_HEAP storage.


Hadoop and Spark Installation on Raspberry Pi-3 Cluster – Part-4

In this part we will see the configuration of Slave Node. Here are the steps

  1. Mount second Raspberry Pi-3 device on the nylon standoffs (on top of Master Node)
  2. Load the image from part2 into a sd_card
  3. Insert the sd_card into one Raspberry Pi-3 (RPI) device
  4. Connect RPI to the keyboard via USB port
  5. Connect to monitor via HDMI cable
  6. Connect to Ethernet switch via ethernet port
  7. Connect to USB switch via micro usb slot
  8. Hadoop related changes on Slave node

Here Steps1-7 are all physical and hence I am skipping them.

Once the device is powered on, login via external keyboard and monitor and change the hostname from rpi3-0 (which comes from base image) to rpi3-1

Step #8: Hadoop Related Configuration

  • Setup HDFS
sudo mkdir -p /hdfs/tmp  
sudo chown hduser:hadoop /hdfs/tmp  
chmod 750 /hdfs/tmp  
hdfs namenode -format
  • Update /etc/hosts file	localhost	rpi3-0	rpi3-1	rpi3-2	rpi3-3
  • Repeat the above steps for each of the slave node. And for every addition of slave node, ensure
  • ssh is setup from master node to slave node
  • slaves file on master is updated
  • /etc/hosts file on both master and slave is updated

Start the hadoop/spark cluster

    • Start dfs and yarn services
cd /opt/hadoop-2.7.3/sbin 
    • On master node “jps” should show following
hduser@rpi3-0:~ $ jps
20421 ResourceManager
20526 NodeManager
19947 NameNode
20219 SecondaryNameNode
24555 Jps
20050 DataNode
    • On Slave Node “jps” should show following processes
hduser@rpi3-3:/opt/hadoop-2.7.3/logs $ jps
2294 NodeManager
2159 DataNode
2411 Jps
    • To verify the successful installation, run a hadoop and spark job in cluster mode and you will see the Application Master tracking URL.
    • Run spark Job
      • spark-submit –class com.learning.spark.SparkWordCount –master yarn –executor-memory 512m ~/word_count-0.0.1-SNAPSHOT.jar /ntallapa/word_count/text 2
    • Run example mapreduce job
      • hadoop jar /opt/hadoop-2.7.3/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.3.jar wordcount /ntallapa/word_count/text /ntallapa/word_count/output

Hadoop and Spark Installation on Raspberry Pi-3 Cluster – Part-3

In this part we will see the configuration of Master Node. Here are the steps

  1. Mount first Raspberry Pi-3 device on the nylon standoffs
  2. Load the image from part2 into a sd_card
  3. Insert the sd_card into one Raspberry Pi-3 (RPI) device
  4. Connect RPI to the keyboard via USB port
  5. Connect to monitor via HDMI cable
  6. Connect to Ethernet switch via ethernet port
  7. Connect to USB switch via micro usb slot
  8. DHCPD Configuration
  9. NAT Configuration
  10. DHCPD Verification
  11. Hadoop related changes on Master node

Here Steps1-7 are all physical and hence I am skipping them.

Step #8: dhcpd configuration

This node will serve as DHCP server or NAT server and overall controller of the cluster

    • Goto “sudo raspi-config” -> Advanced Options -> HostName -> “rpi3-0” (make sure its rpi3-0 as its our first node)
    • sudo apt-get install isc-dhcp-server
    • sudo nano /etc/dhcp/dhcpd.conf
    • Define subnet which will be the network that all the RPI-3 nodes connect to.
subnet netmask {
        option broadcast-address;
        option routers;
        max-lease-time 7200;
        option domain-name "rpi3";
        option domain-name-servers;
  • Adjust server configuration
  • sudo nano /etc/default/isc-dhcp-server
  • Tell which interface to use at last line. (“eth0”)
    # On what interfaces should the DHCP server (dhcpd) serve DHCP requests?
    #       Separate multiple interfaces with spaces, e.g. "eth0 eth1".
  • Configure the interfaces file of rpi3-0 in order to be served as dhcp server and nat server for the rest of the pi cluster
    • sudo nano /etc/network/interfaces
    • Make the below changes and reboot the PI
      auto eth0
      iface eth0 inet static

Step #9: NAT configuration

    • Now we will configure IP tables to provide Network Address Translation services on our master node rpi3-0
    • sudo nano /etc/sysctl.conf
    • uncomment “net.ipv4.ip_forward=1”
    • sudo sh -c “echo 1 > /proc/sys/net/ipv4/ip_forward”
    • Now it has been activated, run below 3 commands to configure IP Tables correctly
sudo iptables -t nat -A POSTROUTING -o wlan0 -j MASQUERADE
sudo iptables -A FORWARD -i wlan0 -o eth0 -m state --state RELATED
sudo iptables -A FORWARD -i eth0 -o wlan0 -j ACCEPT
  • Make sure we have this setup correct
  • sudo iptables -t nat -S
  • sudo iptables -S
  • In order to avoid loosing this config upon reboot, do
    • sudo sh -c “iptables-save > /etc/iptables.ipv4.nat” (save iptables configuration to a file)
    • sudo nano /etc/network/interfaces (add below line to interfaces file)
      post-up iptables-restore < /etc/iptables.ipv4.nat
auto eth0
iface eth0 inet static
	post-up iptables-restore  < /etc/iptables.ipv4.nat

Step #10: Verify dhcpd

  • To see the address that has been assigned to the new PI
  • cat /var/lib/dhcp/dhcpd.leases
  • This would also give us the MAC address of the newly added node
  • It is always handy to have the dhcp server assign fixed addresses to each node in the cluster so that its easy to remember the node by ipaddress. For instance next node in the cluster is rpi3-1 and it would be helpful to have a ip To do this modify dhcp server config file
      • sudo nano /etc/dhcp/dhcpd.conf
    host rpi3-1 {
        hardware ethernet MAC_ADDRESS;
    • Eventually this file will have a entry to all the nodes in the cluster
    • Now we can ssh into the new node via IP Address

Step #11: Hadoop Related Configuration

    • Setup SSH
su hduser 
cd ~  
mkdir .ssh  
cat ~/.ssh/ >> ~/.ssh/authorized_keys  
chmod 0750 ~/.ssh/authorized_keys  
// Lets say we added new slave node rpi3-1, then copy the ssh id to the slave node which will enable passwordless login
ssh-copy-id hduser@rpi3-1 (Repeat for each slave node)  
ssh hduser@rpi3-1
    • Setup HDFS
sudo mkdir -p /hdfs/tmp  
sudo chown hduser:hadoop /hdfs/tmp  
chmod 750 /hdfs/tmp  
hdfs namenode -format
    • Edit master and slave config files
        • sudo nano /opt/hadoop-2.7.3/etc/hadoop/masters
        • sudo nano /opt/hadoop-2.7.3/etc/hadoop/slaves
    • Update /etc/hosts file	localhost	rpi3-0	rpi3-1	rpi3-2	rpi3-3


Hadoop and Spark Installation on Raspberry Pi-3 Cluster – Part-2

Cluster Architecture

  • Master Node will be connected to home router via WiFi
  • Master Node to Slave Node connection will be established through Ethernet switch via Ethernet Cables
  • From my MAC (which will be on my home network), I will be able to SSH to the master node and then control the whole cluster

For Spark/Hadoop Cluster, there are few more TODOs that we need to take care of

  • Update /etc/hosts on every node (master and slave) with hostname and ip_address of every other node
  • Use same super user and group to do all installations on every node
  • Enable SSH on every node and establish passwordless SSH communication from Master to every Slave node.
  • Install zip/unzip and java on every node.

In this part we will see SINGLE NODE SETUP. I will be using MAC to perform all these steps.

Step #1: Load Raspbian Pi image onto the MicroSD Card

  • Download SD Formatter from
  • Format the disk (follow steps from google)
  • Download the Raspbian_Lite OS from and follow these instructions
    • diskutil list (this will list the newly added disk in my case it was /dev/disk4)
    • diskutil unmountDisk /dev/disk4
    • sudo dd bs=1m if=~/Downloads/2017-01-11-raspbian-jessie-lite.img of=/dev/rdisk4

Step #2: Configure the PI, connect to WiFi and upgrade all latest patches.

  • sudo raspi-config
    • Change pwd
    • Localization options (change locale to us_en, timezone to US-Eastern and wifi country to US)
    • Advanced Options – Mem Split from 64 to 16(becos of rasbian_lite OS will take very less footprint and it does not have any UI)
    • Interfacing Options – Enable SSH
  • sudo vi /etc/network/interfaces
    • Ethernet eth0 is the wired connection which we will be using
    • wlan0 is the wifi adapter on the board: Configure wifi so that we can use wifi to get updates on the Raspbian Lite OS
    • change manual to dhcp => this tells the interface that we get the settings via dhcp AND add SSID and PWD of home router
      • change the line “iface wlan0 inet manual” to
iface wlan0 inet dhcp
    wpa-ssid "SSID/NETWORK_NAME"
    wpa-psk "PASSWORD"
  •  Reboot the interface
    • sudo ifdown wlan0
    • sudo ifup wlan0
  • Now Raspberry PI is connected to WIFI
  • Update and Upgrade the Raspberry PI
    • sudo apt-get update
    • sudo apt-get upgrade

Step #3: Create separate superuser and group

  • We will use this user and group for all our core installations and configuration changes on all nodes
sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser
sudo adduser hduser sudo
su hduser

Step #4: Download and Install REQUIRED SOFTWARES

  • Download and install zip utility
    • sudo apt-get install zip unzip
  • Download and install java
    • sudo apt-get install oracle-java7-jdk
  • Download, install and configure Spark
    sudo tar -xvzf spark-2.1.0-bin-hadoop2.7.tgz -C /opt/
    sudo chown -R hduser /opt/spark-2.1.0-bin-hadoop2.7
    source ~/.bashrc
    cp $SPARK_HOME/conf/ $SPARK_HOME/conf/  
  • Download, install and configure Hadoop
    sudo mkdir /opt
    cd ~
    sudo tar -xvzf hadoop-2.7.1.tar.gz -C /opt/
    cd /opt
    sudo chown -R hduser:hadoop hadoop-2.7.1/
    sudo nano /opt/hadoop-2.7.3/etc/hadoop/
    export JAVA_HOME=/usr/lib/jvm/jdk-7-oracle-arm-vfp-hflt/jre
  • sudo nano /opt/hadoop-2.7.3/etc/hadoop/hdfs-site.xml
  • sudo nano /opt/hadoop-2.7.3/etc/hadoop/core-site.xml
  • sudo nano /opt/hadoop-2.7.3/etc/hadoop/mapred-site.xml
  • sudo nano /opt/hadoop-2.7.3/etc/hadoop/yarn-site.xml
  • Add environment variables to the bashrc file
    sudo nano ~/.bashrc

    export JAVA_HOME=/usr/lib/jvm/jdk-7-oracle-arm-vfp-hflt/jre
    export HADOOP_HOME=/opt/hadoop-2.7.3
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
    export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
    export SPARK_HOME=/opt/spark-2.1.0-bin-hadoop2.7 
    export PATH=$PATH:$SPARK_HOME/bin  

Step #5: Create image from SD Card and clone it to all other PIs on the cluster

  • Switch off the Raspberry Pi-3 and take the sd-card out and plug it into mac
  • Run the below commands
    • diskutil list
    • sudo dd if=/dev/disk4 of=~/Downloads/raspberrypi_base_with_hdp.dmg
  • On other nodes
    • diskutil unmountDisk /dev/disk4
    • sudo dd bs=1m if=~/Downloads/raspberrypi_base_with_hdp.dmg of=/dev/rdisk4


Hadoop and Spark Installation on Raspberry Pi-3 Cluster – Part-1

In this part we will go through the hardware configurations of the Raspberyy PI-3 4 Node Cluster setup.

Hardware Configuration



  1. Raspberry Pi 3 Model B 2016 Single Board Computer – 4 = 120 (bought from Microcentre)




2. Samsung EVO 32GB Class 10 Micro SDHC Card with Adapter (MB-MP32DA/AM) – 4 (12) = $48




3. Cable Matters 8-Pack, Cat5E Snagless Ethernet Patch Cable in Blue 3 Feet – 4 = $10







4. TRENDnet 8-Port Unmanaged 10/100 Mbps GREENnet Ethernet Desktop Plastic Housing Switch,TE100-S8 – 1 = $15




5. Anker [4-Pack] PowerLine Micro USB (1ft) – 1 = $10




6. USB2TYPEM 3 Feet USB to Type M Barrel 5V DC Power Cable – 1 = $4




7. 100 Pcs M2.5 x 10mm + 6mm PC Board Hexagonal Hex Threaded Spacer – 1 = $14


usb_switch8. ORICO DUB-10P-WH 96W 10 Ports Family-Sized Smart Super USB Charger with 10 × 5V 2.4A Port – White – 1 = $20





9. Machine Screws (bought from Home Depot) – 2 = under $2




10. Thin card board from Home Depot – $4

Total Budget under $250

Except Machine Screws, Card Board and RaspberryPis all above things are bought in Amazon and NewEgg.

Additionally you can buy heat sinks and fans based on the load and extent you are going to use this cluster.

My personal cluster setup Steps:

  • Drill 4 holes on to the card board as per the holes available on the Raspberry Pi
  • Stack two nylon standoffs together form 20 such sets
  • Insert screws from below the cardboard and 4 stacked nylon standoffs from above the cardboard
  • Keep stacking Raspberry Pis (RPIs) one by one and tighten the standoffs.
  • Connect 4 micro USB cables from RPIs to the USB switch
  • Connect 4 ethernet cables from RPIs to the Ethernet switch
  • Connect Ethernet switch to the USB switch via USB2TYPEM cable.
  • Insert 32G samsung sd card into each of the RPIs
  • Switch on the USB switch and the entire cluster will be powered.

cluster_1 cluster_2


Important YARN configuration properties

To configure YARN and MapReduce on top of YARN, we should look into couple of configuration files

  • yarn-site.xml
  • mapred-site.xml


  • yarn.scheduler.minimum-allocation-mb: The minimum allocation for every container request at the RM
  • yarn.scheduler.maximum-allocation-mb: The maximum allocation for every container request at the RM
  • yarn.scheduler.minimum-allocation-vcores: The minimum allocation for every container request at the RM, in terms of virtual CPU cores.
  • yarn.scheduler.maximum-allocation-vcores: The maximum allocation for every container request at the RM, in terms of virtual CPU cores.
  • yarn.nodemanager.resource.memory-mb: Amount of physical memory, that can be allocated for containers. Total RAM on a given node that can be utilized by the node manager to create the containers
  • yarn.nodemanager.resource.cpu-vcores: Number of vcores that can be allocated for containers. This is used by the RM scheduler when allocating resources for containers. This is not used to limit the number of physical cores used by YARN containers.
  • yarn.nodemanager.pmem-check-enabled: Whether physical memory limits will be enforced for containers.
  • yarn.nodemanager.vmem-check-enabled: Whether virtual memory limits will be enforced for containers.
  • yarn.nodemanager.vmem-pmem-ratio: Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio.

Virtual Memory: physical + paged memory



  • yarn
  • The amount of memory to request from the YARN scheduler for each map task. This is total physical RAM of a Map Task Container.
  • The JVM Heap Size (0.8 times above RAM), so that JVM memory is within the container physical memory
  • mapreduce.reduce.memory.mb: The amount of memory to request from the YARN scheduler for each reduce task.
  • The amount of memory the MR AppMaster needs.



Mostly technology with occasional sprinkling of other random thoughts


Amir Amintabar's personal page

101 Books

Reading my way through Time Magazine's 100 Greatest Novels since 1923 (plus Ulysses)

Seek, Plunnge and more...

My words, my world...

ARRM Foundation

Do not wait for leaders; do it alone, person to person - Mother Teresa

Executive Management

An unexamined life is not worth living – Socrates

Diabolical or Smart

Nitwit, Blubber, Oddment, Tweak !!


A topnotch site


Just another site

coding algorithms

"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem." -- John Tukey