Java8 Streams

Streams are functional programming design pattern for processing of elements of a data structure sequentially or parallely.

Example:

 
List<Order> orders = getOrders();
int qtySum = orders.stream()
                .filter(o -> o.getType().equals("ONLINE"))
                .mapToInt(o -> o.getQuantity())
                .sum();

Above code can be explained as below:

  • Create a stream from source java collection
  • Add a filter operation to the stream “intermediate operations pipeline”
  • Add a map operation to the stream “intermediate operations pipeline”
  • Add a terminal operation that kicks off the stream processing

A Stream has

  • Source: that stream can pull objects from
  • Pipeline: of operations that can execute on the elements of the stream
  • Terminal: operation that pull values down the stream

Streams are lazily evaluated and the stream lifecycle is

  • Creation: from source
  • Configuration: from collection of pipeline operations
  • Execution: (terminal operation is invoked)
  • Cleanup

Few java stream sources

 
// Number Stream
LongStream.range(0, 5).forEach(System.out::println)

// Collection Streams
List<String> cities = Arrays.asList("nyc", "edison", "livingston");
cities.stream().forEach(System.out::println)

// Character Srream
int cnt = "ABC".chars().count()

// File Streams
Path filePath = new Path("/user/ntallapa/test");
Files.list(filePath).forEach(System.out::println);

Stream Terminal Operations

  • Reduction Terminal Operations: results in single result
    • reduce(f(x)), sum(f(x)), min(f(x)), max(f(x)), etc
  • Mutable Reduction Terminal Operations: returns multiple results in a container data structure
    • List integers = Arrays.asList(1,2,3,3);
    • Set integersSet = integers.stream().collect(Collectors.toSet()); // returns 1,2,3
  • Search Terminal Operations: returns result as soon as match is found
    • findFirst(), findAny(), anyMatch(f(x))
  • Generic Terminal Operations: do any kind of processing on every element

Stream Pipeline Rule: Since streams are meant to process elements sequentially/parallely, stream source is not allowed to modify

Intermediate Pipeline operations can be of two types

  • Stateless: filter(f(x)), summaryStatistics()
  • Statefull: distinct(), limit(n), skip(n)

References:
https://www.youtube.com/watch?v=MLksirK9nnE
https://www.youtube.com/watch?v=8pDm_kH4YKY

Java8 Lambdas

Lambdas enable functional programming in Java. Concept of Lambda is available in many languages but the beauty of it in Java is its BACKWARD COMPATIBILITY. In this post we will discuss below areas

  • Concept
  • Syntax
  • Functional Interfaces
  • Variable Capture
  • Method References
  • Default Methods

Concept


Lambda can be defined as

  • a way of defining anonymous functions
  • It can be passed to variables
  • It can be passed to functions
  • Can be returned from functions

What are lambda good for?

  • These are the base for functional programming model
  • Makes parallel programming easier: If we want to make 100s of cores busy then its easier to do with functional programming than that of the object oriented programming
  • Write compact code (Hadoop1 in Java: 200k lines of code VS Spark1 in Scala: 25k Lines of Code)
  • Richer Data Structure Collections
  • Develop Cleaner APIs

Syntax

 
List<Integer> integers = Arrays.asList(1,2,3);
integers.forEach(x -> System.out.println(x));
OR
integers.forEach((x) -> {
    x = x+10;
    System.out.println(x);
});
OR
integers.forEach((Integer x) -> {
    x = x+10;
    System.out.println(x);
});

We can explicitly mention types, but Java8 compiler is able to do Type Inference

Lambda Expression in Java (from video: A Peek Under the Hood by Brian) is converted into a Function and then we call the generated function

Functional Interfaces (FI)


An interface with just one method (one NON-DEFAULT Method) is called a Functional Interface.

Prior to Java8 we use to write the function with signature, open a curly brace and then write body of the function and close it but In Java8:

 
// define FI
@FunctionalInterface 
    // enforces that the interface is FI (it fails compilation if below interface has more than one method)
    // Its optional and can be applied only to interfaces
public interface Consumer<T> {
    void accept(T t);
}

// give the definition
Consumer<Integer> consumer = x -> System.out.println(x);

// use it
List<Integer> integers = Arrays.asList(1,2,3);
integers.forEach(consumer);

Few things to notice here:

  • Here we are separating the body of the function (Line #10) from its signature(Line #6).
  • The method generated from lambda expression must have same signature as that of the FI(see Line#67: lambda takes one arg ‘x’, throws no Exception and returns nothing)
  • In Java8, the type of the lambda expression is same as that of the FI that lambda is assigned to. (see Line#67)

Variable Capture (VC)


Lambdas can interact with variables (local, instance and static) defined outside the body of lambda (aka VC).

 
List<Integer> integers = Arrays.asList(1,2,3);
int vc=10;
integers.forEach(x -> System.out.println(x+vc));

Note: Local variables accessed and used inside the Lambda are final and cannot be modified.

Lambda vs Anonymous Inner Classes

  • Inner classes can have state in the form of class level instance variables whereas lambdas cannot.
  • Inner Classes can have multiple methods whereas Lambda’s cannot
  • ‘this’ points to the object instance of anonymous inner class whereas it points to the enclosing object for lambda

java.util.function.* contains 43 most commonly used functional interfaces

  • Consumer: functions which takes argument of type T and returns void
  • Supplier: functions that takes no argument and returns a result of type T
  • Predicate: functions which takes argument of type T and returns boolean
  • Function<T, R>: function that takes an argument of type T and returns a result of type R

Method References (MRs)


As lambda being a way to define anonymous function, there is a good chance that the function we want to use exists already. In these cases, MRs can be used to pass an existing function in place where lambda is expected

 
@FunctionalInterface 
public interface Consumer<T> {
    void accept(T t);
}

public void doSomething(Integer x) {
    System.out.println(x);
}

Consumer<Integer> cons1 = x -> doSomething(x);
cons1.accept(1);
// Reuse with MR
Consumer<Integer> cons2 = Example::doSomething;
cons2.accept(2);

Note: The signature of the referenced method must match the signature of FI method.

By looking at above definition it is obvious that MR works on method with only one argument and no return type

Referencing a Constructor: Constructor method references are quite handy while working with Streams

 
// Create a new function which has a method that takes 'String' as parameter (LHS to arrow), returns 'Integer' (RHS to arrow as body of method)
Function<String, Integer> mapper1 = x -> new Integer(x);
System.out.println(mapper1.apply("11"));

// Refer a Cons
Function<String, Integer> mapper2 = x -> Integer::new;
System.out.println(mapper2.apply("22"));

References to a specific object instance method:

 
Consumer<Integer> cons1 = x -> doSomething(x);
cons1.accept(1);
// can also be written as: this invokes the println() method on System.out object by passing param '2'
Consumer<Integer> cons2 = System.out::println;
cons2.accept(2);

Default Methods ***


This is very important feature because it addresses Interface Evolution Problem: How a published interface (like List, Iterable, etc) can be evolved without breaking existing implementations (backward compatible)

Default Method: A default method on a java interface has an implementation provided in the interface and is inherited by the classes that implements it.

 
public Iterable<T> { 
    Iterator<T> iterator;
    
    default void forEach(Consumer<? super T> action) {
        for(T t: this) {
            action.accept(t);
        }
    }
}

References:
https://www.youtube.com/watch?v=MLksirK9nnE
https://www.youtube.com/watch?v=8pDm_kH4YKY

The Internal Architecture of JVM

Java Virtual Machine components are depicted in below diagram. Each time we run java_application/java_class, an instance of JVM gets created.

Class Loader Subsystem

Class loader subsystem loads classes and interfaces with fully qualified names into the JVM.

Execution Engine

Execution Engine executes all instructions contained by methods of a loaded class.

While executing a Java program, JVM requires memory for storing objects ,local variables, method arguments, return values, and intermediate computational results and JVM does that memory management on several runtime data areas. The specification of runtime data areas is quite abstract. This abstract nature of JVM specification helps different designers to provide implementation on wide variety of OS and as per choice of the designers. Some implementations may have a lot of memory in which to work, others may have very little. Some implementations may be able to take advantage of virtual memory, others may not.

Method Area and Heap
Each instance of the Java virtual machine has one method area and one heap. These areas are shared by all threads running inside the virtual machine. When the virtual machine loads a class file, it parses information about class type from the binary data contained in the class file. It stored the type information into the method area. As the program runs, the virtual machine places all objects the program instantiates onto the heap.

PC Registers and Stacks
When a new thread is created, it gets its own pc register (program counter) and Java stack. If the thread is executing a Java method (not a native method), the value of the pc register indicates the next instruction to execute. A thread’s Java stack stores the state of Java (not native) method which includes its local variables, the parameters with which it was invoked, its return value (if any), and intermediate calculations.

The Java stack is composed of stack frames (or frames). A stack frame contains the state of one Java method invocation. When a thread invokes a method, the java virtual machine pushes the newly created frame onto that thread’s Java stack. When the method completes, the virtual machine pops and discards the frame for that method.

In JVM ,the instruction set uses the Java stack for storage of intermediate data values. The stack-based architecture of the JVM’s instruction set optimizes code done by just-in-time and dynamic compilers.

Native Method Stacks

The state of native method invocations is stored in an implementation-dependent way in native method stacks, in registers or other implementation-dependent memory areas.

Security aspects of standalone utilities in java

Security aspects of java application development

  • make sure the utility does not accept any file input that can be used in edit (write) mode, this way hacker can pass some system files to corrupt the system
  • Do not store any confidential data of the client/project in string literals because string literals are placed in perm heap space and hence these objects can be dumped to expose all the information
  • Handle all passwords in character arrays as string literals are a security threat (same explanation as above)
  • Do not construct SQL queries from any property files without prepared statement as they are prone to SQL injection attacks.

Immutable Objects and Strings in Java

Why are Strings made immutable?
Security Reasons: as strings are used in class loaders, if it is made mutable it could lead to a potential security threat.
Efficient Memory Usage: We use string for whole lot of reasons with the concept of ‘literal pool/string pool’, when string literal is recreated it points to the already existing literal instead of creating new string in the pool memory.
Note: Along with Strings all the wrappers of primitive data types are made immutable.

Whys is String made final?
In order not to override the existing String class to remove immutable property where we would lose the above mentioned advantages it is made final.

Disadvantages of Strings being immutable?
And the negative side of immutable strings is that even a small change to a big string could lead to the construction of entire string again which is both time and space inefficient. In these scenarios we should go for StringBuffer/StringBuilder.

String pool memory and garbage collection:
These two concepts are contradictory in the sense, intention of coming up with pool concept is for the memory efficiency and for this reason GC wont act on it.
Since string pool memory does not come under the scanner of JAVA’s Garbage Collector (GC); once a string literal is put into pool memory it never dies. The inference from this is NEVER EVER store any user sensitive or confidential information in strings because a hacker can always dump the heap which would give away all the secure information.

String used as hashmap keys because of their immutable property:
Hashmap’s key constraint is that once after inserting key-value pair into it, one should not change the key value (with which hashcode is tied up with) as we manipulate the correct bucket based on the hascode which in turn is associated to the key value. And hence after inserting the key-value pair if one changes the key value, its hashcode changes and hence lookup of this pair ends up in returning null. Because of String’s immutable property one can safely make use of String as key in hashmaps.

other most frequently asked java questions

difference between concurrenthashmap and hashtable


In short, hashtable is synchronized at whole map level where as concurrenthashmap is synchronized at bucket level. Here are the differences:

  • read is completely not synchronized in concurrenthashmap
  • Iteratable/Enumerator does not throw concurrent modification exception while iterting in case the map is modified while iterating
  • write is synchronized only at the bucket level instead at the table level

importance of volatility, atomicity and synchronization


volatile: When a variable is declared volatile, any update to that variable is made directly to main memory instead of CPU cache and hence it guarantees the visibility of changes across threads

private volatile boolean flag;

public void run() {
while(!flag) {
//do some work
}
}

public void markFlag() {
flag = true;
}

atomicity: When a variable is declared atomic, it ensures the operations made on the variable are atomic in nature, it uses compare-and-swap low level cpu operation to accomplish this without the need of synchronization.

synchronization: traditional way of acquiring locks to do read/write operation on state (which is by design time consuming operation)

Serialization Mechanism


Serialization is a mechanism by which you can save the state of an object by converting it to a byte stream.The class whose instances are to be serialized should implement an interface Serializable. Then you pass the instance to the ObjectOutputStream which is connected to a FileOutputStream. This will save the object to a file.

The serializable interface is an empty interface. The class should implement Externalizable interface. This interface contains two methods namely readExternal and writeExternal. You should implement these methods and write the logic for customizing the serialization process.

The serialization mechanism generates an object graph for serialization. Thus it determines whether the included object references are serializable or not. This is a recursive process. Thus when an object is serialized, all the included objects are also serialized along with the original object.One should make sure that all the included objects are also serializable. If any of the objects is not serializable then it throws a NotSerializableException.

There are three exceptions in which serialization doesnot necessarily read and write to the stream. These are

  • Serialization ignores static fields, because they are not part of any particular state.
  • Base class fields are only handled if the base class itself is serializable.
  • Transient fields.

Analysing Java Garbage Collector

Application Phases: Initialization → Steady state → Summary (optional)
– Initialization: startup, lots of allocations
– Steady state: where application spends the most time, running the ‘core’ part of the application
– Summary: produce report of what has happen, might be computationally intensive and maybe has lots of allocation as well
Note: Typically, the steady state is the most important

Performance
Three important performance attributes are throughput, latency, footprint
– Throughput – the percentage of time not spent on GC
– Latencies – the time when an app appears unresponsive because of GC
– Footprint – memory size
Note: Not all three are always important
– Usually select two of the three
– But you need to know what is important for your application

GC Logging
Yes even in production
– Very important, not just for testing, minimal overhead

Correlate application-level with GC/JVM level events
– GC not always to blame for long pauses
– Capture time/context sensitive events that might not be able to reproduce under testing

Flags

– -XX:+PrintGCTimeStamps -XX:+PrintGCDetails -Xloggc:&lt;file&gt;
– -XX:+PrintGCDateStamps (from JDK6u4)
– -XX:+PrintHeapAtGC

GC Logging Output – 1

-XX:+PrintGCTimeStamps -XX:+PrintGCDetails
27.199: [Full GC (System)
[PSYoungGen: 81K-&gt;0K(26304K)]
[PSOldGen: 7497K-&gt;7559K(32256K)] 7579K-&gt;7559K(58560K)
[PSPermGen: 12166K-&gt;12166K(16384K)], <i>0.0495600 secs</i>]
[Times: user=<i>0.05 </i>sys=0.00, real=<i>0.04 secs</i>]
28.503: [GC
[PSYoungGen: 869K-&gt;64K(26304K)] 8429K-&gt;7623K(58560K),
0.0008190 secs]
[Times: user=0.00 sys=0.00, real=0.00 secs]

GC Logging Output – 2

-XX:+PrintGCDateStamps -XX:+PrintGCDetails
2011-04-25T10:55:50.007+0800: 9.140:
[GC [PSYoungGen: 8024K-&gt;1530K(26304K)] 15470K-&gt;10427K(58560K),
0.0023450 secs]
[Times: user=0.00 sys=0.00, real=0.01 secs]
2011-04-25T10:55:50.009+0800: 9.143:
[Full GC (System) [PSYoungGen: 1530K-&gt;0K(26304K)]
[PSOldGen: 8896K-&gt;7135K(32256K)] 10427K-&gt;7135K(58560K)
[PSPermGen: 11640K-&gt;11631K(23808K)], 0.0778080 secs]
[Times: user=0.12 sys=0.00, real=0.07 secs]

GC Logging Output – 3

-XX:+PrintGCHeapAtGC -Xloggc:&lt;file&gt;
{<b>Heap before GC invocations=20 (full 8)</b>:
PSYoungGen total 14144K, used 2287K [0xa3b90000, 0xa4b50000, 0xb3790000)
eden space 12160K, 18% used [0xa3b90000,0xa3dcbe00,0xa4770000)
from space 1984K, 0% used [0xa4770000,0xa4770000,0xa4960000)
to space 1984K, 0% used [0xa4960000,0xa4960000,0xa4b50000)
PSOldGen total 32256K, used 7465K [0x84390000, 0x86310000, 0xa3b90000)
object space 32256K, 23% used [0x84390000,0x84ada710,0x86310000)
PSPermGen total 19968K, used 11994K [0x80390000, 0x81710000, 0x84390000)
object space 19968K, 60% used [0x80390000,0x80f46bb0,0x81710000)
45.215: [GC 9753K-&gt;7538K(46400K), 0.0006110 secs]

Heap after GC invocations=20 (full 8):

PSYoungGen total 14144K, used 72K [0xa3b90000, 0xa4b50000, 0xb3790000)
eden space 12160K, 0% used [0xa3b90000,0xa3b90000,0xa4770000)
from space 1984K, 3% used [0xa4960000,0xa49723b0,0xa4b50000)
to space 1984K, 0% used [0xa4770000,0xa4770000,0xa4960000)
PSOldGen total 32256K, used 7465K [0x84390000, 0x86310000, 0xa3b90000)
object space 32256K, 23% used [0x84390000,0x84ada710,0x86310000)
PSPermGen total 19968K, used 11994K [0x80390000, 0x81710000, 0x84390000)
object space 19968K, 60% used [0x80390000,0x80f46bb0,0x81710000)
}

Choice of JVM
32-bit
– -d32 (default)
– For heap size up to 2.5G/3G

64-bit with/without compressed references
– -d64, -XX:+UseCompressedOops
– For heap size up to 26G or unlimited

32-bit to 64-bit migration
– Higher heap size requirement (approx 20%)
– Throughput impact (up to 20%, without compressed refs)

Internals of Java Garbage Collector

GC Responsibilities
Garbage Detection: Garbage are objects that are no longer reachable
Garbage Reclamation:Make space available to the running program again

HotSpot VM Heap Layout

SS#1: Survivor Space 1
SS#2: Survivor Space 2

How HotSpot’s GC Works

Empirical Statistics
Most objects are very short lived
– 80 – 98% of all newly allocated objects die within a few million instructions
– 80 – 98% of all newly allocated objects die before another megabyte has been allocated

Assumptions
Weak generation hypothesis
– Most new objects die young
– Concentrate effort on managing the young generation
– Make allocate/manage/deallocate cycle fast and efficient
For older objects, manage as little as possible
Keep young and old objects separate
Use different GC algorithms for each generation
– Different requirements for each group

Object Allocation

Garbage collector tuning in JAVA HOTSPOT JVM

Garbage collector tuning in JAVA HOTSPOT JVM

  • GC Basics
  • Analysing GC
  • Memory Footprint
  • Latencies
  • Types of Collectors
  • Summary

Garbage Collector(GC) Basics
What are GC responsibilities? Garbage detection and reclamation
Garbage Detection: garbage is a collection of objects which are no longer reachable
Garbage Reclamation: makes space for the running program again

Analysing GC
Application Phases
• Initialization ? Steady state ? Summary (optional)
– Initialization – startup, lots of allocations
– Steady state – where application spends the most time, running the ‘core’ part of the application
– Summary – produce report of what has happen, might be computationally intensive and maybe has lots of allocation as well
• Typically, the steady state is the most important

Memory Footprint

Footprint Requirements
• Physical memory available to the VM
– Single VM and dedicated machine
• All physical memory on the machine?
– Multiple Vms / other processes on the machine (eg. DB)
• How much memory is the VM allowed to consume?

Latencies Requirements
• How large are the pauses?
– Average GC pause time target
– Max GC pause time target
– How frequently violations can be tolerated?
• How frequent are the pauses?
– GC pauses frequency target

Choice of Collectors
• Serial GC
– Default, single threaded
– For application with small LDS of up to appox 100MB
• Parallel GC / Parallel Old GC
– Best throughput GC, takes advantage of multi processor
– For medium to large heap
• Concurrent Mark-Sweep GC (CMS)
– Good for managing latencies
– For medium to large heap
– Variant of this is the incremental mode

Summary
• Collect GC data
• Understand your application
• Use the material as a guide
– Don’t be afraid to experiment
• Approach presented
– Start with ParallelOldGC
– Move to CMS when necessary
• Relook at your code
– If GC tuning does not bring any tangible improvements

Hashmap Internal Implementation Analysis in Java

Full detailed analysis of java.util.HashMap’s implementation, its internals and working concepts.

java.util.HashMap.java

    /**
     * The maximum capacity, used if a higher value is implicitly specified
     * by either of the constructors with arguments.
     * MUST be a power of two <= 1<<30.
     */
    static final int MAXIMUM_CAPACITY = 1 << 30;

It says the maximum size to which hashmap can expand, i.e, till 2^(30) = 1,073,741,824

java.util.HashMap.java

    /**
     * The default initial capacity - MUST be a power of two.
     */
    static final int DEFAULT_INITIAL_CAPACITY = 16;

    /**
     * The load factor used when none specified in constructor.
     */
    static final float DEFAULT_LOAD_FACTOR = 0.75f;

It says default size of an array is 16 (always power of 2, we will understand soon why it is always power of 2 going further) and load factor means whenever the size of the hashmap reaches to 75% of its current size, i.e, 12, it will double its size by recomputing the hashcodes of existing data structure elements.

Hence to avoid rehashing of the data structure as elements grow it is the best practice to explicitly give the size of the hashmap while creating it.

Do you foresee any problem with this resizing of hashmap in java? Since java is multi threaded it is very possible that more than one thread might be using same hashmap and then they both realize the need for re-sizing the hashmap at the same time which leads to race condition.

What is race condition with respect to hashmaps? When two or more threads see the need for resizing the same hashmap, they might end up adding the elements of old bucket to the new bucket simultaneously and hence might lead to infinite loops. FYI, in case of collision, i.e, when there are different keys with same hashcode, internally we use single linked list to store the elements. And we store every new element at the head of the linked list to avoid tail traversing and hence at the time of resizing the entire sequence of objects in linked list gets reversed, during which there are chances of infinite loops.

For example, lets assume there are 3 keys with same hashcode and hence stored in linked list inside a bucket [below format is in object_value(current_address, next_address) ]
Initial structure: 1(100, 200) –> 2(200, 300) –> 3(300, null)
After resizing by thread#1: 3(300, 200) –> 2(200, 100) –> 1(100, null)
When thread#2 starts resizing, its again starts with 1st element by placing it at the head:
1(100, 300) –> 3(300, 200) –> 2(200, 100) ==> which becomes a infinite loop for next insertion and thread hangs here.

java.util.HashMap.java

	/**
     * Associates the specified value with the specified key in this map.
     * If the map previously contained a mapping for the key, the old
     * value is replaced.
     *
     * @param key key with which the specified value is to be associated
     * @param value value to be associated with the specified key
     * @return the previous value associated with <tt>key</tt>, or
     *         <tt>null</tt> if there was no mapping for <tt>key</tt>.
     *         (A <tt>null</tt> return can also indicate that the map
     *         previously associated <tt>null</tt> with <tt>key</tt>.)
     */
    public V put(K key, V value) {
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key.hashCode());
        int i = indexFor(hash, table.length);
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }

Here it
1. re-generates the hashcode using hash(int h) method by passing user defined hashcode as an argument
2. generates index based on the re-generated hashcode and length of the data structure.
3. if key exists, it over-rides the element, else it will create a new entry in the hashmap at the index generated in Step-2

Step-3 is straight forward but Step-1&2 needs to have deeper understanding. Let us dive into the internals of these methods…

Note: These two methods are very very important to understand the internal working functionality of hashmap in openjdk

java.util.HashMap.java

	/**
     * Applies a supplemental hash function to a given hashCode, which
     * defends against poor quality hash functions.  This is critical
     * because HashMap uses power-of-two length hash tables, that
     * otherwise encounter collisions for hashCodes that do not differ
     * in lower bits. Note: Null keys always map to hash 0, thus index 0.
     */
    static int hash(int h) {
        // This function ensures that hashCodes that differ only by
        // constant multiples at each bit position have a bounded
        // number of collisions (approximately 8 at default load factor).
        h ^= (h >>> 20) ^ (h >>> 12);
        return h ^ (h >>> 7) ^ (h >>> 4);
    }

	/**
     * Returns index for hash code h.
     */
    static int indexFor(int h, int length) {
        return h & (length-1);
    }

here:
‘h’ is hashcode(because of its int data type, it is 32 bit)
‘length’ is DEFAULT_INITIAL_CAPACITY(because of its int data type, it is 32 bit)

Comment from above source code says…
Applies a supplemental hash function to a given hashCode, which defends against poor quality hash functions. This is critical because HashMap uses power-of-two length hash tables, that otherwise encounter collisions for hashCodes that do not differ in lower bits. What do this means???

It means that if in case the algorithm we wrote for hashcode generation does not distribute/mix lower bits evenly, it will lead to more collisions. For example, we have hashcode logic of “empId*deptId” and if deptId is even, it would always generate even hashcodes because any number multiplied by EVEN is always EVEN. And if we directly depend on these hashcodes to compute the index and store our objects into hashmap then
1. odd places in the hashmap are always empty
2. because of #1, it would leave us to use only even places and hence double the number of collisions

For example,
I am considering some hash codes which our code might generate, which are very valid as they are different, but we will prove all these turns out as as same bucket very soon

1111110111011010101101010111110
1100110111011010101011010111110
1100000111011010101110010111110

I am considering these sequences directly (without using hash function) and pass it for indexFor method, where we do AND operation between ‘hashcode’ and ‘length-1(which will always give sequence of 1’s as length is always power of 2)’. Why is hashmap length always power of 2? it is because when we do (user_hash_code & size-1), it will consider only the first few bits (in this case 4 bits) which are used to decide the bucket location.
As we are considering the length as default length, i.e, 16, binary representation of (16-1) is 1111
this is what happens inside indexFor method

1111110111011010101101010111110 & 0000000000000000000000000001111 = 1110
1100110111011010101011010111110 & 0000000000000000000000000001111 = 1110
1100000111011010101110010111110 & 0000000000000000000000000001111 = 1110

From this we understand that all the objects with these different hascodes would have same index which means they would all go into the same bucket, which is a BIG-FAIL as it leads to arraylist complexity O(n) instead of O(1)

Comment from above source code says…
that otherwise encounter collisions for hashCodes that do not differ in lower bits.

Notice this sequence of 0-15 (2-power-4), its the default size of Hashtable

0000 - 0	1000 - 8
0001 - 1	1001 - 9
0010 - 2	1010 - 10
0011 - 3	1011 - 11
0100 - 4	1100 - 12
0101 - 5	1101 - 13
0110 - 6	1110 - 14
0111 - 7	1111 - 15

If we notice here, hashmap with power-of-two length 16(2^4), only last four digits matter in the allocation of buckets, and these are the 4 binary lower bit digit variations that play prominent role in identifying the right bucket.

Keeping the above sequence in mind, we re-generated the hashcode from hash(int h) by passing the existing hascode which makes sure there is enough variation in the lower bits of the hashcode and then pass it to indexFor() method , this will ensure the lower bits of hashcode are used to identify the bucket and the rest higher bits are ignored.
For example, taking the same hascode sequences from above example

Our hash code 1111110111011010101101010111110 when regenerated with hash(int h) method, it generates new hashcode ==> 1111001111110011100110111011010

this is what happens inside indexFor method
1111110111011010101101010111110 ==> 1111001111110011100110111011010
1100110111011010101011010111110 ==> 1100000010010000101101110011001
1100000111011010101110010111110 ==> 1100110001001000011011110001011

passing these set of new hashcodes to indexFor() method
1111001111110011100110111011010 & 0000000000000000000000000001111 = 1010
1100000010010000101101110011001 & 0000000000000000000000000001111 = 1001
1100110001001000011011110001011 & 0000000000000000000000000001111 = 1011

so here it is clear that because of the regenerated hashcode, the lower bits are well distributed/mixed leading to unique index which leads to different buckets avoiding collisions.

Why only these magic numbers 20, 12, 7 and 4. It is explained in the book: “The Art of Computer Programming” by Donald Knuth.
Here we are XORing the most significant bits of the number into the least significant bits (20, 12, 7, 4). The main purpose of this operation is to make the hashcode differences visible in the least significant bits so that the hashmap elements can be distributed evenly across the buckets.

java.util.HashMap.java

	/**
     * Associates the specified value with the specified key in this map.
     * If the map previously contained a mapping for the key, the old
     * value is replaced.
     *
     * @param key key with which the specified value is to be associated
     * @param value value to be associated with the specified key
     * @return the previous value associated with <tt>key</tt>, or
     *         <tt>null</tt> if there was no mapping for <tt>key</tt>.
     *         (A <tt>null</tt> return can also indicate that the map
     *         previously associated <tt>null</tt> with <tt>key</tt>.)
     */
    public V put(K key, V value) {
        if (key == null)
            return putForNullKey(value);
        int hash = hash(key.hashCode());
        int i = indexFor(hash, table.length);
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k))) {
                V oldValue = e.value;
                e.value = value;
                e.recordAccess(this);
                return oldValue;
            }
        }

        modCount++;
        addEntry(hash, key, value, i);
        return null;
    }

Going back to previous steps:
1. re-generates the hashcode using hash(int h) method by passing user defined hashcode as an argument
2. generates index based on the re-generated hashcode and length of the data structure.
3. if key exists, it over-rides the element, else it will create a new entry in the hashmap at the index generated in STEP-2

Steps1&2 must be clear by now.

Step3:
What happens when two different keys have same hashcode?
1. if the keys are equal, i.e, to-be-inserted key and already-inserted key’s hashcodes are same and keys are same (via reference or via equals() method) then over-ride the previous key-value pair with the current key-value pair.
2. if keys are not equal, then store the key-value pair in the same bucket as that of the existing keys.

When collision happens in hashmap? it happens in case-2 of above question.

How do you retrieve value object when two keys with same hashcode are stored in hashmap? Using hashcode wo go to the right bucket and using equals we find the right element in the bucket and then return it.

How do different keys with same hashcode are stored in hashmap? Usual answer is in bucket but technically they are all stored in a single linked list. Little difference is that insertion of new element to the linked list is made at the head instead of tail to avoid tail traversal.

What is bucket and what can be maximum number of buckets in hashmap? A bucket is an instance of the linked list (Entry Inner Class in my previous post) and we can have as many number of buckets as length of the hashmap at maximum, for example, in a hashmap of length 8, there can be maximum of 8 buckets, each is an instance of linked list.

java.util.HashMap.java

	/**
     * Returns the value to which the specified key is mapped,
     * or {@code null} if this map contains no mapping for the key.
     *
     * <p>More formally, if this map contains a mapping from a key
     * {@code k} to a value {@code v} such that {@code (key==null ? k==null :
     * key.equals(k))}, then this method returns {@code v}; otherwise
     * it returns {@code null}.  (There can be at most one such mapping.)
     *
     * <p>A return value of {@code null} does not <i>necessarily</i>
     * indicate that the map contains no mapping for the key; it's also
     * possible that the map explicitly maps the key to {@code null}.
     * The {@link #containsKey containsKey} operation may be used to
     * distinguish these two cases.
     *
     * @see #put(Object, Object)
     */
    public V get(Object key) {
        if (key == null)
            return getForNullKey();
        int hash = hash(key.hashCode());
		int i = indexFor(hash, table.length);
        for (Entry<K,V> e = table[i]; e != null; e = e.next) {
            Object k;
            if (e.hash == hash && ((k = e.key) == key || key.equals(k)))
                return e.value;
        }
        return null;
    }

1. re-generates the hashcode using hash(int h) method by passing user defined hashcode as an argument
2. generates index based on the re-generated hashcode and length of the data structure.
3. point to the right bucket, i.e, table[i], and traverse through the linked list, which is constructed based on Entry inner class
4. when keys are equal and their hashcodes are equal then return the value mapped to that key

Also look at
creating our own hashmap in java

Mawazo

Mostly technology with occasional sprinkling of other random thoughts

amintabar

Amir Amintabar's personal page

101 Books

Reading my way through Time Magazine's 100 Greatest Novels since 1923 (plus Ulysses)

Seek, Plunnge and more...

My words, my world...

ARRM Foundation

Do not wait for leaders; do it alone, person to person - Mother Teresa

Executive Management

An unexamined life is not worth living – Socrates

javaproffesionals

A topnotch WordPress.com site

thehandwritinganalyst

Just another WordPress.com site

coding algorithms

"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem." -- John Tukey

%d bloggers like this: