best possible way to find the duplicates in an array

This is the most frequently asked question in many interviews. This can be done in many ways and simplest way is the brute-force approach by taking 2 for loops with complexity of O(n*n). But the interviewer looks for the solution with the best time complexity and also without using any existing data structures that are available in Java or other languages.

Now I am going to show you the approach with O(n) complexity at the cost of O(n) space. Full source code can be downloaded from here.

Step1: Construct a simple linked list which holds the integer value.

    class MyInt {
		int value;
		MyInt next;
		
		public MyInt(int value) {
			this.value = value;
		}
	}

Step2: Initialize the above array with the size of the input array.

hashArray = new MyInt[size];

Step3: Loop through the input array and insert elements into the newly constructed hash array

  • In this method we construct the hashcode of each element and then find out the appropriate bucket (array index) in which the element has to be stored.
    public void findDuplicates(Integer currentElement) {
		int idx = getSupplementalHash(currentElement.hashCode()) % size;
		//int idx = (element.hashCode()) % size;
		MyInt existingElement = hashArray[idx];

		for(;existingElement != null; existingElement = existingElement.next) {
			if(existingElement.value == currentElement) {
				// duplicate
				System.out.println(currentElement+" this is a duplicate");
				return;
			} else {
				System.out.println("identified collision, adding "+
                                                   existingElement.value+" to the list");
			}
		}
		
		System.out.println("adding "+currentElement+" to the list");
		MyInt mi = new MyInt(currentElement);
		// insert element at the head to avoid tail traversing
		mi.next = hashArray[idx];
		hashArray[idx] = mi;
		
		System.out.println("------------------------------------");
	}

Example #1
Input Array: {1, 2, 3, 3, 4, 5, 6, 8, 8, 1, 2,8}
Output:

3 is a duplicate
8 is a duplicate
1 is a duplicate
2 is a duplicate
8 is a duplicate

But there is a catch here, if the hashcode is not properly implemented by the user the complexity might go upto O(n*n), see the below example

Example #2: In findDuplicates() method above, comment line #2 and uncomment line #3
Input array: {10,0,100,20,20}
Output

Input array size is 5
adding 10 to the list
------------------------------------
identified collision, adding 10 to the list
adding 0 to the list
------------------------------------
identified collision, adding 0 to the list
identified collision, adding 10 to the list
adding 100 to the list
------------------------------------
identified collision, adding 100 to the list
identified collision, adding 0 to the list
identified collision, adding 10 to the list
adding 20 to the list
------------------------------------
20 is a duplicate

If you notice above output, to find if 20 is a duplicate, it traversed 4 times which is nothing but O(n) for each element and hence it goes back to O(n*n) overall.

Disadvantage: The downside with this approach when hashcodes are not proper (like all ending with 0, all are even, all are odd, etc), they all go into one bucket and thus becoming another array (with complexity O(n)) instead or hasharray (with complexity O(1)). Because of this our target complexity O(n) will be lost.

To address this we will re-compute the hash code again at our end to make sure it is well distributes the index, here is the code

    private int getSupplementalHash(int h) {
		// This function ensures that hashCodes that differ only by
		// constant multiples at each bit position have a bounded
		// number of collisions (approximately 8 at default load factor).
		h ^= (h >>> 20) ^ (h >>> 12);
		return h ^ (h >>> 7) ^ (h >>> 4);
	}

Example #2 (run with getSupplementalHash() method): : In findDuplicates() method above, comment line #3 and uncomment line #2
Output

Input array size is 5
adding 10 to the list
------------------------------------
identified collision, adding 10 to the list
adding 0 to the list
------------------------------------
adding 100 to the list
------------------------------------
adding 20 to the list
------------------------------------
20 is a duplicate

If you notice, the collision is reduced to a greater extent.

Full source code can be downloaded from here.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Mawazo

Mostly technology with occasional sprinkling of other random thoughts

amintabar

Amir Amintabar's personal page

101 Books

Reading my way through Time Magazine's 100 Greatest Novels since 1923 (plus Ulysses)

Seek, Plunnge and more...

My words, my world...

ARRM Foundation

Do not wait for leaders; do it alone, person to person - Mother Teresa

Executive Management

An unexamined life is not worth living – Socrates

Diabolical or Smart

Nitwit, Blubber, Oddment, Tweak !!

javaproffesionals

A topnotch WordPress.com site

thehandwritinganalyst

Just another WordPress.com site

coding algorithms

"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem." -- John Tukey