algorithm to find substring in a string (KMP Algorithm)

String Search Algorithm in java OR String Matching Algorithm in java: KMP Algorithm is one of the many string search algorithms which is better suited in scenarios where ‘pattern to be searched’ remains same whereas ‘text to be searched’ changes.

Full Source Code cab be downloaded here

Knuth-Morris-Pratt Algorithm (KMP) detailed analysis
Understanding this would show us what a careful algorithmic thinker would code even for a small problem like this. This is one of the algorithms which is easy to implement but one needs lot of attention and hard work to understand the algorithm.

Problem Statement: lets assume we want to search for the pattern P in the text T.
T: [abcabdabcabdabcabdabdabc]
P: [abcabdabc]

This algorithm is executed in two phases.

  • Preprocessing Phase
  • Search Phase

PRE PROCESSING PHASE

In this phase, KMP algorithm creates a temporary array which is prepared from the pattern. We call this temporary array as Partial Match Table (PMT), lets call this PMT as b[i].

What is Partial Match Table?
It is an array of size (pattern_length+1) where, for each position of i in pattern p, b[i] is defined such that it takes the ‘length of the longest proper suffix of P[1…i-1]’ that matches with the ‘prefix of P’.
What is longest prefix/suffix match??? Proper prefix is a prefix which is not same as the substring. Recall proper set which is a subset of a set but is not same as the set. This is very important to understand the algorithm. Let me take an example and show it, considering the above pattern ‘abcabdabc’

I am taking all possible substrings in the pattern ‘abcabdabc’:

a
ab
abc
abca
abcab
abcabd
abcabda
abcabdab
abcabdabc

Now take each substring and find all possible set of proper prefixes and proper suffixes,

a         : no prefix and no suffix since there is only one letter
ab        : prefixes[a]
            suffixes[b]
abc       : prefixes[a,ab]
            suffixes[c,bc]
abca      : prefixes[a,ab,abc]
            suffixes[a,ca,bca]
abcab     : prefixes[a,ab,abc,abca]
            suffixes[b,ab,cab,bcab]
abcabd    : prefixes[a,ab,abc,abca,abcab]
            suffixes[d,bd,abd,cabd,bcabd]
abcabda   : prefixes[a,ab,abc,abca,abcab,abcabd]
            suffixes[a,da,bda,abda,cabda,bcabda]
abcabdab  : prefixes[a,ab,abc,abca,abcab,abcabd,abcabda]
            suffixes[b,ab,dab,bdab,abdab,cabdab,bcabdab]
abcabdabc : prefixes[a,ab,abc,abca,abcab,abcabd,abcabda,abcabdab]
            suffixes[c,bc,abc,dabc,bdabc,abdabc,cabdabc,bcabdabc,abcabdabc]

Now look at the longest proper prefix which is same as the proper suffix:

a         : no prefix and no suffix since there is only one letter
ab        : 0
abc       : 0
abca      : 1 [a]
abcab     : 2 [ab]
abcabd    : 0
abcabda   : 1 [a]
abcabdab  : 2 [ab]
abcabdabc : 3 [abc]

Now start computing the Partial Match Table array values:
b[0] is computed by looking at longest prefix and suffix match of the substring of p[0…0-1] as p[-1] is not defined, hence b[0] is given -1
b[1] is computed by looking at longest prefix and suffix match of the substring of p[0…1-1] = b[0] = A, since there wont be any prefix and suffix for one letter word, b[1] is 0.
Note: For any given pattern b[0] and b[1] are always fixed
b[2]: is computed by looking at longest prefix and suffix match of the substring of p[0…2-1] = b[0…1] = AB = 0
… so on, below are the values of partial match table

b[0] = -1
b[1] = 0
b[2] = 0
.
.
.
b[7] = 1
b[8] = 2
b[9] = 3

Final array is:

p    :  a   b   c   a   b   d   a   b   c
p[i] :  0   1   2   3   4   5   6   7   8
b[i] : -1   0   0   0   1   2   0   1   2   3

Note: Why a prefix should match suffix of the pattern? its because when we shift the pattern its the prefix of P which comes towards the suffix. And also the key idea is that if we have successfully matched prefix P[1…i-1] of the pattern with the substring T[j-(i-1)…j-1] of the input string and P(i)!=T(j), then we dont need to reprocess any of the suffix T[j-(i-1)…j-1] since we know this portion of the text string is the prefix of the pattern that we have just matched.

Pre Processing Code:

	/**
	 * Pre processes the pattern array based on proper prefixes and proper
	 * suffixes at every position of the array
	 *
	 * @param ptrn
	 *            word that is to be searched in the search string
	 * @return partial match table which indicates
	 */
	public int[] preProcessPattern(char[] ptrn) {
		int i = 0, j = -1;
		int ptrnLen = ptrn.length;
		int[] b = new int[ptrnLen + 1];

		b[i] = j;
		while (i < ptrnLen) { 			
		        while (j >= 0 && ptrn[i] != ptrn[j]) {
				// if there is mismatch consider the next widest border
				// The borders to be examined are obtained in decreasing order from 
				//  the values b[i], b[b[i]] etc.
				j = b[j];
			}
			i++;
			j++;
			b[i] = j;
		}
		return b;
	}

Code Walkthrough:
Lets analyze the pre processing phase:

Standard Definitions:
The preprocessing algorithm inner while loop:
If p(j) == p(i), A border of width j can be extended by p(i), here it doesnt go into inner while loop.
If p(j) != p(i), the next-widest border is examined by setting j = b[j]. The loop terminates at the latest if no border can be extended (j = -1).
Now let us look at how partial match table is prepared:
Provided that the values b[0], …, b[i] are already known, the value of b[i+1] is computed by checking if a border of the prefix p[0] … p[i-1] can be extended by p[i]. This is the case if p[b(i)] = p(i). The borders to be examined are obtained in decreasing order from the values b[i], b[b[i]] etc.

What do these definitions say? you will understand once we walk through the code looking at the below PMT table once again

p    :  a   b   c   a   b   d   a   b   c
p[i] :  0   1   2   3   4   5   6   7   8
b[i] : -1   0   0   0   1   2   0   1   2   3

Initially b[0] is set to -1
At 1st loop, i=0, j=-1, since j<0, it will not enter into inner while loop and i&j are incremented, i.e, i=1&j=0 and b[1]=0
At 2nd loop, i=1, j=0, since a[1] is not equal to a[0], it will enter into inner while loop and set j to -1 (because b[j] i.e, b[0]=-1) and i&j are incremented, i.e, i=2&j=0 and b[2]=0
At 3rd loop, i=2, j=0, since a[2] is not equal to a[0], it will enter into inner while loop and set j to -1 (because b[0]=-1) and i&j are incremented, i.e, i=3&j=0 and b[3]=0
At 4th loop, i=3, j=0, since a[3] is equal to a[0], it will not enter into inner while loop and i&j are incremented, i.e, i=4&j=1 and b[4]=1
At 5th loop, i=4, j=1, since a[4] is equal to a[1], it will not enter into inner while loop and i&j are incremented, i.e, i=5&j=2 and b[5]=2
NEXT LOOP IS IMPORTANT TO UNDERSTAND
At 6th loop, i=5, j=2, since a[5] is not equal to a[2], it will enter into inner while loop and set j to 0 (because b[j] i.e, b[2]=0) and still a[5] is not equal to a[0], it will again enter into inner while loop and set j to -1 (because b[j] i.e, b[0]=-1) .
Why j is jumping from 2, 0, and -1?? see the definition ‘The borders to be examined are obtained in decreasing order from the values b[i], b[b[i]] etc.’, here it is b[5] i.e 2, b[b[5]] i.e 0, b[b[b[5]]] i.e -1

Now it will exit inner while loop, increment i&j, i.e, i=6&j=0 and b[6]=0
…. so on
by now you might have understood how to evaluate other values of the array (IF NOT PLS PING I will add the other values as well)

SEARCH PHASE

This is pretty simple once we get the partial match table from preprocessing phase. Here we just compare first character of the text T i.e, t(i) with the first character of the pattern P i.e, p(i), if there is a match we increment the pointers i and j and go forward. If there is a mismatch then we look into the partial match table b[j] to shift the position of j to appropriate value based on the prefix/suffix match by using b[j].

Code Walkthrough

/**
	 * Based on the pre processed array, search for the pattern in the text
	 *
	 * @param text
	 *            text over which search happens
	 * @param ptrn
	 *            pattern that is to be searched
	 */
	public void searchSubString(char[] text, char[] ptrn) {
		int i = 0, j = 0;
		// pattern and text lengths
		int ptrnLen = ptrn.length;
		int txtLen = text.length;

		// initialize new array and preprocess the pattern
		int[] b = preProcessPattern(ptrn);

		while (i < txtLen) {
 			while (j >= 0 && text[i] != ptrn[j]) {
				j = b[j];
			}
			i++;
			j++;

			// a match is found
			if (j == ptrnLen) {
				System.out.println("found substring at index:" + (i - ptrnLen));
				j = b[j];
			}
		}
	}

Lets analyze the search phase:

Index 	      0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16  17  18  19  20  21  22  23
Text(T)       a b c a b d a b c a b  d  a  b  c  a  b   d   a   b   d   a   b   c
Pattern(P)    a b c a b d a b c
PMT (b[i])   -1 0 0 0 1 2 0 1 2 3

Here j is a variable used to track the pattern P and i is used to track the text T. Looking at the Text and Pattern, T[0…8] and P[0…8] have a perfect match, once there is a perfect match it enters into if loop at line #26, where j is set from 9 to 3 because b[9]=3; and the program continues from i=8 and j=3, going further there is a mismatch at i=11 and j=5, the program enters into inner while loop where j is set to 2 (because b[5]=2), since p[2] is not equals to t[11], program enters into inner while loop again where j is set to 0 (because b[2]=0), since p[0] is still not equals to t[11], program enters into inner while loop again where j is set to -1 (because b[0]=-1); at j=-1 program exits the inner while loop increments i and j; in next loop at i=12 and j=0, the outer while loop executes with perfect match again at T[12…20] and P[0…8]

Advantages:
Implementation is simple
Time complexity is O(m+n) when compared to the brute force algorithm O(mn)
This algorithm takes atmost 2n comparisions

Disadvantages:
Chances of mismatch increases with big sized alphabets (other than English).

Full Source Code cab be downloaded here

The output of this program is:

 a   b   c   a   b   d   a   b   c   a b e a b c a b d a b c a b d   
printing pattern, partial match table, and its index
 a   b   c   a   b   d   a   b   c    
-1   0   0   0   1   2   0   1   2   3   
 0   1   2   3   4   5   6   7   8   
FOUND SUBSTRING AT i 9 and index: 0
Setting j from 9 to 3

Mismatch happened, between text char e and pattern char d, hence jumping 
  the value of j from 5 to 2 at text index i at 11 based on partial match table
Mismatch happened, between text char e and pattern char c, hence jumping 
  the value of j from 2 to 0 at text index i at 11 based on partial match table
Mismatch happened, between text char e and pattern char a, hence jumping 
  the value of j from 0 to -1 at text index i at 11 based on partial match table

FOUND SUBSTRING AT i 21 and index: 12
Setting j from 9 to 3

I know, this algorithm is too big to digest at one shot, but practising more examples and analysing different patterns with the above explanation would give you a good start. Hope you enjoy this article, happy reading 🙂 🙂 🙂

REFERENCES:
http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm
http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/stringmatchingclasses/KmpStringMatcher.java

7 Responses to algorithm to find substring in a string (KMP Algorithm)

  1. Sriram Sundarrajah says:

    DONE THANKS

  2. Chandrsna says:

    Thanks .. !!

  3. gouse says:

    nice one dude. keep posting ur work

  4. Ratheesh a says:

    Hey Bro..

    Logic and explanation is awesome, this helps to clarify most of the string pattern matching test cases.

    Hope you will be updating more..

  5. sr says:

    good explanation..please post some more algorthim

  6. vihang shah says:

    great..easy to understand

Leave a comment

Mawazo

Mostly technology with occasional sprinkling of other random thoughts

amintabar

Amir Amintabar's personal page

101 Books

Reading my way through Time Magazine's 100 Greatest Novels since 1923 (plus Ulysses)

Seek, Plunnge and more...

My words, my world...

ARRM Foundation

Do not wait for leaders; do it alone, person to person - Mother Teresa

Executive Management

An unexamined life is not worth living – Socrates

javaproffesionals

A topnotch WordPress.com site

thehandwritinganalyst

Just another WordPress.com site

coding algorithms

"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem." -- John Tukey