## algorithm to find substring in a string (KMP Algorithm)

May 14, 2013 7 Comments

**String Search Algorithm in java OR String Matching Algorithm in java: ** KMP Algorithm is one of the many string search algorithms which is better suited in scenarios where ‘pattern to be searched’ remains same whereas ‘text to be searched’ changes.

**Full Source Code **cab be downloaded here

**Knuth-Morris-Pratt Algorithm (KMP) detailed analysis**

Understanding this would show us what a careful algorithmic thinker would code even for a small problem like this. This is one of the algorithms which is easy to implement but one needs lot of attention and hard work to understand the algorithm.

**Problem Statement:** lets assume we want to search for the pattern P in the text T.

T: [abcabdabcabdabcabdabdabc]

P: [abcabdabc]

This algorithm is executed in two phases.

- Preprocessing Phase
- Search Phase

**PRE PROCESSING PHASE**

In this phase, KMP algorithm creates a temporary array which is prepared from the pattern. We call this temporary array as Partial Match Table** (PMT)**, lets call this PMT as b[i].

**What is Partial Match Table?**

It is an array of size (pattern_length+1) where, for each position of i in pattern p, b[i] is defined such that it takes the ‘length of the longest proper suffix of P[1…i-1]’ that matches with the ‘prefix of P’.

**What is longest prefix/suffix match???** Proper prefix is a prefix which is not same as the substring. Recall proper set which is a subset of a set but is not same as the set. This is very important to understand the algorithm. Let me take an example and show it, considering the above pattern ‘abcabdabc’

I am taking all possible substrings in the pattern ‘abcabdabc’:

a ab abc abca abcab abcabd abcabda abcabdab abcabdabc

Now take each substring and find all possible set of proper prefixes and proper suffixes,

a : no prefix and no suffix since there is only one letter ab : prefixes[a] suffixes[b] abc : prefixes[a,ab] suffixes[c,bc] abca : prefixes[a,ab,abc] suffixes[a,ca,bca] abcab : prefixes[a,ab,abc,abca] suffixes[b,ab,cab,bcab] abcabd : prefixes[a,ab,abc,abca,abcab] suffixes[d,bd,abd,cabd,bcabd] abcabda : prefixes[a,ab,abc,abca,abcab,abcabd] suffixes[a,da,bda,abda,cabda,bcabda] abcabdab : prefixes[a,ab,abc,abca,abcab,abcabd,abcabda] suffixes[b,ab,dab,bdab,abdab,cabdab,bcabdab] abcabdabc : prefixes[a,ab,abc,abca,abcab,abcabd,abcabda,abcabdab] suffixes[c,bc,abc,dabc,bdabc,abdabc,cabdabc,bcabdabc,abcabdabc]

Now look at the longest proper prefix which is same as the proper suffix:

a : no prefix and no suffix since there is only one letter ab : 0 abc : 0 abca : 1 [a] abcab : 2 [ab] abcabd : 0 abcabda : 1 [a] abcabdab : 2 [ab] abcabdabc : 3 [abc]

Now start computing the Partial Match Table array values:

**b[0]** is computed by looking at longest prefix and suffix match of the substring of p[0…0-1] as p[-1] is not defined, hence b[0] is given -1

**b[1]** is computed by looking at longest prefix and suffix match of the substring of p[0…1-1] = b[0] = A, since there wont be any prefix and suffix for one letter word, b[1] is 0.

**Note: For any given pattern b[0] and b[1] are always fixed**

b[2]: is computed by looking at longest prefix and suffix match of the substring of p[0…2-1] = b[0…1] = AB = 0

**… so on, **below are the values of partial match table

b[0] = -1 b[1] = 0 b[2] = 0 . . . b[7] = 1 b[8] = 2 b[9] = 3

Final array is:

p : a b c a b d a b c p[i] : 0 1 2 3 4 5 6 7 8 b[i] : -1 0 0 0 1 2 0 1 2 3

**Note: Why a prefix should match suffix of the pattern? **its because when we shift the pattern its the prefix of P which comes towards the suffix. And also the key idea is that if we have successfully matched prefix P[1…i-1] of the pattern with the substring T[j-(i-1)…j-1] of the input string and P(i)!=T(j), then we dont need to reprocess any of the suffix T[j-(i-1)…j-1] since we know this portion of the text string is the prefix of the pattern that we have just matched.

**Pre Processing Code:**

/** * Pre processes the pattern array based on proper prefixes and proper * suffixes at every position of the array * * @param ptrn * word that is to be searched in the search string * @return partial match table which indicates */ public int[] preProcessPattern(char[] ptrn) { int i = 0, j = -1; int ptrnLen = ptrn.length; int[] b = new int[ptrnLen + 1]; b[i] = j; while (i < ptrnLen) { while (j >= 0 && ptrn[i] != ptrn[j]) { // if there is mismatch consider the next widest border // The borders to be examined are obtained in decreasing order from // the values b[i], b[b[i]] etc. j = b[j]; } i++; j++; b[i] = j; } return b; }

**Code Walkthrough:**

Lets analyze the pre processing phase:

**Standard Definitions:**

The preprocessing algorithm inner while loop:

If p(j) == p(i), A border of width j can be extended by p(i), here it doesnt go into inner while loop.

If p(j) != p(i), the next-widest border is examined by setting j = b[j]. The loop terminates at the latest if no border can be extended (j = -1).

Now let us look at how partial match table is prepared:

Provided that the values b[0], …, b[i] are already known, the value of b[i+1] is computed by checking if a border of the prefix p[0] … p[i-1] can be extended by p[i]. This is the case if p[b(i)] = p(i). The borders to be examined are obtained in decreasing order from the values b[i], b[b[i]] etc.

**What do these definitions say? **you will understand once we walk through the code looking at the below PMT table once again

p : a b c a b d a b c p[i] : 0 1 2 3 4 5 6 7 8 b[i] : -1 0 0 0 1 2 0 1 2 3

Initially b[0] is set to -1

**At 1st loop,** i=0, j=-1, since j<0, it will not enter into inner while loop and i&j are incremented, i.e, i=1&j=0 and b[1]=0

**At 2nd loop,** i=1, j=0, since a[1] is not equal to a[0], it will enter into inner while loop and set j to -1 (because b[j] i.e, b[0]=-1) and i&j are incremented, i.e, i=2&j=0 and b[2]=0

**At 3rd loop,** i=2, j=0, since a[2] is not equal to a[0], it will enter into inner while loop and set j to -1 (because b[0]=-1) and i&j are incremented, i.e, i=3&j=0 and b[3]=0

**At 4th loop,** i=3, j=0, since a[3] is equal to a[0], it will not enter into inner while loop and i&j are incremented, i.e, i=4&j=1 and b[4]=1

**At 5th loop,** i=4, j=1, since a[4] is equal to a[1], it will not enter into inner while loop and i&j are incremented, i.e, i=5&j=2 and b[5]=2

**NEXT LOOP IS IMPORTANT TO UNDERSTAND**

**At 6th loop,** i=5, j=2, since a[5] is not equal to a[2], it will enter into inner while loop and set j to 0 (because b[j] i.e, b[2]=0) and still a[5] is not equal to a[0], it will again enter into inner while loop and set j to -1 (because b[j] i.e, b[0]=-1) .

**Why j is jumping from 2, 0, and -1??** see the definition ‘The borders to be examined are obtained in decreasing order from the values b[i], b[b[i]] etc.’, here it is b[5] i.e 2, b[b[5]] i.e 0, b[b[b[5]]] i.e -1

Now it will exit inner while loop, increment i&j, i.e, i=6&j=0 and b[6]=0

…. so on

by now you might have understood how to evaluate other values of the array (IF NOT PLS PING I will add the other values as well)

**SEARCH PHASE**

This is pretty simple once we get the partial match table from preprocessing phase. Here we just compare first character of the text T i.e, t(i) with the first character of the pattern P i.e, p(i), if there is a match we increment the pointers i and j and go forward. If there is a mismatch then we look into the partial match table b[j] to shift the position of j to appropriate value based on the prefix/suffix match by using b[j].

Code Walkthrough

/** * Based on the pre processed array, search for the pattern in the text * * @param text * text over which search happens * @param ptrn * pattern that is to be searched */ public void searchSubString(char[] text, char[] ptrn) { int i = 0, j = 0; // pattern and text lengths int ptrnLen = ptrn.length; int txtLen = text.length; // initialize new array and preprocess the pattern int[] b = preProcessPattern(ptrn); while (i < txtLen) { while (j >= 0 && text[i] != ptrn[j]) { j = b[j]; } i++; j++; // a match is found if (j == ptrnLen) { System.out.println("found substring at index:" + (i - ptrnLen)); j = b[j]; } } }

Lets analyze the search phase:

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Text(T) a b c a b d a b c a b d a b c a b d a b d a b c Pattern(P) a b c a b d a b c PMT (b[i]) -1 0 0 0 1 2 0 1 2 3

Here j is a variable used to track the pattern P and i is used to track the text T. Looking at the Text and Pattern, T[0…8] and P[0…8] have a perfect match, once there is a perfect match it enters into if loop at line #26, where j is set from 9 to 3 because b[9]=3; and the program continues from i=8 and j=3, going further there is a mismatch at i=11 and j=5, the program enters into inner while loop where j is set to 2 (because b[5]=2), since p[2] is not equals to t[11], program enters into inner while loop again where j is set to 0 (because b[2]=0), since p[0] is still not equals to t[11], program enters into inner while loop again where j is set to -1 (because b[0]=-1); at j=-1 program exits the inner while loop increments i and j; in next loop at i=12 and j=0, the outer while loop executes with perfect match again at T[12…20] and P[0…8]

**Advantages:**

Implementation is simple

Time complexity is O(m+n) when compared to the brute force algorithm O(mn)

This algorithm takes atmost 2n comparisions

**Disadvantages:**

Chances of mismatch increases with big sized alphabets (other than English).

**Full Source Code **cab be downloaded here

The output of this program is:

a b c a b d a b c a b e a b c a b d a b c a b d printing pattern, partial match table, and its index a b c a b d a b c -1 0 0 0 1 2 0 1 2 3 0 1 2 3 4 5 6 7 8 FOUND SUBSTRING AT i 9 and index: 0 Setting j from 9 to 3 Mismatch happened, between text char e and pattern char d, hence jumping the value of j from 5 to 2 at text index i at 11 based on partial match table Mismatch happened, between text char e and pattern char c, hence jumping the value of j from 2 to 0 at text index i at 11 based on partial match table Mismatch happened, between text char e and pattern char a, hence jumping the value of j from 0 to -1 at text index i at 11 based on partial match table FOUND SUBSTRING AT i 21 and index: 12 Setting j from 9 to 3

I know, this algorithm is too big to digest at one shot, but practising more examples and analysing different patterns with the above explanation would give you a good start. Hope you enjoy this article, happy reading ðŸ™‚ ðŸ™‚ ðŸ™‚

**REFERENCES:**

http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm

http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/stringmatchingclasses/KmpStringMatcher.java

## Recent Comments