**String Search Algorithm in java OR String Matching Algorithm in java: ** KMP Algorithm is one of the many string search algorithms which is better suited in scenarios where ‘pattern to be searched’ remains same whereas ‘text to be searched’ changes.

**Full Source Code **cab be downloaded here

**Knuth-Morris-Pratt Algorithm (KMP) detailed analysis**

Understanding this would show us what a careful algorithmic thinker would code even for a small problem like this. This is one of the algorithms which is easy to implement but one needs lot of attention and hard work to understand the algorithm.

**Problem Statement:** lets assume we want to search for the pattern P in the text T.

T: [abcabdabcabdabcabdabdabc]

P: [abcabdabc]

This algorithm is executed in two phases.

- Preprocessing Phase
- Search Phase

**PRE PROCESSING PHASE**

In this phase, KMP algorithm creates a temporary array which is prepared from the pattern. We call this temporary array as Partial Match Table** (PMT)**, lets call this PMT as b[i].

**What is Partial Match Table?**

It is an array of size (pattern_length+1) where, for each position of i in pattern p, b[i] is defined such that it takes the ‘length of the longest proper suffix of P[1…i-1]’ that matches with the ‘prefix of P’.

**What is longest prefix/suffix match???** Proper prefix is a prefix which is not same as the substring. Recall proper set which is a subset of a set but is not same as the set. This is very important to understand the algorithm. Let me take an example and show it, considering the above pattern ‘abcabdabc’

I am taking all possible substrings in the pattern ‘abcabdabc’:

a
ab
abc
abca
abcab
abcabd
abcabda
abcabdab
abcabdabc

Now take each substring and find all possible set of proper prefixes and proper suffixes,

a : no prefix and no suffix since there is only one letter
ab : prefixes[a]
suffixes[b]
abc : prefixes[a,ab]
suffixes[c,bc]
abca : prefixes[a,ab,abc]
suffixes[a,ca,bca]
abcab : prefixes[a,ab,abc,abca]
suffixes[b,ab,cab,bcab]
abcabd : prefixes[a,ab,abc,abca,abcab]
suffixes[d,bd,abd,cabd,bcabd]
abcabda : prefixes[a,ab,abc,abca,abcab,abcabd]
suffixes[a,da,bda,abda,cabda,bcabda]
abcabdab : prefixes[a,ab,abc,abca,abcab,abcabd,abcabda]
suffixes[b,ab,dab,bdab,abdab,cabdab,bcabdab]
abcabdabc : prefixes[a,ab,abc,abca,abcab,abcabd,abcabda,abcabdab]
suffixes[c,bc,abc,dabc,bdabc,abdabc,cabdabc,bcabdabc,abcabdabc]

Now look at the longest proper prefix which is same as the proper suffix:

a : no prefix and no suffix since there is only one letter
ab : 0
abc : 0
abca : 1 [a]
abcab : 2 [ab]
abcabd : 0
abcabda : 1 [a]
abcabdab : 2 [ab]
abcabdabc : 3 [abc]

Now start computing the Partial Match Table array values:

**b[0]** is computed by looking at longest prefix and suffix match of the substring of p[0…0-1] as p[-1] is not defined, hence b[0] is given -1

**b[1]** is computed by looking at longest prefix and suffix match of the substring of p[0…1-1] = b[0] = A, since there wont be any prefix and suffix for one letter word, b[1] is 0.

**Note: For any given pattern b[0] and b[1] are always fixed**

b[2]: is computed by looking at longest prefix and suffix match of the substring of p[0…2-1] = b[0…1] = AB = 0

**… so on, **below are the values of partial match table

b[0] = -1
b[1] = 0
b[2] = 0
.
.
.
b[7] = 1
b[8] = 2
b[9] = 3

Final array is:

p : a b c a b d a b c
p[i] : 0 1 2 3 4 5 6 7 8
b[i] : -1 0 0 0 1 2 0 1 2 3

**Note: Why a prefix should match suffix of the pattern? **its because when we shift the pattern its the prefix of P which comes towards the suffix. And also the key idea is that if we have successfully matched prefix P[1…i-1] of the pattern with the substring T[j-(i-1)…j-1] of the input string and P(i)!=T(j), then we dont need to reprocess any of the suffix T[j-(i-1)…j-1] since we know this portion of the text string is the prefix of the pattern that we have just matched.

**Pre Processing Code:**

/**
* Pre processes the pattern array based on proper prefixes and proper
* suffixes at every position of the array
*
* @param ptrn
* word that is to be searched in the search string
* @return partial match table which indicates
*/
public int[] preProcessPattern(char[] ptrn) {
int i = 0, j = -1;
int ptrnLen = ptrn.length;
int[] b = new int[ptrnLen + 1];
b[i] = j;
while (i < ptrnLen) {
while (j >= 0 && ptrn[i] != ptrn[j]) {
// if there is mismatch consider the next widest border
// The borders to be examined are obtained in decreasing order from
// the values b[i], b[b[i]] etc.
j = b[j];
}
i++;
j++;
b[i] = j;
}
return b;
}

**Code Walkthrough:**

Lets analyze the pre processing phase:

**Standard Definitions:**

The preprocessing algorithm inner while loop:

If p(j) == p(i), A border of width j can be extended by p(i), here it doesnt go into inner while loop.

If p(j) != p(i), the next-widest border is examined by setting j = b[j]. The loop terminates at the latest if no border can be extended (j = -1).

Now let us look at how partial match table is prepared:

Provided that the values b[0], …, b[i] are already known, the value of b[i+1] is computed by checking if a border of the prefix p[0] … p[i-1] can be extended by p[i]. This is the case if p[b(i)] = p(i). The borders to be examined are obtained in decreasing order from the values b[i], b[b[i]] etc.

**What do these definitions say? **you will understand once we walk through the code looking at the below PMT table once again

p : a b c a b d a b c
p[i] : 0 1 2 3 4 5 6 7 8
b[i] : -1 0 0 0 1 2 0 1 2 3

Initially b[0] is set to -1

**At 1st loop,** i=0, j=-1, since j<0, it will not enter into inner while loop and i&j are incremented, i.e, i=1&j=0 and b[1]=0

**At 2nd loop,** i=1, j=0, since a[1] is not equal to a[0], it will enter into inner while loop and set j to -1 (because b[j] i.e, b[0]=-1) and i&j are incremented, i.e, i=2&j=0 and b[2]=0

**At 3rd loop,** i=2, j=0, since a[2] is not equal to a[0], it will enter into inner while loop and set j to -1 (because b[0]=-1) and i&j are incremented, i.e, i=3&j=0 and b[3]=0

**At 4th loop,** i=3, j=0, since a[3] is equal to a[0], it will not enter into inner while loop and i&j are incremented, i.e, i=4&j=1 and b[4]=1

**At 5th loop,** i=4, j=1, since a[4] is equal to a[1], it will not enter into inner while loop and i&j are incremented, i.e, i=5&j=2 and b[5]=2

**NEXT LOOP IS IMPORTANT TO UNDERSTAND**

**At 6th loop,** i=5, j=2, since a[5] is not equal to a[2], it will enter into inner while loop and set j to 0 (because b[j] i.e, b[2]=0) and still a[5] is not equal to a[0], it will again enter into inner while loop and set j to -1 (because b[j] i.e, b[0]=-1) .

**Why j is jumping from 2, 0, and -1??** see the definition ‘The borders to be examined are obtained in decreasing order from the values b[i], b[b[i]] etc.’, here it is b[5] i.e 2, b[b[5]] i.e 0, b[b[b[5]]] i.e -1

Now it will exit inner while loop, increment i&j, i.e, i=6&j=0 and b[6]=0

…. so on

by now you might have understood how to evaluate other values of the array (IF NOT PLS PING I will add the other values as well)

**SEARCH PHASE**

This is pretty simple once we get the partial match table from preprocessing phase. Here we just compare first character of the text T i.e, t(i) with the first character of the pattern P i.e, p(i), if there is a match we increment the pointers i and j and go forward. If there is a mismatch then we look into the partial match table b[j] to shift the position of j to appropriate value based on the prefix/suffix match by using b[j].

Code Walkthrough

/**
* Based on the pre processed array, search for the pattern in the text
*
* @param text
* text over which search happens
* @param ptrn
* pattern that is to be searched
*/
public void searchSubString(char[] text, char[] ptrn) {
int i = 0, j = 0;
// pattern and text lengths
int ptrnLen = ptrn.length;
int txtLen = text.length;
// initialize new array and preprocess the pattern
int[] b = preProcessPattern(ptrn);
while (i < txtLen) {
while (j >= 0 && text[i] != ptrn[j]) {
j = b[j];
}
i++;
j++;
// a match is found
if (j == ptrnLen) {
System.out.println("found substring at index:" + (i - ptrnLen));
j = b[j];
}
}
}

Lets analyze the search phase:

Index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Text(T) a b c a b d a b c a b d a b c a b d a b d a b c
Pattern(P) a b c a b d a b c
PMT (b[i]) -1 0 0 0 1 2 0 1 2 3

Here j is a variable used to track the pattern P and i is used to track the text T. Looking at the Text and Pattern, T[0…8] and P[0…8] have a perfect match, once there is a perfect match it enters into if loop at line #26, where j is set from 9 to 3 because b[9]=3; and the program continues from i=8 and j=3, going further there is a mismatch at i=11 and j=5, the program enters into inner while loop where j is set to 2 (because b[5]=2), since p[2] is not equals to t[11], program enters into inner while loop again where j is set to 0 (because b[2]=0), since p[0] is still not equals to t[11], program enters into inner while loop again where j is set to -1 (because b[0]=-1); at j=-1 program exits the inner while loop increments i and j; in next loop at i=12 and j=0, the outer while loop executes with perfect match again at T[12…20] and P[0…8]

**Advantages:**

Implementation is simple

Time complexity is O(m+n) when compared to the brute force algorithm O(mn)

This algorithm takes atmost 2n comparisions

**Disadvantages:**

Chances of mismatch increases with big sized alphabets (other than English).

**Full Source Code **cab be downloaded here

The output of this program is:

a b c a b d a b c a b e a b c a b d a b c a b d
printing pattern, partial match table, and its index
a b c a b d a b c
-1 0 0 0 1 2 0 1 2 3
0 1 2 3 4 5 6 7 8
FOUND SUBSTRING AT i 9 and index: 0
Setting j from 9 to 3
Mismatch happened, between text char e and pattern char d, hence jumping
the value of j from 5 to 2 at text index i at 11 based on partial match table
Mismatch happened, between text char e and pattern char c, hence jumping
the value of j from 2 to 0 at text index i at 11 based on partial match table
Mismatch happened, between text char e and pattern char a, hence jumping
the value of j from 0 to -1 at text index i at 11 based on partial match table
FOUND SUBSTRING AT i 21 and index: 12
Setting j from 9 to 3

I know, this algorithm is too big to digest at one shot, but practising more examples and analysing different patterns with the above explanation would give you a good start. Hope you enjoy this article, happy reading ðŸ™‚ ðŸ™‚ ðŸ™‚

**REFERENCES:**

http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/kmpen.htm

http://www.inf.fh-flensburg.de/lang/algorithmen/pattern/stringmatchingclasses/KmpStringMatcher.java

## Recent Comments