Boyer-Moore Pattern-Matching Algorithm


The problem is to determine whether a given (presumably small) string pattern is a substring of a larger string searchString. If one follows the naive method - that is start at the beginning and move your way from left to right until you find a match or a mismatch and, then, slide one character to the right, you end up doing possibly multiple comparisons for each character in searchString. In the worst possible case you could end up with as much as m*n comparisons where m is the length of pattern and n is the length of searchString. This could be expensive if searchString is very long.

What Boyer and Moore discovered is that you can do much better than this simply by comparing from right-to-left instead of from left-to-right. Basically, what you do is start by pretending that your pattern is laid out against the searchString and start comparing from the last position in pattern (position m-1 for us) and moving left until you get a mismatch. If you get a mismatch then shift right an amount which is the greatest of 1, the rightmost postion in the pattern where the character you mismatched with is minus the number of characters matched, or m minus the number of characters matched if the character you mismatched is not in pattern at all.

This alone allows for the possibility that you need not compare with all characters in the searchString so that the total comparisons could be less than n. The cost is that you need to precompute for each character in the alphabet what its shifting amount is.

Boyer-Moore is not finished, however. In addition, it does another part which notes that if you started at the right and matched a suffix until you found a mismatch, then maybe that suffix occirs also earlier in the pattern. If so, you can slide that earlier pattern over so it lies under the pattern you are dealing with.

The two parts together make it so that with a pattern of size more than 5 and a reasonably large searchString, the cost is approximately 0.3*n which is considerably faster than the best other algorithms which have a cost approximately 1.1*n to 1.2*n.

Next