Catalysoft   Turtle
home products articles about us contact us

Recent Articles

What's Good About Clojure?

Clojure is a relatively new language to appear on the Java Virtual Machine (JVM), although it draws on very mature roots in the form of the LISP langu ...

Should You Care About Requirements Engineering?

Recently, I (Adil) was invited to participate in a one day seminar on the subject of Requirements Engineering. Whilst I have no direct experience of t ...

Tips for Setting Up Your First Business Website

To attract all potential customers to your business you need a presence on the web. The problem is that if you haven't set up a website before, you p ...

What's Good about LISP?

LISP is a general-purpose programming language and is the second-oldest programming language still in use, but how much do you know about it? Did you ...

Open Source Tools for Developers: Why They Matter

From a developer's point of view use of open-source tools has advantages beyond the obvious economic ones. With the open-source database MySQL in mind ...

How to Strike a Match

Discuss this article >>>

Introduction

In my previous article Tame the Beast by Matching Similar Strings, I presented a brief survey of approximate string matching algorithms, and argued their importance for information retrieval tasks. A classic example of information retrieval using similarity searching is entering a keyword into the search string box on Amazon's web site in order to retrieve descriptions of products related to that keyword. Approximate string matching algorithms can be classified as equivalence algorithms and similarity ranking algorithms. In this article, I present a new similarity ranking algorithm, together with its associated string similarity metric. I also include Java source code, so you can easily incorporate the algorithm into your own applications.

The algorithm has been successfully applied to the retrieval of terms from a domain-specific electronic thesaurus, and also to the retrieval of geographical place names.

The algorithm was driven by the following requirements:

A true reflection of lexical similarity - strings with small differences should be recognised as being similar. In particular, a significant substring overlap should point to a high level of similarity between the strings.

A robustness to changes of word order - two strings which contain the same words, but in a different order, should be recognised as being similar. On the other hand, if one string is just a random anagram of the characters contained in the other, then it should (usually) be recognised as dissimilar.

Language Independence - the algorithm should work not only in English, but in many different languages.

For example, 'FRANCE' should be similar to 'FRANÇAIS' and 'REPUBLIC OF FRANCE', and 'REPUBLIC OF FRANCE' should be similar to both 'FRENCH REPUBLIC' and 'REPUBLIQUE FRANCAISE'. We can also make relative statements of similarity. For example 'FRENCH' should be more similar to 'FRENCH FOOD' than it is to 'FRENCH CUISINE', because the size of the common substring is the same in both cases and 'FRENCH FOOD' is the shorter of the two strings.

Existing Algorithms

Existing algorithms, such as the Soundex Algorithm, Edit Distance, and Longest Common Substring, do not perform well against these requirements. (See my previous article for descriptions of the algorithms.) The Soundex Algorithm is an equivalence algorithm, so simply states whether or not two strings are similar. However, it would not recognise any similarity between 'FRANCE' and 'REPUBLIC OF FRANCE', as they start with different letters. The Edit Distance algorithm would acknowledge some similarity between the two strings, but would rate 'FRANCE' and 'QUEBEC' (with a distance of 6) to be more similar than 'FRANCE' and 'REPUBLIC OF FRANCE' (which have a distance of 12). The Longest Common Substring would give 'FRANCE' and 'REPUBLIC OF FRANCE' quite a good rating of similarity (a common substring of length 6). However, it is disappointing that according to this metric, the string 'FRENCH REPUBLIC' is equally similar to the two strings 'REPUBLIC OF FRANCE' and 'REPUBLIC OF CUBA'.

The New Metric

Given the drawbacks of the existing algorithms, I decided to invent a new string similarity metric that rewards both common substrings and a common ordering of those substrings. In addition, I wanted the algorithm to consider not only the single longest common substring, but other common substrings too.

The requirements may sound difficult to satisfy, but my solution is deceptively simple:

Find out how many adjacent character pairs are contained in both strings.

The intention is that by considering adjacent characters, I take account not only of the characters, but also of the character ordering in the original string, since each character pair contains a little information about the original ordering.

Let me explain the algorithm by comparing the two strings 'France' and 'French'. First, I map them both to their upper case characters (making the algorithm insensitive to case differences), then split them up into their character pairs:

FRANCE: {FR, RA, AN, NC, CE}

FRENCH: {FR, RE, EN, NC, CH}

Then I work out which character pairs are in both strings. In this case, the intersection is {FR, NC}. Now, I would like to express my finding as a numeric metric that reflects the size of the intersection relative to the sizes of the original strings. If pairs(x) is the function that generates the pairs of adjacent letters in a string, then my numeric metric of similarity is:

The similarity between two strings s1 and s2 is twice the number of character pairs that are common to both strings divided by the sum of the number of character pairs in the two strings. (The vertical bars in the formula mean ?size of?.) Note that the formula rates completely dissimilar strings with a similarity value of 0, since the size of the letter-pair intersection in the numerator of the fraction will be zero. On the other hand, if you compare a (non-empty) string to itself, then the similarity is 1. For our comparison of 'FRANCE' and 'FRENCH', the metric is computed as follows:

Given that the values of the metric always lie between 0 and 1, it is also very natural to express these values as percentages. For example, the similarity between 'FRANCE' and 'FRENCH' is 40%. From now on, I will express similarity values as percentages, rounded to the nearest whole number.

Ranking Results

Typically, we don't just want to know how similar two strings are. We want to know which of a set of known strings are most similar to a particular string. For example, which of the strings 'Heard', 'Healthy', 'Help', 'Herded', 'Sealed' or 'Sold' is most similar to the string 'Healed'?

All we need do to answer the question is find the similarity between 'Healed' and each of the other words, and then rank the results in order of these values. The results for this example are presented in Table 1.

Word

Similarity

Sealed

80%

Healthy

55%

Heard

44%

Herded

40%

Help

25%

Sold

0%

Table 1: Find the Most Similar Word to 'Healed'

A Real World Example

Now for an example that is a little closer to the real world. Suppose we are building a web site that sells books, and we want to allow our users to type in a string to search against the titles of books in our database. When the user?s string does not exactly match any of the books in our database, we return a list of the books whose title is most similar.

Table 2 contains a list of book titles in its first column. If we search against these book titles with the query strings 'Web Database Applications', 'PHP Web Applications' and 'Web Aplications' using the metric described, then we get the results shown in the six right-most columns of the table. (Note the intentional misspelling of 'Applications' in the third of these strings.) The 'Rank' columns indicate the order in which the books would be returned. Each rank is qualified by its corresponding percentage similarity value (rounded to the nearest whole number) in the next column. The three input strings were carefully chosen to target the book 'Web Database Applications with PHP & MySQL', but with a decreasing quality of input string as you move from 'Web Database Applications' to 'Web Aplications'.

 

 

Web Database Applications

PHP Web Applications

Web Aplications

Book Title

Rank

Value

Rank

Value

Rank

Value

 

Web Database Applications with PHP & MySQL

1

82%

1

68%

1

59%

 

Creating Database Web Applications with PHP and ASP

2

71%

3

59%

3

50%

 

Building Database Applications on the Web Using PHP3

3

70%

4

58%

4

49%

 

Building Web Database Applications with Visual Studio 6

4

67%

5

47%

5

46%

 

Web Application Development With PHP

5

51%

2

67%

2

56%

 

WebRAD: Building Database Applications on the Web with Visual FoxPro and Web Connection

6

49%

6

34%

6

32%

 

Structural Assessment: The Role of Large and Full-Scale Testing

7

12%

8

7%

8

7%

 

How to Find a Scholarship Online

8

10%

7

11%

7

12%

 

Table 2: Three Searches for a Book

Two main results are shown in the table. Firstly, the targeted book is scored as the top match in each case, and secondly, the similarity values decrease as you move left to right in the table, reflecting the intended degradation in the quality of the input.

When presenting results of searches to users, you would normally set a threshold value of say, 20%, and only show the top 10 or so results that lie above the threshold. (We wouldn't want to lose the trust of our users by showing results that are irrelevant.)

It is worth pointing out that I devised this example by searching the books at Amazon.com with exactly the same three input strings as shown above, and selecting the top matches. Interestingly, for the third query 'Web Aplications', Amazon does not show any web application development books in its search results. In fact, Amazon's top match is "How to Find a Scholarship Online", a result which appears to confuse writing web applications with writing applications on the web! This suggests that Amazon's search mechanism is not robust to spelling errors, but nevertheless uses a similarity metric based on product descriptions in an attempt to return a good match.

A Java Implementation

In this section, I describe an implementation of the similarity algorithm in Java. The code is not difficult to understand, so if Java is not your main programming language, it should not be a big task to translate it to your favourite language.

Overview

I have implemented the algorithm as a single class with three static methods. Two of those methods are private, so only one, compareStrings(), is exposed to users of the class. Here is the outline of the class, with the bodies of the methods removed:

import java.util.ArrayList;

 

public class LetterPairSimilarity {

   public static double compareStrings(String str1, String str2) {}

   private static ArrayList wordLetterPairs(String str) {}

   private static String[] letterPairs(String str) {}

}

Until now, I have not discussed how I deal with white space in the input strings. My approach is to dismiss from consideration any character pairs that include white space. In effect, this means first finding the words (or tokens) in your input strings before looking for adjacent character pairs. You can now see how this approach maps to the methods outlined above. CompareStrings() calls wordLetterPairs() for each of its inputs. WordLetterPairs() identifies the words in its input string, and uses the method letterPairs() to compute the pairs of adjacent characters in each of those words. When compareStrings() has collected the character pairs for each of its input strings, it computes the return value according to the similarity metric.

I will now discuss the bodies of the methods in detail, in a bottom-up order.

Computing Letter Pairs

The basis of the algorithm is the method that computes the pairs of characters contained in the input string. This method creates an array of Strings to contain its result. It then iterates through the input string, to extract character pairs and store them in the array. Finally, the array is returned.

/** @return an array of adjacent letter pairs contained in the input string */

   private static String[] letterPairs(String str) {

       int numPairs = str.length()-1;

       String[] pairs = new String[numPairs];

       for (int i=0; i<numPairs; i++) {

           pairs[i] = str.substring(i,i+2);

       }

       return pairs;

   }

 

Taking Account of White Space

This method uses the split() method of the String class to split the input string into separate words, or tokens. It then iterates through each of the words, computing the character pairs for each word. The character pairs are added to an ArrayList, which is returned from the method. An ArrayList is used, rather than an array, because we do not know in advance how many character pairs will be returned. (At this point, the program doesn?t know how much white space the input string contains.)

/** @return an ArrayList of 2-character Strings. */

   private static ArrayList wordLetterPairs(String str) {

       ArrayList allPairs = new ArrayList();

       // Tokenize the string and put the tokens/words into an array

       String[] words = str.split("\\s");

       // For each word

       for (int w=0; w < words.length; w++) {

           // Find the pairs of characters

           String[] pairsInWord = letterPairs(words[w]);

           for (int p=0; p < pairsInWord.length; p++) {

               allPairs.add(pairsInWord[p]);

           }

       }

       return allPairs;

   }

Computing the Metric

This public method computes the character pairs from the words of each of the two input strings, then iterates through the ArrayLists to find the size of the intersection. Note that whenever a match is found, that character pair is removed from the second array list to prevent us from matching against the same character pair multiple times. (Otherwise, 'GGGGG' would score a perfect match against 'GG'.)

/** @return lexical similarity value in the range [0,1] */

   public static double compareStrings(String str1, String str2) {

       ArrayList pairs1 = wordLetterPairs(str1.toUpperCase());

       ArrayList pairs2 = wordLetterPairs(str2.toUpperCase());

       int intersection = 0;

       int union = pairs1.size() + pairs2.size();

       for (int i=0; i<pairs1.size(); i++) {

           Object pair1=pairs1.get(i);

           for(int j=0; j<pairs2.size(); j++) {

               Object pair2=pairs2.get(j);

               if (pair1.equals(pair2)) {

                   intersection++;

                   pairs2.remove(j);

                   break;

               }

           }

       }

       return (2.0*intersection)/union;

   }

Summary

In this article, I presented some requirements for a similarity ranking algorithm such that it would be applicable to many domains, in many languages. I described a metric and associated algorithm that meets those requirements by comparing the adjacent character pairs contained in two strings. I illustrated the algorithm with examples, and presented a Java implementation.

To close the article, I should like to return to some of the examples quoted at the start for which I argued the inadequacy of the existing algorithms. Contrary to the Soundex algorithm and the Edit Distance, my algorithm rates the strings 'FRANCE' and 'REPUBLIC OF FRANCE' to have a good similarity of 56%. On the other hand, the strings 'FRANCE' and 'QUEBEC' are seen to be reassuringly dissimilar, with a similarity of 0%. And 'FRENCH REPUBLIC' is more similar to 'REPUBLIC OF FRANCE' than it is to 'REPUBLIC OF CUBA' with similarities of 72% and 61%, respectively.

In my next article, I will use my similarity metric in an attempt to discover which language had the biggest influence on Tolkien's artificial languages (such as Elvish and Dwarvish) in 'Lord of the Rings'.

Addendum I invented this algorithm in 1992 in the context of a project that needed such an approach. A reader of this article has pointed out that the algorithm I describe is the same as Dice's Coefficient, which I was not aware of at the time I rediscovered the algorithm or at the time that I wrote this article.


Discuss this article >>>


Simon White