Where is soundex and other warm and fuzzy string things

For those people coming from Oracle, SQL Server and MySQL or other databases that have soundex functionality, you may be puzzled, or even frustrated when you try to do something like
WHERE soundex('Wushington') = soundex('Washington')
in PostgreSQL and get a function does not exist error.

Well it does so happen that there is a soundex function in PostgreSQL, and yes it is also called soundex, but is offered as a contrib module and not installed by default. It also has other fuzzy string matching functions in addition to soundex. One of my favorites, the levenshenstein distance function is included as well. In this article we'll be covering the contrib module packaged as fuzzystrmatch.sql. Details of the module can be found in FuzzyStrMatch. The contrib module has been around for sometime, but has changed slightly from PostgreSQL version to PostgreSQL version. We are covering the 8.4 version in this article.

For those unfamiliar with soundex, its a basic approach developed by the US Census in the 1930s as a way of sorting names by pronounciation. Read Census and Soundex for more gory history details.

Given that it is an approach designed primarily for the English alphabet, it sort of makes sense why its not built-in to PostgreSQL, which has more of a diverse international concern. For example if you used it to compare two words in Japanese or Chinese, don't think it would fair too well in any of the database platforms that support this function.

The original soundex algorithm has been improved over the years. Though its still the most common used today, newer variants exist called MetaPhone developed in the 1990s and Double Metaphone (DMetaPhone) developed in 2000 that support additional consonants in other languages such as Slavic, Celtic, Italian, Spanish etc. These two variants are also included in the fuzzystrmatch contrib library. The soundex function still seems to be the most popularly used at least for U.S. This is perhaps because most of the other databases (Oracle, SQL Server, MySQL) have soundex built-in but not the metaphone variants. So in a sense soundex is a more portable function. The other reason is that metaphone and dmetaphone take up a bit more space and are also more processor intensive to compute than soundex. We'll demonstrate some differences between them in this article.

To enable soundex and the other fuzzy string matching functions included, just run the share/contrib/fuzzystrmatch.sql located in your PostgreSQL install folder. This library is an important piece of arsenal for geocoding and genealogy tracking particularly the U.S. streets and surnames data sets. I come from a long line of Minors, Miners, Burnettes and Burnets.

For the next set of exercises, we will be using the places dataset we created in Importing Fixed width data into PostgreSQL with just PSQL.

Using Soundex

One of the useful things about soundex, metaphone, and dmetaphone functions in PostgreSQL is that you can index them to get faster performance when searching. Below is a simple example of creating a functional index with soundex and using it.


CREATE INDEX idx_places_sndx_loc_name
   ON places USING btree (soundex(loc_name));


Although the index is not necessary, it improves speed fairly significantly of queries for larger datasets. Regardless of if you add an index or not, you would use the soundex function in a construct such as below.


SELECT loc_name 
FROM places 
WHERE soundex(loc_name) = soundex('Abeville') 
ORDER BY loc_name;


-- output --
    loc_name
----------------
 Abbeville city
 Abbeville city
 Abbeville city
 Abbeville city
 Abbeville town
 Abbyville city
 Abie village

The above result wasn't too bad, but what if we try

SELECT loc_name 
FROM places 
WHERE soundex(loc_name) = soundex('rising sun')
ORDER BY loc_name;


--- results not as appealing --
       loc_name
----------------------
Racine city
Racine city
Raisin City CDP
Rawson city
Regan city
Regina CDP
Riggins city
Rising City village
Rising Star town
Rising Sun-Lebanon CDP
Rising Sun city
Rising Sun town
Risingsun village
Rison city
Rockingham city
Ruskin CDP

Metaphone

As stated soundex, is not as accurate as metaphone and dmetaphone particularly when dealing with non-English languages. If we were to try the above exercises with metaphone, how would they fair?

Metaphone takes an additional argument that defines the length of the metaphone encoding. The longer you make this the more exact the match needs to be. If we repeat our above exercise:

SELECT loc_name 
FROM places 
WHERE metaphone(loc_name,5) = metaphone('rising sun',5)
ORDER BY loc_name;



-- Result is much more appealing
       loc_name
-----------------------
Rising City village
Rising Star town
Rising Sun-Lebanon CDP
Rising Sun city
Rising Sun town
Risingsun village

If we increase our sensitivity a little more, we get fewer and fewer matches till we get nothing. For example:

SELECT loc_name 
FROM places  
WHERE metaphone(loc_name,6) = metaphone('rising sun',6)
ORDER BY loc_name;


-- Yields
        loc_name
-----------------------
 Rising Sun-Lebanon CDP
 Rising Sun city
 Rising Sun town
 Risingsun village

Double MetaPhone

There are two dmetaphone functions in Postgres. There is regular dmetaphone and dmetaphone_alt. The dmetaphone you can think of as returning the primary pronounciation and is similar to what metaphone returns. The dmetaphone_alt will return an alternative pronounciation which may or may not be different from the dmetaphone. If its a non-English word, then the two results tend to be different. For this next exercise, we'll try a slightly foreign place name.

SELECT loc_name 
FROM places
WHERE dmetaphone(loc_name) = dmetaphone('Elle Cahon')
ORDER BY loc_name;


--The above returns no answers, but if instead we use the alt --
SELECT loc_name 
FROM places  
WHERE dmetaphone_alt(loc_name) = dmetaphone_alt('Elle Cahon')
ORDER BY loc_name;



-- We get
   loc_name
---------------
 El Cajon city

If you use soundex for the above matching, you get a bunch of results, none of which are El Cajon. If you use metaphone and set sensitivity to 4, you get some results, but again none are El Cajon.

What is a levenshtein distance?

A levenshtein distance is in a nutshell a measure of how many edits you need to make to a source string to convert it to the target string. So it is the spelling distance between two words.

It is another function packaged in fuzzy string match contrib and also a common favorite for geocoding. Since one of the arguments is not constant for a given column (you are comparing two strings), you can't index it like you can soundex and other soundex variants. It is as a result often used as a secondary search filter to further rank the outputs returned by the soundex family of functions.

In its simplest form, it takes two strings and returns the number of character edits needed to convert one string to another. Below is an example use:

SELECT loc_name , levenshtein(loc_name, 'rising sun')
FROM places
WHERE soundex(loc_name) = soundex('rising sun')
ORDER BY levenshtein(loc_name, 'rising sun');


---
        loc_name        | levenshtein
------------------------+-------------
 Rising Sun city        |           7
 Rising Sun town        |           7
 Regina CDP             |           7
 Ruskin CDP             |           7
 Rison city             |           7
 Racine city            |           8
 Riggins city           |           8
 Racine city            |           8
 Rawson city            |           9
 Rising Star town       |           9
 Regan city             |           9
 Raisin City CDP        |          10
 Risingsun village      |          10
 Rockingham city        |          11
 Rising City village    |          13
 Rising Sun-Lebanon CDP |          14

As you can see our best two matches float to the top when we order by the levenshtein distance, but longer text names are being severely penalized. In order to penalize less, we can define weights for insertions/deletions and substitutions by using the other variant of levenshtein that takes the form levenshtein(text source, text target, int insert_cost, int delete_cost, int substitution_cost)

SELECT loc_name , levenshtein(loc_name, 'rising sun',1,0,4)
FROM places
WHERE soundex(loc_name) = soundex('rising sun')
ORDER BY levenshtein(loc_name, 'rising sun',1,0,4);


-- With the above we penalize most for substitutions and none for deletions --
-- This makes all our risings float to the top even if they are longish --
        loc_name        | levenshtein
------------------------+-------------
 Rising Sun town        |           2
 Risingsun village      |           2
 Rising Sun-Lebanon CDP |           2
 Rising Sun city        |           2
 Rising Star town       |           3
 Rising City village    |           4
 Raisin City CDP        |           5
 Rison city             |           6
 Ruskin CDP             |           6
 Riggins city           |           6
 Rockingham city        |           6
 Rawson city            |           7
 Regina CDP             |           7
 Racine city            |           7
 Racine city            |           7
 Regan city             |           8