For those people coming from Oracle, SQL Server and MySQL or other databases that have soundex functionality,
you may be puzzled, or even frustrated when you try to do
something like
WHERE soundex('Wushington') = soundex('Washington')
in PostgreSQL and get a function does not exist error.
Well it does so happen that there is a soundex function in PostgreSQL, and yes it is
also called soundex, but is offered as a contrib module and not installed by default. It also has other fuzzy string matching functions in addition to soundex.
One of my favorites, the levenshenstein distance function is included as well. In this article
we'll be covering the contrib module packaged as fuzzystrmatch.sql. Details of the module can be found in FuzzyStrMatch.
The contrib module has been around for sometime, but has changed slightly from PostgreSQL version to PostgreSQL version. We are covering the 8.4 version in this article.
For those unfamiliar with soundex, its a basic approach developed by the US Census in the 1930s as a way of sorting
names by pronounciation. Read Census and Soundex for more gory history details.
Given that it is an approach designed primarily for the English alphabet, it sort of makes sense why its not built-in to PostgreSQL,
which has more of a diverse international concern. For example if you used it to compare two words in Japanese or Chinese,
don't think it would fair too well in any of the database platforms that support this function.
The original soundex algorithm has been improved over the years. Though its still the most common used today, newer variants
exist called MetaPhone developed in the 1990s and Double Metaphone (DMetaPhone) developed in 2000 that support additional
consonants in other languages such as Slavic, Celtic, Italian, Spanish etc.
These two variants are also included in the fuzzystrmatch contrib library. The soundex function still seems to be
the most popularly used at least for U.S. This is perhaps because most of the other databases (Oracle, SQL Server, MySQL) have soundex built-in but not the metaphone variants.
So in a sense soundex is a more portable function. The other reason is that metaphone and dmetaphone take up a bit more space and
are also more processor intensive to compute than soundex. We'll demonstrate some differences between them in this article.
To enable soundex and the other fuzzy string matching functions included, just run the
share/contrib/fuzzystrmatch.sql located in your PostgreSQL install folder. This library is an important piece of arsenal for geocoding and genealogy tracking particularly
the U.S. streets and surnames data sets. I come from a long line of Minors, Miners, Burnettes and Burnets.
For the next set of exercises, we will be using the places dataset we created in Importing Fixed width data into PostgreSQL with just PSQL.
Using Soundex
One of the useful things about soundex, metaphone, and dmetaphone functions in PostgreSQL is that you can index them to get faster performance
when searching. Below is a simple example of creating a functional index with soundex and using it.
CREATE INDEX idx_places_sndx_loc_name
ON places USING btree (soundex(loc_name));
Although the index is not necessary, it improves speed fairly significantly of queries for larger datasets. Regardless
of if you add an index or not, you would use the soundex function in a construct such as below.
SELECT loc_name
FROM places
WHERE soundex(loc_name) = soundex('Abeville')
ORDER BY loc_name;
-- output --
loc_name
----------------
Abbeville city
Abbeville city
Abbeville city
Abbeville city
Abbeville town
Abbyville city
Abie village
The above result wasn't too bad, but what if we try
SELECT loc_name
FROM places
WHERE soundex(loc_name) = soundex('rising sun')
ORDER BY loc_name;
--- results not as appealing --
loc_name
----------------------
Racine city
Racine city
Raisin City CDP
Rawson city
Regan city
Regina CDP
Riggins city
Rising City village
Rising Star town
Rising Sun-Lebanon CDP
Rising Sun city
Rising Sun town
Risingsun village
Rison city
Rockingham city
Ruskin CDP
Metaphone
As stated soundex, is not as accurate as metaphone and dmetaphone particularly when dealing with non-English languages.
If we were to try the above exercises with metaphone, how would they fair?
Metaphone takes an additional argument that defines the length of the metaphone encoding. The longer you make this
the more exact the match needs to be. If we repeat our above exercise:
SELECT loc_name
FROM places
WHERE metaphone(loc_name,5) = metaphone('rising sun',5)
ORDER BY loc_name;
-- Result is much more appealing
loc_name
-----------------------
Rising City village
Rising Star town
Rising Sun-Lebanon CDP
Rising Sun city
Rising Sun town
Risingsun village
If we increase our sensitivity a little more, we get fewer and fewer matches till we get nothing. For example:
SELECT loc_name
FROM places
WHERE metaphone(loc_name,6) = metaphone('rising sun',6)
ORDER BY loc_name;
-- Yields
loc_name
-----------------------
Rising Sun-Lebanon CDP
Rising Sun city
Rising Sun town
Risingsun village
Double MetaPhone
There are two dmetaphone functions in Postgres. There is regular dmetaphone and dmetaphone_alt. The dmetaphone
you can think of as returning the primary pronounciation and is similar to what metaphone returns. The dmetaphone_alt will return an alternative
pronounciation which may or may not be different from the dmetaphone. If its a non-English word, then the two results tend to be different.
For this next exercise, we'll try a slightly foreign place name.
SELECT loc_name
FROM places
WHERE dmetaphone(loc_name) = dmetaphone('Elle Cahon')
ORDER BY loc_name;
--The above returns no answers, but if instead we use the alt --
SELECT loc_name
FROM places
WHERE dmetaphone_alt(loc_name) = dmetaphone_alt('Elle Cahon')
ORDER BY loc_name;
-- We get
loc_name
---------------
El Cajon city
If you use soundex for the above matching, you get a bunch of results, none of which are El Cajon. If you use metaphone and set sensitivity to 4, you get some results,
but again none are El Cajon.
What is a levenshtein distance?
A levenshtein distance is in a nutshell a measure of how many edits you need to make to a source string to convert it to
the target string. So it is the spelling distance between two words.
It is another function packaged in fuzzy string match contrib and also a common favorite for geocoding. Since one of the arguments
is not constant for a given column (you are comparing two strings), you can't index it like you can soundex and other soundex variants. It is as a result
often used as a secondary search filter to further rank the outputs returned by the soundex family of functions.
In its simplest form, it takes two strings and returns the number of character edits needed to convert one string to another. Below is an example
use:
SELECT loc_name , levenshtein(loc_name, 'rising sun')
FROM places
WHERE soundex(loc_name) = soundex('rising sun')
ORDER BY levenshtein(loc_name, 'rising sun');
---
loc_name | levenshtein
------------------------+-------------
Rising Sun city | 7
Rising Sun town | 7
Regina CDP | 7
Ruskin CDP | 7
Rison city | 7
Racine city | 8
Riggins city | 8
Racine city | 8
Rawson city | 9
Rising Star town | 9
Regan city | 9
Raisin City CDP | 10
Risingsun village | 10
Rockingham city | 11
Rising City village | 13
Rising Sun-Lebanon CDP | 14
As you can see our best two matches float to the top when we order by the levenshtein distance, but longer text names are being severely penalized. In order
to penalize less, we can define weights for insertions/deletions and substitutions by using the other variant of levenshtein
that takes the form levenshtein(text source, text target, int insert_cost, int delete_cost, int substitution_cost)
SELECT loc_name , levenshtein(loc_name, 'rising sun',1,0,4)
FROM places
WHERE soundex(loc_name) = soundex('rising sun')
ORDER BY levenshtein(loc_name, 'rising sun',1,0,4);
-- With the above we penalize most for substitutions and none for deletions --
-- This makes all our risings float to the top even if they are longish --
loc_name | levenshtein
------------------------+-------------
Rising Sun town | 2
Risingsun village | 2
Rising Sun-Lebanon CDP | 2
Rising Sun city | 2
Rising Star town | 3
Rising City village | 4
Raisin City CDP | 5
Rison city | 6
Ruskin CDP | 6
Riggins city | 6
Rockingham city | 6
Rawson city | 7
Regina CDP | 7
Racine city | 7
Racine city | 7
Regan city | 8
Tracked: Jul 21, 18:14
Tracked: Jul 21, 18:20
Tracked: Jan 16, 20:27