Fuzzy string matching with Trigram and Trigraphs

In an earlier article Where is Soundex and other Fuzzy string things we covered the PostgreSQL contrib module fuzzstrmatch which contains the very popular function soundex that is found in other popular relational databases. We also covered the more powerful levenshtein distance, metaphone and dmetaphone functions included in fuzzstrmatch, but rarely found in other relational databases.

As far as fuzzy string matching goes, PostgreSQL has other functions up its sleeves. This time we will cover the contrib module pg_trgm which was introduced in PostgreSQL 8.3. pgtrgm uses a concept called trigrams for doing string comparisons. The pg_trgm module has several functions and gist/gin operators. Like other contrib modules, you just need to run the /share/contrib/pg_trgm.sql file packaged in your PostgreSQL install to enable it in your database.

For this set of exercises, we'll use trigrams to compare words using the same set of data we tested with soundex and metaphones. For the next set of exercises, we will be using the places dataset we created in Importing Fixed width data into PostgreSQL with just PSQL.

The most useful are the similarity function and the % operator. The % operator allows for using a GIST/GIN index and the similarity function allows for narrowing your filter similar to what levenshtein did for us in fuzzstrmatch.

What do PostgreSQL trigrams look like

A text string is composed of n number of trigrams which in PostgreSQL is displayed as an array of 3 character text values stored which consist of all 3 character consecutive groups that can be formed from the text. You can inspect this with the show_trgm function.

It's probably easiest to get a sense of what they are by looking at some. Below are some texts in random order with accompanying trigrams for each text. We chose only locations of less than 10 characters so that we could easily display the trigrams.

One of the simplest and my favorite is that for Leo.

SELECT show_trgm('Leo');

Which yields: {" l"," le","eo ",leo}


SELECT loc_name, show_trgm(loc_name) As tg 
FROM places 
WHERE length(loc_name) < 10
ORDER BY random() limit 10;



-- Result is much more appealing
 loc_name  |                          tg
-----------+-------------------------------------------------------
 Spade CDP | {"  c","  s"," cd"," sp",ade,cdp,"de ","dp ",pad,spa}
 Espy CDP  | {"  c","  e"," cd"," es",cdp,"dp ",esp,"py ",spy}
 Paia CDP  | {"  c","  p"," cd"," pa",aia,cdp,"dp ","ia ",pai}
 Keys CDP  | {"  c","  k"," cd"," ke",cdp,"dp ",eys,key,"ys "}
 Ulen town | {"  t","  u"," to"," ul","en ",len,own,tow,ule,"wn "}
 Lead city | {"  c","  l"," ci"," le","ad ",cit,ead,ity,lea,"ty "}
 Wray city | {"  c","  w"," ci"," wr","ay ",cit,ity,ray,"ty ",wra}
 Mona town | {"  m","  t"," mo"," to",mon,"na ",ona,own,tow,"wn "}
 Derby CDP | {"  c","  d"," cd"," de","by ",cdp,der,"dp ",erb,rby}
 Sells CDP | {"  c","  s"," cd"," se",cdp,"dp ",ell,lls,"ls ",sel}

Building GIST/GIN indexes for trigram

Before we can take full advantage of these functions and opertors, we need to put an index on our text field. pgtrgm supports both GIST and GIN very similar in concept to the TSearch. The main difference between the two indexes is that GIST is slower to query, but faster to build and GIN gives better search performance but slower to build. for our relatively small dataset of 35,000 some odd records, the performance between the two is about the same and took between 1 and 2 seconds to index all entries, with no real noticeable difference between using GIST or GIN in index speed on our PostgreSQL 9.0 beta install.


CREATE INDEX idx_places_trgm_gist_loc_name
   ON places USING gist (loc_name gist_trgm_ops);
  
-- we chose to build with gin
CREATE INDEX idx_places_trgm_gin_loc_name
   ON places USING gin(loc_name gin_trgm_ops);



Doing comparisons

Just like soundex, the % operator and similarity functions are case insensitive. The pg_trigram approach seems especially good at catching misspellings and allows you to easily control your tolerance for a misspelling. It doesn't suffer from the same issues fuzzystrmatch does when dealing with unicode characters, but will also not catch similar pronunciations as effectively.

Now to use we can use the % operator in conjunction with the similarity function or by itself. The similarity function will always return a number between 0 and 1 which is a measure of how similar two text strings are ; with 1 having same set of trigrams.


SELECT loc_name, similarity(loc_name, 'abeville') As sim_score
FROM places 
WHERE loc_name % 'abeville' AND similarity(loc_name, 'abeville') > 0.35
ORDER BY loc_name;





-- output --
    loc_name    | sim_score
----------------+-----------
 Abbeville city |       0.5
 Abbeville city |       0.5
 Abbeville city |       0.5
 Abbeville city |       0.5
 Abbeville town |       0.5

Let us try a similar example to what we tried with soundex.

SELECT loc_name, similarity(loc_name, 'rising sun') As sim_score 
FROM places 
WHERE loc_name % 'rising sun' AND similarity(loc_name, 'rising sun') > 0.35
ORDER BY similarity(loc_name, 'rising sun') DESC, loc_name;



        loc_name        | sim_score
------------------------+-----------
 Rising Sun city        |    0.6875
 Rising Sun town        |    0.6875
 Rising Sun-Lebanon CDP |  0.478261
 Rising Star town       |       0.4
 Risingsun village      |  0.380952

Controlling default weights

What would happen if we used % without the added similarity filter? It would behave as if the similarity filter were set at the default weight. To see the default weight we use this statement.


SELECT show_limit();

--output --
0.3

You can lower or increase the weight with set_limit();

As mentioned. Trigrams are good for catching misspellings, but not as good as soundex and dmetaphone for catching same sound words spelled differently. To demonstrate, if we picked our El Cajon example from before where our dmetaphone excelled and return El Cajon.


SELECT loc_name, similarity(loc_name, 'Elle Cahon') As sim_score 
FROM places
WHERE loc_name % 'Elle Cahon'
ORDER BY similarity(loc_name, 'Elle Cahon') DESC, loc_name;



    loc_name    | sim_score
----------------+-----------
 Ellenton CDP   |  0.333333
 Ellaville city |       0.3
 Ellendale city |       0.3
 Ellendale city |       0.3

It is obviously not as good as our dmetaphone.