There once existed programmers who were asked to explain this snippet of code: 1 + 2
- The C programmer explained "It's a common mathematical expression."
- The C++, Java, C# and other impure object-oriented programmers said "We concur. It's a common mathematical expression."
- The Smalltalk programmer explained "1 adds 2."
- The Lisp programmer stood up, a bit in disgust, and said, "No no! You are doing it all wrong!"
The Lisp Programmer then pulled out
a Polish calculator, punched in + 1 2
,and with a very serious face, explained
"+ should be pushing those other two around."
I find this episode interesting because while the Lisp programmer I feel is more right, the Smalltalk programmer has managed to follow the rest of the crowd and still stick
to her core principle. This brings us to what does this have to do with trigrams
in PostgreSQL 9.1. Well just like 1 + 2
being a common mathematical expression, abc LIKE '%b%'
is a common logical relational database expression that we have long taken for granted as not an indexable operation in most
databases (not any other database to I can think of) until PostgreSQL 9.1, which can utilize trigram indices (the Lisp programmer behind the curtain) to make it fast.
There are 2 main enhancements happening with trigrams in PostgreSQL 9.1
both of which depesz has already touched on in FASTER LIKE/ILIKE
and KNNGIST. This means you can have an even faster trigram search than you ever
have had before and you can do it in such a fashion that doesn't require any PostgreSQL trigram specific syntactical expressions. So while PostgreSQL 9.1 might be understanding LIKE much like all the other databases
you work with, if you have a trigram index in place, it will just be doing it a little faster and sometimes a lot faster using the more clever PostgreSQL 9.1 planner.
This is one example of how you can use applications designed for many databases and still be able to utilize advanced features in
your database of choice. In this article we'll demonstrate.
For this example we'll use a table of 490,000 someodd records consisting of Massachusetts street segments and their names excerpted from TIGER 2010 data. You can
download the trimmed data set from here if you want to play along.
The old story of using Btree Indexes
Normally if you plan to do LIKE searches, you would create a btree ops on the column or functional index on the upper/lower of the column. So here is the
basic old in action:
CREATE INDEX idx_featnames_short_fullname_btree_ops
ON featnames_short USING btree (fullname varchar_pattern_ops);
vacuum analyze featnames_short;
SELECT DISTINCT fullname
FROM featnames_short
WHERE fullname LIKE 'Devonshire%';
EXPLAIN (ANALYZE true, FORMAT yaml)
SELECT DISTINCT fullname
FROM featnames_short
WHERE fullname LIKE 'Devonshire%';
- Plan:
Node Type: "Aggregate"
Strategy: "Hashed"
Startup Cost: 8.45
Total Cost: 8.47
Plan Rows: 2
Plan Width: 11
Actual Startup Time: 0.120
Actual Total Time: 0.121
Actual Rows: 8
Actual Loops: 1
Plans:
- Node Type: "Index Scan"
Parent Relationship: "Outer"
Scan Direction: "Forward"
Index Name: "idx_featnames_short_fullname_btree_ops"
Relation Name: "featnames_short"
Alias: "featnames_short"
Startup Cost: 0.00
Total Cost: 8.34
Plan Rows: 41
Plan Width: 11
Actual Startup Time: 0.024
Actual Total Time: 0.093
Actual Rows: 48
Actual Loops: 1
Index Cond: "(((fullname)::text ~>=~ 'Devonshire'::text) AND ((fullname)::text ~<~ 'Devonshirf'::text))"
Filter: "((fullname)::text ~~ 'Devonshire%'::text)"
Triggers:
Total Runtime: 0.161
SELECT DISTINCT fullname
FROM featnames_short
WHERE fullname LIKE '%Devonshire%';
EXPLAIN (ANALYZE true, FORMAT yaml)
SELECT DISTINCT fullname
FROM featnames_short
WHERE fullname LIKE '%Devonshire%';
- Plan:
Node Type: "Aggregate"
Strategy: "Hashed"
Startup Cost: 9768.55
Total Cost: 9768.57
Plan Rows: 2
Plan Width: 11
Actual Startup Time: 125.773
Actual Total Time: 125.774
Actual Rows: 8
Actual Loops: 1
Plans:
- Node Type: "Seq Scan"
Parent Relationship: "Outer"
Relation Name: "featnames_short"
Alias: "featnames_short"
Startup Cost: 0.00
Total Cost: 9768.45
Plan Rows: 40
Plan Width: 11
Actual Startup Time: 24.995
Actual Total Time: 125.708
Actual Rows: 48
Actual Loops: 1
Filter: "((fullname)::text ~~ '%Devonshire%'::text)"
Triggers:
Total Runtime: 125.823
SELECT DISTINCT fullname
FROM featnames_short
WHERE fullname ILIKE '%Devonshire%';
EXPLAIN (ANALYZE true, FORMAT yaml)
SELECT DISTINCT fullname
FROM featnames_short
WHERE fullname ILIKE '%Devonshire%';
- Plan:
Node Type: "Aggregate"
Strategy: "Hashed"
Startup Cost: 9768.55
Total Cost: 9768.57
Plan Rows: 2
Plan Width: 11
Actual Startup Time: 842.275
Actual Total Time: 842.276
Actual Rows: 8
Actual Loops: 1
Plans:
- Node Type: "Seq Scan"
Parent Relationship: "Outer"
Relation Name: "featnames_short"
Alias: "featnames_short"
Startup Cost: 0.00
Total Cost: 9768.45
Plan Rows: 40
Plan Width: 11
Actual Startup Time: 113.184
Actual Total Time: 842.190
Actual Rows: 48
Actual Loops: 1
Filter: "((fullname)::text ~~* '%Devonshire%'::text)"
Triggers:
Total Runtime: 842.320
The new way
Well in the new way the btree is sometimes faster for some LIKE scenarios, but can't be employed in all where as the trigram works for all.
In the new way we supplement our btree with a trigram index or use a trigram index instead
to do most ILIKE searches and LIKE '%me%' like searches.
Here is the same exercise repeated with adding on a trigram index.
CREATE EXTENSION pg_trgm;
CREATE INDEX idx_featnames_short_fullname_trgm_gist
ON featnames_short USING gist (fullname gist_trgm_ops);
vacuum analyze featnames_short;
SELECT DISTINCT fullname
FROM featnames_short
WHERE fullname LIKE 'Devonshire%';
EXPLAIN (ANALYZE true, FORMAT yaml)
SELECT DISTINCT fullname
FROM featnames_short
WHERE fullname LIKE 'Devonshire%';
- Plan:
Node Type: "Aggregate"
Strategy: "Hashed"
Startup Cost: 8.44
Total Cost: 8.46
Plan Rows: 2
Plan Width: 11
Actual Startup Time: 0.121
Actual Total Time: 0.123
Actual Rows: 8
Actual Loops: 1
Plans:
- Node Type: "Index Scan"
Parent Relationship: "Outer"
Scan Direction: "Forward"
Index Name: "idx_featnames_short_fullname_btree_ops"
Relation Name: "featnames_short"
Alias: "featnames_short"
Startup Cost: 0.00
Total Cost: 8.34
Plan Rows: 40
Plan Width: 11
Actual Startup Time: 0.024
Actual Total Time: 0.092
Actual Rows: 48
Actual Loops: 1
Index Cond: "(((fullname)::text ~>=~ 'Devonshire'::text) AND ((fullname)::text ~<~ 'Devonshirf'::text))"
Filter: "((fullname)::text ~~ 'Devonshire%'::text)"
Triggers:
Total Runtime: 0.167
SELECT DISTINCT fullname
FROM featnames_short
WHERE fullname ILIKE 'Devonshire%';
EXPLAIN (ANALYZE true, FORMAT yaml)
SELECT DISTINCT fullname
FROM featnames_short
WHERE fullname ILIKE 'Devonshire%';
- Plan:
Node Type: "Aggregate"
Strategy: "Hashed"
Startup Cost: 152.77
Total Cost: 152.79
Plan Rows: 2
Plan Width: 11
Actual Startup Time: 3.647
Actual Total Time: 3.649
Actual Rows: 8
Actual Loops: 1
Plans:
- Node Type: "Bitmap Heap Scan"
Parent Relationship: "Outer"
Relation Name: "featnames_short"
Alias: "featnames_short"
Startup Cost: 4.76
Total Cost: 152.67
Plan Rows: 40
Plan Width: 11
Actual Startup Time: 3.403
Actual Total Time: 3.613
Actual Rows: 48
Actual Loops: 1
Recheck Cond: "((fullname)::text ~~* 'Devonshire%'::text)"
Plans:
- Node Type: "Bitmap Index Scan"
Parent Relationship: "Outer"
Index Name: "idx_featnames_short_fullname_trgm_gist"
Startup Cost: 0.00
Total Cost: 4.75
Plan Rows: 40
Plan Width: 0
Actual Startup Time: 3.378
Actual Total Time: 3.378
Actual Rows: 48
Actual Loops: 1
Index Cond: "((fullname)::text ~~* 'Devonshire%'::text)"
Triggers:
Total Runtime: 3.720
SELECT DISTINCT fullname
FROM featnames_short
WHERE fullname ILIKE '%Devonshire%';
EXPLAIN (ANALYZE true, FORMAT yaml)
SELECT DISTINCT fullname
FROM featnames_short
WHERE fullname ILIKE '%Devonshire%';
- Plan:
Node Type: "Aggregate"
Strategy: "Hashed"
Startup Cost: 152.77
Total Cost: 152.79
Plan Rows: 2
Plan Width: 11
Actual Startup Time: 5.586
Actual Total Time: 5.588
Actual Rows: 8
Actual Loops: 1
Plans:
- Node Type: "Bitmap Heap Scan"
Parent Relationship: "Outer"
Relation Name: "featnames_short"
Alias: "featnames_short"
Startup Cost: 4.76
Total Cost: 152.67
Plan Rows: 40
Plan Width: 11
Actual Startup Time: 5.351
Actual Total Time: 5.553
Actual Rows: 48
Actual Loops: 1
Recheck Cond: "((fullname)::text ~~* '%Devonshire%'::text)"
Plans:
- Node Type: "Bitmap Index Scan"
Parent Relationship: "Outer"
Index Name: "idx_featnames_short_fullname_trgm_gist"
Startup Cost: 0.00
Total Cost: 4.75
Plan Rows: 40
Plan Width: 0
Actual Startup Time: 5.325
Actual Total Time: 5.325
Actual Rows: 48
Actual Loops: 1
Index Cond: "((fullname)::text ~~* '%Devonshire%'::text)"
Triggers:
Total Runtime: 5.655
Some old rules still apply
If your queries are writtern something like below then the below query will not use an index:
SELECT fullname
FROM featnames_short
WHERE upper(fullname) LIKE '%DEVONSHIRE%';
There is no index above that will work for this, however if you put in an index on:
CREATE INDEX idx_featnames_short_ufullname_trgm_gist
ON featnames_short USING gist (upper(fullname) gist_trgm_ops);
What about citext and case insensitive LIKE found in MySQL and SQL Server
Sadly, we couldn't figure out how to get citext to use an index regardless of what we did. Though the hack we described a while ago of turning your PostgreSQL varchar() columns into
case insensitive columns and putting a trigram gist index on upper(fullname) seems to work just dandy.
SELECT DISTINCT fullname FROM featnames_short WHERE fullname LIKE '%devonshire%';
- Plan:
Node Type: "Aggregate"
Strategy: "Hashed"
Startup Cost: 4113.74
Total Cost: 4114.62
Plan Rows: 88
Plan Width: 12
Actual Startup Time: 2.848
Actual Total Time: 2.850
Actual Rows: 9
Actual Loops: 1
Plans:
- Node Type: "Bitmap Heap Scan"
Parent Relationship: "Outer"
Relation Name: "featnames_short"
Alias: "featnames_short"
Startup Cost: 138.38
Total Cost: 4106.67
Plan Rows: 2825
Plan Width: 12
Actual Startup Time: 2.725
Actual Total Time: 2.826
Actual Rows: 51
Actual Loops: 1
Recheck Cond: "(upper((fullname)::text) ~~ 'DEVONSHIRE%'::text)"
Plans:
- Node Type: "Bitmap Index Scan"
Parent Relationship: "Outer"
Index Name: "idx_featnames_short_ufullname_trgm_gist"
Startup Cost: 0.00
Total Cost: 137.67
Plan Rows: 2825
Plan Width: 0
Actual Startup Time: 2.710
Actual Total Time: 2.710
Actual Rows: 51
Actual Loops: 1
Index Cond: "(upper((fullname)::text) ~~ 'DEVONSHIRE%'::text)"
Triggers:
Total Runtime: 2.883
Be case insensitive (with a varchar datatype), at the expense of making all your varchars in the database case insensitive.