We've been playing around with new phrase text feature of full-text PostgreSQL 9.6. In doing so, I was curious how big a number one can designate as max that words can be apart. I discovered thru trial and error, that the magic number is 16384 which is much bigger than I had suspected.
To test out the new full text search phrase feature, we downloaded data from https://archive.org/download/stackexchange. We picked the data science archive since it's relatively small. Data is prepped in stackexchange_posts.zip.
Load up the data from psql with:
\encoding utf8;
\i stackexchange_posts.sql
Add full text index
ALTER TABLE posts ADD COLUMN fts tsvector;
UPDATE posts SET fts =
setweight( to_tsvector(COALESCE(tags,'') ), 'A') ||
setweight( to_tsvector(COALESCE(title,'') ) , 'B') ||
setweight( to_tsvector(COALESCE(body,'') ), 'C');
CREATE INDEX idx_posts_fts_gin ON posts USING gin(fts);
Do a sample query:
SELECT count(*)
FROM posts
WHERE fts @@ to_tsquery('social <5> ( science | scientist)') and title > '';
count ---------------- 1
Up this to 20
SELECT count(*)
FROM posts
WHERE fts @@ to_tsquery('social <20> ( science | scientist)') and title > '';
count ---------------- 6
Let's get crazy
SELECT count(*)
FROM posts
WHERE fts @@ to_tsquery('social <20000000> ( science | scientist)') and title > '';
ERROR: distance in phrase operator should not be greater than 16384
and just to confirm the error is right, I did try going up one more and exactly 16384.
On a related note, we've started work on our 3rd edition of PostgreSQL: Up and Running, slated to be due out in January 2017, and hoping for pre-release around time of PostgreSQL 9.6 release. 2nd edition did not cover Full text at all, which I regreted. Full Text will have a fairly nice size showing in the 3rd edition. There will be quite a bit of coverage of 9.5 and 9.6 features in it.