TSearch Primer

Friday, February 29. 2008

Recommended Books: Lucerne in Action Solr Enterprise Search Server

What is TSearch?

TSearch is a Full-Text Search engine that is packaged with PostgreSQL. The key developers of TSearch are Oleg Bartunov and Teodor Sigaev who have also done extensive work with GiST and GIN indexes used by PostGIS, PgSphere and other projects. For more about how TSearch and OpenFTS got started check out A Brief History of FTS in PostgreSQL. Check out the TSearch Official Site if you are interested in related TSearch tips or interested in donating to this very worthy project.

Tsearch is different from regular string searching in PostgreSQL in a couple of key ways.

It is well-suited for searching large blobs of text since each word is indexed using a Generalized Inverted Index (GIN) or Generalized Search Tree (GiST) and searched using text search vectors. GIN is generally used for indexing. Search vectors are at word and phrase boundaries.
TSearch has a concept of Linguistic significance using various language dictionaries, ISpell, thesaurus, stop words, etc. therefore it can ignore common words and equate like meaning terms and phrases.
TSearch is for the most part case insensitive.
While various dictionaries and configs are available out of the box with TSearch, one can create new ones and customize existing further to cater to specific niches within industries - e.g. medicine, pharmaceuticals, physics, chemistry, biology, legal matters.

Prior to PostgreSQL 8.3, it was a contrib module located in the shared/contribs folder. As of PostgreSQL 8.3 it is now fully integrated into the PostgreSQL core. The official documents for using TSearch in 8.3 are located in Chapter 12. Full Text Search of the official PostgreSQL documentation.

In this article we shall provide a quick primer to using TSearch in 8.3. In the next month's issue of the Postgres OnLine Journal we shall provide a TSearch cheat sheet similar to our PostgreSQL 8.3 cheat sheet.

Getting Started with Using TSearch

While you can still use TSearch without creating indices, for large fields or huge tables it is highly adviced you create indices before performing TSearch queries. Below are the general steps to take to use TSearch.

Create the Index

In terms of creating indexes you have the choice of GIN or GIST indexes. Pros and cons are itemized in the Chapter 12. Full Text Search: 12.9. GiST and GIN Index Types. In a nutshell - GIST indexes are better when doing weighted queries while GIN indexes are better for standard word queries and larger texts that don't require weighting. GIST indexes are also lossy and produce more false positives thus requiring rechecking of raw data, but faster to build than GIN indexes.

Sample indexes are shown below:

--Single field index
CREATE INDEX idx_sometable_somefield 
    ON sometable 
    USING gin(to_tsvector('english', somefield)); 
--Multi-field index
CREATE INDEX idx_sometable_ts
    ON sometable 
    USING gin(to_tsvector('english', COALESCE(somefield1,'') || ' ' || COALESCE(somefield2,'')));
    
CREATE INDEX idx_sometable_ts_gist
    ON sometable 
    USING gist(to_tsvector('english', COALESCE(somefield1,'') || ' ' || COALESCE(somefield2,'')));
--Index on field of type tsvector
CREATE INDEX idx_sometable_ts
    ON sometable 
    USING gin(tsvector_field);

In the above examples we are indexing based on the English language. There are 16 options pre-packaged with PostgreSQL and TSearch is flexible enough that you can define your own custom ones - say catered for certain niche scientific or medical research disciplines. The choice of languages is listed in pg_catalog.ts_config and the default option selected if none is specified is based on your locale.

As demonstrated above, you can create an index on a tsvector type column, but that would require creating yet another column and a trigger to update it when data changes. This does provide efficiency of not having to recalculate a tsvector or specify it in your query each time you run a ts query. To not have to respecify it, you can also use a view. Much of this is well documented in 12.2. Full Text Search: Table Text Search. A good example of using Triggers to update fulltext fields and storing fulltext fields is provided in the Pagila database (check the film table) as well as the Full Text Search: 12.4.3. Triggers for Automatic Updates chapter of official docs.

Using TSearch - The Query

There are 2 concepts in TSearch that are most important - The TSearch query condition which evaluates to either true or false and the ranking which is a measure of how good the match is.

A TSearch condition uses the match operators (@@ or @@@ -- @@ is used for general searches and weighted GIST searches, and @@@ is slower but must be used for weighted GIN searches). @@ is the more common and always of the form (some_ts_vector @@ some_ts_query). The ts vector is simply what you have indexed as shown above and the secret in the sauce is the some_ts_query. The key operators in a TS Query are & (AND), | (OR) and ! (NOT)

Below are some examples of using TS Query:

--Snippet two  - examples using TQuery
--We want to check if the provided phrase contains the words dog and sick.  
--This returns true
SELECT to_tsvector('english', 'My dog is sick') 
    @@ to_tsquery('english', 'dog & SICK');
    
--This one is false because doggy is not a word boundary for dog
SELECT to_tsvector('english', 'My doggy is sick') 
    @@ to_tsquery('english', 'dog & SICK');
    
--However dogs and dog are lexically equivalent so this is true
SELECT to_tsvector('english', 'I want a dog') 
    @@@ to_tsquery('english', 'want & dogs');
    
--This one is also true
--because ski and skiing 
--are derived from same word (lexeme)
SELECT to_tsvector('english', 'I like to ski') 
    @@ to_tsquery('english', 'like & skiing');
    
--This uses the default locale
SELECT to_tsvector('My dog is sick') 
    @@ to_tsquery('dog & SICK');
    
--This searches for all views that have SUM and order in them
SELECT *
FROM information_schema.views
WHERE to_tsvector(view_definition)  @@ to_tsquery('sum & order');

--Search all views that have SUM or FILM
SELECT *
FROM information_schema.views
WHERE to_tsvector(view_definition)  @@ to_tsquery('sum | film');

--Search all view definitions that have sum but are not about films
SELECT *
FROM information_schema.views
WHERE to_tsvector(view_definition)  @@ to_tsquery('sum & !film');

--Search all view definitions with sum and store but not about film
SELECT *
FROM information_schema.views
WHERE to_tsvector(view_definition)  @@ to_tsquery('(sum & store) & !film ');

Ranking Results

TSearch provides 2 ranking functions - ts_rank and ts_rank_cd. The CD in ts_rand_cd stands for Cover Density. The ranking functions rank the relevance of a ts vector of a document to a ts query based on proximity of words, length of document, and weighting of terms. The higher the rank, the more relevant the document. The ts_rand_cd ranking function penalizes results where the search terms are further apart. The ts_rand does not penalize based on distance. One can further control weight based on position or section in a record using setweight. Some examples are shown below:

--Weight positions are demarcated by the letters A, B, C, D
--Create a fulltext field where the title is 
--marked as weight position A and description is weight position B
ALTER TABLE film 
    ADD COLUMN ftext_weighted tsvector;
UPDATE film 
    SET ftext_weighted = (setweight(to_tsvector(title), 'A') 
        || setweight(to_tsvector(description), 'B'));
CREATE INDEX idx_books_ftext_weighted
    ON film
    USING gin(ftext_weighted);
    
--List top 3 films about Mysql that are epic, documentary or chocolate 
--NOTE: the {0,0,0.10,0.90} corresponds 
--to weight positions D,C, B, A and the sum of the array should be 1
-- which means 
--weight the title higher than the summary
--NOTE: we are doing a subselect here because if we don't the expensive
--highlight function gets called all the results that match the WHERE, not just the highest 3
SELECT title, description, therank,
ts_headline(title || ' ' || description, q, 
    'StartSel = <em>, StopSel = </em>, HighlightAll=TRUE') as htmlmarked_summary
FROM (SELECT title, description,  
    ts_rank('{0,0,0.10,0.90}', ftext_weighted, q) as therank, q
FROM film, to_tsquery('(epic | documentary | chocolate) & mysql') as q
WHERE ftext_weighted @@ q
ORDER BY therank DESC
LIMIT 3) As results;

--List top 3 films with (chocolate, secretary , or mad) and (mysql or boring) in the title or description
--Note the {0,0,0.90,0.10} corresponds to weight positions 
--D,C, B, A which means based on how we weighted our index
--weight the title higher than the summary.  
--This time we are using ts_rank_cd which will penalize
--query words that are further apart
--For highlighting this uses the default ts_headline which is to make terms bold
SELECT title, description,  therank, 
    ts_headline(title || ' ' || description, q) as htmlmarked_summary
FROM (SELECT title, description, 
    ts_rank_cd('{0,0,0.9,0.10}', ftext_weighted, q) as therank, q
    FROM film,
        to_tsquery('(chocolate | secretary | mad) & (mysql | boring)') as q
    WHERE ftext_weighted @@ q
    ORDER BY therank DESC
    LIMIT 3) As results;
    
--This is same as previous except in our filtering 
--we only want to count secretary and mad (:A)
-- if it appears in the title of the document
--NOTE: Since we are using a GIN index, we need to use the slower @@@
SELECT title, description,  therank, ts_headline(title || ' ' || description, q) as htmlmarked_summary
FROM (SELECT title, description, 
    ts_rank_cd('{0,0,0.9,0.10}', ftext_weighted, q) as therank, q
    FROM film,
        to_tsquery('(chocolate | secretary:A | mad:A) & (mysql | boring)') as q
    WHERE ftext_weighted @@@ q
    ORDER BY therank DESC
    LIMIT 3) As results;

Additional Features

TSearch also provides some added functions such as:

highlighting locations of matches with html markup, using the ts_headline function as we have demonstrated above.
Gathering statistics about your documents (e.g which words are filler) with ts_stat to better cater the searching to your specific document set.
Debugging functions for showing debugging information

We'll provide more examples in the upcoming cheat sheet.

Support of FullText in Other Databases

As a side note: MySQL and the major commercial offerings have Full Text Search capabilities as well, but all have different full text search query syntax so there is no real standard one can rely on as far as portability goes. MySQL has integrated FullText search and has for some time, but is limited to only MyISAM tables. MySQL 5.1 seems to have some enhanced features in Full Text Search over prior versions that make it more configurable than prior versions and easier to integrate plugins. SQL Server also has FullText search, but from experience has been always somewhat awkward to use, and it relied on an additional service. The upcoming SQL Server 2008 Full Text search is supposed to be more integrated. Not sure about the other popular DBMSs Oracle, IBM DB2 etc.

We are also not sure about the speed comparisons between the various offerings. Osku Salerma's Masters thesis Design of a Full Text Search index for a database management system written in 2006, provides a cursory comparison of how Full Text Search is implemented in Oracle, MySQL, and PostgreSQL as well as a description of full text terminology and indexing strategies and other fulltext non-database search engines such as Lucerne. I think both MySQL and PostgreSQL have changed a great deal in terms of their Full Text Search speed and offerings so applying the comparison to current version of each is probably a bit unfair.

MySQL has an interesting feature called Full Text Query Expansion which allows results of a first level query to be used to find other related results. I'm not sure how good this is or if it produces garbage results and not sure if its natural language mode has improved since the above article was written. GIST and GIN indexes have definitely improved so TSearch is probably faster than it was 2 years ago.

Other related Resources

Ian Barwick Converting tsearch2 to 8.3
Robert Treat's PostgreSQL full text search testing PART II - has links to other part.
Esoteric Curio's related blog entry to Robert Treat's Benchmarking Lucene

Posted by Leo Hsu and Regina Obe in beginner, contrib spotlight, tsearch at 12:19 | Comment (1) | Trackbacks (0)

Trackbacks

Trackback specific URI for this entry

No Trackbacks

Comments

Display comments as (Linear | Threaded)

No comments

Add Comment

Name
Email
Homepage
In reply to
Comment	E-Mail addresses will not be displayed and will only be used for E-Mail notifications. To prevent automated Bots from commentspamming, please enter the string you see in the image below in the appropriate input box. Your comment will only be submitted if the strings match. Please ensure that your browser supports and accepts cookies, or your comment cannot be verified correctly. Enter the string from the spam-prevention image above: Phone* What is one plus eight?
	Remember Information? Subscribe to this entry

TSearch Primer

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting

Friday, February 29. 2008

TSearch Primer

What is TSearch?

Getting Started with Using TSearch

Create the Index

Using TSearch - The Query

Ranking Results

Additional Features

Support of FullText in Other Databases

Other related Resources

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

TSearch Primer

Postgres OnLine Journal PostGIS in Action About the Authors Consulting

Friday, February 29. 2008

TSearch Primer

What is TSearch?

Getting Started with Using TSearch

Create the Index

Using TSearch - The Query

Ranking Results

Additional Features

Support of FullText in Other Databases

Other related Resources

Entry's Links

Quicksearch

Calendar

Categories

Archives

Subscribe

Blog Administration

Postgres OnLine Journal

PostGIS in Action About the Authors Consulting