SQLAlchemy and full text searching in postgresql

by Noufal Ibrahim

May 17, 2019 in Technology , Databases

Postgresql has support for full text search. The basic idea is to create a column of type tsvector and then you can run full text queries (represented as tsquery strings) using the @@ operator. This is different from the LIKE queries using the %string% since this is language aware and can provide things like ranking etc. As an example,

<b>SELECT</b> 'I am satisfied with postgresql' <b>LIKE</b> '%satisfied%' <b>as</b> <b>found</b>;

will return

found 
-------
t

However, if we use "%satisfy%" as the query string, it will fail since it’s text based.

SELECT 'I am satisfied with postgresql' LIKE '%satisfy%' as found;
 found 
-------
 f

If we use a full text query, we will get it right though.

SELECT to_tsvector('I am satisfied with postgresql') @@ to_tsquery('satisfy') as found;
 found 
-------
 t

SQLAlchemy is the Python database toolkit of choice. It supports most of the popular databases out there and has dialect specific features for mysql, postgresql etc. The tsvector type is not natively supported. It’s not very hard to add support for this but I couldn’t find a single reference that helps me do it. After some trial and error, I managed to get this to work (atleast for my purposes) so I’m going to write out what I did. I’ll put links to the articles, posts and other materials on the net which helped me get this to work.

Feedback is welcome as are suggestions on how to get this fully feature complete. If sufficiently done, I’ll contribute this as a patch back to SQLAlchemy.

THE TABLE WE WANT

First, I create a table like so

<b>CREATE</b> <b>TABLE</b> example (
 name VARCHAR(10),
 details TEXT
);

And then, I insert 1000 rows into this using

COPY example FROM '/home/noufal/projects/scratch/sa/items.csv' (FORMAT csv);

Now, I have a database to play with. I can do full text searches like so

SELECT * FROM example where to_tsvector('english', details) @@ to_tsquery('life') limit 3;
 name | details 
---------+-------------------------------------------------------------------------------------------------------------------------------------------------
 item-20 | Life is a grand adventure -- or it is nothing. -- Helen Keller 
 item-46 | Life is a gamble at terrible odds; if it was a bet you wouldn't take it. -- Tom Stoppard; Rosencrantz and Guildenstern are Dead 
 item-63 | Life is like a 10 speed bicycle. Most of us have gears we never use. -- C. Schultz

The 'english' in the to_tsvector is optional. If I skip it, it’ll use the default.

Full text items that can be searched are referred to as documents in postgresql. Now, I create an extra column that holds the document to be searched like so.

ALTER TABLE example ADD COLUMN details_tsvector TSVECTOR;

Then, I run an update on the table that creates the tsvector documents and puts them into this column like so

UPDATE example SET details_tsvector = to_tsvector(details);

Now, if I do a full query, it’ll return the actual tsvector documents.

SELECT name, details_tsvector FROM example limit 3;
 name | details_tsvector 
--------+---------------------------------------------------------
 item-1 | 'bathroom':7 'left':2 'wallet':4
 item-2 | 'bit':7 'could':11 'difficult':3 'effort':9 'imposs':13
 item-3 | 'avoid':3 'hedg':4 'least':6 'think':11

I can search the table like so

SELECT name, details FROM example WHERE details_tsvector @@ to_tsquery('life') limit 3;
 name | details 
---------+-------------------------------------------------------------------------------------------------------------------------------------------------
 item-20 | Life is a grand adventure -- or it is nothing. -- Helen Keller 
 item-46 | Life is a gamble at terrible odds; if it was a bet you wouldn't take it. -- Tom Stoppard; Rosencrantz and Guildenstern are Dead 
 item-63 | Life is like a 10 speed bicycle. Most of us have gears we never use. -- C. Schultz

This, of course, is faster than the other approach since the documents have already been generated. However, running an EXPLAIN shows us how the query is working.

EXPLAIN SELECT name, details FROM example WHERE details_tsvector @@ to_tsquery('life');
 QUERY PLAN 
----------------------------------------------------------
 Seq Scan on example (cost=0.00..53.30 rows=26 width=92)
 Filter: (details_tsvector @@ to_tsquery('life'::text))

One advantage of tsvector columns (over LIKE queries) is that they can be indexed. If I add an index like so,

CREATE INDEX details_idx ON example USING gin(details_tsvector);

The output of the EXPLAIN command changes.

EXPLAIN SELECT name, details FROM example WHERE details_tsvector @@ to_tsquery('life');
 QUERY PLAN 
----------------------------------------------------------------------------
 Bitmap Heap Scan on example (cost=12.20..50.17 rows=26 width=92)
 Recheck Cond: (details_tsvector @@ to_tsquery('life'::text))
 -> Bitmap Index Scan on details_idx (cost=0.00..12.20 rows=26 width=0)
 Index Cond: (details_tsvector @@ to_tsquery('life'::text))

which will be quicker.

One problem with this is that the details_tsvector column is not automatically updated when new rows are inserted. I can fix this by creating a trigger that will automatically compute and add the values when a row is UPDATEd or INSERTed into.

CREATE TRIGGER details_tsvector_update BEFORE INSERT OR UPDATE
ON example
FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger('details_tsvector', 'pg_catalog.english', 'details');

Now, if I run an INSERT statement, the details_tsvector will get updated.

The table definition looks like this

 Table "public.example"
 Column | Type | Modifiers 
------------------+-----------------------+-----------
 name | character varying(10) | 
 details | text | 
 details_tsvector | tsvector | 
Indexes:
 "details_idx" gin (details_tsvector)
Triggers:
 details_tsvector_update BEFORE INSERT OR UPDATE ON example FOR EACH ROW EXECUTE PROCEDURE tsvector_update_trigger('details_tsvector', 'pg_catalog.english', 'details')

This is what I need for my application but I need to do it using SQLAlchemy inside my application rather than in SQL. The rest of the article will cover this.