UpStats had job search and freelancer search for a while, but it was based on basic pattern matching and was quite slow. Currently there are several options for full-text search, the most popular are still Elasticsearch and Lucene. However, when full-text search is available right in the database, it's much easier to compose various types of queries. Postgres has full-text search built into it (and.. as with anything, there are pros and cons for using it). I'll describe below how full-text search was implemented in UpStats.

I liked the overview here, it gives a good intro to full-text search in Postgres. The discussion of the topic in this blogpost is similar but more focused on the ORM aspects and REFRESH MATERIALIZED VIEW CONCURRENTLY is used inside a trigger that's run on TRUNCATE/UPDATE/INSERT/UPDATE. First off, as was mentioned in in that post, the MV does indeed offer an advantage because you can pull/aggregate data from multiple related tables all into one place and apply to_tsvector to certain combinations of columns, however the problem is that the MV takes a long time to refresh.

So the REFRESH MATERIALIZED VIEW CONCURRENTLY statement is very slow, moreover the trigger sets it to run after each operation 1 on the table that holds the data to be indexed. If you have a collector running inserting ~20 rows, it would have to rebuild the MV each time (in my case, 30 seconds for each new row inserted).

Individual triggers on the odesk_job table to update the materialized view aren't an option either because as explained here you can't do INSERT/UPDATE/TRUNCATE/DELETE operations on materialized views.

Incremental updates are also not available for materialized views.

So a different way of doing this was required, I decided to replace the MV with a table containing tsvector columns.

Below is a partial schema diagram describing the two tables involved:

A new SQLAlchemy model was required

from sqlalchemy import ForeignKey, or_, and_
from sqlalchemy import Column,     \
        Integer, String, Boolean,  \
        Integer, DateTime, Table,  \
        UniqueConstraint, Unicode, \
        LargeBinary, Text, Float, Index
from models.shared import dbs, OdeskBase
from sqlalchemy.orm import relationship, backref, foreign, remote, aliased
from sqlalchemy.orm.collections import attribute_mapped_collection
from sqlalchemy.orm.session import object_session
from sqlalchemy.dialects.postgresql import TSVECTOR

class OdeskSearchJob(OdeskBase):
    __tablename__ = 'odesk_search_job'
    __table_args__ = ( \
            Index( \
                    'idx_odesk_search_job_tsv_basic', \
                    'tsv_basic', \
                    postgresql_using = 'gin', \
            Index( \
                    'idx_odesk_search_job_tsv_full', \
                    'tsv_full', \
                    postgresql_using = 'gin', \
    job_id = Column(String(300), primary_key=True, index=True)
    date_created = Column(Integer, index=True)
    tsv_basic = Column(TSVECTOR)
    tsv_full = Column(TSVECTOR)

After generating the alembic migration, some changes had to be made so it could compute the full-text index data. To that end, a new trigger had to be added and a function that would be run by the trigger.

The trigger is run every time a new job is inserted/deleted in odesk_job or every time the odesk_job.snippet column gets updated.

CREATE TRIGGER tsv_odesk_job_snippet_trigger
    ON odesk_job FOR EACH ROW
    EXECUTE PROCEDURE trig_fulltext_refresh();

The trigger will run the following function. It computes the TSVECTOR data for odesk_search_job 2 with data from odesk_job

CREATE OR REPLACE FUNCTION trig_fulltext_refresh() RETURNS trigger AS
       INSERT INTO odesk_search_job
             (NEW.job_id,NEW.date_created,to_tsvector(LOWER(NEW.job_title || ' ' || NEW.snippet)),to_tsvector(''));
       WITH upsert AS (
           UPDATE odesk_search_job
              tsv_basic=to_tsvector(LOWER(NEW.job_title || ' ' || NEW.snippet)),
           WHERE job_id = OLD.job_id
           RETURNING *
       INSERT INTO odesk_search_job
             OLD.job_id,NEW.date_created,to_tsvector(LOWER(NEW.job_title || ' ' || NEW.snippet)),to_tsvector('')
       DELETE FROM odesk_search_job WHERE job_id=OLD.job_id;
   END IF;

LANGUAGE plpgsql;

The INSERT branch above will insert a new row in odesk_search_job for every new row in odesk_job. The TSVECTOR columns (tsv_basic, tsv_full) will contain data about the tokens present in the job text and their positions inside the text of the job ad.

The UPDATE branch in the function uses an UPSERT to update the TSVECTOR if the corresponding record in odesk_job has changed. The tsv_full column is not being used but I will probably use it in the near future.

The DELETE branch simply mimics the same operation from the odesk_job table.

Let's do a simple fulltext search and see how it performs

    SELECT job_id FROM odesk_search_job
    WHERE tsv_basic @@ 'java & python'::tsquery;

In the query above we're searching for jobs that contain the terms "java" and "python" (the Postgresql's syntax for these types of queries is documented here).

                                                                 QUERY PLAN                                                                 
 Bitmap Heap Scan on odesk_search_job  (cost=37.25..642.49 rows=161 width=20) (actual time=3.336..9.036 rows=534 loops=1)
   Recheck Cond: (tsv_basic @@ '''java'' & ''python'''::tsquery)
   Heap Blocks: exact=524
   ->  Bitmap Index Scan on idx_odesk_search_job_tsv_basic  (cost=0.00..37.21 rows=161 width=0) (actual time=3.107..3.107 rows=534 loops=1)
         Index Cond: (tsv_basic @@ '''java'' & ''python'''::tsquery)
 Planning time: 1.675 ms
 Execution time: 9.214 ms
(7 rows)

In the EXPLAIN statement above we time a simple SELECT statement, it takes 9 milliseconds. And we can see that the GIN index for the tsv_basic column is being used.

The equivalent query in SQLAlchemy would look like this:

db_session.query(OdeskSearchJob) \
    .filter(OdeskSearchJob.tsv_basic.match(search_terms, postgresql_reconfig='english'))

For a 180 MB (with 212k+ rows) table to be indexed, the full-text index takes up 372 MB (idx..tsv_basic), the additional table that stores the tsvectors takes 215 MB.

user@garage2:/tmp$ psql -d test1 -c "\di+ '(ix|idx)*search_job*'"
                                          List of relations
 Schema |               Name               | Type  | Owner |      Table       |  Size   | Description 
 public | idx_odesk_search_job_tsv_basic   | index | user  | odesk_search_job | 372 MB  | 
 public | idx_odesk_search_job_tsv_full    | index | user  | odesk_search_job | 7656 kB | 
 public | ix_odesk_search_job_date_created | index | user  | odesk_search_job | 8816 kB | 
 public | ix_odesk_search_job_job_id       | index | user  | odesk_search_job | 13 MB   | 
(4 rows)
user@garage2:/tmp$ psql -d test1 -c "\dt+ 'odesk_search_job'"
                        List of relations
 Schema |       Name       | Type  | Owner |  Size  | Description 
 public | odesk_search_job | table | user  | 215 MB | 
(1 row)



There are two types of triggers depending on when you want them to run: BEFORE and AFTER the INSERT/UPDATE/DELETE operation. For the AFTER trigger, the return value is ignored, so you can just RETURN NULL. However, the BEFORE trigger will prevent a row from being INSERT-ed if the trigger returns NULL. The documentation gives more details on how the return value is interpreted.


This type of table is very similar to SQLite's FTS4 contentless tables.