UpStats had job search and freelancer search for a while, but it was based on basic pattern matching and was quite slow. Currently there are several options for full-text search, the most popular are still Elasticsearch and Lucene. However, when full-text search is available right in the database, it's much easier to compose various types of queries. Postgres has full-text search built into it (and.. as with anything, there are pros and cons for using it). I'll describe below how full-text search was implemented in UpStats.
I liked the overview here, it gives a good intro to full-text
search in Postgres. The discussion of the topic in this blogpost
is similar but more focused on the ORM aspects
REFRESH MATERIALIZED VIEW CONCURRENTLY is used inside a trigger
that's run on TRUNCATE/UPDATE/INSERT/UPDATE.
First off, as was mentioned in in that post, the MV does indeed offer
an advantage because you can pull/aggregate data from multiple related
tables all into one place and apply
to_tsvector to certain
combinations of columns, however the problem is that the MV takes a
long time to refresh.
REFRESH MATERIALIZED VIEW CONCURRENTLY statement is very
slow, moreover the trigger sets it to run after each operation 1
on the table that holds the data to be indexed. If you have a
collector running inserting ~20 rows, it would have to rebuild the MV
each time (in my case, 30 seconds for each new row inserted).
Individual triggers on the
odesk_job table to update the
materialized view aren't an option either because as explained
here you can't do INSERT/UPDATE/TRUNCATE/DELETE operations on
Incremental updates are also not available for materialized views.
So a different way of doing this was required, I decided to replace the MV with a table containing tsvector columns.
Below is a partial schema diagram describing the two tables involved:
A new SQLAlchemy model was required
from sqlalchemy import ForeignKey, or_, and_ from sqlalchemy import Column, \ Integer, String, Boolean, \ Integer, DateTime, Table, \ UniqueConstraint, Unicode, \ LargeBinary, Text, Float, Index from models.shared import dbs, OdeskBase from sqlalchemy.orm import relationship, backref, foreign, remote, aliased from sqlalchemy.orm.collections import attribute_mapped_collection from sqlalchemy.orm.session import object_session from sqlalchemy.dialects.postgresql import TSVECTOR class OdeskSearchJob(OdeskBase): __tablename__ = 'odesk_search_job' __table_args__ = ( \ Index( \ 'idx_odesk_search_job_tsv_basic', \ 'tsv_basic', \ postgresql_using = 'gin', \ ),\ Index( \ 'idx_odesk_search_job_tsv_full', \ 'tsv_full', \ postgresql_using = 'gin', \ ),\ ) job_id = Column(String(300), primary_key=True, index=True) date_created = Column(Integer, index=True) tsv_basic = Column(TSVECTOR) tsv_full = Column(TSVECTOR)
After generating the alembic migration, some changes had to be made so it could compute the full-text index data. To that end, a new trigger had to be added and a function that would be run by the trigger.
The trigger is run every time a new job is inserted/deleted in
or every time the
odesk_job.snippet column gets updated.
CREATE TRIGGER tsv_odesk_job_snippet_trigger AFTER INSERT OR DELETE OR UPDATE OF snippet ON odesk_job FOR EACH ROW EXECUTE PROCEDURE trig_fulltext_refresh();
The trigger will run the following function. It computes the
TSVECTOR data for
odesk_search_job 2 with data from
CREATE OR REPLACE FUNCTION trig_fulltext_refresh() RETURNS trigger AS $$ BEGIN IF TG_OP = 'INSERT' THEN INSERT INTO odesk_search_job (job_id,date_created,tsv_basic,tsv_full) VALUES (NEW.job_id,NEW.date_created,to_tsvector(LOWER(NEW.job_title || ' ' || NEW.snippet)),to_tsvector('')); RETURN NULL; ELSIF TG_OP = 'UPDATE' THEN WITH upsert AS ( UPDATE odesk_search_job SET date_created=OLD.date_created, tsv_basic=to_tsvector(LOWER(NEW.job_title || ' ' || NEW.snippet)), tsv_full=to_tsvector('') WHERE job_id = OLD.job_id RETURNING * ) INSERT INTO odesk_search_job (job_id,date_created,tsv_basic,tsv_full) SELECT OLD.job_id,NEW.date_created,to_tsvector(LOWER(NEW.job_title || ' ' || NEW.snippet)),to_tsvector('') WHERE NOT EXISTS (SELECT * FROM upsert); RETURN NULL; ELSIF TG_OP = 'DELETE' THEN DELETE FROM odesk_search_job WHERE job_id=OLD.job_id; RETURN NULL; END IF; RETURN NULL; END; $$ LANGUAGE plpgsql;
The INSERT branch above will insert a new row in
for every new row in
odesk_job. The TSVECTOR columns (tsv_basic,
tsv_full) will contain data about the tokens present in the job text and
their positions inside the text of the job ad.
The UPDATE branch in the function uses an UPSERT to update the
TSVECTOR if the corresponding record in
odesk_job has changed. The
tsv_full column is not being used but I will probably use it in the near
The DELETE branch simply mimics the same operation from the odesk_job table.
Let's do a simple fulltext search and see how it performs
EXPLAIN ANALYZE SELECT job_id FROM odesk_search_job WHERE tsv_basic @@ 'java & python'::tsquery;
In the query above we're searching for jobs that contain the terms "java" and "python" (the Postgresql's syntax for these types of queries is documented here).
QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on odesk_search_job (cost=37.25..642.49 rows=161 width=20) (actual time=3.336..9.036 rows=534 loops=1) Recheck Cond: (tsv_basic @@ '''java'' & ''python'''::tsquery) Heap Blocks: exact=524 -> Bitmap Index Scan on idx_odesk_search_job_tsv_basic (cost=0.00..37.21 rows=161 width=0) (actual time=3.107..3.107 rows=534 loops=1) Index Cond: (tsv_basic @@ '''java'' & ''python'''::tsquery) Planning time: 1.675 ms Execution time: 9.214 ms (7 rows)
In the EXPLAIN statement above we time a simple SELECT statement, it takes 9 milliseconds. And we can see that the GIN index for the tsv_basic column is being used.
The equivalent query in SQLAlchemy would look like this:
db_session.query(OdeskSearchJob) \ .filter(OdeskSearchJob.tsv_basic.match(search_terms, postgresql_reconfig='english'))
180 MB (with 212k+ rows) table to be indexed, the full-text index takes up
(idx..tsv_basic), the additional table that stores the tsvectors
user@garage2:/tmp$ psql -d test1 -c "\di+ '(ix|idx)*search_job*'" List of relations Schema | Name | Type | Owner | Table | Size | Description --------+----------------------------------+-------+-------+------------------+---------+------------- public | idx_odesk_search_job_tsv_basic | index | user | odesk_search_job | 372 MB | public | idx_odesk_search_job_tsv_full | index | user | odesk_search_job | 7656 kB | public | ix_odesk_search_job_date_created | index | user | odesk_search_job | 8816 kB | public | ix_odesk_search_job_job_id | index | user | odesk_search_job | 13 MB | (4 rows)
user@garage2:/tmp$ psql -d test1 -c "\dt+ 'odesk_search_job'" List of relations Schema | Name | Type | Owner | Size | Description --------+------------------+-------+-------+--------+------------- public | odesk_search_job | table | user | 215 MB | (1 row)
There are two types of triggers depending on when you want them to run: BEFORE and AFTER the INSERT/UPDATE/DELETE operation. For the AFTER trigger, the return value is ignored, so you can just RETURN NULL. However, the BEFORE trigger will prevent a row from being INSERT-ed if the trigger returns NULL. The documentation gives more details on how the return value is interpreted.