home articles sponsorship newsletter

Poking around PostgreSQL full-text search: a beginners primer

Today, I want to share a different type of post. Nothing polished. Just me goofing around with PostgreSQL’s full-text search capabilities. And yes, if you’re wondering how someone can have fun while using full-text search, well, I’m wondering about that myself.

A note: this post is beginners friendly. Even though it is long, I’ll only scratch the topic’s surface.

Let’s start with the basics!

The PostgreSQL documentation says it best:

Full-Text Searching (or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query.

PostgreSQL documentation

In layman’s terms, full-text search allows matches between a query and documents outside of exact matches. When querying with =, LIKE, or ILIKE is not enough, you reach for full-text search.

For instance, you have a database full of books, and you’re searching for stories about sorceresses1. But rows in your books.title column only have a few exact matches for this query. On the other hand, books.blurb contains a mixed bag of words like sorceresses, sorceress or sorcerers.

Some of these books might be of interest to you and you’d like them to match your query. Well, full-text search does just that. It takes your query sorceresses, normalizes it, matches it with normalized variations of your documents, and returns the results.

But I’m getting ahead of myself.

What problems does full-text search solves?

Let’s draw on the above example and look for a book filled with sorceresses!

My database currently stores a handful of books:

  CREATE TABLE books (
    title VARCHAR(255),
    blurb TEXT
  );

  INSERT INTO books (title, blurb) VALUES
  ('The Fair Sorceress', 'A thrilling tale of magic and mystery where a witch meets elves.'),
  ('The Silent Grove', 'A journey through an ancient, forgotten forest, with sorcerers and dwarves.'),
  ('The Last Kingdom', 'A historical novel set in medieval times where sorceresses save the day.'),
  ('Deep Depth', 'A suspenseful story of adventure of sailors at sea.'),
  ('Sorcery School', 'A young-adult thriller where a sorceress and a sorcerer fight to keep the school from being eat up by zombies.');

  INSERT 0 5

Let’s try to select some books.

  SELECT * FROM books
  WHERE title = 'sorceresses';

   Title | Blurb
  -------+-------

  (0 rows)

Ugh, that’s disappointing. What about querying on the blurb instead?

  SELECT * FROM books
  WHERE blurb = 'sorceresses';

          Title     |    Blurb
  ------------------+---------------------------------------------------------------------------------------------------------------------------------------
   The Last Kingdom | A historical novel set in medieval times where sorceresses save the day.

  (1 rows)

Better!

But I know I have more relevant books. Let’s add an OR operator so I can search for variation of sorceresses.

  SELECT * FROM books
  WHERE blurb = 'sorceress' OR blurb = 'sorceresses';

        Title           |    Blurb
  ----------------------+---------------------------------------------------------------------------------------------------------------------------------------
   The Last Kingdom     | A historical novel set in medieval times where sorceresses save the day.
   Sorcery School       | A young-adult thriller where a sorceress and a sorcerer fight to keep the school from being eat up by zombies.

  (2 rows)

We’re getting there, but what about The Silent Grove or The Fair Sorceress? They’d be a good match.

I could add new conditions to my WHERE clause. But in real-life applications, word variations and searchable columns could add up quite a bit.

This long-winded (yet contrived) example shows you, reader, that when you hit the limitations of basic search operators, it’s time to reach for PostgreSQL’s full-text search.

How does full-text search works?

From documents to tokens

Full-text search turns documents (for instance, the rows of your database) into tokens. Tokens are the smallest semantic units extracted from your documents. They can be strings, numbers, email addresses, etc. PostgreSQL uses a built-in parser, but you can provide your own parser if you wish.

Let’s take the following string stored in a row of books.blurb:

  A young-adult thriller where a sorceress and a sorcerer fight to keep the school from being eat up by zombies.

PostgreSQL will first turn this string into tokens.

PostgreSQL does not expose (to my knowledge) the intermediate tokenization step, but here’s a workaround I’ve found that’ll help you make sense of this process.

Let’s start by splitting the string into an array of words, unpacking the array, adding a position column indicating the index of each word, and ordering the words alphabetically.

  SELECT token, position
  FROM unnest(string_to_array('A young-adult thriller where a sorceress and a sorcerer fight to keep the school from being eat up by zombies.', ' '))
  WITH ORDINALITY AS t(token, position)
  ORDER BY token;

      token    | position
  -------------+----------
   A           |        1
   a           |        5
   a           |        8
   and         |        7
   being       |       16
   by          |       19
   eat         |       17
   fight       |       10
   from        |       15
   keep        |       12
   school      |       14
   sorcerer    |        9
   sorceress   |        6
   the         |       13
   thriller    |        3
   to          |       11
   up          |       18
   where       |        4
   young-adult |        2
   zombies.    |       20
  (20 rows)

Let’s note a few things:

PostgreSQL strips stop words and suffixes during the next step when it converts tokens into lexemes.

From tokens to lexemes

Once PostgreSQL has converted your documents to tokens, it converts tokens into lexemes.

Lexemes are normalized representations of your tokens: different forms of the same word made alike.

In short, then, tokens are raw fragments of the document text, while lexemes are words that are believed useful for indexing and searching.

PostgreSQL documentation

For instance:

      tokens   |  lexemes
  -------------+-------------
   magic       | magic
   MAGICS      | magic
   magical     | magic
   magically   | magic

Wanna try it? Fire up a PostgreSQL console and type the following:

  SELECT to_tsvector('magic MAGICS magical magically');
     to_tsvector
  -----------------
   'magic':1,2,3,4
  (1 row)

magic, magics, magical, and magically are all variations of the stem magic, which PostgreSQL identifies as the relevant representation of the previous variations.

You can see how PostgreSQL distributes the lexemes when transforming several words:

  SELECT to_tsvector('magical MAGIC magician');

    to_tsvector
  --------------------------
    'magic':1,2 'magician':3

  (1 row)

The transformation of tokens into lexemes follows several steps:

This is where full-text search shines: you get a natural language input from your users (i.e. magical elves and fun sorceresses) and you compare it against your data.

Chances are, you won’t get an exact match. So, one way to up your chances of a match while keeping the results relevant is to split the input into manageable semantic chunks.

  SELECT to_tsvector('magical elves and fun sorceresses');

    to_tsvector
  -----------------------------------------
    'elv':2 'fun':4 'magic':1 'sorceress':5

  (1 row)

How does PostgreSQL turn tokens into lexemes? With dictionaries!

PostgreSQL uses default dictionaries, but you can pass or create custom ones based on your needs. Want to work with French documents? Pass french as first argument to your to_tsvector() method. Want to add a bit of fuzzy matching? Add an Ispell dictionary that’ll map synonyms to a single word. Neat!

  SELECT to_tsvector('french', 'les bonnes baguettes et les croissants');
                to_tsvector
  ----------------------------------------
   'baguet':3 'bon':2 'croiss':6 'le':1,5
  (1 row)

I won’t dig much into dictionaries at this point, but feel free to read the documentation. You can do a lot of configuration around dictionaries with PostgreSQL.

Finally, normalized lexemes are stored in a specific datatype called tsvector.

  SELECT pg_typeof(to_tsvector('sorcerers'));

    pg_typeof
  -----------
    tsvector
  (1 row)

Processing queries

So far, we’ve seen how PostgreSQL processes your data.

User queries follow the same basic steps as documents:

However, PostgreSQL adds several strategies when processing user queries.

Boolean and phrase search operators

After converting user input into lexemes, PostgreSQL returns a ts_query where lexemes are combined using either:

  -- Select both 'fun' and 'elves'
  SELECT 'fun & elves'::tsquery;
       tsquery
  -----------------
   'fun' & 'elves'
  (1 row)

  -- Select either 'fun' or 'elves'
  SELECT 'fun | elves'::tsquery;
       tsquery
  -----------------
   'fun' | 'elves'
  (1 row)

  -- Select 'elves' but not 'fun'
  SELECT '! fun & elves'::tsquery;
       tsquery
  ------------------
   !'fun' & 'elves'
  (1 row)

Of course, PostgreSQL provides a bunch of methods to parse a query and turn it into a ts_query.

Parsing queries: to_tsquery, plainto_tsquery, phraseto_tsquery, websearch_to_tsquery

PostgreSQL offers different methods to parse a query, each with its own features.

to_tsquery

to_tsquery takes a query that must be a list of tokens already separated by operators. Tokens can be grouped using parenthesis to control the binding between operators (i.e. which couple of operators takes precedence over another couple).

For instance, I can’t pass a plain sentence to to_tsquery because it expects tokens separated by operators such as &, |, etc.

  SELECT to_tsquery('The fun elves and the magical sorceresses');
  ERROR:  syntax error in tsquery: "The fun elves and the magical sorceresses"

Once I provide a formatted list, PostgreSQL turns my tokens into lexemes, and the fun can begin.

  SELECT to_tsquery('The & fun & elves & and & the & magical & sorceresses');
                to_tsquery
  ---------------------------------------
   'fun' & 'elv' & 'magic' & 'sorceress'
  (1 row)

t_tsquery needs a premilinary normalization on my part, which can be annoying, but it also gives me a lot of control over the query. to_tsquery allows me to specify my operators. Let’s say I want books with fun elves but no magical sorceresses, I can group my tokens like so:

  SELECT to_tsquery('The & fun & elves & and & the & !magical & !sorceresses');
                 to_tsquery
  -----------------------------------------
   'fun' & 'elv' & !'magic' & !'sorceress'
  (1 row)

plainto_tsquery

If I don’t want to normalize my input beforehand, I can delegate the work to PostgreSQL with plainto_tsquery. It’ll normalize my text, remove stop words, and add the & operator inbetween surviving words.

  SELECT plainto_tsquery('The fun elves and the magical sorceresses');
              plainto_tsquery
  ---------------------------------------
   'fun' & 'elv' & 'magic' & 'sorceress'
  (1 row)

plainto_tsquery offers convenience over control. For instance, you can’t use logical operators anymore. For example, you can’t specify you don’t want magical sorceresses anymore.

  SELECT plainto_tsquery('The fun elves and the !magical !sorceresses');
              plainto_tsquery
  ---------------------------------------
   'fun' & 'elv' & 'magic' & 'sorceress'
  (1 row)

phraseto_tsquery

phraseto_tsquery is similar to plainto_tsquery except the & (AND) operator is replace by the <-> (FOLLOWED BY) operator. Also, the stop words are not discarded but are used to compute the number of semantically less important words between lexemes (using the <N> operator).

This is useful when you need to fetch results from a group of words that are semantically more relevant together than separated. Think Lord of the Ring or The Wheel of Time, or any such sentences.

  SELECT phraseto_tsquery('the wheel of time');
    phraseto_tsquery
  --------------------
   'wheel' <2> 'time'
  (1 row)

You’d rather get books from the Wheel of Time series than any random books with the words wheels and time in their blurb. phraseto_tsquery also does not recognize operators.

websearch_to_tsquery

Finally, websearch_to_tsquery offers a similar approach to natural language inputs, but handles some operators too.

The PostgreSQL documentation explains its syntax:

  • unquoted text: text not inside quote marks will be converted to terms separated by & operators, as if processed by plainto_tsquery.
  • "quoted text": text inside quote marks will be converted to terms separated by <-> operators, as if processed by phraseto_tsquery.
  • OR: the word “or” will be converted to the operator.
  • -: a dash will be converted to the ! operator.

PostgreSQL documentation

Let’s search for books about fun, elves, but no magical sorceresses.

SELECT websearch_to_tsquery('The fun elves and -"the magical sorceresses"');
             websearch_to_tsquery
----------------------------------------------
 'fun' & 'elv' & !( 'magic' <-> 'sorceress' )
(1 row)

Books about sorceresses might pop up, as long as the sorceress is not preceded by magical.

The possibilities are endless.

Ranking results

This post is getting very long, so I’ll try to be brief.

PostgreSQL allows you to rank results based on:

  SELECT title,
         ts_rank(
             to_tsvector(blurb),
             websearch_to_tsquery('fun elves and magical sorceresses')
         ) AS rank
  FROM recipes
  ORDER BY rank DESC
  LIMIT 10;

                     title                     |   rank
-----------------------------------------------+----------
 The Fair Sorceress                            |      3.1
 The Last Kingdom                              |      2.4
 Sorcery School                                |  2.01317
 (3 rows)

You can add weights arguments to give an extra boost to specific columns for instance.

I’ll let you read the ad hoc documentation

Conclusion

Well, this post got quite long. If you made it to here, well done! But remember, we only just scratched the surface of full-text search. PostgreSQL provides much much more functionnalities, and search is a field on its own.

Cheers,

Rémi - @remi@ruby.social

  1. I probably should read something other than the Wheel of Time.