Poking around PostgreSQL full-text search: a beginners primer
Today, I want to share a different type of post. Nothing polished. Just me goofing around with PostgreSQL’s full-text search capabilities. And yes, if you’re wondering how someone can have fun while using full-text search, well, I’m wondering about that myself.
A note: this post is beginners friendly. Even though it is long, I’ll only scratch the topic’s surface.
Let’s start with the basics!
What is full-text search?
The PostgreSQL documentation says it best:
Full-Text Searching (or just text search) provides the capability to identify natural-language documents that satisfy a query, and optionally to sort them by relevance to the query.
In layman’s terms, full-text search allows matches between a query and documents outside of exact matches. When querying with =
, LIKE
, or ILIKE
is not enough, you reach for full-text search.
For instance, you have a database full of books, and you’re searching for stories about sorceresses
1. But rows in your books.title
column only have a few exact matches for this query. On the other hand, books.blurb
contains a mixed bag of words like sorceresses
, sorceress
or sorcerers
.
Some of these books might be of interest to you and you’d like them to match your query. Well, full-text search does just that. It takes your query sorceresses
, normalizes it, matches it with normalized variations of your documents, and returns the results.
But I’m getting ahead of myself.
What problems does full-text search solves?
Let’s draw on the above example and look for a book filled with sorceresses!
My database currently stores a handful of books:
Let’s try to select some books.
Ugh, that’s disappointing. What about querying on the blurb instead?
Better!
But I know I have more relevant books. Let’s add an OR operator so I can search for variation of sorceresses
.
We’re getting there, but what about The Silent Grove
or The Fair Sorceress
? They’d be a good match.
I could add new conditions to my WHERE
clause. But in real-life applications, word variations and searchable columns could add up quite a bit.
This long-winded (yet contrived) example shows you, reader, that when you hit the limitations of basic search operators, it’s time to reach for PostgreSQL’s full-text search.
How does full-text search works?
From documents to tokens
Full-text search turns documents (for instance, the rows of your database) into tokens. Tokens are the smallest semantic units extracted from your documents. They can be strings, numbers, email addresses, etc. PostgreSQL uses a built-in parser, but you can provide your own parser if you wish.
Let’s take the following string stored in a row of books.blurb
:
PostgreSQL will first turn this string into tokens.
PostgreSQL does not expose (to my knowledge) the intermediate tokenization step, but here’s a workaround I’ve found that’ll help you make sense of this process.
Let’s start by splitting the string into an array of words, unpacking the array, adding a position
column indicating the index of each word, and ordering the words alphabetically.
Let’s note a few things:
- Stop words are still present.
- Punctuation is still present.
- Words have not been converted to normalized representations yet (
zombies
is stillzombies
: plural, thee
suffix representing the meaning as a word).
PostgreSQL strips stop words and suffixes during the next step when it converts tokens into lexemes.
From tokens to lexemes
Once PostgreSQL has converted your documents to tokens, it converts tokens into lexemes.
Lexemes are normalized representations of your tokens: different forms of the same word made alike.
In short, then, tokens are raw fragments of the document text, while lexemes are words that are believed useful for indexing and searching.
For instance:
Wanna try it? Fire up a PostgreSQL console and type the following:
magic
, magics
, magical
, and magically
are all variations of the stem magic
, which PostgreSQL identifies as the relevant representation of the previous variations.
You can see how PostgreSQL distributes the lexemes when transforming several words:
The transformation of tokens into lexemes follows several steps:
- Folding upper-case letters to lower-case.
- Removing suffixes.
- Removing stop words.
- Adding the positional information of each token matched with its lexeme.
This is where full-text search shines: you get a natural language input from your users (i.e. magical elves and fun sorceresses
) and you compare it against your data.
Chances are, you won’t get an exact match. So, one way to up your chances of a match while keeping the results relevant is to split the input into manageable semantic chunks.
How does PostgreSQL turn tokens into lexemes? With dictionaries!
PostgreSQL uses default dictionaries, but you can pass or create custom ones based on your needs. Want to work with French documents? Pass french
as first argument to your to_tsvector()
method. Want to add a bit of fuzzy matching? Add an Ispell dictionary that’ll map synonyms to a single word. Neat!
I won’t dig much into dictionaries at this point, but feel free to read the documentation. You can do a lot of configuration around dictionaries with PostgreSQL.
Finally, normalized lexemes are stored in a specific datatype called tsvector
.
Processing queries
So far, we’ve seen how PostgreSQL processes your data.
User queries follow the same basic steps as documents:
- Convert input into tokens.
- Convert tokens into lexemes.
- Return combined lexemes stored in a
ts_query
.
However, PostgreSQL adds several strategies when processing user queries.
Boolean and phrase search operators
After converting user input into lexemes, PostgreSQL returns a ts_query
where lexemes are combined using either:
- boolean operators:
&
(AND),|
(OR),!
(NOT) - phrase search operators:
<->
(FOLLOWED BY),<N>
(FOLLOWED BY whereN
is an integer specifying the distance between the two lexemes being searched for)
Of course, PostgreSQL provides a bunch of methods to parse a query and turn it into a ts_query
.
Parsing queries: to_tsquery
, plainto_tsquery
, phraseto_tsquery
, websearch_to_tsquery
PostgreSQL offers different methods to parse a query, each with its own features.
to_tsquery
to_tsquery
takes a query that must be a list of tokens already separated by operators. Tokens can be grouped using parenthesis to control the binding between operators (i.e. which couple of operators takes precedence over another couple).
For instance, I can’t pass a plain sentence to to_tsquery
because it expects tokens separated by operators such as &
, |
, etc.
Once I provide a formatted list, PostgreSQL turns my tokens into lexemes, and the fun can begin.
t_tsquery
needs a premilinary normalization on my part, which can be annoying, but it also gives me a lot of control over the query. to_tsquery
allows me to specify my operators. Let’s say I want books with fun elves
but no magical sorceresses
, I can group my tokens like so:
plainto_tsquery
If I don’t want to normalize my input beforehand, I can delegate the work to PostgreSQL with plainto_tsquery
. It’ll normalize my text, remove stop words, and add the &
operator inbetween surviving words.
plainto_tsquery
offers convenience over control. For instance, you can’t use logical operators anymore. For example, you can’t specify you don’t want magical sorceresses
anymore.
phraseto_tsquery
phraseto_tsquery
is similar to plainto_tsquery
except the &
(AND) operator is replace by the <->
(FOLLOWED BY) operator. Also, the stop words are not discarded but are used to compute the number of semantically less important words between lexemes (using the <N>
operator).
This is useful when you need to fetch results from a group of words that are semantically more relevant together than separated. Think Lord of the Ring
or The Wheel of Time
, or any such sentences.
You’d rather get books from the Wheel of Time series than any random books with the words wheels
and time
in their blurb. phraseto_tsquery
also does not recognize operators.
websearch_to_tsquery
Finally, websearch_to_tsquery
offers a similar approach to natural language inputs, but handles some operators too.
The PostgreSQL documentation explains its syntax:
unquoted text
: text not inside quote marks will be converted to terms separated by & operators, as if processed by plainto_tsquery."quoted text"
: text inside quote marks will be converted to terms separated by <-> operators, as if processed by phraseto_tsquery.
OR
: the word “or” will be converted to theoperator. -
: a dash will be converted to the ! operator.
Let’s search for books about fun
, elves
, but no magical sorceresses
.
Books about sorceresses
might pop up, as long as the sorceress
is not preceded by magical
.
The possibilities are endless.
Ranking results
This post is getting very long, so I’ll try to be brief.
PostgreSQL allows you to rank results based on:
- the frequency of lexemes in documents with
ts_rank
. - the density of lexemes in documents (
frequency x proximity
) withts_rank_cd
.
You can add weights arguments to give an extra boost to specific columns for instance.
I’ll let you read the ad hoc documentation
Conclusion
Well, this post got quite long. If you made it to here, well done! But remember, we only just scratched the surface of full-text search. PostgreSQL provides much much more functionnalities, and search is a field on its own.
Cheers,
Rémi - @remi@ruby.social
-
I probably should read something other than the Wheel of Time. ↩