Logo
Englika

Comparing indexes for text search in PostgreSQL (part 1)

Comparing indexes for text search in PostgreSQL (part 1)

During the development of my web application, I needed to add the ability to search for different entities (companies, messages, documents, etc.). Each entity has its own string length to be searched for. For example, a company has a short name (20-40 characters on average), a message has a medium-length text (usually up to 200 characters), and a document can be very long.

I have the following questions:

  1. What indexes are used for text search?
  2. How do they differ from each other? How much disk space do they take up? Which indexes searches faster?
  3. Which index is better to use for searching by short strings (e.g., by company names), and which one is better for searching by long strings (e.g., by chat messages or by the contents of the text documents)? Is it possible to use the same index for all situtations?

In this article, I will answer these questions by comparing different indexes and making a conclusion in which cases it is better to use one or another index for text search.

At the beginning, I want to make a small digression and answer the question: why do I need to use indexes at all and when do I need to use them?

What is an index and why do you need it

Index is a data structure used to significantely speed up the search in the database and to implement integrity constraints (e.g., uniqueness constraints, which can be used to make all values in a column unique).

By default, PostgreSQL automatically creates indexes to implement uniqueness (unique constraint) and primary key (a combination of unique and «NOT NULL» constraints). Hence, if you use primary key, then you are already using indexes. PostgreSQL does not create an index for foreign key, so if you use JOIN, you need to create an index yourself, but this is no longer the topic of this article.

PostgreSQL has many indexes (B-tree, GiST, SP-GiST, GIN, RUM, BRIN, Bloom, etc.). Each of them is used for a specific situation. See this series of articles for more information about these indexes.

If you create an index and do not specify the index type, PostgreSQL creates a B-tree index, which is used in most cases.

CREATE TABLE messages (id int, body text);
CREATE INDEX messages_body_idx ON messages (body);

There is a rule for naming indexes, which you can see here. I will use it in this article.

If you want to use another index, for example, a GIN, you must specify it explicitly:

CREATE INDEX messages_body_idx ON messages USING gin (body);

If you are developing a new application and you don't have any users yet, then you can ask me: «Why should I use indexes if I don't have many users yet? That's when there will be a lot of users, that's when I'll do the indexes».

On the one hand, you will be right. As long as your application is used by only a couple of users per day and as long as only a few hundred entities are stored in the database, you will not see the difference between the presence and absence of indexes. Moreover, the PostgreSQL query planner may prefer to scan a table sequentially, as if there were no index at all.

However, when you develop an app, you think over its structure and design a database. This lays the foundation for your app. I think that you wouldn't want, when there is an influx of active users, to encounter slow execution of database queries, from which the UI will freeze, which will «infuriate» your users. Moreover, you will not know at what point everything started working slowly until one of your users writes to you. However, in practice, it is extremely rare for someone to write about bugs or slowly execution. If something does not work, your users simple leave your app, especially if they have just registered. Before creating indexes, you may need to change the structure of the database and your SQL queries. Are you ready for this? It is faster and more reliable to think through and implement indexes immediately when developing the first version of the app.

At the beginning, we will put forward the requirements for the index and sort them by priority. So, the index should:

  1. Support search by any part of the substring. For example, by entering «center» in the search bar, the user should see an entity with the name «The children's center».
  2. Find inaccurate matches. For example, for the «centers» query, an entity with the name «The children's center» should also be displayed.
  3. Be case insensitive. For example, for the «Center» query, an entity with the name «The children's center» should be displayed.
  4. Support the LIMIT clause to implement pagination.
  5. Support sorting by relevance to the search term.
  6. Search as quickly as possible.
  7. Take up as little disk space as possible.

You can use either trigrams (pg_trgm) or the Full Text Search to search by text. Let's deal with them both, and then go back to comparing indexes.

Trigrams

The search is performed as follows. The string is divided into many trigrams. A trigram consists of 3 consecutive characters.

SELECT show_trgm('word'); -- {"  w"," wo",ord,"rd ",wor}

Characters that are not a letter or a number are ignored.

SELECT show_trgm('a&b'); -- {"  a","  b"," a "," b "}

The case of letters is also ignored.

SELECT show_trgm('A'); -- {"  a"," a "}

When creating trigrams, 2 spaces are added to the beginning of the string, and 1 space is added to the end.

Let's compare how much one string (what a user entered in the search bar) is similar to another (for example, to the content of a specific message) by counting the number of the same trigrams.

SELECT 'elephant' % 'one elephant'; -- TRUE
SELECT similarity('elephant', 'one elephant'); -- 0.6923077

The % character is used to determine how similar 2 strings are. If the value is greater than of equal to pg_trgm.similarity_threshold (0.3 by default), it returns TRUE. The similarity value ranges from 0 (the strings are completely different) to 1 (the strings are the same).

SHOW pg_trgm.similarity_threshold; -- 0.3

You can increase this value, for example, to make the search more accurately:

SET pg_trgm.similarity_threshold = 0.8;

Let's leave this value as the default.

If a user makes a typo in the word «elephant», the result will still be positive, but the similarity value will decrease.

SELECT 'wlephant' % 'one elephant'; -- TRUE
SELECT similarity('wlephant', 'one elephant'); -- 0.375

However, if the a user enters a non-existent word «cat», then the strings will not be similar.

SELECT 'cat' % 'one elephant'; -- FALSE
SELECT similarity('cat', 'one elephant'); -- 0

In PostgreSQL 13 there are 3 different functions for calculation the similarity of 2 strings: similarity, word_similarity, and strict_word_similarity. Each of the functions calculates the similarity of strings in its own way.

To see the difference between these functions, let's take the string «dark wood round table» (for example, we have a product with that name in the products table) and try to determine how similar it is to several different search terms (the strings that a user enters to find this table) using each of the functions. Let's take the following search terms:

«dark» and «wood», which has the same length, but are located in different places of the string: one is closer to the beginning, the other to the end.

«table» and «tabla» to determine how much the presence of a typo in a word afftects.

«tabl» and «able» to see whether the first substring is more relevant than the second one.

«round table», «wood table» and «dark table» to determine how much the distance between words affects the result.

Seach for a product with the name «dark wood round table»:

+------------------+------------+-----------------+------------------------+
| Поисковый запрос | similarity | word_similarity | strict_word_similarity |
+------------------+------------+-----------------+------------------------+
| dark             | 0.22727273 | 1               | 1                      |
| wood             | 0.22727273 | 1               | 1                      |
| table            | 0.27272728 | 1               | 1                      |
| tabla            | 0.16666667 | 0.6666667       | 0.5                    |
| tabl             | 0.17391305 | 0.8             | 0.5714286              |
| able             | 0.125      | 0.6             | 0.375                  |
| round table      | 0.54545456 | 1               | 1                      |
| wood table       | 0.5        | 0.64705884      | 0.64705884             |
| dark table       | 0.5        | 0.54545456      | 0.54545456             |
+------------------+------------+-----------------+------------------------+

At first, you can see, that it is imprortant for the similarity function that the first string is as similar as possible to the second one. However, for the word_similarity и strict_word_similarity functions it is important that the first string is as similar as possible to one of the substring of the second string. The official documentation says that similarity returns a number that indicates how similar the two arguments are, and word_similarity returns a number that indicates the greatest similarity between the set of trigrams in the first string and any continuous extent of an ordered set of trigrams in the second string.

If you used the similarity function, the user would not be able to find a product with the name «dark wood round table» by the search term «table», because the threshold value of pg_trgm.similarity_threshold is 0.3 by default.

Let's check it out:

SELECT 'table' % 'dark wood round table'; -- FALSE

To solve this issue, you can reduce the threshold value, for example, to 0.2:

SET pg_trgm.similarity_threshold = 0.2;
SELECT 'table' % 'dark wood round table'; -- TRUE

However, if the name of the product is a little longer, the user will not be able to find it again:

SELECT 'table' % 'dark brown wood round dining table'; -- FALSE

Therefore, it is preferable to choose the word_similarity and strict_word_similarity functions.

Let's go back to our table. Note that all functions have the save value for «dark» and «wood». It means that the proximity of the first string to the beginning of the second string does not affect the result.

If there is a typo, the strict_word_similarity returns the smaller value, than word_similarity (0.5 and 0.66666667 respectively). Hence, word_similarity is more sensitive to typos. If a user enters «able» instead of «tabl», then when using strict_word_similarity, the value will decrease more than when using word_similarity.

Let's see if the product with the «dark wood round table» name will be shown if a user enters «tabl» and «able». Previously, to compare 2 strings using the similarity function, we used the % symbol. For word_similarity, the <% symbol is used, and for strict_word_similarity<<%.

/* word_similarity */
SELECT 'tabl' <% 'dark wood round table'; -- TRUE
SELECT 'able' <% 'dark wood round table'; -- TRUE

/* strict_word_similarity */
SELECT 'tabl' <<% 'dark wood round table'; -- TRUE
SELECT 'able' <<% 'dark wood round table'; -- FALSE

If you enter «tabl», the product will be shown using both functions. If you enter «able», the product will be shown when using the word_similarity function, but not when using strict_word_similarity. However, it is unlikely that a user will search for the product by typing the end of the word («able» instead of «table»). Most likely, it will be the other way around – users will search by prefix («tabl»). Therefore strict_word_similarity fits better.

Note that each function has its own thresholds, which are equal to the following values:

SHOW pg_trgm.similarity_threshold; -- 0.3
SHOW pg_trgm.word_similarity_threshold; -- 0.6
SHOW pg_trgm.strict_word_similarity_threshold; -- 0.5

Now let's see how much the absence of words in the search term affects. Let me remind you that the product name is «dark wood round table».

See the results of the similarity function in the table above. You can see that the search term «round table» (no missing words) has a value that a slightly greater than «wood table» (1 missing word), but the reason is that the first string is 1 character longer than the second. The results for the search terms «wood table» (1 missing word) and «dark table» (2 missing words) are equal, which is to be expected, since only the number of identical trigrams matters for similarity. However, the smaller the number of missing words, the more relevant the search result should be. Thus, the similarity function loses again.

Let's see if our product with the name «dark wood round table» will be shown to a user who will use the search terms we have consifered:

/* similarity */
SELECT 'round table' % 'dark wood round table'; -- TRUE
SELECT 'wood table' % 'dark wood round table'; -- TRUE
SELECT 'dark table' % 'dark wood round table'; -- TRUE

/* word_similarity */
SELECT 'round table' <% 'dark wood round table'; -- TRUE
SELECT 'wood table' <% 'dark wood round table'; -- TRUE
SELECT 'dark table' <% 'dark wood round table'; -- FALSE

/* strict_word_similarity */
SELECT 'round table' <<% 'dark wood round table'; -- TRUE
SELECT 'wood table' <<% 'dark wood round table'; -- TRUE
SELECT 'dark table' <<% 'dark wood round table'; -- TRUE

The word_similarity function displays our product using any of our search terms, except the «dark table». The value of function in this case is 0.54545456, and the threshold is 0.6. Let's reduce this threshold to 0.5.

SET pg_trgm.word_similarity_threshold = 0.5;
SELECT 'dark table' <% 'dark wood round table'; -- TRUE

If reducing the threshold for similarity didn't make sense, because the product name may be longer, then in this case it makes sense.

It is important to note that, unlike similarity, the word_similarity and strict_word_similarity functions do not have a commutative property, so it's important to place the strings in the right order: the first argument should be a search term, and the second – the string in which we are looking for (for example, the name of the product).

SELECT similarity('table', 'dark table'); -- 0.54545456
SELECT similarity('dark table', 'table'); -- 0.54545456

SELECT word_similarity('table', 'dark table'); -- 1
SELECT word_similarity('dark table', 'table'); -- 0.54545456

SELECT strict_word_similarity('table', 'dark table'); -- 1
SELECT strict_word_similarity('dark table', 'table'); -- 0.54545456

As a result, the similarity function is not suitable, because we need to display even those results in which only a small part of the string matches. For example, if a user enters «table» in the search bar, then the application should display the product with the name «dark wood round table».

The functions word_similarity and strict_word_similarity are similar, but I decided to give preference to the latter, because it will give less irrelevant results. For example, the product «dark wood round table» will not be displayed when a user enters the search term «able» if the default threshold of strict_word_similarity is used.

Full Text Search is similar to any popular search engine that displays list of websites (documents) that match a specific search term. Unlike trigrams, a search query can be more complex. For example, you can find documents in which there are only a few words from a query, even if they are far from each other. You can also find documents in which there is an exact occurrence of the search term.

When you are using trigrams, the similarity of 2 strings decreases if you enter 1 word, for example, not in the singular («table»), but in the plural («tables»). Thus, you can lose relevant results by writing one of the words in the search term in a different form. There is no problem if you use Full Text Search.

The search is performed in the following way. At the beginning, the parser splits the string (for example, the name of the product) into a set of tokens. A token can be a number, a word, etc. Then, using the built-in dictionaries, tokens are converted into lexemes in such a way that different forms of the same word are converted into the same lexeme. As a result, a set of lexemes is obtrained from the string. Then, using this set of lexemes, it is determined whether the string matches to a certain query (a query is also a set of lexemes).

The tsvector type is used to store a set of lexemes. You can convert a string to tsvector using the to_tsvector function.

SELECT to_tsvector('table'); -- 'tabl':1
SELECT to_tsvector('tables'); -- 'tabl':1

With each lexeme, its position in the document is stored. This position will be used to sort documents by relevance.

Any stop words are ignored since they occur too frequently to be useful in searching (for example, «and»).

SELECT to_tsvector('spoons and forks'); -- 'fork':3 'spoon':1

Note that «and» is ignored because it's a stop word. The words «spoons» and «forks» are converted into the lexemes «spoon» and «fork» with their positions in the document («spoon» is in the first place, and «fork» is in the third).

In addition to the position of the lexeme, you can also store its priority. There are 4 priorities in total: A (highest), B, C, and D (lowest; not displayed next to the lexeme). For example, if you need to search by posts, you can set priority A to the title of the post, B to the content of the post, and D to the author's name.

SELECT setweight(to_tsvector('Comparision of indexes'), 'A') || setweight(to_tsvector('The content of the article'), 'B') || setweight(to_tsvector('Ilya Ordin'), 'D');
/* 'articl':8B 'comparis':1A 'content':5B 'ilya':9 'index':3A 'ordin':10 */

The tsquery type is used for search queries. You can create this type using one of the following functions: to_tsquery, plainto_tsquery, phraseto_tsquery и websearch_to_tsquery.

Using the to_tsquery, you can compose queries with logical operators. The words will be transformed into lexemes. The @@ operator is used to determine whether a query matches a document (e.g. a product name).

SELECT to_tsquery('tables'); -- 'tabl'
SELECT to_tsvector('dark wood round table') @@ to_tsquery('tables'); -- TRUE

SELECT to_tsquery('tables & dark'); -- 'tabl' & 'dark'
SELECT to_tsvector('dark wood round table') @@ to_tsquery('tables & dark'); -- TRUE

SELECT to_tsquery('tables | armchairs'); -- 'tabl' | 'armchair'
SELECT to_tsvector('dark wood round table') @@ to_tsquery('tables | armchairs'); -- TRUE

SELECT to_tsquery('tables & ! dark'); -- 'tabl' & !'dark'
SELECT to_tsvector('dark wood round table') @@ to_tsquery('tables & ! dark'); -- FALSE

SELECT to_tsquery('round <-> tables'); -- 'round' <-> 'tabl'
SELECT to_tsvector('dark wood round table') @@ to_tsquery('round <-> tables'); -- TRUE
SELECT to_tsvector('dark wood round table') @@ to_tsquery('dark <-> tables'); -- FALSE
SELECT to_tsvector('dark wood round table') @@ to_tsquery('dark <3> tables'); -- TRUE
SELECT to_tsvector('dark wood round table') @@ to_tsquery('round <1> tables'); -- TRUE
SELECT to_tsvector('dark wood round table') @@ to_tsquery('round <3> tables'); -- FALSE
SELECT to_tsvector('dark wood round table') @@ to_tsquery('dark <-> wood <-> round <-> table'); -- TRUE

The <-> operator means that the words follow each other. You can also specify the length between words. For example, <3> means that there should be 2 words between the words. If the number of words is less, the document will not match the query. As you can see, the round <3> tables do not match the document «dark wood round table», but dark <3> tables do. <1> is similar to <->.

The to_tsquery function is convenient to use when you need to compose a query manually, because logical operators must be used between the words, otherwise there will be an error.

SELECT to_tsquery('dark tables'); -- ERROR:  syntax error in tsquery: "dark tables"

The queries received from users should be passed to other functions, which will discuss below.

plainto_tsquery transforms all words into a set of lexemes and puts the & operator between them. The phraseto_tsquery function does the same thing, but use the <-> operator.

SELECT plainto_tsquery('dark wood tables'); -- 'dark' & 'wood' & 'tabl'
SELECT phraseto_tsquery('dark wood tables'); -- 'dark' <-> 'wood' <-> 'tabl'

Accordingly, if it is necessary to show the user, for example, products, whose name contains one of the words, then you need to use plainto_tsquery, and if you want to search by phrase (the words must follow each other), then phraseto_tsquery.

The websearch_to_tsquery allows your to compose more complex queries:

SELECT websearch_to_tsquery('dark wood tables'); -- 'dark' & 'wood' & 'tabl'
SELECT websearch_to_tsquery('"dark wood tables"'); -- 'dark' <-> 'wood' <-> 'tabl'
SELECT websearch_to_tsquery('tables or armchairs'); -- 'tabl' | 'armchair'
SELECT websearch_to_tsquery('tables -dark'); -- 'tabl' & !'dark'

Thus, the search for documents will be similar to the search using popular search engines.

Let's search for the product with the name «dark wood round table» using different prefixes.

SELECT to_tsvector('dark wood round table'); -- 'dark':1 'round':3 'tabl':4 'wood':2
SELECT to_tsvector('dark wood round table') @@ to_tsquery('tables'); -- TRUE
SELECT to_tsvector('dark wood round table') @@ to_tsquery('table'); -- TRUE
SELECT to_tsvector('dark wood round table') @@ to_tsquery('tabl'); -- TRUE
SELECT to_tsvector('dark wood round table') @@ to_tsquery('tab'); -- FALSE

The «table» word is transformed to the «tabl» lexeme, so the user needs to enter at least «tabl» in the search bar to find the product with the name «dark wood round table». This product cannot be found using the «tab» search term.

Let's look at another example. Let's add a new product with the name «black leather armchair» to our catalog and try to find it using the «armcha» search term.

SELECT to_tsvector('black leather armchair'); -- 'armchair':3 'black':1 'leather':2
SELECT to_tsvector('black leather armchair') @@ to_tsquery('armcha'); -- FALSE

Unfortunately, the user will not be able to find our new product, using this search term. However, we can specify in the query that we want to find all lexemes with such a prefix.

SELECT to_tsvector('black leather armchair') @@ to_tsquery('armcha:*'); -- TRUE

Since the search term is entered by a user, it is possible to make a preliminary conversion of the query on the server side by adding to the last word :*.

For example, the user's search term «black armcha» can be transformed to «black & armcha:*»:

SELECT to_tsquery('black & armcha:*'); -- 'black' & 'armcha':*
SELECT to_tsvector('black leather armchair') @@ to_tsquery('black & armcha:*'); -- TRUE

and the query «black armchairs» to «black & armchairs:*»:

SELECT to_tsquery('black & armchairs:*'); -- 'black' & 'armchair':*
SELECT to_tsvector('black leather armchair') @@ to_tsquery('black & armchairs:*'); -- TRUE

In both cases, a user will see the product with the name «black leather armchair» in search result.

But if we try to make a typo in the lexeme, then we will not be able to find this product:

SELECT to_tsvector('black leather armchair') @@ to_tsquery('armchair'); -- TRUE
SELECT to_tsvector('black leather armchair') @@ to_tsquery('wrmchair'); -- FALSE

This is a singificant disadvantage compared to trigrams. However, in this case there is a solution. You can create a separate table with words that are used in all product names and offer the user options for correcting words that were entered with typos. Read more here.

So far, we have written everything in English. What happens if the search term and the product name are in another language (e.g. Russian)?

By default, the language specified in default_text_search_config is used for token transformation.

SHOW default_text_search_config; -- pg_catalog.english

If we try to transform Russian words to tsvector, then the lexemes will be the same as the tokens, because there are no such words in English dictionaries.

SELECT to_tsvector('черные кресла'); -- 'кресла':2 'черные':1

In any of the above-mentioned functions (to_tsvector, to_tsquery, plainto_tsquery, etc.) you can pass the name of the language to be used as the first argument (the name of the pg_catalog scheme can be omitted):

SELECT to_tsvector('pg_catalog.russian', 'черные кресла'); -- 'кресл':2 'черн':1
SELECT to_tsvector('russian', 'черные кресла'); -- 'кресл':2 'черн':1

But what if we have a table with documents in different languages? For example, English-speaking users talk in the chat in English, and Russian-speaking users in Russian. There are 2 ways to solve this problem.

Option 1. Create an additional column in a table, where the language of the message will be stored.

CREATE TABLE messages (id int, body varchar, lng regconfig);
INSERT INTO messages VALUES (1, 'Good news everyone!', 'english'), (2, 'Это просто замечательно!', 'russian');
SELECT to_tsvector(lng, body) FROM messages;
/* 'everyon':3 'good':1 'news':2 */
/*  'замечательн':3 'прост':2 'эт':1 */

If the language of the message is unknown, then you can write a function that will determine it by going through to to_tsvector with different languages. The language for which the length of the tokens will be shortest, most likely the message is written in this language.

SELECT to_tsvector('english', 'черные кресла'); -- 'кресла':2 'черные':1
SELECT to_tsvector('russian', 'черные кресла'); -- 'кресл':2 'черн':1

Option 2. Create your own dictionary, which will consist of dictionaries of different languages. Detailed instructions are described here.

In this case, unlike the first option, one message can contain words in several languages at once. For example, if an English-speaking user wrote something and additionally inserted a quote in Russian into the message.

Let's talk a little bit about ranking search results. Most likely, the user needs to show the first N results that are most relevant to his query. To do this, you need to use the function to calculate the relevance. There are 2 functions in PostgreSQL: ts_rank and ts_rank_cd. The higher the value, the more relevant the search result is.

The value of ts_rank depends on the frequency of lexemes. The more often a lexeme is found in a document, the less valuable it is, the smaller the value.

The ts_rank_cd function is similar to ts_rank, but the latter takes into account the proximity of the location of lexemes to each other. To do this, the positions of lexemes that are stored in tsvector are used. By the way, judging by this presentation (slides 22-23), the ts_rank_cd function does not work well with the | operator.

Now let's go back to the indexes.

What indexes will be compared

To speed up the search using trigrams, you can create either GiST or GIN index. To speed up the Full Text Search, you can use one of the following indexes: GiST, GIN or RUM. Let's compare them all and find out what one is faster.

GiST index for trigrams

First, let's create the following table:

CREATE TABLE messages (id int, body text);

The GiST index for trigrams is created as follows:

CREATE INDEX messages_body_idx ON messages USING gist (body gist_trgm_ops);

Le'ts talk about how the GiST index works. Let's assume that we have a table with 3 short messages: «table», «able» и «tab». This is what trigrams will look like for them:

SELECT show_trgm('table'); -- {"  t"," ta",abl,ble,"le ",tab}
SELECT show_trgm('able'); -- {"  a"," ab",abl,ble,"le "}
SELECT show_trgm('tab'); -- {"  t"," ta","ab ",tab}

For each trigram, its own signature (bit string) is created, in which all bits are zero, except one. For example, it might look like this:

« t» – 00100000 (3rd бит)

« ta» – 00000100 (6th бит)

«abl» – 10000000 (1st бит)

«ble» – 00010000 (4th бит)

«le » – 00000001 (8th бит)

«tab» – 00001000 (5th бит)

« a» – 10000000 (1st бит)

« ab» – 00100000 (3rd бит)

«ab » – 01000000 (2nd бит)

The number of bit is determined by the hash function, the input of which is a trigram, and the output is the bit number. The length of the bit string (signature) in our case is 8 bits or 1 byte. By default, the signature length is 12 bytes, but it can be changed by specifying the siglen parameter as follows:

CREATE INDEX messages_body_idx ON messages USING gist (body gist_trgm_ops(siglen=32));

The length of the signature can be from 1 to 2024 bytes.

Let's go back to our example. I think you've noticed that some different trigrams has the same signature (for example, «abl» and « a») because the hash function returns the same bit number for them. This is called a collision. Soon you will see how this will affect the search, but now it is important to understand that the longer signature, the less likely it is that for 2 different trigrams the hast function will return the same bit number and they will have the same signature. However, the longer the signature, the more disk space the index takes up.

Then the signature of all documents (strings) are calculated using the logical addition (OR) of all the trigrams that are used in them:

  • «table» – 00100000 (« t») | 00000100 (« ta») | 10000000 («abl») | 00010000 («ble») | 00000001 («le ») | 00001000 («tab») = **10111101**
  • «able» – 10000000 (« a») | 00100000 (« ab») | 10000000 («abl») | 00010000 («ble») | 00000001 («le ») = **10110001**
  • «tab» – 00100000 (« t») | 00000100 (« ta») | 01000000 («ab ») | 00001000 («tab») = **01101100**

The GiST index builds a signature tree, in the leaves of which the signatures of the documents are located (in our case 10111101, 10110001 and 01101100). The root and each internal node of the tree contains references to either other internal nodes or leaves. They store the signature of all the documents they refer to. For example, if in our case there is only the root node, then the signature 11111101 is stored in it (10111101 | 10110001 | 01101100).

When searching, the search term is divided into trigrams and its signature is calculated. Then, starting from the root of the tree, the node is compared with the query signature. If all the positive bits from the query signature match the positive bits of the node signature, then we repeat this operation for all nodes that it refers to. As a result, we will get to the leaves where the references to the rows of the tables are stored.

Recall that we can set the length of the signature. The shorter the signature, the less disk space the index takes up, but in this case the signatures of different trigrams may be the same. As a result, when searching, we will once again go into the leaves that do not actually match the query, which will lead to a slowdown in the search. A false result will not be given to us, because the GiST index will check the results with real values (if they take up a little space, they will be stored inside the node and the search speed will not suffer much, otherwise you will need to access the table, which will slow down the search).

So, the question arises: what is the length of the signature to choose? To see how much the signature length affects the search speed and disk space, I suggest to compare 3 signature lengths: 12 bytes (default value), 120 bytes and 1200 bytes. The comparision itself is below.

The GiST index performs a depth-first search, which means that it can immediately produce the first results found.

So, the question arises: what is the length of the signature to choose? To see how much the signature length affects the search speed and disk space, I suggest comparing 3 signature lengths: 12 bytes (default value), 120 bytes and 1200 bytes. The comparison itself is below.

The GiST index performs a depth-first search, which means that it can immediately produce the first results found. This is what we need to implement pagination using LIMIT.

You can read more about the GiST index here.

GIN index for trigrams

The GIN index for trigrams is created as follows:

CREATE INDEX messages_body_idx ON messages USING gin (body gin_trgm_ops);

The GIN index store trigrams uses a B-tree, the leaves of which contain trigrams and references to the table rows in which these trigrams occur. If there are few references in one leaf node, then they are stored as a list (posting list) together with the values, otherwise as a B-tree. You can read more about the GIN index and see how it looks here.

The search using the GIN index is usually performed quickly, because unlike GiST, unnecessary crawls are not performed. However, the GIN index does not know how to search in depth, which means that it cannot immediately return the first found results to us. When using LIMIT, all results will be received at the beginning, and then the first result will be selected. Thus, if there are a lot of results, then the search will be slow.

You can limit the selection of the GIN index by settings the gin_fuzzy_search_limit parameter, but this is only suitable if the result is not sorted by relevance. Otherwise, the most relevant results will most likely be discarded and will not be shown to the user.

The GIN index usually takes up a little space, because the same lexeme is stored only 1 time. In addition, due to the fact that references to table rows are stored in an orderly manner, each subsequent reference is stored not as a reference, but as a difference between the current reference and the previous one.

So, compared to GiST, the GIN index searches faster and takes up less space, but it does not know how to return the first results found immediately.

LIKE and ILIKE support when using trigrams

In addition to the various similarity functions that we considered earlier, the GiST and GIN indexes for trigrams also support the LIKE and ILIKE operators. Using these operators, the search is faster, because all you need is to find the exact match of the desired substring, instead of determining how much one string is similar to another.

The B-tree index can also use the LIKE operator, but only if the search is performed by prefix. For example, ILIKE 'abcd%' (the string starts with «abcd»). Unlike B-tree, GiST and GIN indexes, using trigrams, can search for any part of the string. For example, ILIKE '%abcd%' (the string contains «abcd»). In practice, this is used much more often. The GiST and GIN indexes do not support equality comparision, so if necessary, you will need to create an addition B-tree index.

Limit on the number of characters when using trigrams

When using trigrams with LIKE and ILIKE operators, at least 3 characters must be entered in the search term. Otherwise, the index will not be used. Quote from the official documentation: «For both LIKE and regular-expression searches, keep in mind that a pattern with no extractable trigrams will degenerate to a full-index scan».

Let's try to search by entered 3 characters in ILIKE:

EXPLAIN ANALYZE SELECT * FROM messages WHERE body ILIKE '%heh%';
Bitmap Heap Scan on messages  (cost=844.55..9333.91 rows=90909 width=31) (actual time=8.578..488.423 rows=52910 loops=1)
   Recheck Cond: (body ~~* '%heh%'::text)
   Heap Blocks: exact=7352
   ->  Bitmap Index Scan on messages_body_idx  (cost=0.00..821.82 rows=90909 width=0) (actual time=7.436..7.442 rows=52910 loops=1
)
         Index Cond: (body ~~* '%heh%'::text)
 Planning Time: 0.417 ms
 Execution Time: 842.995 ms

The PostgreSQL planner used the index. Let's try to enter 2 characters:

EXPLAIN ANALYZE SELECT * FROM messages WHERE body ILIKE '%he%';
Seq Scan on messages  (cost=0.00..19853.00 rows=494949 width=31) (actual time=0.020..4550.681 rows=472940 loops=1)
   Filter: (body ~~* '%he%'::text)
   Rows Removed by Filter: 527060
 Planning Time: 0.415 ms
 Execution Time: 7684.939 ms

The planner chose sequential scanning instead of using the index, which increased the query execution time by almost 10 times.

There is no such restriction when using similarity functions:

EXPLAIN ANALYZE SELECT * FROM messages WHERE body %>> 'h';
Bitmap Heap Scan on messages  (cost=88.78..455.04 rows=100 width=31) (actual time=1546.646..1546.664 rows=0 loops=1)
   Recheck Cond: (body %>> 'h'::text)
   Rows Removed by Index Recheck: 356196
   Heap Blocks: exact=7353
   ->  Bitmap Index Scan on messages_body_idx  (cost=0.00..88.75 rows=100 width=0) (actual time=32.912..32.918 rows=356196 loops=1
)
         Index Cond: (body %>> 'h'::text)
 Planning Time: 0.890 ms
 Execution Time: 1546.862 ms

The planner used the index.

For the LIKE and ILIKE operators, you must specify at least 3 characters to get at least 1 trigram, but when using the similarity functions, 1 character is enough, because even 2 trigrams are formed from it, which will participate in the search.

SELECT show_trgm('h'); -- {"  h"," h "}

If you use LIKE or ILIKE, then you need to specify the minimum length of search terms that users enter in the UI of your application.

GiST index for full text searching

Let's create another table where tsvector will also be stored:

CREATE TABLE messages (id int, body text, body_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED);

Recall that in the GiST index, in addition to the document signature, the documents themselves can be stored in the leaves, but only if they take up a little space (about half a kilobyte if the page size is 8 KB). Otherwise, the GiST index will access the rows of the table.

Recall that in the GiST index, in addition to the document signature, the documents themselves can be stored in the leaves, but only if they take up a little space (about half a kilobyte if the page size is 8 KB). Otherwise, the GiST index will store references to the rows of the table where the documents are stored. If we do not store the tsvector in a separate column, then with each such access to the rows of the table, the tsvector of the checked row will be calculated, which will significantly slow down the search. To avoid this, we will store the tsvector directly in the table and calculate it when adding new documents or changing existing ones. This will take up additional disk space, but will speed up the search.

Let's create a GiST index:

CREATE INDEX messages_body_tsv_idx ON messages USING gist (body_tsv);

The GiST index for full text searching works the same way as for trigrams, only lexemes are used instead of trigrams. The difference is that the default signature is 124 bytes long (12 bytes in gist_trgm_ops).

You can change the signature length as follows:

CREATE INDEX messages_body_tsv_idx ON messages USING gist (body_tsv tsvector_ops(siglen=512));

GIN index for full text searching

Let me remind you that for the GiST index, we added a column in which tsvector is stored. For the GIN index, this column may or may not be required. It all depends on which queries will be used. If phrasal search operators (<->) are used, in which the distance between words is imprortant (for example, so that the words follow each other strictly), then tsvector also needs to be stored in the table. If they are not used (for example, all user search terms will be transformed using the plainto_tsquery function), then tsvector is not needed in the table. This is due to the fact that when using phrase search operators (<->), you need to know the distance between words. This distance is not stored in the GIN index, so each time there will be a call to the table for rechecking. When sorting the result by relevance using ts_rank or ts_rank_cd, tsvector also needs to be stored in a table, bacause these functions use the position of lexemes for calculations.

If the column with tsvector is not needed, then the index will be created as follows:

CREATE TABLE messages (id int, body text);
CREATE INDEX messages_body_idx ON messages USING gin (to_tsvector('english', body));

To use an index created in this way, it is necessary to specify in each SQL query the expression that was used when creating the index. In our case, this is to_tsvector('english', body).

INSERT INTO messages VALUES (1, 'Hello everyone!');
EXPLAIN ANALYZE SELECT * FROM messages WHERE to_tsvector('english', body) @@ to_tsquery('hello');
Bitmap Heap Scan on messages  (cost=12.30..24.77 rows=6 width=36) (actual time=0.045..0.088 rows=1 loops=1)
   Recheck Cond: (to_tsvector('english'::regconfig, body) @@ to_tsquery('hello'::text))
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on messages_body_idx  (cost=0.00..12.30 rows=6 width=0) (actual time=0.021..0.033 rows=1 loops=1)
         Index Cond: (to_tsvector('english'::regconfig, body) @@ to_tsquery('hello'::text))
 Planning Time: 0.094 ms
 Execution Time: 0.215 ms

The planner used the index, because the same expression was specified, which was used when creating the index. Let's try not to specify regconfig with the language name.

EXPLAIN ANALYZE SELECT * FROM messages WHERE to_tsvector(body) @@ to_tsquery('hello');
Seq Scan on messages  (cost=10000000000.00..10000000660.88 rows=6 width=36) (actual time=149.690..149.708 rows=1 loops=1)
   Filter: (to_tsvector(body) @@ to_tsquery('hello'::text))
 Planning Time: 0.082 ms
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.865 ms, Inlining 84.521 ms, Optimization 44.615 ms, Emission 20.293 ms, Total 150.295 ms
 Execution Time: 184.425 ms

The index was not used and the search speed dropped significantly.

If a column with tsvector is needed, then the index is created as follows:

CREATE TABLE messages (id int, body text, body_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED);
CREATE INDEX messages_body_tsv_idx ON messages USING gin (body_tsv);

The GIN index for full text searching is similar to the GIN index for trigrams, only lexemes are used instead of trigrams.

RUM index for full text searching

The RUM index is very similar to GIN, but it:

  • stores the positions of lexemes, which means that phrasal search operators (<->) are supported.
  • is able to search in depth, which means it immediately gives the first results.
  • knows how to give result in the order of their relevance.

This index is not included in the standard PostgreSQL. To install it, use the instructions that are available in the official GitHub repository of the index.

The RUM index is created as follows:

CREATE TABLE messages (id int, body text);
CREATE INDEX messages_body_idx ON messages USING rum (to_tsvector('english', body));

The RUM index can be created with the class of operators rum_tsvector_hash_ops. In this case, the hash of the lexemes will be stored instead of the lexemes themselves. This may reduce the size of the index, but the search will be slower and rechecks will be required. Also, this class of operators does not support prefix search (:*).

CREATE TABLE messages (id int, body text, body_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED);
CREATE INDEX messages_body_idx ON messages USING rum (body_tsv rum_tsvector_hash_ops);

According to the developers of the RUM index, rum_tsvector_hash_ops should be used only if the lexemes are very long (for example, if the lexeme is a chain of nucleotides). For natural languages, hashing lexemes does not make sense, because the size of the lexemes is small, but I suggest to check it out.

You can read more about the RUM index here and see it here.

So, as a result, we will compare the following indexes:

  • GiST for trigrams with signatures of 12, 120 and 1200 bytes in length.
  • GIN for trigrams.
  • GiST for full-text search with signatures of 12, 120 and 1200 bytes in length.
  • GIN for full-text search.
  • RUM for full-text search using rum_tsvector_ops.
  • RUM for full-text search using rum_tsvector_hash_ops.

Read about the comparison of indexes in the second part of the article.

Related posts

How to get N rows per group in SQL

Let's assume that you are developing the home page in an online store. This page should display product categories with 10 products in each. How do you make a query to the database? What indexes do you need to create to speed up query execution?

How to get N rows per group in SQL