Logo
Englika

Comparing indexes for text search in PostgreSQL (part 2)

Comparing indexes for text search in PostgreSQL (part 2)

In the first part of the article, we looked at how text search is performed using trigrams (pg_trgm) and full-text search, and decided that we would compare the following indexes to speed up the search:

  • GiST for trigrams with signatures of 12, 120 and 1200 bytes in length.
  • GIN for trigrams.
  • GiST for full-text search with signatures of 12, 120 and 1200 bytes in length.
  • GIN for full-text search.
  • RUM for full-text search using rum_tsvector_ops.
  • RUM for full-text search using rum_tsvector_hash_ops.

Preparation

After comparing the indexes, we need to answer to the following questions:

  • What index should I choose for trigrams and for full-text search?
  • Is it worth using trigrams for long strings? If not, at what length of the string is it better to abandon the use of trigrams?
  • What length of the signature should I choose when using the GiST index?
  • How much disk space can rum_tsvector_hash_ops save and is it really not worth using it for natural languages?

To determine how the length of the document affects the index size and search speed, we will create 3 tables for each index:

  • short_messages with a message length of about 25 characters.
  • medium_messages with a message length of about 250 characters.
  • long_messages with a message length of about 2500 characters.

We will compare indexes on the same data, so at the beginning we will create 3 tables, fill them with data and copy these tables to test each index.

CREATE TABLE short_messages (id int, body text, body_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED);
CREATE TABLE medium_messages (id int, body text, body_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED);
CREATE TABLE long_messages (id int, body text, body_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED);

Ideally, it would be to fill them with real messages, but since they are not at hand, we will generate them. In order for the generated words to be similar to real ones and some of the lexemes to be shorter than the words themselves, we will compose them by syllables. However, if new words are generated for each string, the number of unique lexemes will be incredibly large, which will greatly increase the size of the GIN index and increase the likelihood of false positives in the GiST index, which will slow down the search for it. So first we will create the words table, from which we will choose words for composing messages.

CREATE TABLE words (id int, word text);

According to these statistics, the average length of 1 English word is 5-6 characters. Let's take the most popular syllables from this page and make words out of them. Each word will consist of 3 syllables. There was the article on The Economist that said that most adult native speakers know 20000-35000 words.

Let's fill our table with 20000 generated words:

INSERT INTO words SELECT i, word
FROM generate_series(1, 20000) as i
CROSS JOIN LATERAL (
	SELECT string_agg(syllable, '')
	FROM (
		SELECT (string_to_array('th,he,an,er,in,re,nd,ou,en,on,ed,to,it,at,ha,ve,as,or,hi,ar,te,es,ver,hat,thi,tha,ent,ion,ith,ire,wit,eve,oul,uld,tio,ter,hen,era,hin,sho,ted,ome', ','))[floor(random() * 42) + 1]
		FROM generate_series(1, 3)
	) as f(syllable)
	WHERE i = i
) as f(word);

We generated the following «words»:

SELECT word FROM words ORDER BY random() LIMIT 10;
   word    
-----------
 totetha
 hiometio
 atoulera
 eratoen
 anoreve
 terhatted
 ontiothi
 reatas
 astioent
 ndwittio
(10 rows)

Let's fill each of our tables with 1 million messages:

INSERT INTO short_messages SELECT i, body
FROM generate_series(1, 1000000) as i
CROSS JOIN LATERAL (
	SELECT string_agg(word, ' ') FROM (
		SELECT word FROM words ORDER BY random() LIMIT 3 -- ~25 letters
	) as words
	WHERE i = i
) as f(body);

INSERT INTO medium_messages SELECT i, body
FROM generate_series(1, 1000000) as i
CROSS JOIN LATERAL (
	SELECT string_agg(word, ' ') FROM (
		SELECT word FROM words ORDER BY random() LIMIT 30 -- ~250 letters
	) as words
	WHERE i = i
) as f(body);

INSERT INTO long_messages SELECT i, body
FROM generate_series(1, 1000000) as i
CROSS JOIN LATERAL (
	SELECT string_agg(word, ' ') FROM (
		SELECT word FROM words ORDER BY random() LIMIT 300 -- ~2500 letters
	) as words
	WHERE i = i
) as f(body);

For convenience, duplicates of the tables short_messages, medium_messages and long_messages were created for each case. For example:

-- GiST for trigrams with signatures of 12 bytes in length
CREATE TABLE sm_trgm_gist_12 AS (SELECT * FROM short_messages);
CREATE TABLE mm_trgm_gist_12 AS (SELECT * FROM medium_messages);
CREATE TABLE lm_trgm_gist_12 AS (SELECT * FROM long_messages);

-- GiST for trigrams with signatures of 120 bytes in length
CREATE TABLE sm_trgm_gist_120 AS (SELECT * FROM short_messages);
CREATE TABLE mm_trgm_gist_120 AS (SELECT * FROM medium_messages);
CREATE TABLE lm_trgm_gist_120 AS (SELECT * FROM long_messages);

-- ...

The following queries were used to create indexes:

/* short_messages */
-- GiST for trigrams with signatures of 12, 120 and 1200 bytes in length
CREATE INDEX sm_trgm_gist_12_idx ON sm_trgm_gist_12 USING gist (body gist_trgm_ops(siglen=12));
CREATE INDEX sm_trgm_gist_120_idx ON sm_trgm_gist_120 USING gist (body gist_trgm_ops(siglen=120));
CREATE INDEX sm_trgm_gist_1200_idx ON sm_trgm_gist_1200 USING gist (body gist_trgm_ops(siglen=1200));

-- GIN for trigrams
CREATE INDEX sm_trgm_gin_idx ON sm_trgm_gin USING gin (body gin_trgm_ops);

-- GiST for full-text search with signatures of 12, 120 and 1200 bytes in length
CREATE INDEX sm_fts_gist_12_idx ON sm_fts_gist_12 USING gist (body_tsv tsvector_ops(siglen=12));
CREATE INDEX sm_fts_gist_120_idx ON sm_fts_gist_120 USING gist (body_tsv tsvector_ops(siglen=120));
CREATE INDEX sm_fts_gist_1200_idx ON sm_fts_gist_1200 USING gist (body_tsv tsvector_ops(siglen=1200));

-- GIN for full-text search
CREATE INDEX sm_fts_gin_idx ON sm_fts_gin USING gin (body_tsv);

-- RUM for full-text search using rum_tsvector_ops
CREATE INDEX sm_fts_rum_idx ON sm_fts_rum USING rum (body_tsv);

-- RUM for full-text search using rum_tsvector_hash_ops
CREATE INDEX sm_fts_rum_hash_idx ON sm_fts_rum_hash USING rum (body_tsv rum_tsvector_hash_ops);

/* medium_messages */
-- ...

/* long_messages */
-- ...

Full code for creating tables

CREATE TABLE words (id int, word text);
CREATE TABLE short_messages (id int, body text, body_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED);
CREATE TABLE medium_messages (id int, body text, body_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED);
CREATE TABLE long_messages (id int, body text, body_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED);

INSERT INTO words SELECT i, word
FROM generate_series(1, 20000) as i
CROSS JOIN LATERAL (
	SELECT string_agg(syllable, '')
	FROM (
		SELECT (string_to_array('th,he,an,er,in,re,nd,ou,en,on,ed,to,it,at,ha,ve,as,or,hi,ar,te,es,ver,hat,thi,tha,ent,ion,ith,ire,wit,eve,oul,uld,tio,ter,hen,era,hin,sho,ted,ome', ','))[floor(random() * 42) + 1]
		FROM generate_series(1, 3)
	) as f(syllable)
	WHERE i = i
) as f(word);

INSERT INTO short_messages SELECT i, body
FROM generate_series(1, 1000000) as i
CROSS JOIN LATERAL (
	SELECT string_agg(word, ' ') FROM (
		SELECT word FROM words ORDER BY random() LIMIT 3 -- ~25 letters
	) as words
	WHERE i = i
) as f(body);

INSERT INTO medium_messages SELECT i, body
FROM generate_series(1, 1000000) as i
CROSS JOIN LATERAL (
	SELECT string_agg(word, ' ') FROM (
		SELECT word FROM words ORDER BY random() LIMIT 30 -- ~250 letters
	) as words
	WHERE i = i
) as f(body);

INSERT INTO long_messages SELECT i, body
FROM generate_series(1, 1000000) as i
CROSS JOIN LATERAL (
	SELECT string_agg(word, ' ') FROM (
		SELECT word FROM words ORDER BY random() LIMIT 300 -- ~2500 letters
	) as words
	WHERE i = i
) as f(body);

CREATE TABLE sm_trgm_gist_12 AS (SELECT * FROM short_messages);
CREATE TABLE mm_trgm_gist_12 AS (SELECT * FROM medium_messages);
CREATE TABLE lm_trgm_gist_12 AS (SELECT * FROM long_messages);

CREATE TABLE sm_trgm_gist_120 AS (SELECT * FROM short_messages);
CREATE TABLE mm_trgm_gist_120 AS (SELECT * FROM medium_messages);
CREATE TABLE lm_trgm_gist_120 AS (SELECT * FROM long_messages);

CREATE TABLE sm_trgm_gist_1200 AS (SELECT * FROM short_messages);
CREATE TABLE mm_trgm_gist_1200 AS (SELECT * FROM medium_messages);
CREATE TABLE lm_trgm_gist_1200 AS (SELECT * FROM long_messages);

CREATE TABLE sm_trgm_gin AS (SELECT * FROM short_messages);
CREATE TABLE mm_trgm_gin AS (SELECT * FROM medium_messages);
CREATE TABLE lm_trgm_gin AS (SELECT * FROM long_messages);

CREATE TABLE sm_fts_gist_12 AS (SELECT * FROM short_messages);
CREATE TABLE mm_fts_gist_12 AS (SELECT * FROM medium_messages);
CREATE TABLE lm_fts_gist_12 AS (SELECT * FROM long_messages);

CREATE TABLE sm_fts_gist_120 AS (SELECT * FROM short_messages);
CREATE TABLE mm_fts_gist_120 AS (SELECT * FROM medium_messages);
CREATE TABLE lm_fts_gist_120 AS (SELECT * FROM long_messages);

CREATE TABLE sm_fts_gist_1200 AS (SELECT * FROM short_messages);
CREATE TABLE mm_fts_gist_1200 AS (SELECT * FROM medium_messages);
CREATE TABLE lm_fts_gist_1200 AS (SELECT * FROM long_messages);

CREATE TABLE sm_fts_gin AS (SELECT * FROM short_messages);
CREATE TABLE mm_fts_gin AS (SELECT * FROM medium_messages);
CREATE TABLE lm_fts_gin AS (SELECT * FROM long_messages);

CREATE TABLE sm_fts_rum AS (SELECT * FROM short_messages);
CREATE TABLE mm_fts_rum AS (SELECT * FROM medium_messages);
CREATE TABLE lm_fts_rum AS (SELECT * FROM long_messages);

CREATE TABLE sm_fts_rum_hash AS (SELECT * FROM short_messages);
CREATE TABLE mm_fts_rum_hash AS (SELECT * FROM medium_messages);
CREATE TABLE lm_fts_rum_hash AS (SELECT * FROM long_messages);

CREATE INDEX sm_trgm_gist_12_idx ON sm_trgm_gist_12 USING gist (body gist_trgm_ops(siglen=12));
CREATE INDEX mm_trgm_gist_12_idx ON mm_trgm_gist_12 USING gist (body gist_trgm_ops(siglen=12));
CREATE INDEX lm_trgm_gist_12_idx ON lm_trgm_gist_12 USING gist (body gist_trgm_ops(siglen=12));

CREATE INDEX sm_trgm_gist_120_idx ON sm_trgm_gist_120 USING gist (body gist_trgm_ops(siglen=120));
CREATE INDEX mm_trgm_gist_120_idx ON mm_trgm_gist_120 USING gist (body gist_trgm_ops(siglen=120));
CREATE INDEX lm_trgm_gist_120_idx ON lm_trgm_gist_120 USING gist (body gist_trgm_ops(siglen=120));

CREATE INDEX sm_trgm_gist_1200_idx ON sm_trgm_gist_1200 USING gist (body gist_trgm_ops(siglen=1200));
CREATE INDEX mm_trgm_gist_1200_idx ON mm_trgm_gist_1200 USING gist (body gist_trgm_ops(siglen=1200));
CREATE INDEX lm_trgm_gist_1200_idx ON lm_trgm_gist_1200 USING gist (body gist_trgm_ops(siglen=1200));

CREATE INDEX sm_trgm_gin_idx ON sm_trgm_gin USING gin (body gin_trgm_ops);
CREATE INDEX mm_trgm_gin_idx ON mm_trgm_gin USING gin (body gin_trgm_ops);
CREATE INDEX lm_trgm_gin_idx ON lm_trgm_gin USING gin (body gin_trgm_ops);

CREATE INDEX sm_fts_gist_12_idx ON sm_fts_gist_12 USING gist (body_tsv tsvector_ops(siglen=12));
CREATE INDEX mm_fts_gist_12_idx ON mm_fts_gist_12 USING gist (body_tsv tsvector_ops(siglen=12));
CREATE INDEX lm_fts_gist_12_idx ON lm_fts_gist_12 USING gist (body_tsv tsvector_ops(siglen=12));

CREATE INDEX sm_fts_gist_120_idx ON sm_fts_gist_120 USING gist (body_tsv tsvector_ops(siglen=120));
CREATE INDEX mm_fts_gist_120_idx ON mm_fts_gist_120 USING gist (body_tsv tsvector_ops(siglen=120));
CREATE INDEX lm_fts_gist_120_idx ON lm_fts_gist_120 USING gist (body_tsv tsvector_ops(siglen=120));

CREATE INDEX sm_fts_gist_1200_idx ON sm_fts_gist_1200 USING gist (body_tsv tsvector_ops(siglen=1200));
CREATE INDEX mm_fts_gist_1200_idx ON mm_fts_gist_1200 USING gist (body_tsv tsvector_ops(siglen=1200));
CREATE INDEX lm_fts_gist_1200_idx ON lm_fts_gist_1200 USING gist (body_tsv tsvector_ops(siglen=1200));

CREATE INDEX sm_fts_gin_idx ON sm_fts_gin USING gin (body_tsv);
CREATE INDEX mm_fts_gin_idx ON mm_fts_gin USING gin (body_tsv);
CREATE INDEX lm_fts_gin_idx ON lm_fts_gin USING gin (body_tsv);

CREATE INDEX sm_fts_rum_idx ON sm_fts_rum USING rum (body_tsv);
CREATE INDEX mm_fts_rum_idx ON mm_fts_rum USING rum (body_tsv);
CREATE INDEX lm_fts_rum_idx ON lm_fts_rum USING rum (body_tsv);

CREATE INDEX sm_fts_rum_hash_idx ON sm_fts_rum_hash USING rum (body_tsv rum_tsvector_hash_ops);
CREATE INDEX mm_fts_rum_hash_idx ON mm_fts_rum_hash USING rum (body_tsv rum_tsvector_hash_ops);
CREATE INDEX lm_fts_rum_hash_idx ON lm_fts_rum_hash USING rum (body_tsv rum_tsvector_hash_ops);

Queries for comparing indexes

The query execution speed for trigrams and full-text search will be compared separately.

Let's increase the work_mem parameter from 4 MB (by default) to 50 MB, so that when using the GIN index, the entire bitmap map is built up to rows and fit in the memory (there will be no lossy heap blocks). In this case, the search will be performed faster. Experimentally, it was found out that this size is sufficient for our tables.

The tests will use PostgresSQL 13.4.

Queries for trigrams

To check the speed of the indexes that use trigrams, we will run 3 different queries: the first two with ILIKE to find all messages that contain a search query (one sorted by relevance, the other without), and the third with %>> (strict_word_similarity) to search for messages that contain a substring, similar to a search query (sorted by relevance). For strict_word_similarity, we will use the standard threshold value of 0.5. The %>> operator does not make sense to test without sorting, because the first results may be very poorly relevant.

All queries will select the first 60 rows. Let me remind you that the GIN index does not know how to return the first results and return them in the right order. Therefore, at the beginning, all records that match the search criteria will be returned, then sorting and selection of the first 60 most relevant rows will take place.

Example of a query for trigrams with ILIKE without sorting:

SELECT body FROM short_messages WHERE body ILIKE '%abcd%' LIMIT 60;

and with sorting:

SELECT body, 'abcd' <<<-> body AS dist FROM short_messages WHERE body ILIKE '%abcd%' ORDER BY dist LIMIT 60;

Example of a query for trigrams with %>> and sorting:

SELECT body, 'abcd' <<<-> body AS dist FROM short_messages WHERE body %>> 'abcd' ORDER BY dist LIMIT 60;

The only difference between queries using GiST, GIN and RUM is that for the first two indexes we will use the ts_rank_cd function to tank the results, and for RUM we will use the <=> operator (its own ts_score function is used). We will return the first 60 results that are most relevant to the search query.

Example of a query for full-text search using GiST and GIN indexes:

SELECT body, ts_rank_cd(body_tsv, to_tsquery('abcd')) AS rank FROM short_messages WHERE body_tsv @@ to_tsquery('abcd') ORDER BY rank DESC LIMIT 60;

Example of a query for full-text search using RUM index:

SELECT body, body_tsv <=> to_tsquery('abcd') AS rank FROM short_messages WHERE body_tsv @@ to_tsquery('abcd') ORDER BY rank DESC LIMIT 60;

We will compare the following 3 varieties of tsquery:

  • A – there is at least one word.
  • A & B – there are both words exist in any places.
  • A <-> B – there are both words that follow each other.

Query execution speed

Let me remind you that in the short_messages table, messages have a length of ~25 characters, medium_messages – ~250 characters, long_messages – ~2500 characters. The speed is specified in milliseconds.

Trigrams with ILIKE (without sorting)

+--------------------+----------------+-----------------+---------------+
|       Index        | short_messages | medium_messages | long_messages |
+--------------------+----------------+-----------------+---------------+
| GiST (siglen=12)   |             11 |              11 |           172 |
| GiST (siglen=120)  |              9 |              14 |           183 |
| GiST (siglen=1200) |             10 |              11 |         29019 |
| GIN                |             11 |              12 |           495 |
+--------------------+----------------+-----------------+---------------+

Trigrams with ILIKE (with sorting)

+--------------------+----------------+-----------------+---------------+
|       Index        | short_messages | medium_messages | long_messages |
+--------------------+----------------+-----------------+---------------+
| GiST (siglen=12)   |          12965 |          171744 |       1232381 |
| GiST (siglen=120)  |          11015 |          176907 |       1306757 |
| GiST (siglen=1200) |           8355 |          191420 |       1478560 |
| GIN                |           1881 |           11879 |        442356 |
+--------------------+----------------+-----------------+---------------+

Trigrams with %>> (with sorting)

+--------------------+----------------+-----------------+---------------+
|       Index        | short_messages | medium_messages | long_messages |
+--------------------+----------------+-----------------+---------------+
| GiST (siglen=12)   |          19160 |          190739 |       1527176 |
| GiST (siglen=120)  |          16420 |          177249 |       1814276 |
| GiST (siglen=1200) |          16393 |          175266 |       2051626 |
| GIN                |            959 |           43536 |        761483 |
+--------------------+----------------+-----------------+---------------+

Full-text search (A)

+-----------------------------+----------------+-----------------+---------------+
|           Index             | short_messages | medium_messages | long_messages |
+-----------------------------+----------------+-----------------+---------------+
| GiST (siglen=12)            |             59 |           17201 |        110395 |
| GiST (siglen=120)           |             40 |           11912 |        133825 |
| GiST (siglen=1200)          |             26 |            2828 |        121076 |
| GIN                         |             12 |            1272 |         11601 |
| RUM (rum_tsvector_ops)      |              7 |            1374 |         20193 |
| RUM (rum_tsvector_hash_ops) |              7 |            1268 |         20372 |
+-----------------------------+----------------+-----------------+---------------+

Full-text search (A & B)

+-----------------------------+----------------+-----------------+---------------+
|           Index             | short_messages | medium_messages | long_messages |
+-----------------------------+----------------+-----------------+---------------+
| GiST (siglen=12)            |             25 |           16413 |       1155500 |
| GiST (siglen=120)           |             12 |            9457 |        160801 |
| GiST (siglen=1200)          |              4 |            2902 |         67691 |
| GIN                         |              6 |              15 |          1107 |
| RUM (rum_tsvector_ops)      |              5 |              16 |          1431 |
| RUM (rum_tsvector_hash_ops) |              4 |              17 |          1382 |
+-----------------------------+----------------+-----------------+---------------+

Full-text search (A <-> B)

+-----------------------------+----------------+-----------------+---------------+
|           Index             | short_messages | medium_messages | long_messages |
+-----------------------------+----------------+-----------------+---------------+
| GiST (siglen=12)            |           1463 |           17395 |       1119352 |
| GiST (siglen=120)           |            751 |            9226 |        151751 |
| GiST (siglen=1200)          |            427 |            5537 |         68095 |
| GIN                         |              8 |              13 |          1047 |
| RUM (rum_tsvector_ops)      |              9 |              13 |            96 |
| RUM (rum_tsvector_hash_ops) |              7 |              14 |            99 |
+-----------------------------+----------------+-----------------+---------------+

Size of indexes

The size is specified in megabytes.

Trigrams

+--------------------+----------------+-----------------+---------------+
|       Index        | short_messages | medium_messages | long_messages |
+--------------------+----------------+-----------------+---------------+
| GiST (siglen=12)   |            133 |             806 |          3200 |
| GiST (siglen=120)  |            111 |             613 |          4714 |
| GiST (siglen=1200) |            137 |             761 |          7373 |
| GIN                |             39 |             227 |           719 |
+--------------------+----------------+-----------------+---------------+
+-----------------------------+----------------+-----------------+---------------+
|           Index             | short_messages | medium_messages | long_messages |
+-----------------------------+----------------+-----------------+---------------+
| GiST (siglen=12)            |             50 |             211 |            77 |
| GiST (siglen=120)           |             41 |             162 |           198 |
| GiST (siglen=1200)          |             49 |             190 |          1619 |
| GIN                         |             16 |             218 |           949 |
| RUM (rum_tsvector_ops)      |             16 |             249 |          3114 |
| RUM (rum_tsvector_hash_ops) |             16 |             248 |          3114 |
+-----------------------------+----------------+-----------------+---------------+

The size of tables:

  • short_messages – 108 MB
  • medium_messages – 782 MB
  • long_messages – 7228 MB

Conclusion

Code examples using the messages table will be used below. For trigrams, this table looks like this:

CREATE TABLE messages (id int, body text);

for full-text search:

CREATE TABLE messages (id int, body text, body_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED);

Trigrams

When using trigrams, the search speed is much higher when using ILIKE without sorting (do not forget to specify at least 3 characters). Sorting and using similarity funcitons greatly slow down the search.

The search speed of GiST and GIN indexes is approximately the same with a short string length (up to ~250 characters). At the same time, the GIN index is much smaller that the GiST index (by 3.5 times in our test). Therefore, it is worth choosing the GIN index for short strings (e.g. up to 250 characters).

-- For short strings (up to ~250 characters); without sorting by relevance
CREATE INDEX messages_body_idx ON messages USING gin (body gin_trgm_ops);
SELECT body FROM messages WHERE body ILIKE '%abcd%' LIMIT 60;

The GiST index will work much faster (by 3 times in our test) for longer strings (e.g. ~2500 characters), but it will take up more space (by 4.5 times in our test). The signature length should be chosen 12 bytes (by default).

-- For long strings (e.g. ~2500 characters); without sorting by relevance
-- It is better to use RUM with <-> (see below)
CREATE INDEX messages_body_idx ON messages USING gist (body gist_trgm_ops);
SELECT body FROM messages WHERE body ILIKE '%abcd%' LIMIT 60;

It is better not to use sorting, because this is a very expensive operation. If it is extremely necessary, it is better to choose the GIN index, but at the same time strongly limit the length of the strings (e.g. up to 50 characters). In out test the GIN index is faster from 2 to 15 times, depending on the length of the rows. The same is true when using the similarity functions.

GIN and RUM indexes for full-text search is the same in performance for short strings (up to ~250 characters). With long strings (e.g. ~2500 characters), RUM is faster when multiple words are used using the <-> operator, which means they must follow each other. When using a search phrase consisting of 1 word, GIN is faster than RUM (about 2 times in our test; tested on different words).

During the additional testing without sorting by relevance, the RUM index started working as fast as GIN, event with search terms consisting of only 1 word. As a result, RUM more productive on long strings when using the <-> operator (the search query can be transformed using phraseto_tsquery) without sorting by relevance.

-- Without sorting by relevance
CREATE INDEX messages_body_idx ON messages USING rum (body_tsv);
SELECT body FROM messages WHERE body_tsv @@ phraseto_tsquery('apple orange') LIMIT 60;

The size of the index does not reduce when using the rum_tsvector_hash_ops operator class for natural language, which is what the developers of this index mentioned. Therefore, it is better to use rum_tsvector_ops (used by default) so that the lexemes themselves are stored in the index, and not their hash, and unnecessary rechecks are not required, which will slow down the search.

You also need to take into account that the RUM index takes up more disk space than the GIN index, because in addition to lexemes, their positions are stored in it. For example, with string ~2500 characters long, the RUM index is 3 times larget than GIN.

If sorting by relenace is required, then the GIN index is more faster, because for search terms consisting of 1 word, it words faster than RUM (by 2 times in out test).

-- With sorting by relevance
-- It is better not to use sorting
CREATE INDEX messages_body_idx ON messages USING gin (body_tsv);
SELECT body, ts_rank_cd(body_tsv, plainto_tsquery('apple')) AS rank FROM messages WHERE body_tsv @@ to_tsquery('apple') ORDER BY rank DESC LIMIT 60;

The GiST index for full-text search is much slower (from 10 to 70 times in our test) than GIN and RUM.

During the additional test of RUM with <-> (without sorting by relevance) the speed of searching thought the tables short_messages, medium_messages and long_messages was the same (~65 ms), regardless of the fact that the length of the strings is different (from ~25 to ~2500 characters).

On the short_messages and medium_messages tables with message lengths of ~25 and ~250 characters, respectively, the GIN index using trigrams with ILIKE performed a search in 11-12 ms (the GiST index has the same time, but it is better not to use it in this case, because it takes up much more disk space), which is faster than RUM. On the long_messages tables with message length of ~2500 characters, the GiST index using trigrams with ILIKE performed a search in 172 ms, and GIN in 495 ms, which is longer than RUM.

As a result, if the speed of query execution is important and the advantage in recognizing different forms of words is not so important (since each word turns into a lexeme during a full-text search), then:

  • for short strings (e.g. up to 250 characters), it is better to use the GIN index using trigrams with ILIKE without sorting with a signature length of 12 bytes (by default). For example, search by company name.
  • for longer strings (e.g. ~2500 characters), it is better to use the RUM index using full-text search with <-> without sorting. For example, search by messages or documents.

If it is important that different forms of words are take into account when searching, then you can use the RUM index for any string length using full-text search with <-> without sorting.

Cheat sheet

-- For short strings (e.g. up to 250 characters). For example, search by company names.
-- The GIN index with using trigrams with ILIKE without sorting with a signature length of 12 bytes (by default).
CREATE TABLE messages (id int, body text);
CREATE INDEX messages_body_idx ON messages USING gin (body gin_trgm_ops);
SELECT body FROM messages WHERE body ILIKE '%abcd%' LIMIT 60;

-- For longer strings (e.g. ~2500 characters). For example, search by messages or documents.
-- The RUM index using full-text search with <-> without sorting.
CREATE TABLE messages (id int, body text, body_tsv tsvector GENERATED ALWAYS AS (to_tsvector('english', body)) STORED);
CREATE INDEX messages_body_idx ON messages USING rum (body_tsv);
SELECT body FROM messages WHERE body_tsv @@ phraseto_tsquery('apple orange') LIMIT 60;SELECT body, ts_rank_cd(body_tsv, to_tsquery('abcd')) AS rank FROM short_messages WHERE body_tsv @@ to_tsquery('abcd') ORDER BY rank DESC LIMIT 60;

Related posts

How to get N rows per group in SQL

Let's assume that you are developing the home page in an online store. This page should display product categories with 10 products in each. How do you make a query to the database? What indexes do you need to create to speed up query execution?

How to get N rows per group in SQL