There are several ways to get random rows from a table. In this article, we will compare the most popular methods, understand their pros and cons and determine in which cases which method is better to use.
Let's create the products table with 10 millions rows, on which we will test different methods:
CREATE TABLE products (id int, name varchar);
INSERT INTO products SELECT i, md5(random()::text) FROM generate_series(1,10000000) AS i;
CREATE INDEX products_id_key ON products (id);
We will select 50 random rows from the table.
Method 1 – ORDER BY random()
SELECT * FROM products ORDER BY random() LIMIT 50;
Limit (cost=540526.81..540526.93 rows=50 width=45) (actual time=143484.564..143485.538 rows=50 loops=1)
-> Sort (cost=540526.81..565526.81 rows=10000000 width=45) (actual time=143439.405..143439.721 rows=50 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 31kB
-> Seq Scan on products (cost=0.00..208334.00 rows=10000000 width=45) (actual time=0.031..71253.970 rows=10000000 loops=1)
Planning Time: 0.121 ms
JIT:
Functions: 3
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 0.917 ms, Inlining 5.433 ms, Optimization 25.189 ms, Emission 13.572 ms, Total 45.111 ms
Execution Time: 143486.892 ms
The query was executed in 143 sec. As you can see in the query execution plan, at the beginning, 10 million rows are scanned (rows opposite the Seq Scan line), and then random 50 are selected from them (rows opposite the LIMIT line), so the query is executed slowly. The larger the table is, the slower the query will be executed.
Example:
6713420 | 75a9cf740bd16b42dc1e43ce880c25a0
7621610 | f562ff34caaf769791dfc48663969df1
2284385 | fad4514cd3245e60fe5c48c64f98b6b8
792226 | 039b7f34de7ee8d882351c4eb93815e0
1881245 | 5316b953acc4285b2623f34cedbdd011
6286124 | 6d03a97c23c89bdfd52db92f4c33e656
9582555 | 4a4b442a8d7b63b66796e4181fe42a90
508950 | eaeaa945bd54fc8e4e2977b1cc086db4
3597563 | e1e735de8ec03ec72ebf978c7b6b3177
8707436 | 0110b19d81a218c5b3b3b40e817c0e6b
This method is suitable only for small tables. As an experiment, I created 3 separate products tables: with 10 thousand, 100 thousand and 1 million rows. Sampling 50 random rows of them took 4 ms, 28 ms and 207 ms respectively. As a result, if there are more than 100 thousand rows in the table, it is better to use another method.
Method 2 – TABLESAMPLE
In PostgreSQL 9.5, the TABLESAMPLE clause has appeared, with which you can get a random sample from a table. After this clause, the name of the method is specified, which determines how the rows will be selected.
PostgreSQL, by default, provides SYSTEM and BERNOULLI methods for TABLESAMPLE, with which you can select a certain percentage of random rows (for example, 0.005% of all rows). Using the tsm_system_rows extension, you can also add the SYSTEM_ROWS method, which will allow you to select a fixed number of random rows.
Let's take a closer look at each method to understand their advantages and disadvantages.
SYSTEM
In the SYSTEM method, the probability of selecting each page is specified as an argument (all rows of the table are physically stored as pages; the page size, by default, is 8 KB).
SELECT * FROM products TABLESAMPLE SYSTEM(0.005);
Sample Scan on products (cost=0.00..21.00 rows=500 width=37) (actual time=0.314..4.909 rows=480 loops=1)
Sampling: system ('0.005'::real)
Planning Time: 1.671 ms
Execution Time: 9.324 ms
The query was executed fast (in 9 ms), because random pages with rows were immediately selected from the table without sequential scanning.
Since the number of selected pages and the number of rows in each page may be different, the SYSTEM method may return a different number of rows from query to query. In out example, we set the probability of selecting each page to 0.005%. If this method chose the exact number of rows, then we would get exactly 500 rows (0.005% of 10 million). In our case, we got 480 rows.
Let's try to execute this query multiple times:
SELECT COUNT(*) FROM products TABLESAMPLE SYSTEM(0.005); -- 480
SELECT COUNT(*) FROM products TABLESAMPLE SYSTEM(0.005); -- 720
SELECT COUNT(*) FROM products TABLESAMPLE SYSTEM(0.005); -- 600
SELECT COUNT(*) FROM products TABLESAMPLE SYSTEM(0.005); -- 360
It's important to understand that rows are returned in the same order in which they are located on the pages. It means that in the final sample, the rows will most likely be arranged sequentially.
Example:
19796 | 5721e954c9e0ffdae2f89a02f29d723c
19797 | 53e008d3b3740a65afb4ba81e5f29f40
19798 | e097788740c30af4db0dc232b2fa514b
19799 | 61c54409df64eb05dad999ec9056246f
19800 | bc6239e67be89ecd3244488070986a4c
559321 | f4cec3ac0666cf2ffe32afe1ee84b619
559322 | 159b7ca0746b47c4b48af58937105bb1
559323 | 9d7e64f419a237e2613a160e9e79afcc
559324 | 1c85669659c7264d053bceee1dc50b16
559325 | 509af792853a8d82cf45ed65bf5b685d
BERNOULLI
If in the SYSTEM method we specified the probability of selecting each page (rows are stored in pages), then the BERNOULLI method specifies the probability of selecting each row. Thus, the selection from the table is more random and the number of selected rows varies less from query to query.
SELECT * FROM products TABLESAMPLE BERNOULLI(0.005);
Sample Scan on products (cost=0.00..83339.00 rows=500 width=37) (actual time=0.248..229.548 rows=513 loops=1)
Sampling: bernoulli ('0.005'::real)
Planning Time: 0.049 ms
Execution Time: 233.994 ms
The BERNOULLI method was executed longer (in 223 ms) than the SYSTEM (in 9 ms), but the sampling is more random:
186606 | 6dd5f0abc1af80599c15341f396e14c3
215946 | 7cddbe553116bd4951bb77d1a7c25dcf
248615 | b5828f7cc3ee249fedb031f2d2cb1b41
278561 | c6a6f67617566dfbbae4c53196d144e8
305718 | 44e748600746f0ca3545969e6ae1d606
348908 | a6de8624f028c5fe76dc9b7c1ca87b5a
363781 | cb527dcdbe87f4356ffe234f8a80b0c7
402060 | 42fd9855687a597e43168ea14b75fada
450987 | 4f073aa6e70d4d0ed97db6edd50da87d
459344 | f07ef3e8c2e01472341d76d920702552
The BERNOULLI method, like the SYSTEM, returns a different number of rows from query to query, but it changes less:
SELECT COUNT(*) FROM products TABLESAMPLE BERNOULLI(0.005); -- 503
SELECT COUNT(*) FROM products TABLESAMPLE BERNOULLI(0.005); -- 505
SELECT COUNT(*) FROM products TABLESAMPLE BERNOULLI(0.005); -- 489
SELECT COUNT(*) FROM products TABLESAMPLE BERNOULLI(0.005); -- 484
SYSTEM_ROWS
Most often, you need to select a fixed number of random rows. This allows you to make the SYSTEM_ROWS method. To use it, you need to add the tsm_system_rows extension.
CREATE EXTENSION tsm_system_rows;
SELECT * FROM products TABLESAMPLE SYSTEM_ROWS(50);
Sample Scan on products (cost=0.00..4.50 rows=50 width=37) (actual time=0.020..0.642 rows=50 loops=1)
Sampling: system_rows ('50'::bigint)
Planning Time: 0.059 ms
Execution Time: 1.201 ms
The query was executed in 1 ms. Exactly 50 rows were selected.
Like SYSTEM, the SYSTEM_ROWS method selects pages, so we face the same problem – rows from 1 page will most likely be arranged sequentially:
7473356 | d9f944c78aa457d941908374686b2f4e
7473357 | f151b2a93d62014c8282889f2cc8b36a
7473358 | 83eb095b5523aa3b0fd8f349e8d31d51
7473359 | 09db36738765a0dccd5a88a7f7357902
7473360 | 0cf56782d2b18fff6dc5ff7cd82cdab4
5215921 | 6decd1ca973bf7b52f16cb5fecc8d4b1
5215922 | 9d941a0f32f6d9a3b834d481f05cc88b
5215923 | 831278c9cec605d3bca29e5ca66ca043
5215924 | fcfa6406b6170b6a29e4fd855c364baa
5215925 | f408f1d6a4935f5068126d0c9d8aa009
Method 3 – recursive random()
In this case, random rows are sampled N times. To do this, the auxiliary construction WITH RECURSIVE is used. At the beginning, the first query is executed (before UNION ALL), then the second query is iteratively executed (after UNION ALL). The first query calculates the values that will be used in the second query to sample random rows: the minimum ID min_id, the maximum ID max_id and the current row number n. After the second query is executed N times (in our case 50 times), the row selection is completed and the selected IDs are passed to the main query to get all the columns.
WITH RECURSIVE rand AS (
SELECT NULL::int id, min(id) min_id, max(id) max_id, 0 n
FROM products
UNION ALL
SELECT products.id, rand.min_id, rand.max_id, rand.n + CASE WHEN products.id IS NULL THEN 0 ELSE 1 END
FROM rand
LEFT JOIN products ON products.id = (
SELECT floor(random() * (rand.max_id - rand.min_id + 1))::int
)
WHERE rand.n < 50
)
SELECT products.*
FROM rand
INNER JOIN products ON rand.id = products.id;
Nested Loop (cost=179.02..441.54 rows=31 width=37) (actual time=2.576..86.295 rows=50 loops=1)
CTE rand
-> Recursive Union (cost=0.93..178.58 rows=31 width=16) (actual time=0.183..46.175 rows=51 loops=1)
-> Result (cost=0.93..0.94 rows=1 width=16) (actual time=0.169..0.233 rows=1 loops=1)
InitPlan 2 (returns $1)
-> Limit (cost=0.43..0.46 rows=1 width=4) (actual time=0.060..0.086 rows=1 loops=1)
-> Index Only Scan using products_id_key on products products_1 (cost=0.43..285061.74 rows=10009960 width=4) (actual time=0.046..0.053 rows=1 loops=1)
Index Cond: (id IS NOT NULL)
Heap Fetches: 0
InitPlan 3 (returns $2)
-> Limit (cost=0.43..0.46 rows=1 width=4) (actual time=0.055..0.081 rows=1 loops=1)
-> Index Only Scan Backward using products_id_key on products products_2 (cost=0.43..285061.74 rows=10009960 width=4) (actual time=0.042..0.048 rows=1 loops=1)
Index Cond: (id IS NOT NULL)
Heap Fetches: 1
-> Nested Loop Left Join (cost=0.46..17.70 rows=3 width=16) (actual time=0.835..0.872 rows=1 loops=51)
-> WorkTable Scan on rand rand_1 (cost=0.00..0.22 rows=3 width=12) (actual time=0.008..0.016 rows=1 loops=51)
Filter: (n < 50)
Rows Removed by Filter: 0
-> Index Only Scan using products_id_key on products products_3 (cost=0.46..5.81 rows=1 width=4) (actual time=0.794..0.802 rows=1 loops=50)
Index Cond: (id = (SubPlan 1))
Heap Fetches: 0
SubPlan 1
-> Result (cost=0.00..0.03 rows=1 width=4) (actual time=0.008..0.015 rows=1 loops=50)
-> CTE Scan on rand (cost=0.00..0.62 rows=31 width=4) (actual time=1.417..46.770 rows=50 loops=1)
Filter: (id IS NOT NULL)
Rows Removed by Filter: 1
-> Index Scan using products_id_key on products (cost=0.43..8.45 rows=1 width=37) (actual time=0.751..0.758 rows=1 loops=50)
Index Cond: (id = rand.id)
Planning Time: 0.344 ms
Execution Time: 86.865 ms
The query was executed in 86 ms. Example:
6830883 | 26fd6b3aafd8fc474a665ea2655cf02e
6097219 | e008f0d9209f59b53af7912bf193fc31
655526 | 815b1cfbf08e64a2e754e585ac2e6e27
9487691 | 46fabc8b003be186eb163b56d7b168e5
7229501 | 098575f64eab9e9d21ab2cc3dffe4616
3535162 | 1ce3c65aa9db72863e2d9f991087d633
3609568 | 9050232f67abe00a7786bcbf5ab8306c
5748236 | 65226bcde08efc34ddc05ad4a3616ed9
2213019 | 92a5f53dcef8fc655f411ebe12f442c5
5555903 | b5d30d266204c346d1ec16842b431358
Cheat sheet
-- Selecting 1 random row
SELECT * FROM products TABLESAMPLE SYSTEM_ROWS(1);
-- Selecting N random rows for small tables (up to ~100000 rows)
SELECT * FROM products ORDER BY random() LIMIT 50;
-- Selecting N random rows for big tables
WITH RECURSIVE rand AS (
SELECT NULL::int id, min(id) min_id, max(id) max_id, 0 n
FROM products
UNION ALL
SELECT products.id, rand.min_id, rand.max_id, rand.n + CASE WHEN products.id IS NULL THEN 0 ELSE 1 END
FROM rand
LEFT JOIN products ON products.id = (
SELECT floor(random() * (rand.max_id - rand.min_id + 1))::int
)
WHERE rand.n < 50
)
SELECT id FROM rand WHERE id IS NOT NULL;
-- Selecting % random rows (the number of rows may vary from query to query)
SELECT * FROM products TABLESAMPLE BERNOULLI(0.005)