In this article, we will look at: what is soft deletion, why it is needed and how best to implement it.
The hard deletion is the process of completely removing a row from a table using the DELETE operator. With soft deletion, a table row is only marked as deleted, but remains in it. Firstly, it allows you to quickly delete many rows at once (for example, when a user in a CRM system deletes all deals that satisfy a certain condition), or delete 1 row with a large number of references (for example, when a user in the online store deletes his account with which many orders are associated). Secondly, soft deletion allows you to restore already deleted rows (for example, to restore accidentally deleted deals in a CRM system).
Soft deletion can be implemented in the following ways:
- Add the deleted_at column that will store the timestamp when the row was deleted. If the row is not deleted (hereinafter referred to as «active»), the value will be NULL. Queries to the table must contain the condition WHERE deleted_at IS NULL if active rows are requested, or WHERE deleted_at IS NOT NULL if deleted rows are requested.
- Create a separate table to store deleted rows. Alternatively, you can create a separate deleted schema for this, in which all tables with deleted rows will be stored. For example, we have the chats and deleted.chats tables. When deleting a row, you must perform the DELETE operation in the chats table and INSERT in the deleted.chats table.
When implementing the first method, it is necessary to make some indexes partial (more detailed below) and use the conditions WHERE deleted_at IS NULL and WHERE deleted_at IS NOT NULL in all queries (you can simplify it; more details below). Soft deletion and recovery operations will be performed quickly, because it is only necessary to update the deleted_at column for the necessary rows.
In the second case, indexes do not need to be changed, and conditions do not need to be added, because active and deleted rows are stored in different tables. However, mass deletion and recovery will be much slower, because it it necessary to delete rows from one table and add them to another.
If the rows are deleted rarely, piece by piece, and they are not referenced by many other rows (for example, there may be several million messages in a chat), then the second option will be more optimal. Howerver, most often the deleted rows have many references (for example, when deleting a user, you need to delete all his chats, each of which may contain millions of messages).
I don't consider splitting the table into sections (table partitioning), because this solution will have the same problem: during mass deletion, it is necessary to delete rows from one section and add them to the second (this will be done under the hood).
Let's assume that a user has a chat with 1 million messages (we'll omit the relation with a user for simplicity):
CREATE TABLE chats (id serial primary key, name text);
CREATE TABLE messages (id serial primary key, chat_id int REFERENCES chats ON DELETE CASCADE, body text);
INSERT INTO chats (name) VALUES ('first chat');
INSERT INTO messages (chat_id, body) SELECT 1, md5(random()::text) FROM generate_series(1, 1000000);
If the user wants to delete the chat, he will have to wait for some time, because both the chat and all related messages must be deleted from the table. In addition, you need to remove the values from indexes.
DELETE FROM chats WHERE id = 1; -- 6.1 sec
If ON DELETE CASCADE is replaced with ON DELETE SET NULL, the query will still be executed for a long time, because it is necessary to update all related rows by replacing chat_id with NULL.
Soft deletion comes to the rescue. Let's add the deleted_at column to the chats table. By default, this field will be NULL (chat is active).
ALTER TABLE chats ADD COLUMN deleted_at timestamp;
When a user deletes the chat, the current date must be set in the deleted_at column.
UPDATE chats SET deleted_at = now() WHERE id = 1; -- 3.5 ms
Now the condition deleted_at IS NULL must be added to all queries (SELECT, UPDATE, DELETE) that will access the table with active chats:
SELECT * FROM chats WHERE deleted_at IS NULL;
UPDATE chats SET name = 'my chat' WHERE id = 1 AND deleted_at IS NULL;
DELETE FROM chats WHERE id = 1 AND deleted_at IS NULL;
The SELECT query returns an empty result, because with the previous query we marked our chat as deleted. The UPDATE query will not change the deleted chat, because its deleted_at column is not NULL. The same applies to the DELETE query.
Since the deleted_at column is NULL by default, new chats will always be visible to the user.
INSERT INTO chats (name) VALUES ('second chat');
SELECT * FROM chats WHERE deleted_at IS NULL;
id | name | deleted_at
----+-------------+------------
2 | second chat |
(1 row)
One of the main disadvantages of the soft deletion is that it is necessary to use the deleted_at IS NULL condition everywhere. If this condition is omitted in some query, the user will see the deleted data, which is unacceptable, especially if the deleted rows contain sensitive data (for example, a bank card number).
Creating a view
In order not to use the same condition in all queries, you can create a view of a query:
CREATE VIEW available_chats AS SELECT * FROM chats WHERE deleted_at IS NULL;
Instead of using the chats table, we will refer to the available_chats «table». The following 2 queries are identical:
SELECT * FROM available_chats;
SELECT * FROM chats WHERE deleted_at IS NULL;
The available_chats «table» is not stored anywhere. Every time it is accessed, a query is made to the chats table with the condition deleted_at IS NULL.
In addition to SELECT, you can perform other operations:
INSERT INTO available_chats (name) VALUES ('third chat'); -- INSERT 0 1
UPDATE available_chats SET name = 'another chat' WHERE name = 'third chat'; -- UPDATE 1
DELETE FROM available_chats WHERE name = 'another chat'; -- DELETE 1
However, referring to the available_chats view, it is not possible to update/delete a chat that does not match the condition deleted_at IS NULL (if the chat is deleted).
UPDATE available_chats SET name = 'another chat' WHERE id = 1; -- UPDATE 0
DELETE FROM available_chats WHERE id = 1; -- DELETE 0
So far, we have used the available_chats view for active chats, however, it is better to use the name chats for it, and rename the table itself, for example, to chats_all. In this case, you do not need to change the name of the chats table to available_chats in existing queries.
ALTER TABLE chats RENAME TO chats_all;
ALTER VIEW available_chats RENAME TO chats;
Let's also add a view for deleted chats:
CREATE VIEW chats_deleted AS SELECT * FROM chats_all WHERE deleted_at IS NOT NULL;
As a result, we have:
- the chats_all table with all chats.
- the chats view with active chats.
- the chats_deleted view with deleted chats.
Creating a trigger
When deleting a chat from the chats «table» (active chats), it is necessary to perform a soft deletion (UPDATE) instead of physical deletion (DELETE) by setting the current timestamp in the deleted_at column. When deleting a chat from the chats_deleted «table», the DELETE operation must be performed.
We implement the logic of soft deletion using a trigger:
CREATE FUNCTION soft_delete() RETURNS trigger AS $$
DECLARE
fullTableName text := quote_ident(TG_TABLE_SCHEMA) || '.' || quote_ident(TG_TABLE_NAME);
BEGIN
EXECUTE 'UPDATE ' || fullTableName || ' SET deleted_at = now() WHERE id = $1' USING OLD.id;
RETURN OLD;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER soft_delete
INSTEAD OF DELETE ON chats
FOR EACH ROW
EXECUTE FUNCTION soft_delete();
Here, the logic of soft deletion is placed in a separate trigger function soft_delete, which can be used for different tables. In it, we define the variable fullTableName with the name of the schema and the table for which the trigger was triggered (in our case, this is the public schema and the chats table). We take these names from the variables TG_TABLE_SCHEMA and TG_TABLE_NAME. The quote_ident function wraps the string in double quotes, if necessary (for example, when the table name looks like camelCaseTableName).
SELECT quote_ident('table_name'); -- table_name
SELECT quote_ident('tableName'); -- "tableName"
Strings are concatenated using the || operator.
In the body of the soft_delete trigger function, we execute a query to update the deleted_at column, setting the current timestamp for the row for which the trigger was triggered.
The trigger function returns OLD, thereby informing that it has successfully performed the operation. The affected row will be consifered processed (DELETE 1). After that, subsequent triggers will be called (if any). If the trigger function returned NULL (it can be found in some implementations), then the affected row would NOT be considered processed (DELETE 0), despite the fact that the UPDATE operation would also be performed. After that, subsequent triggers would not be called (if any).
-- The trigger function is used that returns OLD
DELETE FROM chats WHERE id = 1; -- DELETE 1
-- The trigger function is used that returns NULL
DELETE FROM chats WHERE id = 1; -- DELETE 0
Therefore, is it important that our trigger function returns OLD at the end.
The soft_delete trigger (it is not necessary that the name of the function and the trigger match) is triggered when a DELETE operation is performed in the chats table. In this case, intead of DELETE, the soft_delete function is called for each row.
If 1 million rows are deleted from the chats table, the soft_delete function will be called for each row, which will update the deleted_at column. As a result, 1 million UPDATE queries will be triggered. To avoid it, when deleting, you can set a limit on the maximum number of deleted rows (for example, 1000).
DELETE FROM chats WHERE ctid = ANY(ARRAY(SELECT ctid FROM chats LIMIT 1000)); -- DELETE 1000
As a result, when deleting rows from the chats «table», we do the soft deletion by setting the current timestamp in the deleted_at column.
When deleting rows from the chats_deleted «table», we do the hard deletion:
DELETE FROM chats_deleted WHERE name = 'second chat'; -- DELETE 1
SELECT * FROM chats_deleted WHERE name = 'second chat'; -- 0 rows
SELECT * FROM chats_all WHERE name = 'second chat'; -- 0 rows
Removing sensitive data
In some cases, when a user deletes a row, it is necessary to delete some sensitive data it it (for example, a bank card number or an id for which recurrent payments can be re-debited).
Suppose we have the users table with the credit_card_number column. When a user deletes his account, it is necessary to delete his credit card number (when restoring the account, you can request the card number again).
To do it, you need to set NULL in the trigger in columns with sensitive data (in our case, in the credit_card_number column):
CREATE FUNCTION soft_delete_user() RETURNS trigger AS $$
DECLARE
fullTableName text := quote_ident(TG_TABLE_SCHEMA) || '.' || quote_ident(TG_TABLE_NAME);
BEGIN
EXECUTE 'UPDATE ' || fullTableName || ' SET credit_card_number = NULL, deleted_at = now() WHERE id = $1' USING OLD.id;
RETURN OLD;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER soft_delete
INSTEAD OF DELETE ON users
FOR EACH ROW
EXECUTE FUNCTION soft_delete_user();
Changing uniqueness constraints
For example, let's create a new companies table with a uniqueness constraint for the name column. In order not to complicate the example, we will not create view of queries and triggers.
CREATE TABLE companies (
id int,
name text,
deleted_at timestamp,
CONSTRAINT companies_name_key UNIQUE (name)
);
What happens if we add a new company, soft delete it and try to create a new company with the same name?
INSERT INTO companies (id, name) VALUES (1, 'my company');
UPDATE companies SET deleted_at = now() WHERE id = 1;
INSERT INTO companies (id, name) VALUES (2, 'my company');
-- ERROR: duplicate key value violates unique constraint "companies_name_key"
-- DETAIL: Key (name)=(my company) already exists.
The uniqueness constraint throws an error because the company still exists in the table. This can be fixed by making only the names of active companies unique. Unforunately, we can't take advantage of the uniqueness constraint, but we can take advantage of other options.
The uniqueness constraint throws an error because the company still exists in the table. This can be fixed by making only the names of active companies unique. Unfortunately, we cannot implement this using the uniqueness constraint, but there are other options.
Option 1. Create a partial index that will check for uniqueness.
CREATE UNIQUE INDEX companies_name_key ON companies (name) WHERE deleted_at IS NULL;
Option 2. Create an exclusion constraint that will check the name column for quality (equivalent to a UNIQUE constaint, but the usual uniqueness constaint will work faster).
ALTER TABLE companies ADD CONSTRAINT companies_name_key EXCLUDE (name WITH =) WHERE deleted_at IS NULL;
Let's use the first option:
ALTER TABLE companies DROP CONSTRAINT companies_name_key;
CREATE UNIQUE INDEX companies_name_key ON companies (name) WHERE deleted_at IS NULL;
Let's try to add a company with the same name twice:
INSERT INTO companies (id, name) VALUES (2, 'my company'); -- INSERT 0 1
INSERT INTO companies (id, name) VALUES (3, 'my company');
-- ERROR: duplicate key value violates unique constraint "companies_name_key"
-- DETAIL: Key (name)=(my company) already exists.
The company was added successfully for the first time, because there is no active company with the same name. The second time there was an error, because a such company already exists.
What happens if we try to restore a deleted company with the same name?
UPDATE companies SET deleted_at = NULL WHERE id = 1;
-- ERROR: duplicate key value violates unique constraint "companies_name_key"
-- DETAIL: Key (name)=(my company) already exists.
An error occurred because we have already created another company with the same name. As a solution, if an error occurs, you can request a new company name from the user.
UPDATE companies SET name = 'another company', deleted_at = NULL WHERE id = 1;
The query was completed successfully, because there are no companies with a new name.
It is important to note that if we implemented a uniqueness constraint on 2 columns (for example, deleted and name), then in this case, when trying to soft delete a company, the uniqueness constraint would throw an error if a company with that name already exists amonth the deleted ones. For simplicity, I can give an analogy. Imagine that there is a clean trash in your computer. You are deleting a file 1.jpg, then, after a while, delete another file with the same name. At this moment, the trash throws an error that it is impossible to delete such a file, because a file with this name already exists amonth the deleted ones. That would be weird. Therefore, we used a partial index that checks only active companies for uniqueness.
Changing indexes
Let me remind you that to access the table with active chats, we use the chats view, and to access deleted chats – chats_deleted. Both of these views access the physical chats_all table.
-- When executing queries
SELECT * FROM chats;
SELECT * FROM chats_deleted;
-- They are actually executed as
SELECT * FROM chats_all WHERE deleted_at IS NULL;
SELECT * FROM chats_all WHERE deleted_at IS NOT NULL;
Indexes that are used to speed up the search for active and deleted chats should use this condition. Otherwise, instead of immediately returning the results using the index, a separate stage will filter results according to the condition deleted_at IS NULL or deleted_at IS NOT NULL.
CREATE TABLE chats (id int, name text, deleted_at timestamp);
INSERT INTO chats (id, name, deleted_at) SELECT i, md5(random()::text), (CASE WHEN random() < 0.1 THEN now() ELSE NULL END) FROM generate_series(1, 1000000) AS i;
CREATE INDEX chats_name_idx ON chats USING gin (name gin_trgm_ops);
EXPLAIN ANALYZE SELECT * FROM chats WHERE name ILIKE '%abc%' AND deleted_at IS NULL;
Bitmap Heap Scan on chats (cost=24.77..393.40 rows=90 width=45) (actual time=2.850..2237.357 rows=6593 loops=1)
Recheck Cond: (name ~~* '%abc%'::text)
Filter: (deleted_at IS NULL)
Rows Removed by Filter: 738
Heap Blocks: exact=4910
-> Bitmap Index Scan on chats_name_idx (cost=0.00..24.75 rows=100 width=0) (actual time=1.940..1.947 rows=7331 loops=1)
Index Cond: (name ~~* '%abc%'::text)
Planning Time: 0.367 ms
Execution Time: 2286.334 ms
As you can see, 7331 chats (active and deleted) were found at the beginning, which contain the search term «abc», and then 738 deleted chats were filtered out, which do not match the condition deleted_at IS NULL. The search took 2.2 sec.
Let's make a partial index by adding the condition deleted_at IS NULL to it.
DROP INDEX chats_name_idx;
CREATE INDEX chats_name_idx ON chats USING gin (name gin_trgm_ops) WHERE deleted_at IS NULL;
EXPLAIN ANALYZE SELECT * FROM chats WHERE name ILIKE '%abc%' AND deleted_at IS NULL;
Bitmap Heap Scan on chats (cost=24.70..357.97 rows=90 width=45) (actual time=6.397..528.855 rows=6593 loops=1)
Recheck Cond: ((name ~~* '%abc%'::text) AND (deleted_at IS NULL))
Heap Blocks: exact=4574
-> Bitmap Index Scan on chats_name_idx (cost=0.00..24.67 rows=90 width=0) (actual time=5.627..5.634 rows=6593 loops=1)
Index Cond: (name ~~* '%abc%'::text)
Planning Time: 4.438 ms
Execution Time: 579.336 ms
This time we immediately selected 6593 active chats that contain the search term «abc», so the search was performed 4 times faster (in 579 ms).
If you need to sort by the deleted_at column, then it should be included in the index. For example. let's get the first 100 deleted chats with id > 1000 sorted by deleted_at in descending order so that recently deleted chats are displayed on top.
DROP INDEX chats_name_idx;
CREATE INDEX chats_deleted_at_id_idx ON chats (deleted_at, id) WHERE deleted_at IS NOT NULL;
EXPLAIN ANALYZE SELECT * FROM chats WHERE id > 1000 AND deleted_at IS NOT NULL ORDER BY deleted_at DESC LIMIT 100;
Limit (cost=0.42..18.64 rows=100 width=45) (actual time=0.110..3.289 rows=100 loops=1)
-> Index Scan Backward using chats_deleted_at_id_idx on chats (cost=0.42..18616.35 rows=102138 width=45) (actual time=0.087..1.170 rows=100 loops=1)
Index Cond: (id > 1000)
Planning Time: 0.092 ms
Execution Time: 4.330 ms
The first 100 deleted chats that satisfy our condition were selected immediately. The search was completed in 4 ms.
Automatic deletion
Rows marked as deleted are best removed from the table over time so that they do not take up a lot of disk space. For example, you can give the user 30 days to restore deleted rows. After this period, the rows will be deleted from the table forever.
Most likely, the percentage of deleted rows from the table will be small, so you can use the DELETE operation. If you need to delete most of the rows, you can use the ways described here and here.
It is best to run deletion process during non-working hours, for example. at night once a day (or on weekends once a week). Deleting a large number of rows can be broken into pieces (for example, 10 thousands each).
DELETE FROM chats WHERE ctid = ANY(ARRAY(SELECT ctid FROM chats LIMIT 10000));
The ctid field is unique for each record in the table and indicates the location of the row (tuple) in PostgreSQL. Using ctid in the condition, instead of id, allows you to singificatrly speed up the deletion.
-- Both tests use the following table with 1 million rows
CREATE TABLE messages (id serial primary key, body text);
INSERT INTO messages (body) SELECT md5(random()::text) FROM generate_series(1, 1000000);
-- The deletion takes 6.4 sec when using the id
-- При использовании id удаление занимает 6.4 сек.
DELETE FROM messages WHERE id = ANY(ARRAY(SELECT id FROM messages));
-- The deletion takes 2.9 sec when using the ctid (2 times faster)
DELETE FROM messages WHERE ctid = ANY(ARRAY(SELECT ctid FROM messages));
The number of deleted rows in 1 iteration strongly depends on many parameters (for example, which indexes are used in the table, how many rows refer to the deleted one, etc.). Alternatively, at the beginning you can determine how long 1 iteration should be performed, and then empirically select the number of deleted rows in 1 iteration that takes this time. You can start with 10 thousand, gradually insreasing this number (50, 100 thousands, 1 million).
If you want to delete a row referenced by millions of other rows, you can delete the child rows in the several iterations at the beginning, and then delete the parent row.
Cheat sheet
-- Creating a table and query views for active and deleted chats
CREATE TABLE chats_all (id serial primary key, name text, deleted_at timestamp);
CREATE VIEW chats AS SELECT * FROM chats_all WHERE deleted_at IS NULL;
CREATE VIEW chats_deleted AS SELECT * FROM chats_all WHERE deleted_at IS NOT NULL;
-- Creating a trigger
CREATE FUNCTION soft_delete() RETURNS trigger AS $$
DECLARE
fullTableName text := quote_ident(TG_TABLE_SCHEMA) || '.' || quote_ident(TG_TABLE_NAME);
BEGIN
EXECUTE 'UPDATE ' || fullTableName || ' SET deleted_at = now() WHERE id = $1' USING OLD.id;
RETURN OLD;
END;
$$ LANGUAGE plpgsql;
CREATE TRIGGER soft_delete
INSTEAD OF DELETE ON chats
FOR EACH ROW
EXECUTE FUNCTION soft_delete();
-- Soft deletion
DELETE FROM chats WHERE id = 1;
-- Hard deletion
DELETE FROM chats_deleted WHERE id = 1;
-- Uniqueness constrain
CREATE UNIQUE INDEX chats_all_name_key ON chats_all (name) WHERE deleted_at IS NULL;
-- Request a new name from the user if an error occurs during recovery
UPDATE chats SET name = 'another chat', deleted_at = NULL WHERE id = 1;
-- Use partial index (include deleted_at if sorting by this column is required)
CREATE INDEX chats_name_idx ON chats USING gin (name gin_trgm_ops) WHERE deleted_at IS NULL;
CREATE INDEX chats_deleted_at_id_idx ON chats (deleted_at, id) WHERE deleted_at IS NOT NULL;
-- Automatic deletion of rows (during non-working hours: at night, on weekends)
DELETE FROM chats WHERE ctid = ANY(ARRAY(SELECT ctid FROM chats LIMIT 10000));CREATE TABLE companies (
id int,
name text,
deleted_at timestamp,
CONSTRAINT companies_name_key UNIQUE (name)
);CREATE UNIQUE INDEX companies_name_key ON companies (name) WHERE deleted_at IS NULL;ALTER TABLE companies ADD CONSTRAINT companies_name_key EXCLUDE (name WITH =) WHERE deleted_at IS NULL;