Migrations and declarative schema management

When you develop some project alone, usually during development process there are no questions about how to manage the database schema. If you need to add a new column to a table, you can just do it in your local database without any consequnsces. Maybe you haven't even thought about it because your ORM synchronizes the database schema automatically (although it is recommended to disable this synchronization in production in order to avoid irrevocable deletion of the data [read more below about renaming]).

Once you have made the first release, each change should be applied not only to your local database, but also to databases in other environments (staging, production, etc). If the database schema can be changed by at least 2 people, then even during development you should orhestrate its synchroniztaion between your team members.

To make changes to the database, you can use:

Migrations (imperative schema management).
Declarative schema management.

Migrations

The most popular way is to create a new file every time you want to represent a series of changes to a database. Most of these changes are adding new columns, changing enum fields, changing default values, and adding indexes. Some script applies these migrations one by one in the order they are created. Locally you have to run this script manually, but in a production (and other environments) this should be done by CI/CD automatically.

The list of migration files may look like this:

0001_initial.sql
0002_add_full_name_to_users.sql
0003_create_sales_table.sql

The first one defines the initial database schema, the others make changes sequentially. For example, the initial migration creates the users table:

-- 0001_initial.sql
CREATE TABLE users (
  id serial PRIMARY KEY,
  first_name text NOT NULL,
  last_name text NOT NULL
);

The second migration adds a new column full_name:

-- 0002_add_full_name_to_users.sql
ALTER TABLE users ADD COLUMN full_name text NOT NULL
  GENERATED ALWAYS AS (first_name || ' ' || last_name) STORED;

Serial number vs timestamp

As a prefix in the file name, both the serial number (0001, 0002, etc) and the timestamp (20230302102511 [<yyyy><MM><dd><HH><mm><ss>]) can be used. Each of these has advantages and disadvantages.

When using a timestamp, 2 developers can create a new migration file at the same time. Each migration file will have a unique prefix unless they were created in the same second, which is extremely unlikely. On the contrary, when using a serial number, the new migration file will have the same prefix (N+1).

However, in some cases, there may be a problem when using a timestamp. For example, let's imagine that John created a migration that renames the old_name column to new_name, and Mia created a migration a few seconds later, in which the type of the old_name column changes. John will push his changes to the remote repository and CI/CD will rename the old_name column to new_name in production. After that, Mia will pull John's changes from the remote repository, merge them with her branch without conflicts (after all, John's migration file has a different name, because a timestamp is used) and then push all the changes into the remote repository without any problems. The error will occur only at the moment when CI/CD tries to apply Mia's migration against the production database (the old_name column does not exist anymore).

If in this situation you use the serial number in the name of the migration files, then when Mia pulls John's changes from the remote repository, she will see a conflict (a file with that name already exists). In this case, she will see that the column has already been renamed, create a new migration file with the following serial number and modify her SQL query.

In this example, when using a timestamp, Mia discovered a problem at the stage of rolling migration against the production database. But what happens if there is no mistake? Let's say John transformed a certain column

In this example, when using a timestamp, Mia discovered a problem at the stage of rolling migration to the database in production. And what happens if there is no mistake? Let's say John transformed the data to a certain column and Mia did the same. As a result, the data in this column will be trasnformed twice and no one will see any conflict or error.

To avoid such problems, you need to use either a serial number, or carefully do a code review.

Run migrations inside transactions

If one of the SQL queries throws an error during the migration, then all queries executed before it must roll back. This can be achieved by performing migrations within transactions.

Thus, either the migration will be performed completely, or it will not be performed at all.

Downgrade migrations

Usually, migration tools allows to define 2 scripts for each change to the database: up and down. The first one applies the changes, the second rolls them back.

-- 0002_add_full_name_to_users.up.sql
ALTER TABLE users ADD COLUMN full_name text NOT NULL
  GENERATED ALWAYS AS (first_name || ' ' || last_name) STORED;

-- 0002_add_full_name_to_users.down.sql
ALTER TABLE users DROP COLUMN full_name;

Many tools support downgrade migrations. Theoretically, it may be necessary to undo changes to the database (roll back one or more recent migrations) if something has been broken. Practically, such migrations are usually not used because:

Some migrations cannot be rolled back. For example, try to revert a migration that drops a NOT NULL column with some data.
In most cases, you test migrations extremely carefully, double-check everything several times, apply them and never roll back. To be honest, I have not yet heard of a person who apply the downgrade migration at least once.
In rare case when you need to revert a migration, you can create another forward migration.
Down migrations are very likely to contain mistakes, because they are not checked as thoroughly as up migrations.

Thus, most likely, writing downgrade migrations will be a waste of your time. It is better to test up migrations more thoroughly.

Language

The migration file can be written not only in pure SQL, but also using query builders in any language. It depends on the migration tool used. For example, Knex has the built-in tool that reads migration files written in JavaScript/TypeScript that contains 2 exported functions: up and down.

// 20230302102511_add_full_name_to_users.ts
import { Knex } from 'knex';

export const up = async (knex: Knex) =>
  knex.schema.alterTable('tasks', (table) => {
    table.text('full_name');
  });

export const down = async (knex: Knex) =>
  knex.schema.alterTable('tasks', (table => {
    table.dropColumn('full_name');
  }));

But it is better to write migrations in SQL, because:

There will be no unpredictable SQL queries generated by query builders (e.g. Knex).
SQL is much easier and faster to read when you know it (and you should know it if you work with a database).
When you or a member of your team will develop a new project in another language (e.g. Python), you should either add support for migrations in this language and prepare/maintain separate CI/CD scripts for it, or write migrations for this project in different language (what if the developer who is creating a new project in Python doesn't know the TypeScript that is used for write migrations).

Zero downtime migrations

Some migrations are pefromed in a few seconds, some may take many hours. If the migration process is performed before the application is started, then it will not be available until the migration is completed. In addition, if an error occurs during the migration, your project will not start. Thus, migrations must be performed separately.

To make zero downtime migration you should:

Apply the first migration (aka regular migration) that will be compatible with the current version of the code. For example, add a new full_name column, but don't delete the first_name and last_name columns yet. The current version of the code will continue to use first_name and last_name.
Deploy a new version of the code that uses the full_name column instead of first_name and last_name.
Apply the second migration (aka post-deployment migration), which will remove the first_name and last_name columns.

This process may get a little more complicated, for example, if you want to store data in a table be more complicated, for example, if you want to store data in a table with a different structure.

This process may get a little more complicated, for example, if you want to use a table with a different structure. To do this, you need:

Apply a migration that creates a new table.
Deploy the code that writes to the new table and reads from both (if there are no rows in the old table, try to find them in the new one). Before reading the old table, check if it exists (it will be required after deleting the old table).
Apply a migration that moves the data from the old table to the new one.
Apply a migration that deletes the old table. This step can be merged with the previous one, but you will most likely want to check that all the data has been successfully transferred to the new table before deleting the old one.
Deploy the code that reads only from the new table.

Disadvantages of migrations

1. Huge number of files.

After a while, you will have quite a lot of files with migrations (0001_create_users_table.sql, 0002_add_full_name_to_users.sql, ..., 5372_create_sales_table.sql). This will cause the deployment of the local database to take much longer.

If, when testing a project, you recreate the database before running each test or create a separate database for each thread (if multithreaded testing is used), then when running each test, you will additionally wait until all migrations are applied to the database.

This problem can be solved by collapsing migrations into a single file from time to time, making a dump of the current version of the schema (e.g. using pg_dump -s in PostgreSQL). This can be done whenever the number of migrations exceeds a certain value.

2. No full schema.

To look at the current version of the database schema, you must deploy the database, apply all migrations and create a dump of the current schema, saving it into a file.

You can't just go into your git-repository and open the file with the full schema to see which columns are in a particular table or which value is used by default in a specific column.

This problem can be solved by making a dump of the current schema after each application of new migrations. As a result, you will be able to:

View the entire database schema.
View schema changes using git diff.
Quickly find out who made this or that change, and find out when it was done and why, using git blame.

3. Manual changes -> unexpected results.

Let's look at the following example. Suppose that you have made some changes to the schema of the local database manually (e.g. added a new column). After some time, you forgot about it. Now you have decided to create a new migration file containing a reference to these changes (e.g. to delete this column). Migration is successfullly applied against the local database, but something unexpected may happen in production. This can also happen if you manually make changes to the database schema not locally, but in production (e.g. to make an urgent hot fix).

This problem can be solved by saving the schema hash after each migration applied and checking it before applying a new migration. If the current schema is not equal to the schema that was at the time of the last applied migration, then an error should occur.

The most popular migration tools: flyway, liquibase, sqitch, dbmate, nomad.

Declarative schema management

The declarative approach allows you to store only the desired state of the database schema (the last schema dump). The script figures out what are the differences between the current schema and the desired one, generates the appropriate SQL queries that allow you to make changes and applies them.

It can be done by diff tools like skeema (MySQL, MariaDB), migra (PostgreSQL), atlas (PostgreSQL, MySQL, MariaDB, SQLite), sqldef (PostgreSQL, MySQL, SQLite, SQL Server), etc. This approach is used by Facebook more than 10 years and other companies (e.g. SendGrid).

You can just work with your schema file by adding new columns, changing defaullt values, etc. The script will automatically make the changes to a database to reach this state.

This approach eliminates all the disadventages that I mentioned earlier:

There is only one file with the entire schema (or several – one for each table for convenience). Thus, you can deploy a new database in milliseconds (for development or tests).
Since the whole schema is stored in one file, you can see which columns are in a particular table or what the default value is a specific column. Using git diff, you can view the changes made to the schema, and using git blame to find out who made these changes, when and why.
You can make changes to a database manually. For example, if you add a new column to the schema file, but this column already exists in the database in production, no changes will occur.

Renaming

At the moment, the declarative schema management tools are not able to understand what was done: the column was renamed or deleted and a new one was created. If you rename, then DROP and CREATE statements will be generated. The data will be lost. So, if you want to rename something with one command, use migrations.

We discussed earlier that in order to avoid downtime of the project during migrations, you should separate the execution of migrations from execution of the project. Otherwise, your project may be unavailable for several hours until all the necessary migrations are applied.

For example, if you want to rename a column using migrations without project downtime, you need to:

Apply a migration that creates a new column.
Deploy the code that writes to a new column and reads from both.
Apply a migration that moves the data from the old column to a new one.
Apply a migration that deletes the old column.
Deploy the code that reads only from the new column.

This workflow can be implemented not only using migrations, but also using a declarative management schema tool, because there is no renaming, there are adding and deleting only.

No changes without your approval

When you modify a file with your schema, you push it to the remote git repository. The CI/CD script determine the differences between the database schema and the schema in the file using a diff tool and then create a new pull request with attached SQL queries that will be applied after your confirmation. If everything is okay, you approve it and CI/CD applies this changes to the database. Any DROP statement should require additional confirmation. This will avoid data loss.

Thus, all changes in the production database will be under your control.

Data migrations

Used abbreviations:

DDL (Data Definition Language) is used for schema management (CREATE, ALTER, DROP).
DML (Data Manipulation Language) is used for data management (SELECT, INSERT, DELETE, UPDATE).

Declarative schema management tools generate only DDL. If you want to run DML statements, use migrations. Make sure your CI/CD script runs everything in the right order: DDL -> DML -> deploy the code.

Sometimes it is necessary to preserve the execution order of DLL and DML. For example, commit_1 creates a new column (DDL), commit_2 fills in the data in this column (DML), commit_3 drops this column (DDL). If DDL statements are executed by the declarative schema management tool, and DML are executed by migrations, the DML queries will throw an error because the column will not exist. To preserve the order of these changes, use migrations for both DDL and DML.

Workflow

Development environment

During development, you are playing with the database schema, so it is convenient to apply changes automatically without any confirmation (except for DROP statements, for example, if you rename a column).

Start syncing your schema before starting the project and do it whenever you change the schema file.

Production environment

Use a declarative approach to schema changes and migrations to rename something and manipulate data.

Any changes to the database schema in production must be manually checked. The whole process may look like as follows.

Check whether the schema has been changed. You need to do this every time, not just when the schema.sql file has been changed, because the production database could have been changed manually.

If the schema hasn't been changed, then the CI/CD pipeline should build the project, test and deploy it.

If the schema has been changed (at this moment, the code must support both the current and new schema), then:

Get schema changes using a declarative schema management tool. For example, using dsm, execute the diff command. As a result, you'll have a list of SQL queries, after executing which, the schema of your database in production will be the same as it is specified in the local file (e.g. schema.sql or schema/*.sql).
Create a new branch (e.g. change-schema-<timestamp>), add a file with SQL queries that will transform the schema (e.g. pending.sql), make a commit, push to a remote repository and create a merge request. GitLab CI allows you to create a merge request automatically. To do it you need to specify extra arguments in push: merge_request.create, merge_request.target, merge_request.title. See more here.
Each time your code has been changed during the merge request, you need to run the pipeline that checks the pending.sql and test the code using the current schema used in a production and a new one. If everything is fine, you can approve the merge request.
During the merge request you must check the pending.sql and test the code using the current a new schema again. You should do it because is it possible that the merge request has been approved regardless of the pipeline in the previous step was failed.
Deploy a project and run all SQL queries from the pending.sql.
Remove the pending.sql file, make a commit and push it to the remote repository.
Remove support for the old schema. This will start a pipeline, which is described above (if the schema hasn't been changed).