Navigating SQL Migrations with Confidence: Introducing sql-compare

Delivering SQL schema change at scale.

Jul 16, 2024

As long as I can remember, SQL has been a cornerstone of my engineering journey. My early days at university were filled with monotonous Oracle-based SQL courses, which I found uninspiring. Knowing I would likely never use Oracle, I shifted my focus to MySQL. Over time, I discovered the limitations of MySQL and was introduced to PostgreSQL, thanks to Dimitri. I even organized a few meetups in Paris and encouraged Dimitri to publish "The Art of PostgreSQL," arguably the best book on SQL. Eventually, I embraced PostgreSQL wholeheartedly.

SQL databases are a timeless technology that continues to evolve. From Timescale to pgvector, new advancements are continually emerging. However, one persistent challenge has been managing database migrations. Modifying your data model is crucial for evolving your application, but it’s often a daunting task. At Mergify, like many companies, we’ve faced this challenge head-on.

We've tried various solutions, from custom Python scripts to using migra, an open-source project that is unfortunately no longer maintained. Each solution had its drawbacks, leading us to a crossroads where we had to decide on our next move.

The Initial Struggle

At Mergify, PostgreSQL is the backbone of our data handling, from managing the state of GitHub objects to maintaining our event log. From the beginning, we’ve interacted with the database exclusively using an ORM, choosing SQLAlchemy for its maturity, framework agnosticism, and support for asynchronous I/O since version 2.0.0.

Given our frequent production deployments, a robust CI/CD pipeline is essential to handle database evolution smoothly. Every schema modification must be rigorously tested and automatically applied to the production database, adhering to the principles outlined in Martin Fowler's "Evolutionary Database Design." Version-controlling each database artifact and scripting every change as a migration are critical steps in this process.

We chose Alembic to manage our database migrations. Maintained by the SQLAlchemy team, Alembic is a command-line tool that can automatically create migration scripts from your SQLAlchemy models. Each script is version-controlled alongside your source code. Alembic applies these migrations to the database, recording the revision number in the alembic_version table to ensure only new migrations are applied subsequently. This command is typically executed in the continuous delivery pipeline to keep the production database up-to-date.

A Naive Beginning

Our initial approach to testing migration scripts was straightforward: create two databases—one using SQLAlchemy models and the other using only the migration scripts—and ensure they have identical schemas. This involved:

Creating PostgreSQL servers using Docker: On a new server, create two empty databases.
Generating schemas: Use the first database to create artifacts with SQLAlchemy models, and use Alembic to run migration scripts on the second database.
Comparing schemas: Dump each database schema into SQL files using pg_dump and compare them using Python’s filecmp and difflib builtin libraries.

Here’s an example command to dump a database schema into an SQL file:

pg_dump \
    --dbname=postgresql://user:password@host:port/database \
    --schema-only \
    --exclude-table=alembic_version \
    --format=p \
    --encoding=UTF8 \
    --file /path/to/dump.sql

To compare the files:

assert filecmp.cmp(schema_dump_creation_path, schema_dump_migration_path, shallow=False)

If the test fails, use difflib to display the differences:

def filediff(path1: pathlib.Path, path2: pathlib.Path) -> str:
    with path1.open() as f1, path2.open() as f2:
        diff = difflib.unified_diff(
            f1.readlines(),
            f2.readlines(),
            path1.name,
            path2.name,
        )
        return "Database dump differences: \n" + "".join(diff)

While effective, this test had limitations, such as sensitivity to column order. PostgreSQL doesn’t easily allow changing column positions, necessitating consistent column order in models and production databases.

The Complexity Grows

As our models grew more complex, our naive test struggled to keep up. Consider the following example:

class Base:
    updated_at: orm.Mapped[datetime.datetime] = orm.mapped_column(
        sqlalchemy.DateTime(timezone=True),
        server_default=sqlalchemy.func.now(),
    )

class User(Base):
    id: orm.Mapped[int] = orm.mapped_column(
        sqlalchemy.BigInteger,
        primary_key=True,
    )

In this setup, the updated_at column is added to every child model, such as User. Adding a new column to User, like name, would misalign the order, causing schema mismatches.

To address this, we needed to compare schemas while ignoring column order. We explored various tools:

Alembic: Can compare schemas to generate migration scripts but misses some differences.
Migra: An unmaintained tool that compares database schemas effectively.
SQL dumps: The most reliable format but challenging to parse and compare directly.

Building the Solution: sql-compare

It was clear that our current solutions were insufficient. We needed a hero to rescue us from the perils of SQL migration management, so we developed sql-compare.

sql-compare is a Python library that uses sqlparse to parse SQL files and compare schemas, ignoring irrelevant differences like comments, whitespace, and column order. This new tool became an integral part of our workflow, catching migration issues that other tools might miss.

The main challenge was filtering and grouping tokens by column definition before sorting them. Despite these complexities, sql-compare emerged victorious, enabling us to ensure seamless migrations and maintain schema integrity.

The Journey Forward

We’ve open-sourced sql-compare to help others facing similar challenges. You can try it by running pip install sql-compare. We plan to enhance sql-compare, such as creating functions to retrieve all schema differences for better test results. If you have suggestions or want to contribute, feel free to submit issues or pull requests on our GitHub repository.

Conclusion

Managing database migrations is a complex but essential task for evolving applications. With sql-compare, we found our solution, ensuring seamless migrations, maintaining schema integrity, and continuing to deliver high-quality software. Our journey through the challenges of SQL migrations has taught us valuable lessons, and with sql-compare, we’re better equipped to face the future.

Damien

Jul 17

Hello, we are using `alembic check` for this purpose at my daily job. I would be very interested if you could elaborate/give an example on which differences are missed ? (if so, we might have unaligned models & DB schemas without knowing it) Thanks !

Expand full comment

1 reply by Julien Danjou

euri10

Jul 16

pgmodeler built-in diff is very good as well though gui-based

1 more comment...

Continuous Rambling

Discussion about this post