MySQL Time-Travel with near-ZERO overhead: How bintrail skips the write load

After the Time Travel SQL for MySQL, finally post, a reader asked four questions that I want to answer in depth, because they’re the ones every DBA running multi-TB workloads ends up asking:

Where is history stored: same tablespace, or separate?
Is there automatic retention/TTL, or does history grow unbounded like MariaDB?
Can you monitor history growth independently of the main table?
How does this interact with partitioning?

Short version: bintrail’s history is in a separate database, with hourly partitions, opt-in TTL via a daemon, and its own observability surface. None of those things share the source table’s tablespace, lifecycle, or partitioning. That’s a deliberate choice and it’s the main reason it scales differently than MariaDB’s WITH SYSTEM VERSIONING.

Let’s go through each.

1. Where history is stored

bintrail uses two MySQL instances (or two DSNs against the same instance, if you really want; but in production they should be separate):

--source-dsn: your production MySQL. bintrail never writes to it. It only reads the binlog stream.
--index-dsn: a separate MySQL where bintrail keeps the indexed event history in a table called binlog_events.

When the shim serves a query against a virtual schema like _flashback.users AS OF '...', it reads from binlog_events on the index DSN. The source database is untouched. You can drop the index entirely and your production tablespace is unaffected.

For long-term retention, aged hourly partitions are archived to Parquet (local disk or S3) and dropped from MySQL. The shim still serves queries against archived ranges transparently, by reading the Parquet files via DuckDB and merging the results with the live MySQL partitions.

So the storage tiering looks like this:

hot      │ MySQL binlog_events (recent hours/days)
warm     │ Parquet local (older but on-disk)
cold     │ Parquet on S3
source   │ untouched — your production tablespace

For a multi-TB source table, this matters: history I/O does not contend with your OLTP buffer pool, your replication threads, or your backup windows. Run the index DSN on a different host, a different EBS volume, a different instance class. It’s just another MySQL.

2. Retention and TTL

This is where bintrail diverges hardest from MariaDB.

MariaDB’s WITH SYSTEM VERSIONING stores history inside the same physical table as current rows, and there is no automatic expiration. The MariaDB KB is explicit about this: “Because it stores all the history, a system-versioned table might grow very large over time. There are many options to trim down the space and remove the old history.” Those options: DELETE HISTORY FROM t BEFORE SYSTEM_TIME '...' or ALTER TABLE … DROP PARTITION , are operator-driven; nothing prunes history on its own.

You can use PARTITION BY SYSTEM_TIME to split history off into its own partitions (INTERVAL 1 WEEK, LIMIT 100000, etc.). MariaDB 10.9.1+ added the AUTO keyword to auto-create new history partitions as needed, but even with AUTO, dropping old partitions is still manual. The lifecycle is “we’ll give you the rotation primitives; you wire up the retention policy.”

In practice this is the failure mode I’ve seen most often in the wild: history quietly doubles, triples, then 10x’s the table size and nobody notices until optimize bites.

bintrail does it the other way: history is a stream of partitions you actively manage, and there’s a built-in command to manage them:

# One-shot: drop partitions older than 7 days, archive to S3 first
bintrail rotate \
  --index-dsn "$IDX" \
  --retain 7d \
  --archive-dir /var/lib/bintrail/archive \
  --bintrail-id "$BINTRAIL_ID" \
  --archive-s3 s3://my-bucket/bintrail/

If you want this to happen continuously without involving cron, run it as a daemon:

# Continuous: rotate every hour
bintrail rotate \
  --index-dsn "$IDX" \
  --retain 7d \
  --archive-dir /var/lib/bintrail/archive \
  --bintrail-id "$BINTRAIL_ID" \
  --archive-s3 s3://my-bucket/bintrail/ \
  --daemon --interval 1h

Under that mode, every hour bintrail:

Adds future hourly partitions ahead of now() so writes never hit the p_future catch-all (cheap if they already exist).
Walks partitions older than --retain, writes each one to a Parquet file, uploads to S3 if configured, and only then drops it.
Updates archive_state so future queries know where the dropped data lives.

Safety property: If any archive write fails, no partitions are dropped. The command aborts before touching the table, leaves what’s archived in place, and the next run with --retry resumes (Parquet files already on disk are skipped; S3 uploads recorded as completed in archive_state are skipped). You can’t accidentally drop a partition that didn’t make it to long-term storage. This is the engineering invariant that makes “automated rotation” something you can actually run on a production schedule.

Does history grow unbounded if you don’t run rotate? Yes. Without rotate, binlog_events keeps accumulating hourly partitions forever. The design choice is explicit lifecycle, not implicit GC. The flip side is that you can keep 30 days hot in MySQL and 5 years cold in Parquet at S3 prices, and the shim will transparently serve both.

So the comparison isn’t really “automatic vs manual”, it’s:

MariaDB: history lifecycle is implicit, lives in your hot tablespace, and you discover its cost when it’s already a problem.
bintrail: history lifecycle is explicit, lives outside your hot path, and tiers to cheap storage when you tell it to.

3. Monitoring history growth separately

Because binlog_events is a regular MySQL table on its own DSN, you can use every MySQL tool you already have: information_schema.PARTITIONS, SHOW TABLE STATUS, your existing Datadog/Grafana MySQL integration, etc. Nothing about it is special, it’s just rows.

bintrail also ships a status command with four sections:

bintrail status --index-dsn "$IDX"

Indexed Files: every file or stream chunk processed, with row counts and the bintrail_id of the server that indexed it (multiple bintrail instances can share one index database).
Partitions: every hourly partition with its boundary and an estimated row count from information_schema.PARTITIONS.TABLE_ROWS (InnoDB estimate, close enough for capacity planning, not exact).
Summary: files grouped by bintrail_id, with totals.
Archives: total archived partitions, total bytes, S3 upload count. Read from archive_state, which also tracks file_size_bytes and row_count per archived partition.

--format json works on status if you want to pipe into something else.

What’s not turnkey yet, honestly: there are Prometheus metrics for the stream indexer (events received, events indexed, batch sizes, replication lag, errors-by-type), but there isn’t a per-partition size gauge exposed natively. You can build a Grafana panel showing “MB per hour partition over the last 7 days” from information_schema.PARTITIONS and archive_state directly (both are regular SQL tables) but it’s not auto-emitted. That’s on the list.

4. Interaction with source-table partitioning

The reader’s question was specifically about how Time Travel SQL interacts with the partitioning of the source table: your orders, users, whatever. The crisp answer: it doesn’t, and that’s the whole architectural difference.

bintrail reads the binlog. The binlog is a logical row-event stream: every row insert/update/delete, with before and after images, regardless of how the source table is physically laid out. Hash-partitioned by user_id? Range-partitioned by created_at? Not partitioned at all? Sub-partitioned by region? bintrail doesn’t care, doesn’t see it, doesn’t constrain it. Your source partitioning strategy, partition pruning, ALTER TABLE … REORGANIZE PARTITION behavior, and .ibd file layout are all unaffected by the existence of bintrail.

Contrast that with MariaDB system-versioned tables. The MariaDB KB recommends PARTITION BY SYSTEM_TIME as the way to keep history-row reads from interfering with current-row reads — but PARTITION BY SYSTEM_TIME is the table’s partitioning. If orders was already PARTITION BY RANGE (YEAR(created_at)) for archival reasons, enabling history-partitioning replaces that scheme — you can’t have both PARTITION BY RANGE (created_at) and PARTITION BY SYSTEM_TIME on the same table, because a CREATE TABLE statement takes one top-level PARTITION BY clause. Sub-partitioning helps in narrow cases but doesn’t change the top-level constraint.

With bintrail, history lives in a separate database with its own partitioning, so the tradeoff doesn’t exist. Source keeps whatever partitioning it had; the index DB has its own, independent layout.

How the history is partitioned

binlog_events is range-partitioned by hour:

PARTITION BY RANGE (TO_SECONDS(event_timestamp)) (
  PARTITION p_2026052300 VALUES LESS THAN (...),
  PARTITION p_2026052301 VALUES LESS THAN (...),
  ...
  PARTITION p_future     VALUES LESS THAN MAXVALUE
)

One hourly partition per hour, plus a p_future catch-all that should never actually receive events in steady state (you keep it ahead of now() via --add-future).

A few practical consequences:

Drop a single hour of history without touching the source or any other partition. ALTER TABLE binlog_events DROP PARTITION p_2026020114 is a DDL metadata operation “any data that was stored in the dropped partitions … is discarded” not a row-by-row DELETE or a redo-log-per-row delete. The performance gap between DROP and DELETE on a partition with millions of rows is the whole reason bintrail uses range partitioning here.
Archive layout mirrors partitions, in Hive format, so other engines can read it directly:
```
s3://bucket/bintrail/bintrail_id=<uuid>/event_date=2026-02-01/event_hour=14/events.parquet
```
Athena (partition_projection), DuckDB (hive_partitioning), and Spark all read this layout natively. A year of historical change data is queryable from any of them without bintrail in the loop.
TO_SECONDS not UNIX_TIMESTAMP in the partition expression. The MySQL 8.0 manual lists both as allowed, but UNIX_TIMESTAMP() is only valid against TIMESTAMP columns. Against a DATETIME column the conversion depends on session timezone, which MySQL rejects with Error 1486 (ER_PARTITION_FUNCTION_IS_NOT_ALLOWED), the “timezone-dependent expression” trap. TO_SECONDS() is a pure calendar function (seconds since year zero, no timezone involved), so it works for DATETIME regardless of time_zone.

bintrail vs MariaDB system-versioned tables, side by side

Sources for the MariaDB column: System-Versioned Tables and Partitioning system-versioned tables.

Dimension	MariaDB `WITH SYSTEM VERSIONING`	bintrail
Storage location	Same physical table; history rows accessible via `ROW_START`/`ROW_END` pseudo-columns	Separate database (`--index-dsn`)
Impact on source tablespace	Grows the source table	Zero
Impact on source latency	UPDATE retains the old row and writes the new one; DELETE marks the row historical instead of removing it	Zero (reads the binlog asynchronously)
Retention	Manual: `DELETE HISTORY` or `ALTER TABLE … DROP PARTITION`	`bintrail rotate --retain` (one-shot, cron, systemd timer, or `--daemon`)
Atomicity of retention	`DELETE HISTORY` is just SQL	Archive must succeed before partition is dropped
Cold storage tier	None (history shares the source’s tablespace)	Parquet on local disk or S3
Monitoring	Standard table tools	`bintrail status` + standard tools + `archive_state` SQL
Time-travel query syntax	`SELECT * FROM t FOR SYSTEM_TIME AS OF TIMESTAMP'...'`	`SELECT * FROM _flashback.t AS OF '...'`
DDL changes	Restricted (the MariaDB KB notes “PARTITION BY SYSTEM_TIME is not supported when using transaction-precise system versioning” and various ALTER caveats)	`bintrail snapshot` captures schema versions; history is decoupled from source DDL
Works on MySQL?	No, MariaDB only	Yes: this is the whole point

Closing

The pattern bintrail follows is the same one data warehouses figured out a decade ago: history is a stream, not a column. Stream it out of the hot path, partition it by time, tier it to cheap storage, and serve queries by reading the right partitions. MariaDB’s in-table approach is convenient for small workloads and gets expensive fast for multi-TB ones. Externalizing the stream is what makes Time Travel SQL viable at the size where you actually need it.

If you want to try this on a real workload, the Quick start takes about ten minutes against a Docker MySQL. The full Time Travel SQL walkthrough (ProxySQL + shim under systemd) is in docs/time-travel-sql.md, and the rotation/archive internals are in docs/rotation-and-status.md. If you’re running multi-TB production MySQL and want to talk through the deployment model, grab 15 minutes on my calendar.

Running MariaDB?

A few people have already asked the natural follow-up: “OK but what if my source is MariaDB?” Today bintrail doesn’t support MariaDB binlogs: the parser, the GTID handling, and the replication setup are MySQL-specific. We’re scoping what it would take. If you’re running MariaDB in production and would actually use time-travel SQL on it, tell us: version, topology (replication / Galera / etc.), and rough event volume. That’s the input we need to decide whether to prioritize it this quarter or next. Real conversations beat the survey-of-one in my head.