Meta Rebuilds Petabyte Scale MySQL Ingestion With Reverse Shadowing and Continuous Checksums

Meta's data infrastructure team has finished migrating the ingestion platform that pulls several petabytes per day off its MySQL social graph into the analytics warehouse, replacing a sprawl of customer owned pipelines with a single self managed service. The team, led by engineer Zihao Tao, described the work in a May 12 Engineering at Meta post, and InfoQ summarized the architecture on May 30. Meta reports that 100% of the workload moved over and the legacy system is fully deprecated.

The interesting part is not the destination. It is the rollout pattern. Meta ran tens of thousands of change data capture jobs through a three phase canary that included a reverse shadow stage, where the new pipeline owns production while the old one keeps running as a live validator. Most teams skip that second phase and treat the cutover as a one-way door. Meta treated it as a hinge.

Inside the three phase shadow rollout

Each job moved through three stages. In the shadow phase, the new pipeline consumed the same MySQL source as production but wrote to a separate shadow table in a pre-production environment. Engineers compared row counts and checksums against the live target, and also measured compute and storage so a flip would not starve the production cluster of capacity.

In the reverse shadow phase, the roles swapped. The new job wrote to the production target table, and the legacy job kept running but now wrote to the shadow table. That arrangement kept a live integrity signal flowing after the cutover and, just as important, kept a hot rollback path. If a divergence showed up, the team did not need to rebuild the old pipeline, they just promoted it back. Only after that quiet period did the cleanup phase delete the legacy job.

How Meta validates billions of rows without sampling

Validation is where the post gets specific. Meta did not sample. For each shadow partition the tooling read the matching production partition and compared row count plus checksum. Mismatches were logged to Scuba, Meta's real-time observability store. An hourly job then read those Scuba logs, ran queries to surface example mismatched rows, and wrote the debug context back. According to Tao, the team "continuously monitored row count and checksum mismatches between the production jobs and the shadow jobs" and only promoted a job once the gap closed in pre-production and verification passed.

That tooling is still running after the migration finished, now as part of release validation. The lesson is that a checksum harness built for a one-off project becomes load bearing the moment anyone trusts it, so it might as well be designed for permanent operation.

Stopping bad data from cascading through CDC

Both the legacy and the new pipeline are CDC systems, similar in shape to Debezium or a hand-rolled binlog reader. Each job carries an internal full dump table for the initial snapshot, an internal delta table for incremental changes, and the target table that analysts and ML pipelines query. A central management service tracks job metadata, schemas, and table names.

The risk in any CDC topology is that a bad partition propagates forward through every downstream merge. Meta's fix is metadata driven. A delta partition marked bad halts new landings and fires an alert. A target partition marked bad causes the system to fall back to an older clean partition and reapply more deltas. Rollback then becomes a query over metadata for all bad-marked partitions, followed by targeted backfill, rather than a full rebuild.

The full dump bill that forced creative reuse

At petabyte scale, the initial full dump for a new job is slow and expensive, and any data quality issue that requires re-snapshotting doubles that cost. Meta's response was twofold. First, jobs with known unresolved issues were excluded from batches so they would not waste a full dump cycle on a problem the team already knew about. Second, and more interesting, the new pipeline reused snapshot partitions delivered by the legacy system as its own initial snapshots during the early migration stages. The old system was already paying for the dump, so the new system inherited it.

What this playbook means for any team running CDC under load

For the CTOs and platform leads we talk to, the takeaway is sharper than "Meta is big." If you operate Debezium, Fivetran, Airbyte, or a homegrown binlog reader feeding a warehouse, the reverse shadow phase is the cheapest insurance you are probably not buying. It costs roughly 1x extra compute for the duration of the validation window, and it converts a rollback from a recovery project into a config flip.

The second concrete move is continuous checksum and row count comparison rather than sampling. Sampling tells you the median pipeline is fine. It does not tell you the one table the finance team depends on has been off by 0.4% for six weeks. If you cannot afford full comparison on every table, run it on the tables whose owners would escalate to your CEO, and budget the Scuba-equivalent storage for the mismatch logs up front.

The third move is centralization. Meta replaced fragmented customer owned pipelines with one warehouse service. Every mid-size data org we see hits the same wall around the point where five teams are each running their own CDC stack and nobody owns the SLA. Pulling those into a platform team is a political project as much as a technical one, but Meta's post is a useful internal exhibit for making that case.

The next signal to watch

The open question is whether Meta or one of the managed CDC vendors codifies the reverse shadow pattern into a product feature in the next two to three quarters. If Debezium 3.x ships a first class "dual write validator" mode, or if Fivetran or Confluent surface a comparable rollout flow by their next major release, that is a strong signal the industry has accepted Meta's framing as the default. If the pattern stays a custom build that only hyperscalers attempt, expect the next round of public CDC failure postmortems to read exactly like the ones from the past three years.

Inside the three phase shadow rollout

How Meta validates billions of rows without sampling

Stopping bad data from cascading through CDC

The full dump bill that forced creative reuse

What this playbook means for any team running CDC under load

The next signal to watch

Uber Puts 500 Sensor Cars on the Road to Feed Robotaxi Partners 2M Miles a Month

Databricks Scales Genie for Cross-Industry Conversational Intelligence

Snowflake Splits the Context Layer Into Horizon Context and Cortex Sense