Deterministic vs statistical data quality

← Back to blog

"Data quality" covers two genuinely different jobs that are easy to conflate. One estimates whether data looks right. The other proves whether it is right. Knowing which you need — and when — saves you from buying the wrong tool for the moment that matters.

The statistical approach: is this normal?

Statistical data quality — the world of data observability and anomaly detection — learns what your data usually looks like and alerts when reality deviates. It watches freshness, row volume, null rates, and column distributions, builds a baseline from history, and flags the outliers. Modern tools do this with machine learning so they can cover hundreds of tables no human could monitor by hand.

This is real, useful work. It catches the unknown-unknowns: the pipeline that silently stopped, the upstream schema change nobody announced, the sudden spike that signals a bug. Its defining property is that the output is probabilistic. The tool is saying, "based on the past, this looks unusual — you should investigate." That's a well-informed suspicion, not a verdict.

The deterministic approach: does this match?

Deterministic data quality — reconciliation — asks a narrower, harder question: does dataset A match dataset B, exactly? It doesn't learn a baseline or estimate anything. It reads the actual values from both sides, aligns them on a key, and compares them cell by cell. The output isn't a score; it's the specific rows and columns that differ.

Because there's no model, the result is reproducible. Run the same comparison on the same data twice and you get identical output. That property — the same inputs always yielding the same answer — is what makes it evidence rather than opinion.

Statistical quality tells you a table looks wrong. Deterministic quality tells you which values are wrong. Both are valuable; only one is proof.

Why the distinction matters

Most of the time, a probability is exactly what you want. You can't hand-check every table in a warehouse, so a tool that surfaces the suspicious ones is a force multiplier. But some moments demand certainty, and in those moments a confidence score is not enough:

A migration cutover. Before you switch the old system off, you need to prove the new one holds the same data — not that it "probably" does.
A financial control. Showing that a ledger and its sub-ledger agree is a yes/no fact, and an auditor will want the exact reconciling items, not a model's output.
A regulated report. When a regulator asks you to demonstrate the numbers are correct, "our anomaly detector didn't fire" is not a demonstration.

The failure mode of the statistical approach in these moments is specific and quiet: a value can drift by an amount that's real but statistically unremarkable, and slide under the anomaly threshold. A few rows with a rounding error, a subset of records that attached to the wrong key — none of it looks unusual in aggregate. A deterministic comparison catches all of it, because it isn't looking for unusual; it's checking for equal.

They're complementary, not rivals

The mature answer is to use both, for the jobs each is good at. Let observability run broad and continuous across the warehouse, catching surprises you didn't know to look for. Reach for reconciliation at the high-stakes checkpoints, where you have two specific datasets that must agree and you need to be able to prove it.

A good rule of thumb: if the question is "is anything off across all my data?", that's a statistical job. If the question is "do these two systems agree, exactly?", that's a deterministic one.

DataRecs is built for the second question. It compares two sources value by value and returns the precise differences — auditable, reproducible, cross-engine. If you want the head-to-head framing, see data reconciliation vs data observability, or the shorter argument on why we build for proof, not probability.

Deterministic vs statistical data quality: when you need proof, not probability

The statistical approach: is this normal?

The deterministic approach: does this match?

Why the distinction matters

They're complementary, not rivals

When probably isn't good enough