Is This How Everyone Does It?
Short answer: no. This is a design pattern — one approach that works well for a specific class of data engineering problems. Teams at companies like Meta or Amazon are dealing with petabyte-scale distributed systems and often build entirely different architectures optimized for their constraints. The right pattern depends on your data volume, team skill set, tooling, and SLA requirements.
What's described here is a clean, practical pattern for teams running SQL-based ETL pipelines against a traditional or cloud data warehouse.
The Two-Table Pattern: Stage + Base
The design rests on a strict separation of responsibilities between two layers:
| Layer | Table | Responsibility |
|---|---|---|
| Stage | STG_CUSTOMERS | All ETL transformation logic runs here |
| Base | BASE_CUSTOMERS | Clean, versioned history — what users query |
The rule is simple: every piece of ETL logic belongs in the load into STG. Field mapping, type casting, deduplication, business rules, derived columns — all of it happens before you touch Base. STG is where you do the work.
Once STG is loaded and validated, the move from STG → Base is a one-to-one copy using the incremental merge logic described below. This is sometimes called a kill-and-fill on the current records — you're not rebuilding from scratch, you're applying a clean diff.
Source System
│
▼ (all ETL logic lives here)
┌──────────┐
│ STG │
│ (staging)│
└──────────┘
│
│ one-to-one incremental copy
▼
┌──────────┐
│ BASE │ ◀── Users query here
│ (base) │
└──────────┘
This separation keeps Base predictable. You always know exactly what's in it because the only way data gets there is through a controlled, well-defined merge from STG.
The Two Tracking Columns
This pattern implements SCD Type 2 (Slowly Changing Dimension Type 2) — a standard technique for maintaining full history as attributes change over time. It runs on two columns:
1. SURROGATE_KEY
A hash of the business/natural key — the attribute that uniquely identifies who this person or entity is. For customer data, that's typically something like SSN.
SURROGATE_KEY = MD5(ssn)
This never changes. The same person always hashes to the same surrogate key regardless of how many times their address or phone number changes. It's purely an identity question: is this person already in Base?
2. CHKSUMTXT
A hash of the attributes you care about tracking changes for — the things that describe the entity. For a customer, that might be:
- First name
- Last name
- Address
- Phone number
CHKSUMTXT = MD5(CONCAT(first_name, last_name, address, phone))
If any one of those attributes changes, CHKSUMTXT changes, and that triggers an SCD2 update: the old row gets expired and a new row is inserted capturing the new state. You decide which columns go into CHKSUMTXT based on what your business actually needs to track historically.
-- Building both columns in STG
SELECT
MD5(ssn) AS surrogate_key,
MD5(CONCAT(first_name, last_name, address, phone_number)) AS chksumtxt,
ssn,
first_name,
last_name,
address,
phone_number,
CURRENT_TIMESTAMP AS load_ts
FROM raw_customers;
The Three Scenarios
Take a concrete customer as the example throughout: Jane Doe, SSN 111-22-3333.
Scenario 1 — New Record (Surrogate Key Not In Base)
Jane is a brand new customer — her surrogate key (MD5(ssn)) doesn't exist anywhere in Base. Insert her as an active record.
STG incoming:
| surrogate_key | chksumtxt | ssn | first_name | last_name | address | phone |
|---|---|---|---|---|---|---|
sk_jane | cks_A | 111-22-3333 | Jane | Doe | 123 Main St | 555-0100 |
BASE before:
| (no rows) |
|---|
BASE after:
| surrogate_key | chksumtxt | first_name | last_name | address | phone | is_current | effective_ts | expiry_ts |
|---|---|---|---|---|---|---|---|---|
sk_jane | cks_A | Jane | Doe | 123 Main St | 555-0100 | Y | 2026-05-01 | NULL |
Scenario 2 — Changed Record (Surrogate Key Exists, CHKSUMTXT Different)
Jane moved to a new address. Her SSN hasn't changed (same person, same surrogate_key), but her address changed, so CHKSUMTXT is now different.
SCD2 kicks in:
- Expire the existing row — set
is_current = 'N'and stampexpiry_ts - Insert a new row with the updated attributes and
is_current = 'Y'
STG incoming:
| surrogate_key | chksumtxt | first_name | last_name | address | phone |
|---|---|---|---|---|---|
sk_jane | cks_B | Jane | Doe | 456 Oak Ave | 555-0100 |
BASE before:
| surrogate_key | chksumtxt | first_name | last_name | address | phone | is_current | effective_ts | expiry_ts |
|---|---|---|---|---|---|---|---|---|
sk_jane | cks_A | Jane | Doe | 123 Main St | 555-0100 | Y | 2026-05-01 | NULL |
BASE after:
| surrogate_key | chksumtxt | first_name | last_name | address | phone | is_current | effective_ts | expiry_ts |
|---|---|---|---|---|---|---|---|---|
sk_jane | cks_A | Jane | Doe | 123 Main St | 555-0100 | N | 2026-05-01 | 2026-05-07 |
sk_jane | cks_B | Jane | Doe | 456 Oak Ave | 555-0100 | Y | 2026-05-07 | NULL |
You now have a full history of every address Jane ever had, without deleting a single row.
The same logic applies if her first name, last name, or phone number changes — any attribute inside CHKSUMTXT triggers this.
Scenario 3 — Unchanged Record (Surrogate Key Exists, CHKSUMTXT Same)
Jane's record comes through again in the next load, but nothing about her has changed. CHKSUMTXT is identical. Skip it entirely. No update, no insert.
STG incoming:
| surrogate_key | chksumtxt | first_name | last_name | address | phone |
|---|---|---|---|---|---|
sk_jane | cks_B | Jane | Doe | 456 Oak Ave | 555-0100 |
BASE (current row):
| surrogate_key | chksumtxt | first_name | last_name | address | phone | is_current |
|---|---|---|---|---|---|---|
sk_jane | cks_B | Jane | Doe | 456 Oak Ave | 555-0100 | Y |
→ No action taken. This is where incremental loading earns its name — if a million customers load unchanged, you process zero rows against Base.
The Merge Logic in SQL
MERGE INTO base_customers AS base
USING stg_customers AS stg
ON base.surrogate_key = stg.surrogate_key
AND base.is_current = 'Y'
-- Scenario 2: record changed → expire the old row
WHEN MATCHED AND base.chksumtxt <> stg.chksumtxt THEN
UPDATE SET
base.is_current = 'N',
base.expiry_ts = CURRENT_TIMESTAMP
-- Scenario 3: record unchanged → do nothing
WHEN MATCHED AND base.chksumtxt = stg.chksumtxt THEN
-- no-op, intentionally omitted
-- Scenario 1: new record → insert
WHEN NOT MATCHED THEN
INSERT (surrogate_key, chksumtxt, customer_id, address, email,
phone, status, is_current, effective_ts, expiry_ts)
VALUES (stg.surrogate_key, stg.chksumtxt, stg.customer_id,
stg.address, stg.email, stg.phone, stg.status,
'Y', CURRENT_TIMESTAMP, NULL);
-- Scenario 2 (continued): after expiring old row, insert the updated record
INSERT INTO base_customers
(surrogate_key, chksumtxt, customer_id, address, email,
phone, status, is_current, effective_ts, expiry_ts)
SELECT
stg.surrogate_key, stg.chksumtxt, stg.customer_id,
stg.address, stg.email, stg.phone, stg.status,
'Y', CURRENT_TIMESTAMP, NULL
FROM stg_customers stg
JOIN base_customers base
ON stg.surrogate_key = base.surrogate_key
WHERE base.chksumtxt <> stg.chksumtxt;
Why We Never Delete
Deletes are permanent and often irreversible within a pipeline run. Hard deletes in Base would mean:
- No audit trail of what the data looked like previously
- Broken downstream reports that joined on those rows
- No way to distinguish "was deleted from source" vs "pipeline bug"
Instead, a soft delete pattern (setting is_current = 'N') handles source-side deletions gracefully — you mark the row expired and let consumers decide whether to include inactive records.
The Full Picture
Every pipeline run:
STG populated ─────────────────────────┐
▼
┌─────────────────────────┐
│ Compare STG vs BASE │
└─────────────────────────┘
│
┌───────────────────┼───────────────────┐
▼ ▼ ▼
New record? CHKSUMTXT CHKSUMTXT
(key missing changed? same?
in BASE) (key exists) (key exists)
│ │ │
INSERT new Expire old + No-op,
row with INSERT new row skip it
is_current='Y' with is_current='Y'
Result: BASE always has valid data. Pipeline failures only affect how fresh it is, never whether it exists.
Summary
| Condition | Action |
|---|---|
| Key not in Base | Insert new row (is_current = 'Y') |
| Key exists, CHKSUMTXT changed | Expire old row + insert updated row |
| Key exists, CHKSUMTXT same | Skip — no action |
| Source-side delete | Expire old row (is_current = 'N') — never hard delete |
This pattern gives you resilience (Base survives Stage failures), auditability (full row history), and efficiency (only process what changed). It scales from a few thousand rows to hundreds of millions without changing the logic — only the compute underneath it.