Skip to content

Remove Duplicates

Transform

Drop duplicate rows that match on a chosen set of key columns. Choose whether the first or last occurrence is kept; non-key columns from the unkept rows are discarded. Use it to deduplicate user lists by email, collapse repeated event records by (user_id, day), or pick the latest snapshot per entity.

Remove Duplicates reads the entire input into memory before producing output, so it requires the full dataset (it is not a streaming transform). On large inputs, peak memory scales with row count.

For each row, the transform computes a key by JSON.stringify’ing the values of the configured key columns. Rows that produce the same key are duplicates. Depending on keepStrategy, either the first or the last occurrence is kept; the others are dropped entirely (their non-key columns are not merged in).

The relative order of kept rows matches their order in the input. If no key columns are configured, all rows pass through unchanged.

Input: One tabular data connection. Output: A subset of input rows — the chosen occurrence per duplicate group, with non-key columns intact for that row.

OptionTypeDescriptionDefault
columnsstring[]Key columns. Two rows with the same values in all key columns are duplicates. Empty list = passthrough.[]
keepStrategy"first" | "last"Which occurrence to retain when duplicates are found."first"

Deduplicate by email, keep first occurrence

Section titled “Deduplicate by email, keep first occurrence”

A signup export has multiple rows per user. You want one row per email, keeping the earliest signup.

Before:

emailfull_namesignup_date
[email protected]Alice Anderson2024-03-12
[email protected]Bob Brown2024-04-01
[email protected]Alice A.2024-09-22
[email protected]Carol Chen2025-01-18
[email protected]Robert Brown2025-02-10

Configuration: key columns: ["email"], keep: first.

After:

emailfull_namesignup_date
[email protected]Alice Anderson2024-03-12
[email protected]Bob Brown2024-04-01
[email protected]Carol Chen2025-01-18

Sort the data by date upstream, then deduplicate on the entity key with keep: last to retain the newest record per entity.

Before: (already sorted ascending by recorded_at)

device_idstatusrecorded_at
dev-01ok2025-06-01 09:00:00
dev-02ok2025-06-01 09:05:00
dev-01warn2025-06-01 12:30:00
dev-02error2025-06-01 14:20:00
dev-01ok2025-06-01 18:45:00

Configuration: key columns: ["device_id"], keep: last.

After:

device_idstatusrecorded_at
dev-02error2025-06-01 14:20:00
dev-01ok2025-06-01 18:45:00

Two columns together form the deduplication key. Rows that match on both are collapsed.

Before:

accountdaylogins
acct-70012025-04-103
acct-70022025-04-101
acct-70012025-04-115
acct-70012025-04-107

Configuration: key columns: ["account", "day"], keep: first.

After:

accountdaylogins
acct-70012025-04-103
acct-70022025-04-101
acct-70012025-04-115
  • Key matching is type-sensitive. Keys are computed by JSON.stringify of the key column values, so the string "42" and the number 42 are treated as different keys (they serialize as "\"42\"" vs "42"). If your data mixes types in the same column from different upstream sources, normalize types first (e.g. via Type Coercion or Formula). See apps/web/src/transforms/deduplicate/logic.ts:41-52.
  • last retains the row that appears latest in input order, not by any timestamp. This transform doesn’t sort. To pick the newest row by date, sort ascending on the timestamp column upstream, then deduplicate with keepStrategy: "last".
  • Non-key columns from dropped rows are discarded. If duplicates have different values in non-key columns, only the kept row’s values survive — there is no merge or aggregation. Use Group By if you need to combine values across duplicates (e.g. summing, taking max).
  • Sort Rows — sort upstream to control which duplicate keep: last retains.
  • Group By — aggregate rather than discard non-key data when collapsing duplicates.
  • Filter Rows — narrow the dataset before deduplicating to reduce memory usage.