Remove Duplicates

Transform

Drop duplicate rows that match on a chosen set of key columns. Choose whether the first or last occurrence is kept; non-key columns from the unkept rows are discarded. Use it to deduplicate user lists by email, collapse repeated event records by (user_id, day), or pick the latest snapshot per entity.

How it works

Remove Duplicates reads the entire input into memory before producing output, so it requires the full dataset (it is not a streaming transform). On large inputs, peak memory scales with row count.

For each row, the transform computes a key by JSON.stringify’ing the values of the configured key columns. Rows that produce the same key are duplicates. Depending on keepStrategy, either the first or the last occurrence is kept; the others are dropped entirely (their non-key columns are not merged in).

The relative order of kept rows matches their order in the input. If no key columns are configured, all rows pass through unchanged.

Input: One tabular data connection. Output: A subset of input rows — the chosen occurrence per duplicate group, with non-key columns intact for that row.

Options

Option	Type	Description	Default
`columns`	string[]	Key columns. Two rows with the same values in all key columns are duplicates. Empty list = passthrough.	`[]`
`keepStrategy`	`"first"` \| `"last"`	Which occurrence to retain when duplicates are found.	`"first"`

Examples

Deduplicate by email, keep first occurrence

A signup export has multiple rows per user. You want one row per email, keeping the earliest signup.

Before:

email	full_name	signup_date
[email protected]	Alice Anderson	2024-03-12
[email protected]	Bob Brown	2024-04-01
[email protected]	Alice A.	2024-09-22
[email protected]	Carol Chen	2025-01-18
[email protected]	Robert Brown	2025-02-10

Configuration: key columns: ["email"], keep: first.

After:

email	full_name	signup_date
[email protected]	Alice Anderson	2024-03-12
[email protected]	Bob Brown	2024-04-01
[email protected]	Carol Chen	2025-01-18

Latest snapshot per entity (keep last)

Sort the data by date upstream, then deduplicate on the entity key with keep: last to retain the newest record per entity.

Before: (already sorted ascending by recorded_at)

device_id	status	recorded_at
dev-01	ok	2025-06-01 09:00:00
dev-02	ok	2025-06-01 09:05:00
dev-01	warn	2025-06-01 12:30:00
dev-02	error	2025-06-01 14:20:00
dev-01	ok	2025-06-01 18:45:00

Configuration: key columns: ["device_id"], keep: last.

After:

device_id	status	recorded_at
dev-02	error	2025-06-01 14:20:00
dev-01	ok	2025-06-01 18:45:00

Composite key: (account, day)

Two columns together form the deduplication key. Rows that match on both are collapsed.

Before:

account	day	logins
acct-7001	2025-04-10	3
acct-7002	2025-04-10	1
acct-7001	2025-04-11	5
acct-7001	2025-04-10	7

Configuration: key columns: ["account", "day"], keep: first.

After:

account	day	logins
acct-7001	2025-04-10	3
acct-7002	2025-04-10	1
acct-7001	2025-04-11	5

Tips and Edge Cases

Key matching is type-sensitive. Keys are computed by JSON.stringify of the key column values, so the string "42" and the number 42 are treated as different keys (they serialize as "\"42\"" vs "42"). If your data mixes types in the same column from different upstream sources, normalize types first (e.g. via Type Coercion or Formula). See apps/web/src/transforms/deduplicate/logic.ts:41-52.
last retains the row that appears latest in input order, not by any timestamp. This transform doesn’t sort. To pick the newest row by date, sort ascending on the timestamp column upstream, then deduplicate with keepStrategy: "last".
Non-key columns from dropped rows are discarded. If duplicates have different values in non-key columns, only the kept row’s values survive — there is no merge or aggregation. Use Group By if you need to combine values across duplicates (e.g. summing, taking max).

Sort Rows — sort upstream to control which duplicate keep: last retains.
Group By — aggregate rather than discard non-key data when collapsing duplicates.
Filter Rows — narrow the dataset before deduplicating to reduce memory usage.

Remove Duplicates

How it works

Options

Examples

Deduplicate by email, keep first occurrence

Latest snapshot per entity (keep last)

Composite key: (account, day)

Tips and Edge Cases

Related Transforms