Remove Duplicates
Drop duplicate rows that match on a chosen set of key columns. Choose whether the first or last occurrence is kept; non-key columns from the unkept rows are discarded. Use it to deduplicate user lists by email, collapse repeated event records by
(user_id, day), or pick the latest snapshot per entity.
How it works
Section titled “How it works”Remove Duplicates reads the entire input into memory before producing output, so it requires the full dataset (it is not a streaming transform). On large inputs, peak memory scales with row count.
For each row, the transform computes a key by JSON.stringify’ing the values of the configured key columns. Rows that produce the same key are duplicates. Depending on keepStrategy, either the first or the last occurrence is kept; the others are dropped entirely (their non-key columns are not merged in).
The relative order of kept rows matches their order in the input. If no key columns are configured, all rows pass through unchanged.
Input: One tabular data connection. Output: A subset of input rows — the chosen occurrence per duplicate group, with non-key columns intact for that row.
Options
Section titled “Options”| Option | Type | Description | Default |
|---|---|---|---|
columns | string[] | Key columns. Two rows with the same values in all key columns are duplicates. Empty list = passthrough. | [] |
keepStrategy | "first" | "last" | Which occurrence to retain when duplicates are found. | "first" |
Examples
Section titled “Examples”Deduplicate by email, keep first occurrence
Section titled “Deduplicate by email, keep first occurrence”A signup export has multiple rows per user. You want one row per email, keeping the earliest signup.
Before:
| full_name | signup_date | |
|---|---|---|
| [email protected] | Alice Anderson | 2024-03-12 |
| [email protected] | Bob Brown | 2024-04-01 |
| [email protected] | Alice A. | 2024-09-22 |
| [email protected] | Carol Chen | 2025-01-18 |
| [email protected] | Robert Brown | 2025-02-10 |
Configuration: key columns: ["email"], keep: first.
After:
| full_name | signup_date | |
|---|---|---|
| [email protected] | Alice Anderson | 2024-03-12 |
| [email protected] | Bob Brown | 2024-04-01 |
| [email protected] | Carol Chen | 2025-01-18 |
Latest snapshot per entity (keep last)
Section titled “Latest snapshot per entity (keep last)”Sort the data by date upstream, then deduplicate on the entity key with keep: last to retain the newest record per entity.
Before: (already sorted ascending by recorded_at)
| device_id | status | recorded_at |
|---|---|---|
| dev-01 | ok | 2025-06-01 09:00:00 |
| dev-02 | ok | 2025-06-01 09:05:00 |
| dev-01 | warn | 2025-06-01 12:30:00 |
| dev-02 | error | 2025-06-01 14:20:00 |
| dev-01 | ok | 2025-06-01 18:45:00 |
Configuration: key columns: ["device_id"], keep: last.
After:
| device_id | status | recorded_at |
|---|---|---|
| dev-02 | error | 2025-06-01 14:20:00 |
| dev-01 | ok | 2025-06-01 18:45:00 |
Composite key: (account, day)
Section titled “Composite key: (account, day)”Two columns together form the deduplication key. Rows that match on both are collapsed.
Before:
| account | day | logins |
|---|---|---|
| acct-7001 | 2025-04-10 | 3 |
| acct-7002 | 2025-04-10 | 1 |
| acct-7001 | 2025-04-11 | 5 |
| acct-7001 | 2025-04-10 | 7 |
Configuration: key columns: ["account", "day"], keep: first.
After:
| account | day | logins |
|---|---|---|
| acct-7001 | 2025-04-10 | 3 |
| acct-7002 | 2025-04-10 | 1 |
| acct-7001 | 2025-04-11 | 5 |
Tips and Edge Cases
Section titled “Tips and Edge Cases”- Key matching is type-sensitive. Keys are computed by
JSON.stringifyof the key column values, so the string"42"and the number42are treated as different keys (they serialize as"\"42\""vs"42"). If your data mixes types in the same column from different upstream sources, normalize types first (e.g. via Type Coercion or Formula). Seeapps/web/src/transforms/deduplicate/logic.ts:41-52. lastretains the row that appears latest in input order, not by any timestamp. This transform doesn’t sort. To pick the newest row by date, sort ascending on the timestamp column upstream, then deduplicate withkeepStrategy: "last".- Non-key columns from dropped rows are discarded. If duplicates have different values in non-key columns, only the kept row’s values survive — there is no merge or aggregation. Use Group By if you need to combine values across duplicates (e.g. summing, taking max).
Related Transforms
Section titled “Related Transforms”- Sort Rows — sort upstream to control which duplicate
keep: lastretains. - Group By — aggregate rather than discard non-key data when collapsing duplicates.
- Filter Rows — narrow the dataset before deduplicating to reduce memory usage.