Remove duplicate rows
Description
The Remove Duplicate Rows activity filters out duplicate entries from a dataset by evaluating values in a specified column. It preserves the first occurrence of each unique value and removes all subsequent duplicates, helping to clean and normalize data for further processing.
This is particularly useful in scenarios where data merging or imports may have resulted in duplicate records, and only the most relevant or earliest instance should be retained.
Use this activity to:
- Clean data before analysis or export
- Remove redundant records during ETL processing
- Prepare datasets for machine learning or reporting by ensuring uniqueness
Use case: In a CRM export where customers may appear multiple times due to recent activity, use this activity to deduplicate by Email or Customer ID, retaining only the earliest instance.
Input
Type | Status |
---|---|
Data | Required |
Output
Output Type | Format | Description |
---|---|---|
Data | Table | The cleaned dataset with only the first instance of each duplicate retained. |
Configuration Fields
Field Name | Description |
---|---|
Column Name | The column based on which duplicate detection is performed. If two or more rows share the same value in this column, only the first row is retained. |
If the column contains empty or null values, those rows are not treated as duplicates of each other unless the values are exactly identical.
Sample Input
ID | Name | Age | City |
---|---|---|---|
101 | John | 25 | New York |
102 | Alice | 30 | Chicago |
103 | John | 25 | New York |
104 | Bob | 40 | Boston |
105 | Alice | 30 | Chicago |
In this example, rows 103 and 105 are duplicates based on the Name column.
Sample Configuration
Field | Value |
---|---|
Column Name | Name |
Sample Output
ID | Name | Age | City |
---|---|---|---|
101 | John | 25 | New York |
102 | Alice | 30 | Chicago |
104 | Bob | 40 | Boston |
The duplicate rows for John and Alice were removed, keeping only the first occurrence based on their appearance in the input data.