RemoveDuplicatesTransformer
Passo a passo de como iniciar sua jornada na Dadosfera
RemoveDuplicatesTransformer
Overview
Transforms DataStage RemoveDuplicates stages to SQL using DISTINCT or ROW_NUMBER(). Removes duplicate rows based on specified key columns or all columns.
Stage Type: REMOVE_DUPLICATES (remdup operator)
Output: SQL with DISTINCT or ROW_NUMBER()
Capabilities
- Remove exact duplicates (all columns match) using DISTINCT
- Remove duplicates by key columns using ROW_NUMBER() with PARTITION BY
- Keep first or last occurrence of duplicates
- Preserves input schema (no columns added/removed)
- Column filtering via keep clause
DataStage Stage Example
Example 1: Remove Duplicates by Key Columns (Keep Last)
Input (.dsx format)
#### STAGE: RemoveDuplicate
## Operator
remdup
## Operator options
-keep last
-key 'TYPE_CODE'
-key 'SOURCE_INFO'
## General options
[ident('RemoveDuplicate'); jobmon_ident('RemoveDuplicate')]
## Inputs
0< [] 'InputData:input.v'
## Outputs
0> [modify (
keep
MONTH_REF,DESC_FIELD_1,DESC_FIELD_2,TYPE_CODE,
SOURCE_INFO,INFO_DETAIL,LOAD_DATE,SOURCE_SYSTEM,
JOB_NAME;
)] 'RemoveDuplicate:output.v'
Input Schema
[
{"name": "MONTH_REF", "type": "string"},
{"name": "DESC_FIELD_1", "type": "string"},
{"name": "DESC_FIELD_2", "type": "string"},
{"name": "TYPE_CODE", "type": "string"},
{"name": "SOURCE_INFO", "type": "string"},
{"name": "INFO_DETAIL", "type": "string"},
{"name": "LOAD_DATE", "type": "date"},
{"name": "SOURCE_SYSTEM", "type": "string"},
{"name": "JOB_NAME", "type": "string"}
]Generated SQL Output
SELECT
MONTH_REF,
DESC_FIELD_1,
DESC_FIELD_2,
TYPE_CODE,
SOURCE_INFO,
INFO_DETAIL,
LOAD_DATE,
SOURCE_SYSTEM,
JOB_NAME
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY TYPE_CODE, SOURCE_INFO
ORDER BY TYPE_CODE DESC, SOURCE_INFO DESC
) AS _row_num
FROM inputdata_V8S0
) AS ranked
WHERE _row_num = 1Output Schema
[
{"name": "MONTH_REF", "type": "string"},
{"name": "DESC_FIELD_1", "type": "string"},
{"name": "DESC_FIELD_2", "type": "string"},
{"name": "TYPE_CODE", "type": "string"},
{"name": "SOURCE_INFO", "type": "string"},
{"name": "INFO_DETAIL", "type": "string"},
{"name": "LOAD_DATE", "type": "date"},
{"name": "SOURCE_SYSTEM", "type": "string"},
{"name": "JOB_NAME", "type": "string"}
]Example 2: Remove Duplicates (Keep First)
Input (.dsx format)
#### STAGE: DeduplicateData
## Operator
remdup
## Operator options
-keep first
-key 'CUSTOMER_ID'
-key 'PRODUCT_ID'
## General options
[ident('DeduplicateData'); jobmon_ident('DeduplicateData')]
## Inputs
0< [] 'SourceData:input.v'
## Outputs
0> [] 'DeduplicateData:Output.v'
Generated SQL Output
SELECT
CUSTOMER_ID,
PRODUCT_ID,
ORDER_DATE,
AMOUNT
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY CUSTOMER_ID, PRODUCT_ID
ORDER BY CUSTOMER_ID, PRODUCT_ID
) AS _row_num
FROM sourcedata_V0S5
) AS ranked
WHERE _row_num = 1Example 3: Remove Exact Duplicates (All Columns)
Input (.dsx format)
#### STAGE: RemoveExactDupes
## Operator
remdup
## General options
[ident('RemoveExactDupes'); jobmon_ident('RemoveExactDupes')]
## Inputs
0< [] 'RawData:input.v'
## Outputs
0> [] 'RemoveExactDupes:Output.v'
Generated SQL Output
SELECT DISTINCT
CUSTOMER_ID,
PRODUCT_ID,
ORDER_DATE,
AMOUNT
FROM rawdata_V0S12How It Works
1. Identify Deduplication Strategy
No Key Columns (Exact Duplicates)
## Operator
remdup
# No -key options specified
Uses SELECT DISTINCT to remove rows where all columns match exactly.
With Key Columns
-key 'TYPE_CODE'
-key 'SOURCE_INFO'
Uses ROW_NUMBER() OVER (PARTITION BY ...) to remove rows where key columns match.
2. Determine Keep Option
-
-keep first (default): Keeps the first occurrence, removes subsequent duplicates
- Orders by key columns ASC (default sort order)
ORDER BY TYPE_CODE, SOURCE_INFO -
-keep last: Keeps the last occurrence, removes earlier duplicates
- Orders by key columns DESC
ORDER BY TYPE_CODE DESC, SOURCE_INFO DESC
3. Apply Keep Clause
Filters output columns based on keep list in modify section:
keep
MONTH_REF,DESC_FIELD_1,TYPE_CODE,
SOURCE_INFO,LOAD_DATE;
Only these columns appear in final SELECT.
4. Generate SQL
For Exact Duplicates (No Keys)
SELECT DISTINCT
<all_columns or keep_columns>
FROM input_tableFor Key-Based Deduplication (Keep First)
SELECT
<keep_columns or all_columns>
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY <key_columns>
ORDER BY <key_columns>
) AS _row_num
FROM input_table
) AS ranked
WHERE _row_num = 1For Key-Based Deduplication (Keep Last)
SELECT
<keep_columns or all_columns>
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY <key_columns>
ORDER BY <key_columns> DESC
) AS _row_num
FROM input_table
) AS ranked
WHERE _row_num = 15. Preserve Schema
Output schema matches the columns in keep clause (or all input columns if no keep clause).
Deduplication Strategies
Exact Duplicate Removal
Use when you want to remove rows that are completely identical across all columns.
Use Case:
Input:
ID NAME VALUE
1 Alice 100
1 Alice 100 ← Duplicate (all columns match)
2 Bob 200
Output:
ID NAME VALUE
1 Alice 100
2 Bob 200
Key-Based Deduplication (Keep First)
Use when you want to keep the first occurrence of rows with matching key columns.
Use Case:
Input:
TYPE SOURCE VALUE
A SYS1 100 ← Keep (first occurrence)
A SYS1 150 ← Remove
B SYS2 200
Output:
TYPE SOURCE VALUE
A SYS1 100
B SYS2 200
Key-Based Deduplication (Keep Last)
Use when you want to keep the last occurrence of rows with matching key columns.
Use Case:
Input:
TYPE SOURCE VALUE
A SYS1 100 ← Remove
A SYS1 150 ← Keep (last occurrence)
B SYS2 200
Output:
TYPE SOURCE VALUE
A SYS1 150
B SYS2 200
Composite Keys
You can deduplicate on multiple key columns:
Input
-key 'REGION_CODE'
-key 'CUSTOMER_ID'
-key 'PRODUCT_ID'
-keep first
Generated SQL
SELECT
REGION_CODE,
CUSTOMER_ID,
PRODUCT_ID,
ORDER_DATE,
AMOUNT
FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY REGION_CODE, CUSTOMER_ID, PRODUCT_ID
ORDER BY REGION_CODE, CUSTOMER_ID, PRODUCT_ID
) AS _row_num
FROM input_table
) AS ranked
WHERE _row_num = 1Only rows with all three keys matching are considered duplicates.
Keep Clause
The keep clause in the modify section filters which columns appear in output:
Without Keep Clause
## Outputs
0> [] 'RemoveDuplicate:output.v'
All columns from input are included in output.
With Keep Clause
## Outputs
0> [modify (
keep
MONTH_REF,TYPE_CODE,SOURCE_INFO,
LOAD_DATE,JOB_NAME;
)] 'RemoveDuplicate:output.v'
Only specified columns appear in output SELECT.
Limitations
- ⚠️ Keep last with no natural order - When
-keep lastis used without a timestamp column, the result may depend on physical row order. Consider adding an explicit ORDER BY timestamp if available. - ⚠️ Multiple inputs not supported - RemoveDuplicates stage must have exactly one input
- ⚠️ No custom ORDER BY - When using key columns, ordering is always by the key columns themselves. For timestamp-based "latest" logic, consider using AGGREGATOR with MAX(timestamp) or add a timestamp to the ORDER BY clause manually.
- ⚠️ ORDER BY uses key columns only - The ORDER BY clause uses the same columns as PARTITION BY. If you need to keep the row with the latest timestamp, you may need to modify the generated SQL or use a different approach.
Related Transformers
- AggregatorStageTransformer - For grouping with aggregations (COUNT, SUM, etc.)
- FunnelStageTransformer - For combining multiple inputs and removing duplicates
- ModifyStageTransformer - For column filtering after deduplication
Updated 5 days ago
