RemoveDuplicatesTransformer

Passo a passo de como iniciar sua jornada na Dadosfera

RemoveDuplicatesTransformer

Overview

Transforms DataStage RemoveDuplicates stages to SQL using DISTINCT or ROW_NUMBER(). Removes duplicate rows based on specified key columns or all columns.

Stage Type: REMOVE_DUPLICATES (remdup operator) Output: SQL with DISTINCT or ROW_NUMBER()

Capabilities

  • Remove exact duplicates (all columns match) using DISTINCT
  • Remove duplicates by key columns using ROW_NUMBER() with PARTITION BY
  • Keep first or last occurrence of duplicates
  • Preserves input schema (no columns added/removed)
  • Column filtering via keep clause

DataStage Stage Example

Example 1: Remove Duplicates by Key Columns (Keep Last)

Input (.dsx format)

#### STAGE: RemoveDuplicate
## Operator
remdup
## Operator options
-keep last
-key 'TYPE_CODE'
-key 'SOURCE_INFO'

## General options
[ident('RemoveDuplicate'); jobmon_ident('RemoveDuplicate')]
## Inputs
0< [] 'InputData:input.v'
## Outputs
0> [modify (
keep
  MONTH_REF,DESC_FIELD_1,DESC_FIELD_2,TYPE_CODE,
  SOURCE_INFO,INFO_DETAIL,LOAD_DATE,SOURCE_SYSTEM,
  JOB_NAME;
)] 'RemoveDuplicate:output.v'

Input Schema

[
  {"name": "MONTH_REF", "type": "string"},
  {"name": "DESC_FIELD_1", "type": "string"},
  {"name": "DESC_FIELD_2", "type": "string"},
  {"name": "TYPE_CODE", "type": "string"},
  {"name": "SOURCE_INFO", "type": "string"},
  {"name": "INFO_DETAIL", "type": "string"},
  {"name": "LOAD_DATE", "type": "date"},
  {"name": "SOURCE_SYSTEM", "type": "string"},
  {"name": "JOB_NAME", "type": "string"}
]

Generated SQL Output

SELECT
  MONTH_REF,
  DESC_FIELD_1,
  DESC_FIELD_2,
  TYPE_CODE,
  SOURCE_INFO,
  INFO_DETAIL,
  LOAD_DATE,
  SOURCE_SYSTEM,
  JOB_NAME
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY TYPE_CODE, SOURCE_INFO
            ORDER BY TYPE_CODE DESC, SOURCE_INFO DESC
        ) AS _row_num
    FROM inputdata_V8S0
) AS ranked
WHERE _row_num = 1

Output Schema

[
  {"name": "MONTH_REF", "type": "string"},
  {"name": "DESC_FIELD_1", "type": "string"},
  {"name": "DESC_FIELD_2", "type": "string"},
  {"name": "TYPE_CODE", "type": "string"},
  {"name": "SOURCE_INFO", "type": "string"},
  {"name": "INFO_DETAIL", "type": "string"},
  {"name": "LOAD_DATE", "type": "date"},
  {"name": "SOURCE_SYSTEM", "type": "string"},
  {"name": "JOB_NAME", "type": "string"}
]

Example 2: Remove Duplicates (Keep First)

Input (.dsx format)

#### STAGE: DeduplicateData
## Operator
remdup
## Operator options
-keep first
-key 'CUSTOMER_ID'
-key 'PRODUCT_ID'

## General options
[ident('DeduplicateData'); jobmon_ident('DeduplicateData')]
## Inputs
0< [] 'SourceData:input.v'
## Outputs
0> [] 'DeduplicateData:Output.v'

Generated SQL Output

SELECT
  CUSTOMER_ID,
  PRODUCT_ID,
  ORDER_DATE,
  AMOUNT
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY CUSTOMER_ID, PRODUCT_ID
            ORDER BY CUSTOMER_ID, PRODUCT_ID
        ) AS _row_num
    FROM sourcedata_V0S5
) AS ranked
WHERE _row_num = 1

Example 3: Remove Exact Duplicates (All Columns)

Input (.dsx format)

#### STAGE: RemoveExactDupes
## Operator
remdup
## General options
[ident('RemoveExactDupes'); jobmon_ident('RemoveExactDupes')]
## Inputs
0< [] 'RawData:input.v'
## Outputs
0> [] 'RemoveExactDupes:Output.v'

Generated SQL Output

SELECT DISTINCT
  CUSTOMER_ID,
  PRODUCT_ID,
  ORDER_DATE,
  AMOUNT
FROM rawdata_V0S12

How It Works

1. Identify Deduplication Strategy

No Key Columns (Exact Duplicates)

## Operator
remdup
# No -key options specified

Uses SELECT DISTINCT to remove rows where all columns match exactly.

With Key Columns

-key 'TYPE_CODE'
-key 'SOURCE_INFO'

Uses ROW_NUMBER() OVER (PARTITION BY ...) to remove rows where key columns match.

2. Determine Keep Option

  • -keep first (default): Keeps the first occurrence, removes subsequent duplicates

    • Orders by key columns ASC (default sort order)
    ORDER BY TYPE_CODE, SOURCE_INFO
  • -keep last: Keeps the last occurrence, removes earlier duplicates

    • Orders by key columns DESC
    ORDER BY TYPE_CODE DESC, SOURCE_INFO DESC

3. Apply Keep Clause

Filters output columns based on keep list in modify section:

keep
  MONTH_REF,DESC_FIELD_1,TYPE_CODE,
  SOURCE_INFO,LOAD_DATE;

Only these columns appear in final SELECT.

4. Generate SQL

For Exact Duplicates (No Keys)

SELECT DISTINCT
  <all_columns or keep_columns>
FROM input_table

For Key-Based Deduplication (Keep First)

SELECT
  <keep_columns or all_columns>
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY <key_columns>
            ORDER BY <key_columns>
        ) AS _row_num
    FROM input_table
) AS ranked
WHERE _row_num = 1

For Key-Based Deduplication (Keep Last)

SELECT
  <keep_columns or all_columns>
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY <key_columns>
            ORDER BY <key_columns> DESC
        ) AS _row_num
    FROM input_table
) AS ranked
WHERE _row_num = 1

5. Preserve Schema

Output schema matches the columns in keep clause (or all input columns if no keep clause).

Deduplication Strategies

Exact Duplicate Removal

Use when you want to remove rows that are completely identical across all columns.

Use Case:

Input:
ID  NAME   VALUE
1   Alice  100
1   Alice  100  ← Duplicate (all columns match)
2   Bob    200

Output:
ID  NAME   VALUE
1   Alice  100
2   Bob    200

Key-Based Deduplication (Keep First)

Use when you want to keep the first occurrence of rows with matching key columns.

Use Case:

Input:
TYPE  SOURCE  VALUE
A     SYS1    100   ← Keep (first occurrence)
A     SYS1    150   ← Remove
B     SYS2    200

Output:
TYPE  SOURCE  VALUE
A     SYS1    100
B     SYS2    200

Key-Based Deduplication (Keep Last)

Use when you want to keep the last occurrence of rows with matching key columns.

Use Case:

Input:
TYPE  SOURCE  VALUE
A     SYS1    100   ← Remove
A     SYS1    150   ← Keep (last occurrence)
B     SYS2    200

Output:
TYPE  SOURCE  VALUE
A     SYS1    150
B     SYS2    200

Composite Keys

You can deduplicate on multiple key columns:

Input

-key 'REGION_CODE'
-key 'CUSTOMER_ID'
-key 'PRODUCT_ID'
-keep first

Generated SQL

SELECT
  REGION_CODE,
  CUSTOMER_ID,
  PRODUCT_ID,
  ORDER_DATE,
  AMOUNT
FROM (
    SELECT
        *,
        ROW_NUMBER() OVER (
            PARTITION BY REGION_CODE, CUSTOMER_ID, PRODUCT_ID
            ORDER BY REGION_CODE, CUSTOMER_ID, PRODUCT_ID
        ) AS _row_num
    FROM input_table
) AS ranked
WHERE _row_num = 1

Only rows with all three keys matching are considered duplicates.

Keep Clause

The keep clause in the modify section filters which columns appear in output:

Without Keep Clause

## Outputs
0> [] 'RemoveDuplicate:output.v'

All columns from input are included in output.

With Keep Clause

## Outputs
0> [modify (
keep
  MONTH_REF,TYPE_CODE,SOURCE_INFO,
  LOAD_DATE,JOB_NAME;
)] 'RemoveDuplicate:output.v'

Only specified columns appear in output SELECT.

Limitations

  • ⚠️ Keep last with no natural order - When -keep last is used without a timestamp column, the result may depend on physical row order. Consider adding an explicit ORDER BY timestamp if available.
  • ⚠️ Multiple inputs not supported - RemoveDuplicates stage must have exactly one input
  • ⚠️ No custom ORDER BY - When using key columns, ordering is always by the key columns themselves. For timestamp-based "latest" logic, consider using AGGREGATOR with MAX(timestamp) or add a timestamp to the ORDER BY clause manually.
  • ⚠️ ORDER BY uses key columns only - The ORDER BY clause uses the same columns as PARTITION BY. If you need to keep the row with the latest timestamp, you may need to modify the generated SQL or use a different approach.

Related Transformers

  • AggregatorStageTransformer - For grouping with aggregations (COUNT, SUM, etc.)
  • FunnelStageTransformer - For combining multiple inputs and removing duplicates
  • ModifyStageTransformer - For column filtering after deduplication