AggregatorStageTransformer

Passo a passo de como iniciar sua jornada na Dadosfera

AggregatorStageTransformer

Overview

Transforms DataStage AGGREGATOR stages to SQL GROUP BY statements. Performs grouping and aggregation operations (COUNT, SUM, AVG, MAX, MIN).

Stage Type: AGGREGATOR (group operator) Output: SQL SELECT with GROUP BY

Capabilities

  • Single and multiple grouping keys
  • COUNT(*) aggregation
  • SUM, AVG, MAX, MIN aggregations
  • Automatic output schema generation
  • Supports composite GROUP BY clauses

DataStage Stage Example

Input (.dsx format)

#### STAGE: AggregateData
## Operator
group
## Operator options
-hash
-key 'CUSTOMER_ID'
-key 'PRODUCT_ID'
-countField 'SALES_COUNT'
-sumField 'SALE_AMOUNT' 'TOTAL_SALES'
-avgField 'SALE_AMOUNT' 'AVG_SALES'
## General options
[ident('AggregateData'); jobmon_ident('AggregateData')]
## Inputs
0< [] 'SalesData:input.v'
## Outputs
0> [] 'AggregateData:Output.v'

Input Schema

[
  {"name": "CUSTOMER_ID", "type": "string"},
  {"name": "PRODUCT_ID", "type": "string"},
  {"name": "SALE_AMOUNT", "type": "number"},
  {"name": "SALE_DATE", "type": "date"}
]

Generated SQL Output

SELECT
  CUSTOMER_ID,
  PRODUCT_ID,
  COUNT(*) AS SALES_COUNT,
  SUM(SALE_AMOUNT) AS TOTAL_SALES,
  AVG(SALE_AMOUNT) AS AVG_SALES
FROM salesdata_V0S8
GROUP BY CUSTOMER_ID, PRODUCT_ID

Output Schema

[
  {"name": "CUSTOMER_ID", "type": "string"},
  {"name": "PRODUCT_ID", "type": "string"},
  {"name": "SALES_COUNT", "type": "number"},
  {"name": "TOTAL_SALES", "type": "number"},
  {"name": "AVG_SALES", "type": "number"}
]

How It Works

1. Identify Grouping Keys

Parses -key options to identify GROUP BY columns:

-key 'CUSTOMER_ID'
-key 'PRODUCT_ID'

Generates:

GROUP BY CUSTOMER_ID, PRODUCT_ID

2. Extract Aggregation Functions

Parses aggregation options:

OptionSQLExample
-countField 'ROW_COUNT'COUNT(*) AS ROW_COUNTCount rows per group
-sumField 'AMOUNT' 'TOTAL'SUM(AMOUNT) AS TOTALSum of AMOUNT column
-avgField 'AMOUNT' 'AVG_AMT'AVG(AMOUNT) AS AVG_AMTAverage of AMOUNT
-maxField 'VALUE' 'MAX_VAL'MAX(VALUE) AS MAX_VALMaximum VALUE
-minField 'VALUE' 'MIN_VAL'MIN(VALUE) AS MIN_VALMinimum VALUE

3. Build SELECT Clause

Combines grouping keys and aggregations:

SELECT
  <grouping_key_1>,
  <grouping_key_2>,
  COUNT(*) AS <count_field>,
  SUM(<input_col>) AS <sum_field>,
  ...
FROM input_table
GROUP BY <grouping_key_1>, <grouping_key_2>

4. Infer Output Schema

  • Grouping keys inherit types from input schema
  • COUNT fields typed as INTEGER
  • SUM/AVG fields typed as DECIMAL
  • MAX/MIN fields inherit input column type

Aggregation Functions

COUNT

-countField 'ROW_COUNT'

Generates:

COUNT(*) AS ROW_COUNT

Counts rows per group.

SUM

-sumField 'SALE_AMOUNT' 'TOTAL_SALES'

Generates:

SUM(SALE_AMOUNT) AS TOTAL_SALES

Sums values of SALE_AMOUNT column per group.

AVG

-avgField 'SALE_AMOUNT' 'AVG_SALES'

Generates:

AVG(SALE_AMOUNT) AS AVG_SALES

Calculates average of SALE_AMOUNT per group.

MAX

-maxField 'SALE_DATE' 'LATEST_DATE'

Generates:

MAX(SALE_DATE) AS LATEST_DATE

Finds maximum value per group.

MIN

-minField 'SALE_DATE' 'EARLIEST_DATE'

Generates:

MIN(SALE_DATE) AS EARLIEST_DATE

Finds minimum value per group.

Single Key Example

Input

-key 'REGION_CODE'
-countField 'REGION_COUNT'

Generated SQL

SELECT
  REGION_CODE,
  COUNT(*) AS REGION_COUNT
FROM input_table
GROUP BY REGION_CODE

Composite Key Example

Input

-key 'REGION_CODE'
-key 'CATEGORY_CODE'
-key 'PERIOD_CODE'
-countField 'ROW_COUNT'

Generated SQL

SELECT
  REGION_CODE,
  CATEGORY_CODE,
  PERIOD_CODE,
  COUNT(*) AS ROW_COUNT
FROM input_table
GROUP BY REGION_CODE, CATEGORY_CODE, PERIOD_CODE

Multiple Aggregations

You can combine multiple aggregation functions:

Input

-key 'CUSTOMER_ID'
-countField 'ORDER_COUNT'
-sumField 'ORDER_VALUE' 'TOTAL_VALUE'
-avgField 'ORDER_VALUE' 'AVG_VALUE'
-maxField 'ORDER_DATE' 'LAST_ORDER_DATE'
-minField 'ORDER_DATE' 'FIRST_ORDER_DATE'

Generated SQL

SELECT
  CUSTOMER_ID,
  COUNT(*) AS ORDER_COUNT,
  SUM(ORDER_VALUE) AS TOTAL_VALUE,
  AVG(ORDER_VALUE) AS AVG_VALUE,
  MAX(ORDER_DATE) AS LAST_ORDER_DATE,
  MIN(ORDER_DATE) AS FIRST_ORDER_DATE
FROM input_table
GROUP BY CUSTOMER_ID

Limitations

  • ⚠️ No HAVING clause - Post-aggregation filters not supported (use downstream FILTER stage)
  • ⚠️ No complex expressions - Aggregations only work on simple column references, not expressions
  • ⚠️ No DISTINCT aggregations - COUNT(DISTINCT col) not supported
  • ⚠️ Hash partitioning ignored - -hash option is DataStage-specific and not converted to SQL
  • ⚠️ No ROLLUP/CUBE - Advanced grouping sets not supported

Related Transformers

  • TransformerStageTransformer - For post-aggregation column derivations
  • ModifyStageTransformer - For post-aggregation column filtering
  • RemoveDuplicatesTransformer - For simple deduplication without aggregation