ImportStageTransformer

Passo a passo de como iniciar sua jornada na Dadosfera

ImportStageTransformer

Overview

Transforms DataStage IMPORT stages (CSV/sequential file readers) to SQL SELECT statements. Maps file paths to pre-loaded Snowflake tables.

Stage Type: INPUT (import operator with PxSequentialFile) Output: SQL SELECT from mapped Snowflake table

Recommended Approach

For IMPORT stages reading CSV or sequential files, we strongly recommend:

  1. Ingest the files into Snowflake using Dadosfera Ingestion module
  2. Map the file path to the target Snowflake table in the input mapping configuration
  3. The converter will generate a simple SELECT from the target table

This approach ensures data is already loaded in Snowflake before the converted SQL runs.

Example Workflow

Original (DataStage):

File: /data/input/sales_data.csv
Schema: ID, NAME, AMOUNT

Recommended:

  1. Ingest file using Dadosfera Ingestion module:

    • Source: /data/input/sales_data.csv
    • Target: PUBLIC.SALES_DATA_TABLE
    • Run ingestion pipeline
  2. Configure input mapping:

    {
      "/data/input/sales_data.csv": "PUBLIC.SALES_DATA_TABLE"
    }
  3. Converter generates:

    SELECT ID, NAME, AMOUNT FROM PUBLIC.SALES_DATA_TABLE

Capabilities

  • Maps CSV/sequential file paths to Snowflake tables
  • Preserves column schema from file definition
  • Handles delimited files (CSV, semicolon, tab, etc.)
  • Supports nullable columns
  • Column list from inline schema or -firstLineColumnNames

DataStage Stage Example

Input (.dsx format)

#### STAGE: ImportFileData
## Operator
import
## Operator options
-schema record
  {final_delim=end, record_delim_string='\\r\\n', delim=';', quote=none}
(
  REGION_CODE:nullable ustring[max=50];
  SEGMENT_CODE:nullable ustring[max=50];
  PRODUCT_NAME:nullable ustring[max=255];
  VALUE_AMOUNT:nullable ustring[max=255];
  STATUS_FLAG:nullable ustring[max=10];
)
-rejects continue
-reportProgress yes
-firstLineColumnNames
-file '[&\"$PathDataFiles\"]/input_data/processed/[&\"file_name\"].csv'

## General options
[ident('ImportFileData'); jobmon_ident('ImportFileData')]
## Outputs
0> [] 'ImportFileData:output.v'

Input Mapping Configuration

{
  "[&\"$PathDataFiles\"]/input_data/processed/[&\"file_name\"].csv": "PUBLIC.INPUT_DATA_TABLE"
}

Generated SQL Output

SELECT
  REGION_CODE,
  SEGMENT_CODE,
  PRODUCT_NAME,
  VALUE_AMOUNT,
  STATUS_FLAG
FROM PUBLIC.INPUT_DATA_TABLE

Output Schema

[
  {"name": "REGION_CODE", "type": "string"},
  {"name": "SEGMENT_CODE", "type": "string"},
  {"name": "PRODUCT_NAME", "type": "string"},
  {"name": "VALUE_AMOUNT", "type": "string"},
  {"name": "STATUS_FLAG", "type": "string"}
]

How It Works

1. Extract File Path

Parses the -file option to identify the source file:

-file '[&\"$PathDataFiles\"]/input_data/processed/sales_2024.csv'

2. Lookup Snowflake Table

Uses input mapping to find corresponding Snowflake table:

{
  "[&\"$PathDataFiles\"]/input_data/processed/sales_2024.csv": "PUBLIC.SALES_2024_TABLE"
}

3. Extract Schema

Reads column definitions from inline schema:

-schema record
  {final_delim=end, record_delim_string='\\r\\n', delim=';', quote=none}
(
  REGION_CODE:nullable ustring[max=50];
  SEGMENT_CODE:nullable ustring[max=50];
  ...
)

4. Generate SELECT Statement

Creates SQL selecting specified columns from mapped table:

SELECT
  REGION_CODE,
  SEGMENT_CODE,
  ...
FROM PUBLIC.SALES_2024_TABLE

File Options

Delimiter Configuration

-schema record
  {delim=';'}          ← Semicolon delimiter
  {delim=','}          ← Comma delimiter (CSV)
  {delim='\\t'}        ← Tab delimiter (TSV)

Record Delimiter

{record_delim_string='\\r\\n'}   ← Windows line endings (CRLF)
{record_delim_string='\\n'}      ← Unix line endings (LF)

Quote Handling

{quote='"'}          ← Double quote for text fields
{quote=none}         ← No quoting

First Line Options

-firstLineColumnNames    ← Skip first row (contains headers)

Without this option, all rows are treated as data.

Column Type Mapping

DataStage types in schema are converted to Snowflake types:

DataStage TypeSnowflake Type
ustring[max=N]string
string[max=N]string
int32number
int64number
decimal[p,s]number
datedate
timestamptimestamp

All types can be marked nullable.

Input Mapping Examples

Simple File Mapping

{
  "/data/sales.csv": "PUBLIC.SALES_TABLE"
}

Parameterized File Path

DataStage often uses parameters in file paths:

-file '[&\"$BasePath\"]/input/[&\"region\"]_data.csv'

Map by replacing parameters with actual values or use pattern matching:

{
  "[&\"$BasePath\"]/input/[&\"region\"]_data.csv": "PUBLIC.REGION_DATA_TABLE"
}

Multiple Files

{
  "/data/sales_2023.csv": "PUBLIC.SALES_2023",
  "/data/sales_2024.csv": "PUBLIC.SALES_2024",
  "/data/customers.csv": "PUBLIC.CUSTOMERS"
}

Reject Handling

-rejects continue

DataStage option to continue processing when encountering malformed records. In Snowflake, this is handled during the ingestion phase (not in SQL conversion).

Limitations

  • ⚠️ Requires input mapping - File paths must be mapped to Snowflake tables; no automatic file reading in SQL
  • ⚠️ Pre-ingestion required - Files must be loaded into Snowflake using Dadosfera Ingestion module before SQL execution
  • ⚠️ Parameter resolution - DataStage job parameters in file paths must be resolved or mapped appropriately
  • ⚠️ No file format conversion - CSV parsing, delimiter handling, and data type conversions must be done during ingestion, not in SQL
  • ⚠️ Reject records - Malformed record handling must be configured in the ingestion pipeline

Best Practices

When to Use IMPORT Transformer

  • Simple file reads with pre-defined schema
  • Files already ingested into Snowflake tables
  • Straightforward column mappings (1:1 from file to table)

Pre-Ingestion with Dadosfera

Always use Dadosfera Ingestion module to:

  1. Load CSV/sequential files into Snowflake tables
  2. Handle schema mapping and data type conversions
  3. Configure error handling for malformed records
  4. Set up incremental loads if needed

Then configure input_mapping to point to the ingested tables.

Related Transformers

  • InputStageTransformer - For Oracle database inputs (not files)
  • CopyStageTransformer - For reading DataStage datasets (not files)
  • ModifyStageTransformer - For post-import column filtering/renaming