Functionalities

Feature	Supported	Notes
Set pipeline frequency	Yes
Incremental Synchronization	Yes
Full load Synchronization	No
Entity Selection	No
Column Selection	No
Micro-transformation: Hash	No

📘
File Types
The currently supported file types are: CSV, JSON, and Parquet.
Only one file type can be loaded at a time.

Quick Guide

To start creating a Pipeline, just go to the "Collect" module, "Pipelines" and press "New Pipeline".

Choose the Data Source

Use an already registered source or register a new one

Connection Parameters (source registration)

Field Name	Description	Example
Access Key ID	Access key provided by an IAM user or AWS account root user. Should have 20 characters.	ACHLNDKM6AIPSWH3TP
Secret Access Key	Secret key provided along with the AWS Access Key. Should have 40 characters.	KwYmQq/zZQAjc+pMRiQ

Pipeline Information

Assign a name and a brief description to your Pipeline.

Pipeline Configuration Parameters (CSV)

Fill in the fields below with the parameters and click "Continue".

Field Name	Description	Example
File Type	CSV	-
Name	Descriptive name of the file	My spreadsheet
Bucket	Name of the source bucket of the file	my_bucket
File Name to Extract	Path name to reach the file. It's everything after the bucket name. (Note: the file name must not contain spaces and special characters for the success of the collection step)	mydata/2021/data.csv
Separator	Delimiter character of columns in the file	";", ",", "tab", "/"
Encoding Type	Standard of the character set	UTF-8, ISO-8859-1, US-ASCII, UTF-16BE, UTF-16, UTF-16LE
Enable Header	Does the file have a header?	-

📘
CSV with Multiline
If your CSV has data that breaks into lines, it will only be correctly collected for Dadosfera if it is enclosed in quotes. Example:
name,age,address
john,29,"19 Street
Ohio"
Multiline not accepted:
name,age,address
john,
29,
19 Street Ohio

Pipeline Configuration Parameters (JSON)

Field Name	Description	Example
File Type	JSON	-
Name	Descriptive name of the file	My spreadsheet
Bucket	Name of the source bucket of the file	my_bucket
File Name to Extract	Path name to reach the file. It's everything after the bucket name. (Note: the file name must not contain spaces and special characters for the success of the collection step)	mydata/2021/data.csv
Encoding Type	Standard of the character set	UTF-8, ISO-8859-1, US-ASCII, UTF-16BE, UTF-16, UTF-16LE

Pipeline Configuration Parameters (Parquet)

Field Name	Description	Example
File Type	Parquet	-
Name	Descriptive name of the file	My spreadsheet
Bucket	Name of the source bucket of the file	my_bucket
File Name to Extract	Path name to reach the file. It's everything after the bucket name. (Note: the file name must not contain spaces and special characters for the success of the collection step)	mydata/2021/data.csv

📘
Special Characters:
Even if the "Encoding Type" is adjusted, special characters may not appear in the catalog by default.

📘
General Observations on Data Types:

A column will only be recognized as the number type when all elements of the column are whole numbers.

A column will only be recognized as the float type when all elements of the column follow the data type standard for floating-point numbers, i.e., a rational number (e.g., xxxx.xx).

The use of commas in numeric formatting (x,xxx.xx) results in a conversion of a column to the text type.

Frequency

Finally, configure the desired frequency for your pipeline to run. You can choose from the presented options or insert a custom frequency using a cron expression. To learn more, click here.

📘
Incremental Data Extraction
When working with Amazon S3, we often encounter a hierarchical organization of data. A common structure is to have a main directory (e.g., "data") and within it, several subdirectories that segment the data, often by periods such as months or years.
Example structure:
data/
|-- 2023/
    |-- 01/
        |-- 0.csv
        |-- 1.csv
    |-- 02/
        |-- 0.csv
        |-- 1.csv
In the example above, data is our main directory and 2023/01/ and 2023/02/ are subdivisions, often called "partitions." Each partition contains individual data files.
In this incremental approach, we create a single pipeline that points to the main directory (data). This pipeline is smart enough to identify and automatically extract all the files from the subdivisions or partitions.

Ready! Now just wait for the collection to be done at the scheduled time and day.

If you want to execute the pipeline immediately, it is possible to do so manually. Go to "Pipelines", "List" and "Synchronize Pipeline".

After a few minutes, your pipeline will be cataloged in the exploration tab as a Data Asset.

📘
The table will be cataloged by adding the column _processing_timestamp. Thus, in incremental collections where values are overwritten, it will be possible to collect the latest value of the record. Additionally, the column with the date when the record was written will assist in data analysis.

You can also check the pipeline details in the pipeline list, such as: summary, list of entities and collected columns, execution history and micro-transformation, under "View Pipeline".

[EN] AWS S3

Functionalities

📘
File Types

Quick Guide

Choose the Data Source

Use an already registered source or register a new one

Connection Parameters (source registration)

Pipeline Information

Assign a name and a brief description to your Pipeline.

Pipeline Configuration Parameters (CSV)

📘
CSV with Multiline

Pipeline Configuration Parameters (JSON)

Pipeline Configuration Parameters (Parquet)

📘
Special Characters:

📘
General Observations on Data Types:

Frequency

📘
Incremental Data Extraction

📘

Functionalities

📘File Types

Quick Guide

Choose the Data Source

Use an already registered source or register a new one

Connection Parameters (source registration)

Pipeline Information

Assign a name and a brief description to your Pipeline.

Pipeline Configuration Parameters (CSV)

📘CSV with Multiline

Pipeline Configuration Parameters (JSON)

Pipeline Configuration Parameters (Parquet)

📘Special Characters:

📘General Observations on Data Types:

Frequency

📘Incremental Data Extraction

📘

📘
File Types

📘
CSV with Multiline

📘
Special Characters:

📘
General Observations on Data Types:

📘
Incremental Data Extraction