[EN] AWS S3

Ferramenta de armazenamento de objetos oferecida pela Amazon Web Services, Inc. A Dadosfera replica os dados e carrega na Dadosfera.

Functionalities

FeatureSupportedNotes
Set pipeline frequencyYes
Incremental SynchronizationYes
Full load SynchronizationNo
Entity SelectionNo
Column SelectionNo
Micro-transformation: HashNo

📘

File Types

The currently supported file types are: CSV, JSON, and Parquet.

Only one file type can be loaded at a time.

Quick Guide

To start creating a Pipeline, just go to the "Collect" module, "Pipelines" and press "New Pipeline".

Choose the Data Source

Use an already registered source or register a new one

Connection Parameters (source registration)

Field NameDescriptionExample
Access Key IDAccess key provided by an IAM user or AWS account root user. Should have 20 characters.ACHLNDKM6AIPSWH3TP
Secret Access KeySecret key provided along with the AWS Access Key. Should have 40 characters.KwYmQq/zZQAjc+pMRiQ

Pipeline Information

Assign a name and a brief description to your Pipeline.

Pipeline Configuration Parameters (CSV)

Fill in the fields below with the parameters and click "Continue".

Field NameDescriptionExample
File TypeCSV-
NameDescriptive name of the fileMy spreadsheet
BucketName of the source bucket of the filemy_bucket
File Name to ExtractPath name to reach the file. It's everything after the bucket name. (Note: the file name must not contain spaces and special characters for the success of the collection step)mydata/2021/data.csv
SeparatorDelimiter character of columns in the file";", ",", "tab", "/"
Encoding TypeStandard of the character setUTF-8, ISO-8859-1, US-ASCII, UTF-16BE, UTF-16, UTF-16LE
Enable HeaderDoes the file have a header?-

📘

CSV with Multiline

If your CSV has data that breaks into lines, it will only be correctly collected for Dadosfera if it is enclosed in quotes. Example:

name,age,address
john,29,"19 Street
Ohio"

Multiline not accepted:

name,age,address
john,
29,
19 Street Ohio

Pipeline Configuration Parameters (JSON)

Field NameDescriptionExample
File TypeJSON-
NameDescriptive name of the fileMy spreadsheet
BucketName of the source bucket of the filemy_bucket
File Name to ExtractPath name to reach the file. It's everything after the bucket name. (Note: the file name must not contain spaces and special characters for the success of the collection step)mydata/2021/data.csv
Encoding TypeStandard of the character setUTF-8, ISO-8859-1, US-ASCII, UTF-16BE, UTF-16, UTF-16LE

Pipeline Configuration Parameters (Parquet)

Field NameDescriptionExample
File TypeParquet-
NameDescriptive name of the fileMy spreadsheet
BucketName of the source bucket of the filemy_bucket
File Name to ExtractPath name to reach the file. It's everything after the bucket name. (Note: the file name must not contain spaces and special characters for the success of the collection step)mydata/2021/data.csv

📘

Special Characters:

Even if the "Encoding Type" is adjusted, special characters may not appear in the catalog by default.

📘

General Observations on Data Types:

  • A column will only be recognized as the number type when all elements of the column are whole numbers.

  • A column will only be recognized as the float type when all elements of the column follow the data type standard for floating-point numbers, i.e., a rational number (e.g., xxxx.xx).

  • The use of commas in numeric formatting (x,xxx.xx) results in a conversion of a column to the text type.

Frequency

  • Finally, configure the desired frequency for your pipeline to run. You can choose from the presented options or insert a custom frequency using a cron expression. To learn more, click here.

📘

Incremental Data Extraction

When working with Amazon S3, we often encounter a hierarchical organization of data. A common structure is to have a main directory (e.g., "data") and within it, several subdirectories that segment the data, often by periods such as months or years.
Example structure:

data/
|-- 2023/
    |-- 01/
        |-- 0.csv
        |-- 1.csv
    |-- 02/
        |-- 0.csv
        |-- 1.csv

In the example above, data is our main directory and 2023/01/ and 2023/02/ are subdivisions, often called "partitions." Each partition contains individual data files.

In this incremental approach, we create a single pipeline that points to the main directory (data). This pipeline is smart enough to identify and automatically extract all the files from the subdivisions or partitions.

Ready! Now just wait for the collection to be done at the scheduled time and day.

If you want to execute the pipeline immediately, it is possible to do so manually. Go to "Pipelines", "List" and "Synchronize Pipeline".

After a few minutes, your pipeline will be cataloged in the exploration tab as a Data Asset.

📘

The table will be cataloged by adding the column _processing_timestamp. Thus, in incremental collections where values are overwritten, it will be possible to collect the latest value of the record. Additionally, the column with the date when the record was written will assist in data analysis.

You can also check the pipeline details in the pipeline list, such as: summary, list of entities and collected columns, execution history and micro-transformation, under "View Pipeline".