[EN] AWS S3
Ferramenta de armazenamento de objetos oferecida pela Amazon Web Services, Inc. A Dadosfera replica os dados e carrega na Dadosfera.
Functionalities
Feature | Supported | Notes |
---|---|---|
Set pipeline frequency | Yes | |
Incremental Synchronization | Yes | |
Full load Synchronization | No | |
Entity Selection | No | |
Column Selection | No | |
Micro-transformation: Hash | No |
File Types
The currently supported file types are: CSV, JSON, and Parquet.
Only one file type can be loaded at a time.
Quick Guide
To start creating a Pipeline, just go to the "Collect" module, "Pipelines" and press "New Pipeline".
Choose the Data Source
Use an already registered source or register a new one
Connection Parameters (source registration)
Field Name | Description | Example |
---|---|---|
Access Key ID | Access key provided by an IAM user or AWS account root user. Should have 20 characters. | ACHLNDKM6AIPSWH3TP |
Secret Access Key | Secret key provided along with the AWS Access Key. Should have 40 characters. | KwYmQq/zZQAjc+pMRiQ |
Pipeline Information
Assign a name and a brief description to your Pipeline.
Pipeline Configuration Parameters (CSV)
Fill in the fields below with the parameters and click "Continue".
Field Name | Description | Example |
---|---|---|
File Type | CSV | - |
Name | Descriptive name of the file | My spreadsheet |
Bucket | Name of the source bucket of the file | my_bucket |
File Name to Extract | Path name to reach the file. It's everything after the bucket name. (Note: the file name must not contain spaces and special characters for the success of the collection step) | mydata/2021/data.csv |
Separator | Delimiter character of columns in the file | ";", ",", "tab", "/" |
Encoding Type | Standard of the character set | UTF-8, ISO-8859-1, US-ASCII, UTF-16BE, UTF-16, UTF-16LE |
Enable Header | Does the file have a header? | - |
CSV with Multiline
If your CSV has data that breaks into lines, it will only be correctly collected for Dadosfera if it is enclosed in quotes. Example:
name,age,address john,29,"19 Street Ohio"
Multiline not accepted:
name,age,address john, 29, 19 Street Ohio
Pipeline Configuration Parameters (JSON)
Field Name | Description | Example |
---|---|---|
File Type | JSON | - |
Name | Descriptive name of the file | My spreadsheet |
Bucket | Name of the source bucket of the file | my_bucket |
File Name to Extract | Path name to reach the file. It's everything after the bucket name. (Note: the file name must not contain spaces and special characters for the success of the collection step) | mydata/2021/data.csv |
Encoding Type | Standard of the character set | UTF-8, ISO-8859-1, US-ASCII, UTF-16BE, UTF-16, UTF-16LE |
Pipeline Configuration Parameters (Parquet)
Field Name | Description | Example |
---|---|---|
File Type | Parquet | - |
Name | Descriptive name of the file | My spreadsheet |
Bucket | Name of the source bucket of the file | my_bucket |
File Name to Extract | Path name to reach the file. It's everything after the bucket name. (Note: the file name must not contain spaces and special characters for the success of the collection step) | mydata/2021/data.csv |
Special Characters:
Even if the "Encoding Type" is adjusted, special characters may not appear in the catalog by default.
General Observations on Data Types:
A column will only be recognized as the number type when all elements of the column are whole numbers.
A column will only be recognized as the float type when all elements of the column follow the data type standard for floating-point numbers, i.e., a rational number (e.g., xxxx.xx).
The use of commas in numeric formatting (x,xxx.xx) results in a conversion of a column to the text type.
Frequency
- Finally, configure the desired frequency for your pipeline to run. You can choose from the presented options or insert a custom frequency using a cron expression. To learn more, click here.
Incremental Data Extraction
When working with Amazon S3, we often encounter a hierarchical organization of data. A common structure is to have a main directory (e.g., "data") and within it, several subdirectories that segment the data, often by periods such as months or years.
Example structure:data/ |-- 2023/ |-- 01/ |-- 0.csv |-- 1.csv |-- 02/ |-- 0.csv |-- 1.csv
In the example above, data is our main directory and 2023/01/ and 2023/02/ are subdivisions, often called "partitions." Each partition contains individual data files.
In this incremental approach, we create a single pipeline that points to the main directory (data). This pipeline is smart enough to identify and automatically extract all the files from the subdivisions or partitions.
Ready! Now just wait for the collection to be done at the scheduled time and day.
If you want to execute the pipeline immediately, it is possible to do so manually. Go to "Pipelines", "List" and "Synchronize Pipeline".
After a few minutes, your pipeline will be cataloged in the exploration tab as a Data Asset.
The table will be cataloged by adding the column _processing_timestamp. Thus, in incremental collections where values are overwritten, it will be possible to collect the latest value of the record. Additionally, the column with the date when the record was written will assist in data analysis.
You can also check the pipeline details in the pipeline list, such as: summary, list of entities and collected columns, execution history and micro-transformation, under "View Pipeline".
Updated about 1 year ago