Files

Features

Feature
Supported?
Full Refresh Sync
Yes
Incremental Sync
No
Replicate Incremental Deletes
No
Replicate Folders (multiple Files)
No
Replicate Glob Patterns (multiple Files)
No
This source produces a single table for the target file as it replicates only one file at a time for the moment. Note that you should provide the dataset_name which dictates how the table will be identified in the destination (since URL can be made of complex characters).

Storage Providers

Storage Providers
Supported?
HTTPS
Yes
Google Cloud Storage
Yes
Amazon Web Services S3
Yes
SFTP
Yes
SSH / SCP
Yes
local filesystem
Local use only (inaccessible for Airbyte Cloud)

File / Stream Compression

Compression
Supported?
Gzip
Yes
Zip
No
Bzip2
No
Lzma
No
Xz
No
Snappy
No

File Formats

Format
Supported?
CSV
Yes
JSON
Yes
HTML
No
XML
No
Excel
Yes
Excel Binary Workbook
Yes
Feather
Yes
Parquet
Yes
Pickle
No
This connector does not support syncing unstructured data files such as raw text, audio, or videos.

Getting Started (Airbyte Cloud)

Setup through Airbyte Cloud will be exactly the same as the open-source setup, except for the fact that local files are disabled.

Getting Started (Airbyte Open-Source)

  1. 1.
    Once the File Source is selected, you should define both the storage provider along its URL and format of the file.
  2. 2.
    Depending on the provider choice and privacy of the data, you will have to configure more options.

Provider Specific Information

  • In case of GCS, it is necessary to provide the content of the service account keyfile to access private buckets. See settings of BigQuery Destination
  • In case of AWS S3, the pair of aws_access_key_id and aws_secret_access_key is necessary to access private S3 buckets.
  • In case of AzBlob, it is necessary to provide the storage_account in which the blob you want to access resides. Either sas_token (info) or shared_key (info) is necessary to access private blobs.

Reader Options

The Reader in charge of loading the file format is currently based on Pandas IO Tools. It is possible to customize how to load the file into a Pandas DataFrame as part of this Source Connector. This is doable in the reader_options that should be in JSON format and depends on the chosen file format. See pandas' documentation, depending on the format:
For example, if the format CSV is selected, then options from the read_csv functions are available.
  • It is therefore possible to customize the delimiter (or sep) to in case of tab separated files.
  • Header line can be ignored with header=0 and customized with names
  • etc
We would therefore provide in the reader_options the following json:
1
{ "sep" : "\t", "header" : 0, "names": "column1, column2"}
Copied!
In case you select JSON format, then options from the read_json reader are available.
For example, you can use the {"orient" : "records"} to change how orientation of data is loaded (if data is [{column -> value}, … , {column -> value}])

Changing data types of source columns

Normally, Airbyte tries to infer the data type from the source, but you can use reader_options to force specific data types. If you input {"dtype":"string"}, all columns will be forced to be parsed as strings. If you only want a specific column to be parsed as a string, simply use {"dtype" : {"column name": "string"}}.

Examples

Here are a list of examples of possible file inputs:
Dataset Name
Storage
URL
Reader Impl
Service Account
Description
hr_and_financials
GCS
gs://airbyte-vault/financial.csv
smart_open or gcfs
{"type": "service_account", "private_key_id": "XXXXXXXX", ...}
data from a private bucket, a service account is necessary
landsat_index
GCS
gcp-public-data-landsat/index.csv.gz
smart_open
Using smart_open, we don't need to specify the compression (note the gs:// is optional too, same for other providers)
Examples with reader options:
Dataset Name
Storage
URL
Reader Impl
Reader Options
Description
landsat_index
GCS
gs://gcp-public-data-landsat/index.csv.gz
GCFS
{"compression": "gzip"}
Additional reader options to specify a compression option to read_csv
GDELT
S3
s3://gdelt-open-data/events/20190914.export.csv
{"sep": "\t", "header": null}
Here is TSV data separated by tabs without header row from AWS Open Data
server_logs
local
/local/logs.log
{"sep": ";"}
After making sure a local text file exists at /tmp/airbyte_local/logs.log with logs file from some server that are delimited by ';' delimiters
Example for SFTP:
Dataset Name
Storage
User
Password
Host
URL
Reader Options
Description
Test Rebext
SFTP
demo
password
test.rebext.net
/pub/example/readme.txt
{"sep": "\r\n", "header": null, "names": ["text"], "engine": "python"}
We use python engine for read_csv in order to handle delimiter of more than 1 character while providing our own column names.
Please see (or add) more at airbyte-integrations/connectors/source-file/integration_tests/integration_source_test.py for further usages examples.

Performance Considerations and Notes

In order to read large files from a remote location, this connector uses the smart_open library. However, it is possible to switch to either GCSFS or S3FS implementations as it is natively supported by the pandas library. This choice is made possible through the optional reader_impl parameter.
  • Note that for local filesystem, the file probably have to be stored somewhere in the /tmp/airbyte_local folder with the same limitations as the CSV Destination so the URL should also starts with /local/.
  • The JSON implementation needs to be tweaked in order to produce more complex catalog and is still in an experimental state: Simple JSON schemas should work at this point but may not be well handled when there are multiple layers of nesting.

Changelog

Version
Date
Pull Request
Subject
0.2.8
2021-12-06
8524
Update connector fields title/description
0.2.7
2021-10-28
7387
Migrate source to CDK structure, add SAT testing.
0.2.6
2021-08-26
5613
Add support to xlsb format
0.2.5
2021-07-26
4953
Allow non-default port for SFTP type
0.2.4
2021-06-09
3973
Add AIRBYTE_ENTRYPOINT for Kubernetes support
0.2.3
2021-06-01
3771
Add Azure Storage Blob Files option
0.2.2
2021-04-16
2883
Fix CSV discovery memory consumption
0.2.1
2021-04-03
2726
Fix base connector versioning
0.2.0
2021-03-09
2238
Protocol allows future/unknown properties
0.1.10
2021-02-18
2118
Support JSONL format
0.1.9
2021-02-02
1768
Add test cases for all formats
0.1.8
2021-01-27
1738
Adopt connector best practices
0.1.7
2020-12-16
1331
Refactor Python base connector
0.1.6
2020-12-08
1249
Handle NaN values
0.1.5
2020-11-30
1046
Add connectors using an index YAML file
Last modified 20d ago