Azure Blob Storage
This page contains the setup guide and reference information for the Azure Blob Storage source connector.
Cloud storage may incur egress costs. Egress refers to data that is transferred out of the cloud storage system, such as when you download files or access them from a different location. For more information, see the Azure Blob Storage pricing guide.
Prerequisites
- Tenant ID of the Microsoft Azure Application user
- Azure Blob Storage account name
- Azure blob storage container (Bucket) Name
Minimum permissions (role Storage Blob Data Reader ):
[
{
"actions": [
"Microsoft.Storage/storageAccounts/blobServices/containers/read",
"Microsoft.Storage/storageAccounts/blobServices/generateUserDelegationKey/action"
],
"notActions": [],
"dataActions": [
"Microsoft.Storage/storageAccounts/blobServices/containers/blobs/read"
],
"notDataActions": []
}
]
Setup guide
Step 1: Set up Azure Blob Storage
- Create a storage account and grant roles details
To use Oauth2 or Client Credentials Authentication methods, Access Control (IAM) should be setup. It is recommended to use role Storage Blob Data Reader
Follow these steps to set up an IAM role:
- Go to Azure portal, select the Storage (or Container) you'd like to sync from and get to Access Control(IAM) -> Role Assignment
- Click on
Add
and selectAdd role assignment
from the dropdown list - Search by role name
Storage Blob Data Reader
in search box, Select role from the list and clickNext
- Select
User, Group, or service principal
, click onmembers
and select member(s) so they appear in table and clickNext
- (Optional) Add Conditions to restrict the role assignments a user can create.
- Click
Review + Assign
Follow these steps to set up a Service Principal to use the Client Credentials authentication method.
In the Azure portal, navigate to your Service Principal's App Registration.
Note the Directory (tenant) ID
and Application (client) ID
in the Overview panel.
In the Manage / Certificates & secrets
panel, click Client Secrets
and create a new secret. Note the Value
of the secret.
Step 2: Set up the Azure Blob Storage connector in Airbyte
For Airbyte Cloud:
- Log into your Airbyte Cloud account.
- Click Sources and then click + New source.
- On the Set up the source page, select Azure Blob Storage from the Source type dropdown.
- Enter a name for the Azure Blob Storage connector.
- Enter the name of your Azure Account.
- Enter your Tenant ID and Click Authenticate your Azure Blob Storage account.
- Log in and authorize the Azure Blob Storage account.
- Enter the name of the Container containing your files to replicate.
- Add a stream
- Write the File Type
- In the Format box, use the dropdown menu to select the format of the files you'd like to replicate. The supported formats are CSV, Parquet, Avro and JSONL. Toggling the Optional fields button within the Format box will allow you to enter additional configurations based on the selected format. For a detailed breakdown of these settings, refer to the File Format section below.
- Give a Name to the stream
- (Optional)—If you want to enforce a specific schema, you can enter a Input schema. By default, this value is set to
{}
and will automatically infer the schema from the file(s) you are replicating. For details on providing a custom schema, refer to the User Schema section. - Optionally, enter the Globs which dictates which files to be synced. This is a regular expression that allows Airbyte to pattern match the specific files to replicate. If you are replicating all the files within your bucket, use
**
as the pattern. For more precise pattern matching options, refer to the Path Patterns section below.
- (Optional) Enter the endpoint to use for the data replication.
- (Optional) Enter the desired start date from which to begin replicating data.
For Airbyte Open Source:
- Navigate to the Airbyte Open Source dashboard.
- Click Sources and then click + New source.
- On the Set up the source page, select Azure Blob Storage from the Source type dropdown.
- Enter a name for the Azure Blob Storage connector.
- Enter the name of your Azure Storage Account and container.
- Choose the Authentication method.
- If you are accessing through a Storage Account Key, choose
Authenticate via Storage Account Key
and enter the key. - If you are accessing through a Service Principal, choose the
Authenticate via Client Credentials
. - See above regarding setting IAM role bindings for the Service Principal and getting detail of the app registration
- Enter the
Directory (tenant) ID
value from app registration in Azure Portal into theTenant ID
field. - Enter the
Application (client) ID
from Azure Portal into theTenant ID
field. Note this is not the secret ID - Enter the Secret
Value
from Azure Portal into theClient Secret
field.
- If you are accessing through a Storage Account Key, choose
- Add a stream
- Write the File Type
- In the Format box, use the dropdown menu to select the format of the files you'd like to replicate. The supported formats are CSV, Parquet, Avro and JSONL. Toggling the Optional fields button within the Format box will allow you to enter additional configurations based on the selected format. For a detailed breakdown of these settings, refer to the File Format section below.
- Give a Name to the stream
- (Optional)—If you want to enforce a specific schema, you can enter a Input schema. By default, this value is set to
{}
and will automatically infer the schema from the file(s) you are replicating. For details on providing a custom schema, refer to the User Schema section. - Optionally, enter the Globs which dictates which files to be synced. This is a regular expression that allows Airbyte to pattern match the specific files to replicate. If you are replicating all the files within your bucket, use
**
as the pattern. For more precise pattern matching options, refer to the Path Patterns section below.
- (Optional) Enter the endpoint to use for the data replication.
- (Optional) Enter the desired start date from which to begin replicating data.
Supported sync modes
The Azure Blob Storage source connector supports the following sync modes:
Feature | Supported? |
---|---|
Full Refresh Sync | Yes |
Incremental Sync | Yes |
Replicate Incremental Deletes | No |
Replicate Multiple Files (pattern matching) | Yes |
Replicate Multiple Streams (distinct tables) | Yes |
Namespaces | No |
Supported Streams
File Compressions
Compression | Supported? |
---|---|
Gzip | Yes |
Zip | No |
Bzip2 | Yes |
Lzma | No |
Xz | No |
Snappy | No |
Please let us know any specific compressions you'd like to see support for next!
Path Patterns
(tl;dr -> path pattern syntax using wcmatch.glob. GLOBSTAR and SPLIT flags are enabled.)
This connector can sync multiple files by using glob-style patterns, rather than requiring a specific path for every file. This enables:
- Referencing many files with just one pattern, e.g.
**
would indicate every file in the bucket. - Referencing future files that don't exist yet (and therefore don't have a specific path).
You must provide a path pattern. You can also provide many patterns split with | for more complex directory layouts.
Each path pattern is a reference from the root of the bucket, so don't include the bucket name in the pattern(s).
Some example patterns:
**
: match everything.**/*.csv
: match all files with specific extension.myFolder/**/*.csv
: match all csv files anywhere under myFolder.*/**
: match everything at least one folder deep.*/*/*/**
: match everything at least three folders deep.**/file.*|**/file
: match every file called "file" with any extension (or no extension).x/*/y/*
: match all files that sit in folder x -> any folder -> folder y.**/prefix*.csv
: match all csv files with specific prefix.**/prefix*.parquet
: match all parquet files with specific prefix.
Let's look at a specific example, matching the following bucket layout:
myBucket
-> log_files
-> some_table_files
-> part1.csv
-> part2.csv
-> images
-> more_table_files
-> part3.csv
-> extras
-> misc
-> another_part1.csv
We want to pick up part1.csv, part2.csv and part3.csv (excluding another_part1.csv for now). We could do this a few different ways:
- We could pick up every csv file called "partX" with the single pattern
**/part*.csv
. - To be a bit more robust, we could use the dual pattern
some_table_files/*.csv|more_table_files/*.csv
to pick up relevant files only from those exact folders. - We could achieve the above in a single pattern by using the pattern
*table_files/*.csv
. This could however cause problems in the future if new unexpected folders started being created. - We can also recursively wildcard, so adding the pattern
extras/**/*.csv
would pick up any csv files nested in folders below "extras", such as "extras/misc/another_part1.csv".
As you can probably tell, there are many ways to achieve the same goal with path patterns. We recommend using a pattern that ensures clarity and is robust against future additions to the directory structure.
User Schema
Providing a schema allows for more control over the output of this stream. Without a provided schema, columns and datatypes will be inferred from the first created file in the bucket matching your path pattern and suffix. This will probably be fine in most cases but there may be situations you want to enforce a schema instead, e.g.:
- You only care about a specific known subset of the columns. The other columns would all still be included, but packed into the
_ab_additional_properties
map. - Your initial dataset is quite small (in terms of number of records), and you think the automatic type inference from this sample might not be representative of the data in the future.
- You want to purposely define types for every column.
- You know the names of columns that will be added to future data and want to include these in the core schema as columns rather than have them appear in the
_ab_additional_properties
map.
Or any other reason! The schema must be provided as valid JSON as a map of {"column": "datatype"}
where each datatype is one of:
- string
- number
- integer
- object
- array
- boolean
- null
For example:
{"id": "integer", "location": "string", "longitude": "number", "latitude": "number"}
{"username": "string", "friends": "array", "information": "object"}