GCS Data Lake [ARCHIVED]
This page guides you through setting up the GCS Data Lake destination connector.
This connector is Airbyte's official support for the Iceberg protocol on Google Cloud Storage. It writes the Iceberg table format to GCS using a supported Iceberg catalog.
Prerequisites
The GCS Data Lake connector requires two things:
- A Google Cloud Storage bucket
- A supported Iceberg catalog. Currently, the connector supports these catalogs:
- BigLake
- Polaris
Setup guide
Follow these steps to set up your GCS storage and Iceberg catalog permissions.
GCS setup and permissions
Create a GCS bucket
- Open the Google Cloud Console
- Click Cloud Storage > Buckets
- Click CREATE BUCKET
- Choose a bucket name and location
- Select a storage class and access control settings
- Click CREATE
Create a service account
-
In the Google Cloud Console, navigate to IAM & Admin > Service Accounts
-
Click CREATE SERVICE ACCOUNT
-
Give it a name (for example:
airbyte-gcs-data-lake) -
Grant the following roles:
- Storage Admin - For full GCS bucket access
- BigQuery Data Editor - For BigLake catalog operations
- BigQuery User - For BigQuery operations
- Service Usage Consumer - For using GCP services
-
Click CREATE KEY and choose the JSON format
-
Download the JSON key file
-
In Airbyte, paste the entire contents of this JSON file into the Service Account JSON field
Iceberg catalog setup and permissions
The rest of the setup process differs depending on the catalog you're using.
BigLake
The BigLake catalog is Google Cloud's managed Iceberg catalog service. To use BigLake, you need to have created a BigLake catalog in your GCP project. The service account you created earlier should have the necessary permissions to access this catalog.
Polaris
To authenticate with Apache Polaris, follow these steps:
-
Set up your Polaris catalog and create a principal with the necessary permissions. Refer to the Apache Polaris documentation for detailed setup instructions.
-
When creating a principal in Polaris, you'll receive OAuth credentials (Client ID and Client Secret). Keep these credentials secure.
-
Grant the required privileges to your principal's catalog role. You can either:
Option A: grant the broad
CATALOG_MANAGE_CONTENTprivilege (recommended for simplicity):- This single privilege allows the connector to manage tables and namespaces in the catalog
Option B: grant specific granular privileges:
TABLE_LIST- List tables in a namespaceTABLE_CREATE- Create new tablesTABLE_DROP- Delete tablesTABLE_READ_PROPERTIES- Read table metadataTABLE_WRITE_PROPERTIES- Update table metadataTABLE_WRITE_DATA- Write data to tablesNAMESPACE_LIST- List namespacesNAMESPACE_CREATE- Create new namespacesNAMESPACE_READ_PROPERTIES- Read namespace metadata
-
Ensure that your Polaris catalog has been configured with the appropriate storage credentials to access your GCS bucket.
Configuration
In Airbyte, configure the following fields:
Common fields (all catalog types)
| Field | Required | Description |
|---|---|---|
| GCS Bucket Name | Yes | The name of your GCS bucket (for example: my-data-lake) |
| Service Account JSON | Yes | The complete JSON content from your service account key file |
| GCP Project ID | No | The GCP project ID. If not specified, extracted from service account |
| GCP Location | Yes | The GCP location/region (for example: us, us-central1, eu) |
| Warehouse Location | Yes | Root path for Iceberg data in GCS (for example: gs://my-bucket/warehouse) |
| Catalog Type | Yes | Select the type of Iceberg catalog to use: BigLake or Polaris |
| Main Branch Name | No | Iceberg branch name (default: main) |
BigLake-specific fields
When Catalog Type is set to BigLake, configure these additional fields:
| Field | Required | Description |
|---|---|---|
| BigLake Catalog Name | Yes | Name of your BigLake catalog (from the setup step) |
| BigLake Database | Yes | Default database/namespace for tables |
Polaris-specific fields
When Catalog Type is set to Polaris, configure these additional fields:
| Field | Required | Description |
|---|---|---|
| Polaris Server URI | Yes | The base URL of your Polaris server (for example: http://localhost:8181/api/catalog) |
| Catalog Name | Yes | The name of the catalog in Polaris (for example: quickstart_catalog) |
| Client ID | Yes | The OAuth Client ID for authenticating with the Polaris server |
| Client Secret | Yes | The OAuth Client Secret for authenticating with the Polaris server |
Output schema
How Airbyte generates the Iceberg schema
In each stream, Airbyte maps top-level fields to Iceberg fields. Airbyte maps nested fields (objects, arrays, and unions) to string columns and writes them as serialized JSON.
This is the full mapping between Airbyte types and Iceberg types.
| Airbyte type | Iceberg type |
|---|---|
| Boolean | Boolean |
| Date | Date |
| Integer | Long |
| Number | Double |
| String | String |
| Time with timezone* | Time |
| Time without timezone | Time |
| Timestamp with timezone* | Timestamp with timezone |
| Timestamp without timezone | Timestamp without timezone |
| Object | String (JSON-serialized value) |
| Array | String (JSON-serialized value) |
| Union | String (JSON-serialized value) |
*Airbyte converts the time with timezone and timestamp with timezone types to Coordinated Universal Time (UTC) before writing to the Iceberg file.
Managing schema evolution
This connector never rewrites existing Iceberg data files. This means Airbyte can only handle specific source schema changes:
- Adding or removing a column
- Widening a column
- Changing the primary key
You have the following options to manage schema evolution:
- To handle unsupported schema changes automatically, use Full Refresh - Overwrite as your sync mode.
- To handle unsupported schema changes as they occur, wait for a sync to fail, then take action to restore it. Either:
Deduplication
This connector uses a merge-on-read strategy to support deduplication.
- Airbyte translates the stream's primary keys to Iceberg's identifier columns.
- An "upsert" is an equality-based delete on that row's primary key, followed by an insertion of the new data.
Assumptions about primary keys
The GCS Data Lake connector assumes that one of two things is true:
- The source never emits the same primary key twice in a single sync attempt.
- If the source emits the same primary key multiple times in a single attempt, it always emits those records in cursor order from oldest to newest.
If these conditions aren't met, you may see inaccurate data in Iceberg in the form of older records taking precedence over newer records. If this happens, use append or overwrite as your sync modes.
An unknown number of API sources have streams that don't meet these conditions. Airbyte knows Stripe and Monday don't, but there are probably others.
Branching and data availability
Iceberg supports Git-like semantics over your data. This connector leverages those semantics to provide resilient syncs.
- In each sync, each microbatch creates a new snapshot.
- During truncate syncs, the connector writes the refreshed data to the
airbyte_stagingbranch and replaces themainbranch with theairbyte_stagingat the end of the sync. Since most query engines target themainbranch, people can query your data until the end of a truncate sync, at which point it's atomically swapped to the new version.
Branch replacement
At the end of stream sync, the current main branch is replaced with the airbyte_staging branch. Fast-forwarding is intentionally avoided to better handle potential compaction issues.
Important Warning: any changes made to the main branch outside of Airbyte's operations after a sync begins is going to be lost during this process.
Compaction
Do not run compaction during a truncate refresh sync to prevent data loss. During a truncate refresh sync, the system deletes all files that don't belong to the latest generation. This includes:
- Files without generation IDs (compacted files)
- Files from previous generations
If compaction runs simultaneously with the sync, it would delete files from the current generation, causing data loss.
Reference
No configuration specification is available for this connector.