GCS Data Lake

Availability: Core Standard Plus Pro Enterprise Flex Self-Managed Enterprise PyAirbyte
Support Level: Airbyte
Connector Version: 1.0.2(last updated 20 days ago)
Sync Success Rate
Usage Rate
Definition ID: 8c8a2d3e-1b4f-4a9c-9e7d-6f5a4b3c2d1e

This page guides you through setting up the GCS Data Lake destination connector.

This connector is Airbyte's official support for the Iceberg protocol on Google Cloud Storage. It writes the Iceberg table format to GCS using a supported Iceberg catalog.

Prerequisites

The GCS Data Lake connector requires two things:

A Google Cloud Storage bucket
A supported Iceberg catalog. Currently, the connector supports these catalogs:
- BigLake
- Polaris

Setup guide

Follow these steps to set up your GCS storage and Iceberg catalog permissions.

GCS setup and permissions

Create a GCS bucket

Open the Google Cloud Console
Click Cloud Storage > Buckets
Click CREATE BUCKET
Choose a bucket name and location
Select a storage class and access control settings
Click CREATE

Create a service account

In the Google Cloud Console, navigate to IAM & Admin > Service Accounts
Click CREATE SERVICE ACCOUNT
Give it a name (for example: airbyte-gcs-data-lake)
Grant the following roles:
- Storage Admin - For full GCS bucket access
- BigQuery Data Editor - For BigLake catalog operations
- BigQuery User - For BigQuery operations
- Service Usage Consumer - For using GCP services
Click CREATE KEY and choose the JSON format
Download the JSON key file
In Airbyte, paste the entire contents of this JSON file into the Service Account JSON field

Iceberg catalog setup and permissions

The rest of the setup process differs depending on the catalog you're using.

BigLake

The BigLake catalog is Google Cloud's managed Iceberg catalog service. To use BigLake, you need to have created a BigLake catalog in your GCP project. The service account you created earlier should have the necessary permissions to access this catalog.

Polaris

To authenticate with Apache Polaris, follow these steps:

Set up your Polaris catalog and create a principal with the necessary permissions. Refer to the Apache Polaris documentation for detailed setup instructions.
When creating a principal in Polaris, you'll receive OAuth credentials (Client ID and Client Secret). Keep these credentials secure.
Grant the required privileges to your principal's catalog role. You can either:

Option A: grant the broad CATALOG_MANAGE_CONTENT privilege (recommended for simplicity):
- This single privilege allows the connector to manage tables and namespaces in the catalog
Option B: grant specific granular privileges:
- TABLE_LIST - List tables in a namespace
- TABLE_CREATE - Create new tables
- TABLE_DROP - Delete tables
- TABLE_READ_PROPERTIES - Read table metadata
- TABLE_WRITE_PROPERTIES - Update table metadata
- TABLE_WRITE_DATA - Write data to tables
- NAMESPACE_LIST - List namespaces
- NAMESPACE_CREATE - Create new namespaces
- NAMESPACE_READ_PROPERTIES - Read namespace metadata
Ensure that your Polaris catalog has been configured with the appropriate storage credentials to access your GCS bucket.

Configuration

In Airbyte, configure the following fields:

Common fields (all catalog types)

Field	Required	Description
GCS Bucket Name	Yes	The name of your GCS bucket (for example: `my-data-lake`)
Service Account JSON	Yes	The complete JSON content from your service account key file
GCP Project ID	No	The GCP project ID. If not specified, extracted from service account
GCP Location	Yes	The GCP location/region (for example: `us`, `us-central1`, `eu`)
Warehouse Location	Yes	Root path for Iceberg data in GCS (for example: `gs://my-bucket/warehouse`)
Catalog Type	Yes	Select the type of Iceberg catalog to use: `BigLake` or `Polaris`
Main Branch Name	No	Iceberg branch name (default: `main`)

BigLake-specific fields

When Catalog Type is set to BigLake, configure these additional fields:

Field	Required	Description
BigLake Catalog Name	Yes	Name of your BigLake catalog (from the setup step)
BigLake Database	Yes	Default database/namespace for tables

Polaris-specific fields

When Catalog Type is set to Polaris, configure these additional fields:

Field	Required	Description
Polaris Server URI	Yes	The base URL of your Polaris server (for example: `http://localhost:8181/api/catalog`)
Catalog Name	Yes	The name of the catalog in Polaris (for example: `quickstart_catalog`)
Client ID	Yes	The OAuth Client ID for authenticating with the Polaris server
Client Secret	Yes	The OAuth Client Secret for authenticating with the Polaris server

Output schema

How Airbyte generates the Iceberg schema

In each stream, Airbyte maps top-level fields to Iceberg fields. Airbyte maps nested fields (objects, arrays, and unions) to string columns and writes them as serialized JSON.

This is the full mapping between Airbyte types and Iceberg types.

Airbyte type	Iceberg type
Boolean	Boolean
Date	Date
Integer	Long
Number	Double
String	String
Time with timezone*	Time
Time without timezone	Time
Timestamp with timezone*	Timestamp with timezone
Timestamp without timezone	Timestamp without timezone
Object	String (JSON-serialized value)
Array	String (JSON-serialized value)
Union	String (JSON-serialized value)

*Airbyte converts the time with timezone and timestamp with timezone types to Coordinated Universal Time (UTC) before writing to the Iceberg file.

Managing schema evolution

This connector never rewrites existing Iceberg data files. This means Airbyte can only handle specific source schema changes:

Adding or removing a column
Widening a column
Changing the primary key

You have the following options to manage schema evolution:

To handle unsupported schema changes automatically, use Full Refresh - Overwrite as your sync mode.
To handle unsupported schema changes as they occur, wait for a sync to fail, then take action to restore it. Either:
- Manually edit your table schema in Iceberg directly.
- Refresh your connection in Airbyte.
- Clear your connection in Airbyte.

Deduplication

This connector uses a merge-on-read strategy to support deduplication.

Airbyte translates the stream's primary keys to Iceberg's identifier columns.
An "upsert" is an equality-based delete on that row's primary key, followed by an insertion of the new data.

Assumptions about primary keys

The GCS Data Lake connector assumes that one of two things is true:

The source never emits the same primary key twice in a single sync attempt.
If the source emits the same primary key multiple times in a single attempt, it always emits those records in cursor order from oldest to newest.

If these conditions aren't met, you may see inaccurate data in Iceberg in the form of older records taking precedence over newer records. If this happens, use append or overwrite as your sync modes.

An unknown number of API sources have streams that don't meet these conditions. Airbyte knows Stripe and Monday don't, but there are probably others.

Branching and data availability

Iceberg supports Git-like semantics over your data. This connector leverages those semantics to provide resilient syncs.

In each sync, each microbatch creates a new snapshot.
During truncate syncs, the connector writes the refreshed data to the airbyte_staging branch and replaces the main branch with the airbyte_staging at the end of the sync. Since most query engines target the main branch, people can query your data until the end of a truncate sync, at which point it's atomically swapped to the new version.

Branch replacement

At the end of stream sync, the current main branch is replaced with the airbyte_staging branch. Fast-forwarding is intentionally avoided to better handle potential compaction issues.

Important Warning: any changes made to the main branch outside of Airbyte's operations after a sync begins is going to be lost during this process.

Compaction

caution

Do not run compaction during a truncate refresh sync to prevent data loss. During a truncate refresh sync, the system deletes all files that don't belong to the latest generation. This includes:

Files without generation IDs (compacted files)
Files from previous generations

If compaction runs simultaneously with the sync, it would delete files from the current generation, causing data loss.

Reference

Config fields reference

Field

Type

Property name

object

catalog_type

string

gcp_location

string

gcs_bucket_name

string

main_branch_name

string

namespace

string

service_account_json

string

warehouse_location

string

gcp_project_id

string

gcs_endpoint

Changelog

Expand to review

Version	Date	Pull Request	Subject
1.0.2	2025-11-13	69317	Connector generally available
1.0.1	2025-11-13	69212	Initial release of GCS Data Lake destination with BigLake and Polaris catalog support

Prerequisites​

Setup guide​

GCS setup and permissions​

Create a GCS bucket​

Create a service account​

Iceberg catalog setup and permissions​

BigLake​

Polaris​

Configuration​

Common fields (all catalog types)​

BigLake-specific fields​

Polaris-specific fields​

Output schema​

How Airbyte generates the Iceberg schema​

Managing schema evolution​

Deduplication​

Assumptions about primary keys​

Branching and data availability​

Branch replacement​

Compaction​

Reference​

Config fields reference

Changelog​

Prerequisites

Setup guide

GCS setup and permissions

Create a GCS bucket

Create a service account

Iceberg catalog setup and permissions

BigLake

Polaris

Configuration

Common fields (all catalog types)

BigLake-specific fields

Polaris-specific fields

Output schema

How Airbyte generates the Iceberg schema

Managing schema evolution

Deduplication

Assumptions about primary keys

Branching and data availability

Branch replacement

Compaction

Reference

Changelog