Skip to main content

GCS Data Lake [ARCHIVED]

This page guides you through setting up the GCS Data Lake destination connector.

This connector is Airbyte's official support for the Iceberg protocol on Google Cloud Storage. It writes the Iceberg table format to GCS using a supported Iceberg catalog.

Prerequisites

The GCS Data Lake connector requires two things:

  1. A Google Cloud Storage bucket
  2. A supported Iceberg catalog. Currently, the connector supports these catalogs:
    • BigLake
    • Polaris

Setup guide

Follow these steps to set up your GCS storage and Iceberg catalog permissions.

GCS setup and permissions

Create a GCS bucket

  1. Open the Google Cloud Console
  2. Click Cloud Storage > Buckets
  3. Click CREATE BUCKET
  4. Choose a bucket name and location
  5. Select a storage class and access control settings
  6. Click CREATE

Create a service account

  1. In the Google Cloud Console, navigate to IAM & Admin > Service Accounts

  2. Click CREATE SERVICE ACCOUNT

  3. Give it a name (for example: airbyte-gcs-data-lake)

  4. Grant the following roles:

    • Storage Admin - For full GCS bucket access
    • BigQuery Data Editor - For BigLake catalog operations
    • BigQuery User - For BigQuery operations
    • Service Usage Consumer - For using GCP services
  5. Click CREATE KEY and choose the JSON format

  6. Download the JSON key file

  7. In Airbyte, paste the entire contents of this JSON file into the Service Account JSON field

Iceberg catalog setup and permissions

The rest of the setup process differs depending on the catalog you're using.

BigLake

The BigLake catalog is Google Cloud's managed Iceberg catalog service. To use BigLake, you need to have created a BigLake catalog in your GCP project. The service account you created earlier should have the necessary permissions to access this catalog.

Polaris

To authenticate with Apache Polaris, follow these steps:

  1. Set up your Polaris catalog and create a principal with the necessary permissions. Refer to the Apache Polaris documentation for detailed setup instructions.

  2. When creating a principal in Polaris, you'll receive OAuth credentials (Client ID and Client Secret). Keep these credentials secure.

  3. Grant the required privileges to your principal's catalog role. You can either:

    Option A: grant the broad CATALOG_MANAGE_CONTENT privilege (recommended for simplicity):

    • This single privilege allows the connector to manage tables and namespaces in the catalog

    Option B: grant specific granular privileges:

    • TABLE_LIST - List tables in a namespace
    • TABLE_CREATE - Create new tables
    • TABLE_DROP - Delete tables
    • TABLE_READ_PROPERTIES - Read table metadata
    • TABLE_WRITE_PROPERTIES - Update table metadata
    • TABLE_WRITE_DATA - Write data to tables
    • NAMESPACE_LIST - List namespaces
    • NAMESPACE_CREATE - Create new namespaces
    • NAMESPACE_READ_PROPERTIES - Read namespace metadata
  4. Ensure that your Polaris catalog has been configured with the appropriate storage credentials to access your GCS bucket.

Configuration

In Airbyte, configure the following fields:

Common fields (all catalog types)

FieldRequiredDescription
GCS Bucket NameYesThe name of your GCS bucket (for example: my-data-lake)
Service Account JSONYesThe complete JSON content from your service account key file
GCP Project IDNoThe GCP project ID. If not specified, extracted from service account
GCP LocationYesThe GCP location/region (for example: us, us-central1, eu)
Warehouse LocationYesRoot path for Iceberg data in GCS (for example: gs://my-bucket/warehouse)
Catalog TypeYesSelect the type of Iceberg catalog to use: BigLake or Polaris
Main Branch NameNoIceberg branch name (default: main)

BigLake-specific fields

When Catalog Type is set to BigLake, configure these additional fields:

FieldRequiredDescription
BigLake Catalog NameYesName of your BigLake catalog (from the setup step)
BigLake DatabaseYesDefault database/namespace for tables

Polaris-specific fields

When Catalog Type is set to Polaris, configure these additional fields:

FieldRequiredDescription
Polaris Server URIYesThe base URL of your Polaris server (for example: http://localhost:8181/api/catalog)
Catalog NameYesThe name of the catalog in Polaris (for example: quickstart_catalog)
Client IDYesThe OAuth Client ID for authenticating with the Polaris server
Client SecretYesThe OAuth Client Secret for authenticating with the Polaris server

Output schema

How Airbyte generates the Iceberg schema

In each stream, Airbyte maps top-level fields to Iceberg fields. Airbyte maps nested fields (objects, arrays, and unions) to string columns and writes them as serialized JSON.

This is the full mapping between Airbyte types and Iceberg types.

Airbyte typeIceberg type
BooleanBoolean
DateDate
IntegerLong
NumberDouble
StringString
Time with timezone*Time
Time without timezoneTime
Timestamp with timezone*Timestamp with timezone
Timestamp without timezoneTimestamp without timezone
ObjectString (JSON-serialized value)
ArrayString (JSON-serialized value)
UnionString (JSON-serialized value)

*Airbyte converts the time with timezone and timestamp with timezone types to Coordinated Universal Time (UTC) before writing to the Iceberg file.

Managing schema evolution

This connector never rewrites existing Iceberg data files. This means Airbyte can only handle specific source schema changes:

  • Adding or removing a column
  • Widening a column
  • Changing the primary key

You have the following options to manage schema evolution:

  • To handle unsupported schema changes automatically, use Full Refresh - Overwrite as your sync mode.
  • To handle unsupported schema changes as they occur, wait for a sync to fail, then take action to restore it. Either:
    • Manually edit your table schema in Iceberg directly.
    • Refresh your connection in Airbyte.
    • Clear your connection in Airbyte.

Deduplication

This connector uses a merge-on-read strategy to support deduplication.

Assumptions about primary keys

The GCS Data Lake connector assumes that one of two things is true:

  • The source never emits the same primary key twice in a single sync attempt.
  • If the source emits the same primary key multiple times in a single attempt, it always emits those records in cursor order from oldest to newest.

If these conditions aren't met, you may see inaccurate data in Iceberg in the form of older records taking precedence over newer records. If this happens, use append or overwrite as your sync modes.

An unknown number of API sources have streams that don't meet these conditions. Airbyte knows Stripe and Monday don't, but there are probably others.

Branching and data availability

Iceberg supports Git-like semantics over your data. This connector leverages those semantics to provide resilient syncs.

  • In each sync, each microbatch creates a new snapshot.
  • During truncate syncs, the connector writes the refreshed data to the airbyte_staging branch and replaces the main branch with the airbyte_staging at the end of the sync. Since most query engines target the main branch, people can query your data until the end of a truncate sync, at which point it's atomically swapped to the new version.

Branch replacement

At the end of stream sync, the current main branch is replaced with the airbyte_staging branch. Fast-forwarding is intentionally avoided to better handle potential compaction issues.

Important Warning: any changes made to the main branch outside of Airbyte's operations after a sync begins is going to be lost during this process.

Compaction

caution

Do not run compaction during a truncate refresh sync to prevent data loss. During a truncate refresh sync, the system deletes all files that don't belong to the latest generation. This includes:

  • Files without generation IDs (compacted files)
  • Files from previous generations

If compaction runs simultaneously with the sync, it would delete files from the current generation, causing data loss.

Reference

No configuration specification is available for this connector.

Changelog

Expand to review
VersionDatePull RequestSubject
1.0.22025-11-1369317Connector generally available
1.0.12025-11-1369212Initial release of GCS Data Lake destination with BigLake and Polaris catalog support