Skip to main content

Databricks Lakehouse

Overview

This destination syncs data to Delta Lake on Databricks Lakehouse. Each stream is written to its own delta-table.

This connector requires a JDBC driver to connect to the Databricks cluster. By using the driver and the connector, you must agree to the JDBC ODBC driver license. This means that you can only use this connector to connect third party applications to Apache Spark SQL within a Databricks offering using the ODBC and/or JDBC protocols.

Currently, this connector requires 30+MB of memory for each stream. When syncing multiple streams, it may run into an out-of-memory error if the allocated memory is too small. This performance bottleneck is tracked in this issue. Once this issue is resolved, the connector should be able to sync an almost infinite number of streams with less than 500MB of memory.

Sync Mode

FeatureSupportNotes
Full Refresh SyncWarning: this mode deletes all previously synced data in the configured bucket path.
Incremental - Append Sync
Incremental - Deduped History
Namespaces

Data Source

Managed table

Please check Databricks documentation about What is managed tables

External data source

Databricks Delta Lake supports various cloud storage as the data source. Currently, only Amazon S3 and Azure Blob Storage are supported by this connector.

Note: The External data source is under development stage, and it is recommended to use Managed tables instead.

Configuration

CategoryParameterTypeNotes
DatabricksServer HostnamestringRequired. Example: abc-12345678-wxyz.cloud.databricks.com. See documentation. Please note that this is the server for the Databricks Cluster. It is different from the SQL Endpoint Cluster.
HTTP PathstringRequired. Example: sql/protocolvx/o/1234567489/0000-1111111-abcd90. See documentation.
PortstringOptional. Default to "443". See documentation.
Personal Access TokenstringRequired. Example: dapi0123456789abcdefghij0123456789AB. See documentation.
GeneralDatabricks catalogstringOptional. The name of the catalog. If not specified otherwise, the "hive_metastore" will be used.
Database schemastringOptional. The default schema tables are written. If not specified otherwise, the "default" will be used.
Purge Staging DatabooleanThe connector creates staging files and tables on S3 or Azure. By default, they will be purged when the data sync is complete. Set it to false for debugging purposes.
Data Source - S3Bucket NamestringName of the bucket to sync data into.
Bucket PathstringSubdirectory under the above bucket to sync the data into.
RegionstringSee documentation for all region codes.
Access Key IDstringAWS/Minio credential.
Secret Access KeystringAWS/Minio credential.
S3 Filename patternstringThe pattern allows you to set the file-name format for the S3 staging file(s), next placeholders combinations are currently supported: {date}, {date:yyyy_MM}, {timestamp}, {timestamp:millis}, {timestamp:micros}, {part_number}, {sync_id}, {format_extension}. Please, don't use empty space and not supportable placeholders, as they won't recognized.
Data Source - AzureAccount NamestringName of the account to sync data into.
Container NamestringContainer under the above account to sync the data into.
SAS tokenstringShared-access signature token for the above account.
Endpoint domain namestringUsually blob.core.windows.net.

⚠️ Please note that under "Full Refresh Sync" mode, data in the configured bucket and path will be wiped out before each sync. We recommend you provision a dedicated S3 or Azure resource for this sync to prevent unexpected data deletion from misconfiguration. ⚠️

Staging Files (Delta Format)

S3

Data streams are first written as staging delta-table (Parquet + Transaction Log) files on S3, and then loaded into Databricks delta-tables. All the staging files will be deleted after the sync is done. For debugging purposes, here is the full path for a staging file:

s3://<bucket-name>/<bucket-path>/<uuid>/<stream-name>

For example:

s3://testing_bucket/data_output_path/98c450be-5b1c-422d-b8b5-6ca9903727d9/users/_delta_log
↑ ↑ ↑ ↑ ↑
| | | | transaction log
| | | stream name
| | database schema
| bucket path
bucket name

Azure

Similarly, streams are first written to a staging location, but the Azure option uses CSV format. A staging table is created from the CSV files.

Unmanaged Spark SQL Table

Currently, all streams are synced into unmanaged Spark SQL tables. See documentation for details. In summary, you have full control of the location of the data underlying an unmanaged table. In S3, the full path of each data stream is:

s3://<bucket-name>/<bucket-path>/<database-schema>/<stream-name>

For example:

s3://testing_bucket/data_output_path/public/users
↑ ↑ ↑ ↑
| | | stream name
| | database schema
| bucket path
bucket name

In Azure, the full path of each data stream is:

abfss://<container-name>@<account-name>.dfs.core.windows.net/<database-schema>/<stream-name>

Please keep these data directories on S3/Azure. Otherwise, the corresponding tables will have no data in Databricks.

Output Schema

Each table will have the following columns:

ColumnTypeNotes
_airbyte_ab_idstringUUID.
_airbyte_emitted_attimestampData emission timestamp.
Data fields from the source streamvariousAll fields in the staging files will be expanded in the table.

Under the hood, an Airbyte data stream in Json schema is first converted to an Avro schema, then the Json object is converted to an Avro record, and finally the Avro record is outputted to the Parquet format. Because the data stream can come from any data source, the Json to Avro conversion process has arbitrary rules and limitations. Learn more about how source data is converted to Avro and the current limitations here.

Getting started

Requirements

  1. Credentials for a Databricks cluster. See documentation.
  2. Credentials for an S3 bucket or Azure container. See documentation.
  3. Grant the Databricks cluster full access to the S3 bucket or Azure container. Or mount it as Databricks File System (DBFS). See documentation.

Suppose you are interested in learning more about the Databricks connector or details on how the Delta Lake tables are created. You may want to consult the tutorial on How to Load Data into Delta Lake on Databricks Lakehouse.

CHANGELOG

VersionDatePull RequestSubject
1.0.02023-04-13#23965Added: Managed table storage type, Databricks Catalog field
0.3.12022-10-15#18032Add SSL=1 to the JDBC URL to ensure SSL connection.
0.3.02022-10-14#15329Add support for Azure storage.
2022-09-01#16243Fix Json to Avro conversion when there is field name clash from combined restrictions (anyOf, oneOf, allOf fields)
0.2.62022-08-05#14801Fix multiply log bindings
0.2.52022-07-15#14494Make S3 output filename configurable.
0.2.42022-07-14#14618Removed additionalProperties: false from JDBC destination connectors
0.2.32022-06-16#13852Updated stacktrace format for any trace message errors
0.2.22022-06-13#13722Rename to "Databricks Lakehouse".
0.2.12022-06-08#13630Rename to "Databricks Delta Lake" and add field orders in the spec.
0.2.02022-05-15#12861Use new public Databricks JDBC driver, and open source the connector.
0.1.52022-05-04#12578In JSON to Avro conversion, log JSON field values that do not follow Avro schema for debugging.
0.1.42022-02-14#10256Add -XX:+ExitOnOutOfMemoryError JVM option
0.1.32022-01-06#7622 #9153Upgrade Spark JDBC driver to 2.6.21 to patch Log4j vulnerability; update connector fields title/description.
0.1.22021-11-03#7288Support Json additionalProperties.
0.1.12021-10-05#6792Require users to accept Databricks JDBC Driver Terms & Conditions.
0.1.02021-09-14#5998Initial private release.