The Airbyte Specification
As a quick recap, the Airbyte Specification requires an Airbyte Source to support 4 distinct operations:
Spec- The required configuration in order to interact with the underlying technical system e.g. database
information, authentication information etc.
Check- Validate that the provided configuration is valid with sufficient permissions for one to perform all
required operations on the Source.
Discover- Discover the Source's schema. This let users select what a subset of the data to sync. Useful
if users require only a subset of the data.
Read- Perform the actual syncing process. Data is read from the Source, parsed into
and sent to the Airbyte Destination. Depending on how the Source is implemented, this sync can be incremental
or a full-refresh.
A core concept discussed here is the Source.
The Source contains one or more Streams (or Airbyte Streams). A Stream is the other concept key to understanding how Airbyte models the data syncing process. A Stream models the logical data groups that make up the larger Source. If the Source is a RDMS, each Stream is a table. In a REST API setting, each Stream corresponds to one resource within the API. e.g. a Stripe Source would have have one Stream for
Transactions, one for
Charges and so on.
Airbyte provides abstract base classes which make it much easier to perform certain categories of tasks e.g:
HttpStream makes it easy to create HTTP API-based streams. However, if those do not satisfy your use case (for example, if you're pulling data from a relational database), you can always directly implement the Airbyte Protocol by subclassing the CDK's
Source class implements the
Spec operation by looking for a file named
spec.json) in the module's root by default. This is expected to be a json schema file that specifies the required configuration. Here is an example from the Exchange Rates source.
Note that while this is the most flexible way to implement a source connector, it is also the most toilsome as you will be required to manually manage state, input validation, correctly conforming to the Airbyte Protocol message formats, and more. We recommend using a subclass of
Source unless you cannot fulfill your use case otherwise.
AbstractSource is a more opinionated implementation of
Source. It implements
Source's 4 methods as follows:
Check delegates to the
check_connection function. The function's
config parameter contains the user-provided configuration, specified in the
spec.yaml returned by
check_connection uses this configuration to validate access and permissioning. Here is an example from the same Exchange Rates API.
Stream Abstract Base Class
AbstractSource also owns a set of
Streams. This is populated via the
Read rely on this populated set.
Discover returns an
AirbyteCatalog representing all the distinct resources the underlying API supports. Here is the entrypoint for those interested in reading the code. See schemas for more information on how to declare the schema of a stream.
Read creates an in-memory stream reading from each of the
AbstractSource's streams. Here is the entrypoint for those interested.
As the code examples show, the
AbstractSource delegates to the set of
Streams it owns to fulfill both
Read. Thus, implementing
streams function is required when using the CDK.
A summary of what we've covered so far on how to use the Airbyte CDK:
- A concrete implementation of the
AbstractSourceobject is required.
- This involves,
- implementing the
- Creating the appropriate
Streamclasses and returning them in the
- placing the above mentioned
spec.yamlfile in the right place.
- implementing the
We've covered how the
AbstractSource works with the
Stream interface in order to fulfill the Airbyte Specification. Although developers are welcome to implement their own object, the CDK saves developers the hassle of doing so in the case of HTTP APIs with the