Understanding the YAML file
The low-code framework involves editing a boilerplate YAML file. This section deep dives into the components of the YAML file.
Streams define the schema of the data to sync, as well as how to read it from the underlying API source. A stream generally corresponds to a resource within the API. They are analogous to tables for a relational database source.
By default, the schema of a stream's data is defined as a JSONSchema file in
Alternately, the stream's data schema can be stored in YAML format inline in the YAML file, by including the optional
schema_loader key. If the data schema is provided inline, any schema on disk for that stream will be ignored.
More information on how to define a stream's schema can be found here
The stream object is represented in the YAML file as:
description: A stream whose behavior is described by a set of declarative low code components
definition: The schema loader used to retrieve the schema for the current stream
- "$ref": "#/definitions/InlineSchemaLoader"
- "$ref": "#/definitions/JsonFileSchemaLoader"
definition: The field of the records being read that will be used during checkpointing
- type: string
- type: array
- type: string
definition: A list of transformations to be applied to each output record in the
- "$ref": "#/definitions/AddFields"
- "$ref": "#/definitions/CustomTransformation"
- "$ref": "#/definitions/RemoveFields"
More details on streams and sources can be found in the basic concepts section.
Configuring a stream for incremental syncs
If you want to allow your stream to be configured so that only data that has changed since the prior sync is replicated to a destination, you can specify a
DatetimeBasedCursor on your
Given a start time, an end time, and a step function, it will partition the interval [start, end] into small windows of the size described by the step.
More information on
incremental_sync configurations and the
DatetimeBasedCursor component can be found in the incremental syncs section.
The data retriever defines how to read the data for a Stream and acts as an orchestrator for the data retrieval flow.
It is described by:
- Requester: Describes how to submit requests to the API source
- Paginator: Describes how to navigate through the API's pages
- Record selector: Describes how to extract records from a HTTP response
- Partition router: Describes how to retrieve data across multiple resource locations
Each of those components (and their subcomponents) are defined by an explicit interface and one or many implementations. The developer can choose and configure the implementation they need depending on specifications of the integration they are building against.
Retriever is defined as part of the Stream configuration, different Streams for a given Source can use different
Retriever definitions if needed.
The schema of a retriever object is:
description: Retrieves records by synchronously sending requests to fetch records. The retriever acts as an orchestrator between the requester, the record selector, the paginator, and the partition router.
Routing to Data that is Partitioned in Multiple Locations
Some sources might require specifying additional parameters that are needed to retrieve data. Using the
PartitionRouter component, you can specify a static or dynamic set of elements which will be iterated upon and made available for use when a connector dispatches requests to get data from a source.
More information on how to configure the
partition_router field on a Retriever to retrieve data from multiple location can be found in the iteration section.
Combining Incremental Syncs and Iterable Locations
A stream can be configured to support incrementally syncing data that is spread across multiple partitions by defining
incremental_sync on the
partition_router on the
During a sync where both are configured, the Cartesian product of these parameters will be calculated and the connector will repeat requests to the source using the different combinations of parameters to get all of the data.
For example, if we had a
DatetimeBasedCursor requesting data over a 3-day range partitioned by day and a
ListPartitionRouter with the following locations
C. This would result in the following combinations that will be used to request data.
|A||2022-01-01T00:00:00 - 2022-01-01T23:59:59|
|B||2022-01-01T00:00:00 - 2022-01-01T23:59:59|
|C||2022-01-01T00:00:00 - 2022-01-01T23:59:59|
|A||2022-01-02T00:00:00 - 2022-01-02T23:59:59|
|B||2022-01-02T00:00:00 - 2022-01-02T23:59:59|
|C||2022-01-02T00:00:00 - 2022-01-02T23:59:59|
|A||2022-01-03T00:00:00 - 2022-01-03T23:59:59|
|B||2022-01-03T00:00:00 - 2022-01-03T23:59:59|
|C||2022-01-03T00:00:00 - 2022-01-03T23:59:59|