Troubleshooting Mongo DB Sources
Connector Limitations
MongoDB Oplog and Change Streams
MongoDB's Change Streams are based on the Replica Set Oplog. This has retention limitations. Syncs that run less frequently than the retention period of the Oplog may encounter issues with missing data.
We recommend adjusting the Oplog size for your MongoDB cluster to ensure it holds at least 24 hours of changes. For optimal results, we suggest expanding it to maintain a week's worth of data. To adjust your Oplog size, see the corresponding tutorials for MongoDB Atlas (fully-managed) and MongoDB shell (self-hosted).
If you are running into an issue similar to "invalid resume token", it may mean you need to:
- Increase the Oplog retention period.
- Increase the Oplog size.
- Increase the Airbyte sync frequency.
You can run the commands outlined in this tutorial to verify the current of your Oplog. The expect output is:
configured oplog size: 10.10546875MB
log length start to end: 94400 (26.22hrs)
oplog first event time: Mon Mar 19 2012 13:50:38 GMT-0400 (EDT)
oplog last event time: Wed Oct 03 2012 14:59:10 GMT-0400 (EDT)
now: Wed Oct 03 2012 15:00:21 GMT-0400 (EDT)
When importing a large MongoDB collection for the first time, the import duration might exceed the Oplog retention period. The Oplog is crucial for incremental updates, and an invalid resume token will require the MongoDB collection to be re-imported to ensure no source updates were missed.
MongoDB CDC Limitations
MongoDB has a 16MB maximum document size limit for BSON documents. During CDC (Change Data Capture) syncs, change stream events can exceed this limit when documents are large, causing a BSONObjectTooLarge error. This typically occurs during incremental syncs when change stream events include the full document content.
If you encounter this error, you have several options to resolve it:
- Switch the affected stream to Full Refresh sync mode instead of Incremental mode. Full Refresh does not use change streams and is not subject to this limitation.
- If you are using Post Image update capture mode, switch to Lookup mode. Lookup mode retrieves the current document state separately, which can reduce the size of change stream events.
- Restructure large documents in your MongoDB collection to stay under the 16MB limit.
- Deselect streams containing documents that exceed the size limit.
For more information about MongoDB's document size limits, see the MongoDB documentation on limits.
Update Capture Mode: Lookup vs Post Image
When using CDC (Incremental sync mode), the Update Capture Mode setting determines how Airbyte retrieves the full document content for update events. The default mode is Lookup, but Lookup and Post Image have important behavioral differences:
Lookup (default)
Lookup fetches the document’s latest available state when the update event is processed.
If a document is updated multiple times in rapid succession, or if multiple updates happen between syncs, Airbyte may capture the newest available version instead of each intermediate state. When this happens, multiple update events can show the same final version of the document, and the intermediate full-document states are not captured.
Use Lookup when
- Your MongoDB version is earlier than 6.0.
- You only need the latest version of each document and do not need to capture every intermediate change.
- Your documents are very large and you want to reduce the size of change stream events.
Post Image (requires MongoDB 6.0+)
Uses MongoDB's built-in change stream post-images, which capture the document state immediately after each individual change. This is useful when you need the exact document state after every update, rather than the latest available version that Lookup may return.
When to use Post Image
- If you need accurate per-update document states.
- If your MongoDB version is
6.0or later
Requirements for Post Image mode
- MongoDB 6.0+
- Collections must be configured to return pre and post images. If this configuration is not enabled, Airbyte may not be able to retrieve the expected document state for update events.
Post Image can increase the size of change stream events because the full document is included in the event. For very large documents, change stream events may exceed MongoDB’s 16 MiB BSON limit and fail with BSONObjectTooLarge errors. This risk can also apply to Lookup when the full document and change event metadata are large.
Supported MongoDB Clusters
- Only supports replica set cluster type.
- TLS/SSL is required by this connector. TLS/SSL is enabled by default for MongoDB Atlas clusters. To enable TSL/SSL connection for a self-hosted MongoDB instance, please refer to MongoDb Documentation.
- Views, capped collections and clustered collections are not supported.
- Empty collections are excluded from schema discovery.
- Collections with different data types for the values in the
_idfield among the documents in a collection are not supported. All_idvalues within the collection must be the same data type. - Atlas DB cluster are only supported in a dedicated M10 tier and above. Lower tiers may fail during connection setup.
Schema Discovery & Enforcement
- Schema discovery uses sampling of the documents to collect all distinct top-level fields. This value is universally applied to all collections discovered in the target database. The approach is modelled after MongoDB Compass sampling and is used for efficiency. By default, 10,000 documents are sampled. This value can be increased up to 100,000 documents to increase the likelihood that all fields will be discovered. However, the trade-off is time, as a higher value will take the process longer to sample the collection.
- When running with Schema Enforced set to
false, there is no attempt to discover any schema. See more in Schema Enforcement.
Schema discovery performance impact
Because MongoDB collections are schemaless, documents in the same collection can have different fields and data types. The connector attempts to infer a schema by sampling documents, but no sample size can guarantee a complete or stable schema. New fields can be added to documents at any time, and a schema derived from today's sample may not represent tomorrow's data. Keep this inherent limitation in mind when choosing between schema-enforced and schemaless modes.
When schema enforcement is enabled, the Discover phase executes a $sample aggregation pipeline against every collection in each configured database. These pipelines run concurrently using parallel threads, one per collection. Each pipeline samples up to 10,000 documents by default, then processes them through $project, $unwind, and $group stages to extract field names and types.
On clusters with hundreds of collections, this means hundreds of simultaneous aggregation queries hitting the database at once. The $sample stage performs a random collection scan, which can be I/O-intensive on large collections. Combined with the downstream aggregation stages, this can exhaust available CPU and memory on your MongoDB nodes.
Recommended approaches
These approaches address the root cause of the performance risk by reducing or eliminating the discovery workload.
-
Disable schema enforcement. Set Schema Enforced to
falseto skip the sampling-based discovery entirely. In schemaless mode, the connector samples only one document per collection to confirm the_idfield exists. This dramatically reduces the load on your cluster, but all data is returned as a single JSON object per document rather than individual typed fields. See Schema Enforcement for configuration details. -
Reduce the discovery sample size. If you need schema enforcement, lower the Discovery Sample Size setting to reduce the number of documents sampled per collection. The default is 10,000. A smaller value such as 1,000 reduces the load on your cluster but may miss fields in collections with highly variable document structures. See the Discovery Sample Size configuration parameter.
Other alternatives
These approaches do not reduce the discovery workload itself, but can help isolate it from your production traffic.
-
Direct reads to a secondary node. Add
readPreference=secondaryorreadPreference=secondaryPreferredto your MongoDB connection string. This routes the discovery queries to a secondary replica set member instead of the primary, protecting your primary node from the additional load.mongodb+srv://cluster0.abcd1.mongodb.net/?readPreference=secondaryPreferred -
Use MongoDB Atlas analytics nodes. If you use MongoDB Atlas (M10 tier or above), you can provision analytics nodes that are isolated from your operational workload. Direct Airbyte's reads to an analytics node by adding read preference tags to your connection string:
mongodb+srv://cluster0.abcd1.mongodb.net/?readPreference=secondary&readPreferenceTags=nodeType:ANALYTICSThis fully isolates the discovery workload from your production traffic.
-
Schedule syncs during off-peak hours. If you cannot isolate the read workload, schedule your Airbyte syncs to run during periods of low production traffic. Schema discovery runs at the start of every sync, so timing matters.
-
Reduce the number of configured databases. The connector discovers collections across all configured databases. If you only need data from specific databases, remove unnecessary databases from your source configuration to reduce the total number of collections discovered.
Vendor-Specific Connector Limitations
Not all implementations or deployments of a database will be the same. This section lists specific limitations and known issues with the connector based on how or where it is deployed.
Self Hosted MongoDB
Airbyte does not support self-signed SSL certificates for SSH tunnels.
AWS DocumentDB
The Airbyte connector does not support custom SSL certificates, which DocumentDB requires.