Conduit Architecture

Here is an overview of the internal Conduit Architecture.

Conduit is split into the following layers:

API layer

Exposes the public APIs used to communicate with Conduit. It exposes 2 types of APIs:

gRPC

This is the main API provided by Conduit. The gRPC API definition can be found in api.proto, it can be used to generate code for the client.

HTTP

The HTTP API is generated using grpc-gateway and forwards the requests to the gRPC API. Conduit exposes an openapi definition that describes the HTTP API, which is also exposed through Swagger UI on http://localhost:8080/openapi/.

Orchestration layer

The orchestration layer is responsible for coordinating the flow of operations between the core services. It also takes care of transactions, making sure that changes made to specific entities are not visible to the outside until the whole operation succeeded. There are five orchestrators, each responsible for actions related to one of the main entities:

Pipeline Orchestrator.
Connector Orchestrator.
Processor Orchestrator.
Connector Plugin Orchestrator.
Processor Plugin Orchestrator.

Core layer

We regard the core to be the combination of the entity management layer and the pipeline engine. It provides functionality to the orchestrator layer and does not concern itself with where requests come from and how single operations are combined into more complex flows.

Entity management

This layer is concerned with the creation, editing, deletion and storage of the main entities. You can think about this as a simple CRUD layer. It can be split up further using the main entities:

Pipeline

This is the central entity managed by the Pipeline Service that ties together all other components. A pipeline contains the configuration that defines how pipeline nodes should be connected together in a running pipeline. It has references to at least one source and one destination connector and zero or multiple processors, a pipeline that does not meet the criteria is regarded as incomplete and can't be started. A pipeline can be either running, stopped or degraded (stopped because of an error). The pipeline can only be edited if it's not in a running state.

Connector

A connector takes care of receiving or forwarding records to connector plugins, depending on its type (source or destination). It is also responsible for tracking the connector state as records flow through it. The Connector Service manages the creation of connectors and permanently stores them in the Connector Store. A connector can be configured to reference a number of processors, which will be executed only on records that are received from or forwarded to that specific connector.

Connector Plugin - interfaces with Conduit on one side, and with the standalone connector plugin on the other and facilitates the communication between them. A standalone connector plugin is a separate process that implements the interface defined in conduit-connector-protocol and provides the read/write functionality for a specific resource (e.g. a database).

Processor

Processors are stateless components that operate on a single record and can execute arbitrary actions before forwarding the record to the next node in the pipeline. A processor can also choose to drop a record without forwarding it. They can be attached either to a connector or to a pipeline, based on that they are either processing only records that flow from/to a connector or all records that flow through a pipeline.

Processor Plugin - interfaces with Conduit on one side, and with the standalone processor plugin on the other and facilitates the communication between them. A standalone processor plugin is a WASM binary that implements the interface defined in conduit-processor-sdk and provides the logic for processing a record (e.g. transforming its content).

Pipeline Engine

A pipeline engine is responsible for running a pipeline: reading records from a source, calling processors for the read data, sending it to destinations, writing failed records into a dead-letter queue, handling errors, etc. We sometimes call it a pipeline architecture; the terms are used interchangeably.

Conduit has two pipeline engines with two different internal architectures: v1 ( the default) and v2 (under preview).

Pipeline engine (architecture) v1

info

Pipeline architecture v1 is the default architecture used by Conduit.

The pipeline engine consists of nodes that can be connected with Go channels to form a data pipeline.

Nodes

A node is a lightweight component that runs in its own goroutine and runs as long as the incoming channel is open. As soon as the previous node stops forwarding records and closes its out channel, the current node also stops running and closes its out channel. This continues down the pipeline until all records are drained and the pipeline gracefully stops. In case a node experiences an error, all other nodes will be notified and stop running as soon as possible without draining the pipeline.

Pipeline engine (architecture) v2

warning

Pipeline architecture v2 is currently in preview and is not yet the default architecture. It can be enabled with the --preview.pipeline-arch-v2 flag when running Conduit.

Once we implement the remaining features and gather sufficient data on the new architecture's stability—both on the Conduit Platform and from community feedback—it will become the default architecture.

An engine with pipeline architecture v2 is composed of multiple tasks that send batches of records in a specific order. For example, a source task reads a batch of records from a source connector, then sends the batch to processor tasks (if any), which are then sent to a destination task.

The major differences between v1 and v2 are the following:

Goroutines:
- v1 uses multiple goroutines for a single pipeline. Individual records are moved between pipelines.
- v2 uses a single goroutine for all the work associated with a single source.
Batching:
- v1 collects individual records from a source. The Connector SDK builds batches at the destination.
- v2 collects batches of records from a source. The whole batch is moved as a single unit through the whole pipeline.

Benefits

Pipeline architecture v2 has shown significant performance improvements when compared to v1 that has been described here. It's also easier to maintain and troubleshoot due to its simplicity.

Limitations

Currently, only pipelines with a single source and a single destination are supported. We plan to add support for multiple destinations, followed by support for multiple sources.

Persistence layer

This layer is used directly by the Orchestration layer and indirectly by the Core layer, and Schema registry service (through stores) to persist data. It provides the functionality of creating transactions and storing, retrieving and deleting arbitrary data like configurations or state.

More information on storage.

Connector utility services

Schema registry service

The schema service is responsible for managing the schema of the records that flow through the pipeline. It provides functionality to infer a schema from a record. The schema is stored in the schema store and can be referenced by connectors and processors. By default, Conduit provides a built-in schema registry, but this service can be run separately from Conduit.

More information on Schema Registry.

scarf pixel conduit-site-docs-what-is-core-concepts

API layer​

gRPC​

HTTP​

Orchestration layer​

Core layer​

Entity management​

Pipeline​

Connector​

Processor​

Pipeline Engine​

Pipeline engine (architecture) v1​

Nodes​

Pipeline engine (architecture) v2​

Benefits​

Limitations​

Persistence layer​

Connector utility services​

Schema registry service​