Skip to main content

openai-embeddings

Generate embeddings for records using OpenAI models.

Description

Embeddings is a Conduit processor that will generate vector embeddings for a record using OpenAI's embeddings API.

Configuration parameters

version: 2.2
pipelines:
- id: example
status: running
connectors:
# define source and destination ...
processors:
- id: example
plugin: "openai-embeddings"
settings:
# APIKey is the OpenAI API key.
# Type: string
api_key: ""
# BackoffFactor is the factor by which the backoff increases. Defaults
# to 2.0
# Type: float
backoff_factor: "2.0"
# Dimensions is the number of dimensions the resulting output
# embeddings should have.
# Type: int
dimensions: ""
# EncodingFormat is the format to return the embeddings in. Can be
# "float" or "base64".
# Type: string
encoding_format: ""
# Field is the reference to the field to process. Defaults to
# ".Payload.After".
# Type: string
field: ".Payload.After"
# InitialBackoff is the initial backoff duration in milliseconds.
# Defaults to 1000ms (1s).
# Type: int
initial_backoff: "1000"
# MaxBackoff is the maximum backoff duration in milliseconds. Defaults
# to 30000ms (30s).
# Type: int
max_backoff: "30000"
# MaxRetries is the maximum number of retries for API calls. Defaults
# to 3.
# Type: int
max_retries: "3"
# Model is the OpenAI embeddings model to use (e.g.,
# text-embedding-3-small).
# Type: string
model: ""
# Whether to decode the record key using its corresponding schema from
# the schema registry.
# Type: bool
sdk.schema.decode.key.enabled: "true"
# Whether to decode the record payload using its corresponding schema
# from the schema registry.
# Type: bool
sdk.schema.decode.payload.enabled: "true"
# Whether to encode the record key using its corresponding schema from
# the schema registry.
# Type: bool
sdk.schema.encode.key.enabled: "true"
# Whether to encode the record payload using its corresponding schema
# from the schema registry.
# Type: bool
sdk.schema.encode.payload.enabled: "true"
# User is the user identifier for OpenAI API.
# Type: string
user: ""

Examples

Generate embeddings for text

This example generates embeddings for the text stored in .Payload.After. The embeddings are returned as a JSON array of floating point numbers. These embeddings can be used for semantic search, clustering, or other machine learning tasks.

Configuration parameters

version: 2.2
pipelines:
- id: example
status: running
connectors:
# define source and destination ...
processors:
- id: example
plugin: "openai-embeddings"
settings:
api_key: "your-openai-api-key"
backoff_factor: "2.0"
field: ".Payload.After"
initial_backoff: "1000"
max_backoff: "30000"
max_retries: "3"
model: "text-embedding-3-small"

Record difference

Before
After
1
{
1
{
2
  "position": "dGVzdC1wb3NpdGlvbg==",
2
  "position": "dGVzdC1wb3NpdGlvbg==",
3
  "operation": "create",
3
  "operation": "create",
4
  "metadata": {
4
  "metadata": {
5
    "key1": "val1"
5
    "key1": "val1"
6
  },
6
  },
7
  "key": "test-key",
7
  "key": "test-key",
8
  "payload": {
8
  "payload": {
9
    "before": null,
9
    "before": null,
10
-
    "after": "This is a sample text to generate embeddings for."
10
+
    "after": "[0.1,0.2,0.3,0.4,0.5]"
11
  }
11
  }
12
}
12
}

scarf pixel conduit-site-docs-using-processors