openai-embeddings
Generate embeddings for records using OpenAI models.
Description
Embeddings is a Conduit processor that will generate vector embeddings for a record using OpenAI's embeddings API.
Configuration parameters
- YAML
- Table
version: 2.2
pipelines:
- id: example
status: running
connectors:
# define source and destination ...
processors:
- id: example
plugin: "openai-embeddings"
settings:
# APIKey is the OpenAI API key.
# Type: string
api_key: ""
# BackoffFactor is the factor by which the backoff increases. Defaults
# to 2.0
# Type: float
backoff_factor: "2.0"
# Dimensions is the number of dimensions the resulting output
# embeddings should have.
# Type: int
dimensions: ""
# EncodingFormat is the format to return the embeddings in. Can be
# "float" or "base64".
# Type: string
encoding_format: ""
# Field is the reference to the field to process. Defaults to
# ".Payload.After".
# Type: string
field: ".Payload.After"
# InitialBackoff is the initial backoff duration in milliseconds.
# Defaults to 1000ms (1s).
# Type: int
initial_backoff: "1000"
# MaxBackoff is the maximum backoff duration in milliseconds. Defaults
# to 30000ms (30s).
# Type: int
max_backoff: "30000"
# MaxRetries is the maximum number of retries for API calls. Defaults
# to 3.
# Type: int
max_retries: "3"
# Model is the OpenAI embeddings model to use (e.g.,
# text-embedding-3-small).
# Type: string
model: ""
# Whether to decode the record key using its corresponding schema from
# the schema registry.
# Type: bool
sdk.schema.decode.key.enabled: "true"
# Whether to decode the record payload using its corresponding schema
# from the schema registry.
# Type: bool
sdk.schema.decode.payload.enabled: "true"
# Whether to encode the record key using its corresponding schema from
# the schema registry.
# Type: bool
sdk.schema.encode.key.enabled: "true"
# Whether to encode the record payload using its corresponding schema
# from the schema registry.
# Type: bool
sdk.schema.encode.payload.enabled: "true"
# User is the user identifier for OpenAI API.
# Type: string
user: ""
Name | Type | Default | Description |
---|---|---|---|
api_key | string | null | APIKey is the OpenAI API key. |
backoff_factor | float | 2.0 | BackoffFactor is the factor by which the backoff increases. Defaults to 2.0 |
dimensions | int | null | Dimensions is the number of dimensions the resulting output embeddings should have. |
encoding_format | string | null | EncodingFormat is the format to return the embeddings in. Can be "float" or "base64". |
field | string | .Payload.After | Field is the reference to the field to process. Defaults to ".Payload.After". |
initial_backoff | int | 1000 | InitialBackoff is the initial backoff duration in milliseconds. Defaults to 1000ms (1s). |
max_backoff | int | 30000 | MaxBackoff is the maximum backoff duration in milliseconds. Defaults to 30000ms (30s). |
max_retries | int | 3 | MaxRetries is the maximum number of retries for API calls. Defaults to 3. |
model | string | null | Model is the OpenAI embeddings model to use (e.g., text-embedding-3-small). |
sdk.schema.decode.key.enabled | bool | true | Whether to decode the record key using its corresponding schema from the schema registry. |
sdk.schema.decode.payload.enabled | bool | true | Whether to decode the record payload using its corresponding schema from the schema registry. |
sdk.schema.encode.key.enabled | bool | true | Whether to encode the record key using its corresponding schema from the schema registry. |
sdk.schema.encode.payload.enabled | bool | true | Whether to encode the record payload using its corresponding schema from the schema registry. |
user | string | null | User is the user identifier for OpenAI API. |
Examples
Generate embeddings for text
This example generates embeddings for the text stored in
.Payload.After
. The embeddings are returned as a JSON array of floating point numbers.
These embeddings can be used for semantic search, clustering, or other machine learning tasks.
Configuration parameters
- YAML
- Table
version: 2.2
pipelines:
- id: example
status: running
connectors:
# define source and destination ...
processors:
- id: example
plugin: "openai-embeddings"
settings:
api_key: "your-openai-api-key"
backoff_factor: "2.0"
field: ".Payload.After"
initial_backoff: "1000"
max_backoff: "30000"
max_retries: "3"
model: "text-embedding-3-small"
Name | Value |
---|---|
api_key | your-openai-api-key |
backoff_factor | 2.0 |
field | .Payload.After |
initial_backoff | 1000 |
max_backoff | 30000 |
max_retries | 3 |
model | text-embedding-3-small |
Record difference
Before | After | ||||
1 | { | 1 | { | ||
2 | "position": "dGVzdC1wb3NpdGlvbg==", | 2 | "position": "dGVzdC1wb3NpdGlvbg==", | ||
3 | "operation": "create", | 3 | "operation": "create", | ||
4 | "metadata": { | 4 | "metadata": { | ||
5 | "key1": "val1" | 5 | "key1": "val1" | ||
6 | }, | 6 | }, | ||
7 | "key": "test-key", | 7 | "key": "test-key", | ||
8 | "payload": { | 8 | "payload": { | ||
9 | "before": null, | 9 | "before": null, | ||
10 | - | "after": "This is a sample text to generate embeddings for." | 10 | + | "after": "[0.1,0.2,0.3,0.4,0.5]" |
11 | } | 11 | } | ||
12 | } | 12 | } |