`openai-embeddings`

Generate embeddings for records using OpenAI models.

Description

Embeddings is a Conduit processor that will generate vector embeddings for a record using OpenAI's embeddings API.

Configuration parameters

YAML
Table

version: 2.2
pipelines:
  - id: example
    status: running
    connectors:
      # define source and destination ...
    processors:
      - id: example
        plugin: "openai-embeddings"
        settings:
          # APIKey is the OpenAI API key.
          # Type: string
          api_key: ""
          # BackoffFactor is the factor by which the backoff increases. Defaults
          # to 2.0
          # Type: float
          backoff_factor: "2.0"
          # Dimensions is the number of dimensions the resulting output
          # embeddings should have.
          # Type: int
          dimensions: ""
          # EncodingFormat is the format to return the embeddings in. Can be
          # "float" or "base64".
          # Type: string
          encoding_format: ""
          # Field is the reference to the field to process. Defaults to
          # ".Payload.After".
          # Type: string
          field: ".Payload.After"
          # InitialBackoff is the initial backoff duration in milliseconds.
          # Defaults to 1000ms (1s).
          # Type: int
          initial_backoff: "1000"
          # MaxBackoff is the maximum backoff duration in milliseconds. Defaults
          # to 30000ms (30s).
          # Type: int
          max_backoff: "30000"
          # MaxRetries is the maximum number of retries for API calls. Defaults
          # to 3.
          # Type: int
          max_retries: "3"
          # Model is the OpenAI embeddings model to use (e.g.,
          # text-embedding-3-small).
          # Type: string
          model: ""
          # Whether to decode the record key using its corresponding schema from
          # the schema registry.
          # Type: bool
          sdk.schema.decode.key.enabled: "true"
          # Whether to decode the record payload using its corresponding schema
          # from the schema registry.
          # Type: bool
          sdk.schema.decode.payload.enabled: "true"
          # Whether to encode the record key using its corresponding schema from
          # the schema registry.
          # Type: bool
          sdk.schema.encode.key.enabled: "true"
          # Whether to encode the record payload using its corresponding schema
          # from the schema registry.
          # Type: bool
          sdk.schema.encode.payload.enabled: "true"
          # User is the user identifier for OpenAI API.
          # Type: string
          user: ""

Name	Type	Default	Description
`api_key`	string	null	APIKey is the OpenAI API key.
`backoff_factor`	float	`2.0`	BackoffFactor is the factor by which the backoff increases. Defaults to 2.0
`dimensions`	int	null	Dimensions is the number of dimensions the resulting output embeddings should have.
`encoding_format`	string	null	EncodingFormat is the format to return the embeddings in. Can be "float" or "base64".
`field`	string	`.Payload.After`	Field is the reference to the field to process. Defaults to ".Payload.After".
`initial_backoff`	int	`1000`	InitialBackoff is the initial backoff duration in milliseconds. Defaults to 1000ms (1s).
`max_backoff`	int	`30000`	MaxBackoff is the maximum backoff duration in milliseconds. Defaults to 30000ms (30s).
`max_retries`	int	`3`	MaxRetries is the maximum number of retries for API calls. Defaults to 3.
`model`	string	null	Model is the OpenAI embeddings model to use (e.g., text-embedding-3-small).
`sdk.schema.decode.key.enabled`	bool	`true`	Whether to decode the record key using its corresponding schema from the schema registry.
`sdk.schema.decode.payload.enabled`	bool	`true`	Whether to decode the record payload using its corresponding schema from the schema registry.
`sdk.schema.encode.key.enabled`	bool	`true`	Whether to encode the record key using its corresponding schema from the schema registry.
`sdk.schema.encode.payload.enabled`	bool	`true`	Whether to encode the record payload using its corresponding schema from the schema registry.
`user`	string	null	User is the user identifier for OpenAI API.

Examples

Generate embeddings for text

This example generates embeddings for the text stored in .Payload.After. The embeddings are returned as a JSON array of floating point numbers. These embeddings can be used for semantic search, clustering, or other machine learning tasks.

Configuration parameters

YAML
Table

version: 2.2
pipelines:
  - id: example
    status: running
    connectors:
      # define source and destination ...
    processors:
      - id: example
        plugin: "openai-embeddings"
        settings:
          api_key: "your-openai-api-key"
          backoff_factor: "2.0"
          field: ".Payload.After"
          initial_backoff: "1000"
          max_backoff: "30000"
          max_retries: "3"
          model: "text-embedding-3-small"

Name	Value
`api_key`	`your-openai-api-key`
`backoff_factor`	`2.0`
`field`	`.Payload.After`
`initial_backoff`	`1000`
`max_backoff`	`30000`
`max_retries`	`3`
`model`	`text-embedding-3-small`

Record difference

After
{
  "position": "dGVzdC1wb3NpdGlvbg==",
  "operation": "create",
  "metadata": {
    "key1": "val1"
  },
  "key": "test-key",
  "payload": {
    "before": null,
    "after": "[0.1,0.2,0.3,0.4,0.5]"
  }
}

scarf pixel conduit-site-docs-using-processors

Before			After
1		{	1		{
2		"position": "dGVzdC1wb3NpdGlvbg==",	2		"position": "dGVzdC1wb3NpdGlvbg==",
3		"operation": "create",	3		"operation": "create",
4		"metadata": {	4		"metadata": {
5		"key1": "val1"	5		"key1": "val1"
6		},	6		},
7		"key": "test-key",	7		"key": "test-key",
8		"payload": {	8		"payload": {
9		"before": null,	9		"before": null,
10	-	"after": "This is a sample text to generate embeddings for."	10	+	"after": "[0.1,0.2,0.3,0.4,0.5]"
11		}	11		}
12		}	12		}

Description​

Configuration parameters​

Examples​

Generate embeddings for text​

Configuration parameters​

Record difference​

Description

Configuration parameters

Examples

Generate embeddings for text

Configuration parameters

Record difference