Discovery Reference Guide

The Pureinsights Discovery Platform is a cloud-based family of products, aimed to help with the creation, maintenance and monitoring of production-ready AI applications.

There is not a single recipe that fits all: each collection, each use case and each implementation is different in its own way, and evolving from a prototype to a live solution is not a trivial task. Here is where Discovery shows its value:

An architecture that follows a pay-as-you-go model for only the resources required by the specific use case.
A no-code approach with building blocks configured through finite-state machines that provides flexibility while reducing the hassle of developer-related tasks such as error handling and orchestration of services.
Changes in configuration and tuning up happens on-the-fly, without the downtime of redeploying.
Data processing pipelines to extract, transform and load collections from different sources (ETL).
Data storage as a "push" model alternative to traditional ETL solutions.
Custom REST endpoints with advanced capabilities that adapts to the complexities of processing a query with minimum overhead.
Observability with standard monitoring and alerting tools.

What’s new in 2.2.0?

Troubleshooting tool for Ingested Records

When a seed execution is completed successfully, its final state is stored and can be accessed through the Records API, which helps to keep track of the processed records since it allows to know the status of each record, as well as the errors, if there was any during its execution, and other relevant metadata.

Support for connections with mTLS

Now you can configure a Server to communicate to services protected with mTLS. Similarly to the already supported certificates, now you can add keys and the certificates associated to those keys, to stablish a successful connection with your favorite services protected with mTLS.

Vespa integration

We added two new Discovery Ingestion and Discovery QueryFlow components to ingest and query documents from a Vespa app, either in Vespa Cloud or self hosted.

Troubleshooting Tool for Seed Executions

A new tool was added to help troubleshoot seed executions by providing a summary of the status of all jobs involved. It allows users to quickly identify how many jobs are in states such as DONE, FAILED, RUNNING, or CREATED, making it easier to monitor execution progress and spot potential issues during runtime.

Breaking changes

The configuration for Ingestion batch policies has been restructured

In order to have more intuitive configurations for record grouping mechanisms in Discovery Ingestion, the policies referring to an outgoing batch of records have been included as part of a new type of policy, referred as Outbound Policy. Aditionally, policies regarding records are now recommended to be included as part of a recordPolicy configuration, tough the previous record denomation is still accepted.

Please note that this will make Batch Policies defined outside an Outbound Policy no longer affect outbound batches of records, and in fact be ignored. This change pertains to both the global recordPolicy (or record) setting of a Seed and its optional override in a Pipeline processor state.

In order to include settings for outgoing record batches, the Batch Policies must now be included as follows:

{
  "recordPolicy": {
    "outboundPolicy": {
        "batchPolicy": {
            <Batch configuration>
      }
    },
    "batchPolicy": {
      // Settings found here will now be ignored
      // because they aren't included as part of the "outboundPolicy"
    }
  },
  "batchPolicy": {
    // These settings will also be ignored
  }
}

It’s recommended to check the affected entities respective sections for deatils of how outbound batches' settings are included in the general configurations.

The Ingestion component Engine Score has been renamed to Insights

The Ingestion component Engine Score is now named Insights. We change the purpose of this component to provide more actions that generate information that can be used for later analysis, so a rebranding of the component was needed for a more accurate name.

Now the configuration for an engine score action will look a little bit different:

{
  "type": "insights",
  "name": "My component",
  "config": {
    "action": "engine-score:non-contextual",
    ...
  },
  ...
}

Please notice the type is now insights instead of engine-score and the name of the action has changed to engine-score:non-contextual.

Basics

Discovery is a platform composed by 3 products:

Discovery Ingestion: a distributed ETL, supported by a finite-state machine that represents the data transformation and loading.
Discovery Staging: an abstraction for a Document Database, where collections are represented as buckets with an HTTP interface for simple CRUD operations.
Discovery QueryFlow: a configurable REST API with custom endpoints, supported by a finite-state machine that represents the query processing.

All products are supported by the Discovery Core Libraries and API: a common layer for shared concepts and configurations.

Each independent product has its own value and can exist by itself (although both Discovery Ingestion and Discovery QueryFlow require Discovery Staging for their internal operations). However, using them together brings all the tools to create an end-to-end solution.

Architecture

One of the main goals of Discovery, is being cloud agnostic: using the native resources of each cloud provider without affecting the application itself.

This decision has a direct impact in the overall architecture, as external services should be abstracted in a way they can be mapped by a managed service available on each cloud provider.

The relational database is the main storage for configurations, metadata and the state of the multiple executions.
The document database is the service abstracted by Discovery Staging. The default implementation for all cloud providers is MongoDB Atlas.
The object storage supports the file server and data processing of binary files in Discovery Ingestion.
The message queue handles asynchronous communication between the components.
The secrets manager stores secure information such as passwords and credentials. The default implementation is an internal secrets provider.

The Discovery Core Libraries and API is an interface for every Discovery product to interact with these external services. However, despite of this standardization, each independent product is designed based on its own needs: Discovery QueryFlow and Discovery Staging are monoliths due to their need of fast responses, but Discovery Ingestion follows an event-driven architecture that targets scalability with distributed components processing data in parallel.

AWS

Discovery can be integrated with Amazon Web Services (AWS) with managed services that natively support the installation requirements.

The application is deployed using Amazon Elastic Container Service and AWS Lambda in a private subnet, later exposed with Amazon API Gateway and Amazon Route 53.

Other services such as Amazon EventBridge and Amazon CloudWatch support the correct control, autoscaling and monitoring of all components.

Monitoring

All Discovery products constantly publish metrics to a selected monitoring and observability tool:

Integrations

Integrations to external services are represented by Servers, optionally authenticated with a Credential that references an encrypted Secret.

They are re-usable configurations that can later be referenced in Discovery Ingestion Components and Discovery QueryFlow Components.

Connecting to an external service

Servers API

Create a new Server

$ curl --request POST 'core-api:8080/v2/server' --data '{ ... }'

List all Servers

$ curl --request GET 'core-api:8080/v2/server'

Get a single Server

$ curl --request GET 'core-api:8080/v2/server/{id}'

Test a Server connection

$ curl --request GET 'core-api:8080/v2/server/{id}/ping'

Note	Not all integrations support the `/ping` endpoint.

Update an existing Server

$ curl --request PUT 'core-api:8080/v2/server/{id}' --data '{ ... }'

Note	The type of an existing server can’t be modified.

Delete an existing Server

$ curl --request DELETE 'core-api:8080/v2/server/{id}'

Clone an existing Server

$ curl --request POST 'core-api:8080/v2/server/{id}/clone?name=clone-new-name'

Query Parameters

name: (Required, String) The name of the new Server

Search for Servers using DSL Filters

$ curl --request POST 'core-api:8080/v2/search/search' --data '{ ... }'

Body

The body payload is a DSL Filter to apply to the search

Autocomplete for Servers

$ curl --request GET 'core-api:8080/v2/search/autocomplete?q=value'

Query Parameters

q: (Required, String) The query to execute the autocomplete search

A server has the properties to create an authenticated connection to an external service:

{
  "type": "my-external-service",
  "name": "My External Service Configuration",
  "config": {
    ...
  },
  ...
}

type

(Required, String) The type of external supported service

name

(Required, String) The unique name to identify the external service

description

(Optional, String) The description for the configuration

config

(Required, Object) The configuration to connect to the external service

credential

(Optional, UUID) The ID of the credential to authenticate in the external service

certificates

(Optional, Object) The custom certificates for encrypted connection (SSL/TLS), loaded using the file storage. The value can be either the string with the location of the certificate, or a detailed configuration

Details

{
  "certificates": {
    "sampleA": {
      "type": "X.509",
      "value": "certificates/sample.crt"
    },
    "sampleB": {
      "value": "certificates/sample.crt"
    },
    "sampleC": "certificates/sample.crt"
  },
  ...
}

type: (Optional, String) The type of certificate. Defaults to X.509
value: (Required, String) The location of the certificate in the file storage

Note	The existence of the certificate will be verified.

keys

(Optional, Object) The keys with their respective certificate chain for encrypted connection (mTLS), loaded using the file storage.

Details

{
  "keys": {
    "keyA": {
      "value": "keys/sample.pem",
      "certificateChain": [
        {
            "type": "X.509",
            "value": "certificates/sample.pem"
        },
        {
            "value": "certificates/sample.pem"
        },
        "certificates/sample.pem"
      ]
    }
  }
}

value: (Required, String) The location of the key in the file storage. The contents are expected to be PKCS8 encoded key in PEM format.
certificateChain: (Required, List) One or more public certificates associated to the key

Note	The existence of the key will be verified.

circuitBreaker

(Optional, Object) The circuit breaker configuration as a mechanism to handle request errors and limitations of an external service

Details

{
  "circuitBreaker": {
    "waitInOpenState": "90s",
    "maxTestRequests": 1
  },
  ...
}

waitInOpenState: (Optional, Duration) The maximum time to wait in OPEN state before transitioning to HALF_OPEN state
maxTestRequests: (Optional, Integer) The maximum number of requests on HALF_OPEN state

Note	While all server types can be configured with a circuit breaker, not all may necessarily utilize it.

labels

(Optional, Array of Objects) The labels for the configuration

Details

{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}

key: (Required, String) The key of the label
value: (Required, String) The value of the label

Authentication and Credentials

Credentials API

Create a new Credential

$ curl --request POST 'core-api:8080/v2/credential' --data '{ ... }'

List all Credentials

$ curl --request GET 'core-api:8080/v2/credential'

Get a single Credential

$ curl --request GET 'core-api:8080/v2/credential/{id}'

Note	Reading credentials will never expose the referenced secret.

Update an existing Credential

$ curl --request PUT 'core-api:8080/v2/credential/{id}' --data '{ ... }'

Note	The type of an existing credential can’t be modified.

Delete an existing Credential

$ curl --request DELETE 'core-api:8080/v2/credential/{id}'

Clone an existing Credential

$ curl --request POST 'core-api:8080/v2/credential/{id}/clone?name=clone-new-name'

Query Parameters

name: (Required, String) The name of the new Server

Search for Credentials using DSL Filters

$ curl --request POST 'core-api:8080/v2/credential/search' --data '{ ... }'

Body

The body payload is a DSL Filter to apply to the search

Autocomplete for Credentials

$ curl --request GET 'core-api:8080/v2/credential/autocomplete?q=value'

Query Parameters

q: (Required, String) The query to execute the autocomplete search

A credential references a secret with the authentication parameters required to connect to an external service:

{
  "type": "my-external-service",
  "name": "My External Service Credential",
  "secret": "MY_SECRET",
  ...
}

When the secret provider is internal, it is possible to create a secret during the creation of the credential:

{
  "type": "my-external-service",
  "name": "My External Service Credential",
  "secret": {
    "name": "MY_SECRET",
    "content": {
      "username": <username>,
      "password": <password>,
    },
    ...
  },
  ...
}

Note	It is assumed that the referenced secret exists, and has the correct JSON-formatted authentication information. However, this is only a soft-reference, and any deletion of secret keys won’t be noticed until the next time it is required.

type

(Required, String) The type of credentials for the external supported service

name

(Required, String) The unique name to identify the credentials

description

(Optional, String) The description for the configuration

secret

(Required, String or Object) Either the secret key to connect to the external service, or an object with the authentication details

labels

(Optional, Array of Objects) The labels for the configuration

Details

{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}

key: (Required, String) The key of the label
value: (Required, String) The value of the label

Secrets

Secrets API

Create a new Secret

$ curl --request POST 'core-api:8080/v2/secret' --data '{ ... }'

List all Secrets

$ curl --request GET 'core-api:8080/v2/secret'

Get a single Secret

$ curl --request GET 'core-api:8080/v2/secret/{id}'

Note	Reading secrets will never expose their encrypted data.

Update an existing Secret

$ curl --request PUT 'core-api:8080/v2/secret/{id}' --data '{ ... }'

Delete an existing Secret

$ curl --request DELETE 'core-api:8080/v2/secret/{id}'

A secret is a representation of a secure JSON. This document could be anything, but its most common usage is for credentials:

{
  "name": "MY_SECRET",
  "content": {
    "username": <username>,
    "password": <password>,
  },
  ...
}

Note	When the secrets are backed up by an external service, Discovery won’t expose any CRUD for their management.

name

(Required, String) The unique name to identify the secret

description

(Optional, String) The description for the configuration

content

(Required, Object) The JSON to securely store

labels

(Optional, Array of Objects) The labels for the configuration

Details

{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}

key: (Required, String) The key of the label
value: (Required, String) The value of the label

Supported Services

Amazon Bedrock

{
  "type": "amazon-bedrock",
  "name": "My Amazon Bedrock Server",
  "config": {
    ...
  }
}

Note	This integration supports the circuit breaker configuration.

Server Configuration for Amazon Bedrock

region

(Required, String) The AWS region

apiCallTimeout

(Optional, Duration) The complete duration of an API call

connection

(Optional, Object) The configuration of the connection to Amazon Bedrock

Details

connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s

pool

(Optional, Object) The configuration for the connection pool

Details

size: (Optional, Integer) The size of the connection pool. Defaults to 5
keepAlive: (Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m

backoffPolicy

(Optional, Object) The configuration for retries to Amazon Bedrock

Details

type: (Optional, String) The type of backoff policy to apply. One of NONE, CONSTANT, or EXPONENTIAL. Defaults to EXPONENTIAL
initialDelay: (Optional, Duration) The initial delay before retrying. Defaults to 50ms
retries: (Optional, Integer) The maximum number of retries. Defaults to 5

Authentication

Credentials Secret for Amazon Bedrock with AWS Credentials

{
  "type": "aws",
  "name": "My Amazon Bedrock Credentials",
  "secret": {
    ...
  }
}

accessKeyId: (Required, String) The ID of your access key, used to identify the user
secretAccessKey: (Required, String) The secret access key, used to authenticate the user
sessionToken: (Optional, String) The session token from an AWS token service, used to authenticating that this user has received temporary permission to access some resource
expirationTime: (Optional, Duration) The time after which this identity will no longer be valid. If not provided, the expiration is unknown but it may still expire at some time

Note	See Manage access keys for IAM users

Elasticsearch

{
  "type": "elasticsearch",
  "name": "My Elasticsearch Server",
  "config": {
    ...
  }
}

Note	This integration supports the `/ping` endpoint.

Server Configuration for Elasticsearch

servers

(Required, Array of Strings) The URI for the Elasticsearch installation. Multiple servers will be invoked in round-robin

pathPrefix

(Optional, String) The path prefix to add to the servers on each call

connection

(Optional, Object) The configuration of the HTTP connection to Elasticsearch

Details

connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s

pool

(Optional, Object) The configuration for the connection pool

Details

size: (Optional, Integer) The size of the connection pool. Defaults to 5
keepAlive: (Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m

compressRequest

(Optional, Boolean) true if the requests must be compressed

followRedirects

(Optional, Boolean) true if redirects must be followed

Server Configuration for Elastic Cloud

cloudId

(Required, String) The ID of the instance in Elastic Cloud

connection

(Optional, Object) The configuration of the HTTP connection to Elasticsearch

Details

connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s

pool

(Optional, Object) The configuration for the connection pool

Details

size: (Optional, Integer) The size of the connection pool. Defaults to 5
keepAlive: (Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m

compressRequest

(Optional, Boolean) true if the requests must be compressed

followRedirects

(Optional, Boolean) true if redirects must be followed

Authentication

Credentials Secret for Elasticsearch with HTTP Basic Authentication

{
  "type": "http",
  "name": "My Elasticsearch Credentials",
  "secret": {
    ...
  }
}

username: (Required, String) The username of the credentials
password: (Required, String) The password of the credentials

Credentials Secret for Elasticsearch with HTTP Bearer Token

{
  "type": "http",
  "name": "My Elasticsearch Credentials",
  "secret": {
    ...
  }
}

token: (Required, String) The token of the credentials

Credentials Secret for Elasticsearch with API Key

{
  "type": "http",
  "name": "My Elasticsearch Credentials",
  "secret": {
    ...
  }
}

apiKey: (Required, String) The API key of the credentials

DSL

Table 1. DSL Filters for Elasticsearch
Filter	Elasticsearch Query Operator
Equals	Term Query when `normalize` is `false`, otherwise Match Query
Less Than	Range Query
Less Than or Equal To	Range Query
Between	Range Query
Greater Than	Range Query
Greater Than or Equal To	Range Query
In	Terms Query
Exists	Exists Query
And	Bool Query with `must` clauses
Or	Bool Query with `should` clauses
Regex	Regexp Query

Hugging Face

{
  "type": "hugging-face",
  "name": "My Hugging Face Server",
  "config": {
    ...
  }
}

Server Configuration for Hugging Face

servers

(Required, Array of Strings) The URI for the Hugging Face Inference API service. Multiple servers will be invoked in round-robin

connection

(Optional, Object) The configuration of the HTTP connection to Hugging Face

Details

connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s

pool

(Optional, Object) The configuration for the connection pool

Details

size: (Optional, Integer) The size of the connection pool. Defaults to 5
keepAlive: (Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m

compressRequest

(Optional, Boolean) true if the requests must be compressed

followRedirects

(Optional, Boolean) true if redirects must be followed

Authentication

Credentials Secret for Hugging Face

{
  "type": "http",
  "name": "My Hugging Face Credentials",
  "secret": {
    ...
  }
}

token: (Required, String) The token of the credentials

MongoDB

{
  "type": "mongo",
  "name": "My MongoDB Server",
  "config": {
    ...
  }
}

Note	This integration supports the `/ping` endpoint.

Server Configuration for MongoDB/MongoDB Atlas

servers

(Required, Array of Strings) The connection string for the MongoDB/MongoDB Atlas installation. Multiple servers represent a replica set

connection

(Optional, Object) The configuration of the connection to MongoDB

Details

connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s

pool

(Optional, Object) The configuration for the connection pool

Details

size: (Optional, Integer) The size of the connection pool. Defaults to 5
keepAlive: (Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m

compressors

(Optional, Array of Strings) A list of data compressors. One of SNAPPY, ZLIB or ZSTD

tls

(Optional, Boolean) true if the connection should be done through SSL. Defaults to false

retryWrites

(Optional, Boolean) true if the connection should retry requests. Defaults to true

Authentication

Credentials Secret for MongoDB/MongoDB Atlas using SCRAM-SHA-1

{
  "type": "mongo",
  "name": "My MongoDB Credentials",
  "secret": {
    "mechanism": "SCRAM-SHA-1",
    ...
  }
}

mechanism: (Required, String) The authentication mechanism. Must be SCRAM-SHA-1
username: (Required, String) The username of the credentials
password: (Required, String) The password of the credentials
source: (Required, String) The database name associated with the user’s authentication data. Defaults to admin

Credentials Secret for MongoDB/MongoDB Atlas using SCRAM-SHA-256

{
  "type": "mongo",
  "name": "My MongoDB Credentials",
  "secret": {
    "mechanism": "SCRAM-SHA-256",
    ...
  }
}

mechanism: (Required, String) The authentication mechanism. Must be SCRAM-SHA-256
username: (Required, String) The username of the credentials
password: (Required, String) The password of the credentials
source: (Required, String) The username of the credentials. Defaults to admin

Credentials Secret for MongoDB/MongoDB using AWS IAM

{
  "type": "mongo",
  "name": "My MongoDB Credentials",
  "secret": {
    "mechanism": "MONGODB-AWS",
    ...
  }
}

mechanism: (Required, String) The authentication mechanism. Must be MONGODB-AWS
accessKeyId: (Required, String) The AWS access key ID
secretAccessKey: (Optional, String) The AWS secret access key
sessionToken: (Optional, String) The AWS session token for authentication with temporary credentials when using an AssumeRole request, or when working with AWS resources that specify this value such as Lambda

DSL

Table 2. DSL Filters for MongoDB
Filter	Mongo Query Operator
Equals	`$eq`
Less Than	`$lt`
Less Than or Equal To	`$lte`
Between	An `$and` of `$gte` and `$lt`
Greater Than	`$gt`
Less Than or Equal To	`$gte`
In	`$in`
Exists	`$exists`
And	`$and`
Or	`$or`
Not	`$not`
Regex	`$regex` with regular expressions

Table 3. DSL Filters for MongoDB Atlas Search
Filter	Mongo Query Operator
Equals	Text Operator with a string query
Less Than	Range Operator with numbers or dates
Less Than or Equal To	Range Operator with numbers or dates
Between	Range Operator with numbers or dates
Greater Than	Range Operator with numbers or dates
Greater Than or Equal To	Range Operator with numbers or dates
In	In Operator
Exists	Exists Operator
And	Compound Operator with `must`
Or	Compound Operator with `should`
Regex	Regex Operator with the keyword analyzer

Neo4j

{
  "type": "neo4j",
  "name": "My Neo4j Server",
  "config": {
    ...
  }
}

Note	This integration supports the `/ping` endpoint.

Server Configuration for Neo4j

server

(Required, String) The URI to connect to Neo4j, following the supported schemes

connection

(Optional, Object) The configuration of the connection to Neo4j

Details

connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s

pool

(Optional, Object) The configuration for the connection pool

Details

size: (Optional, Integer) The size of the connection pool. Defaults to 5
keepAlive: (Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m

Authentication

Credentials Secret for Neo4j

{
  "type": "neo4j",
  "name": "My Neo4j Credentials",
  "secret": {
    ...
  }
}

username: (Required, String) The username of the credentials
password: (Required, String) The password of the credentials

OpenAI

{
  "type": "openai",
  "name": "My OpenAI Server",
  "config": {
    ...
  }
}

Note	This integration supports the circuit breaker configuration.

Server Configuration for OpenAI

organizationId

(Optional, String) The Organization ID to be added to the requests header

connection

(Optional, Object) The configuration of the connection for OpenAI

Details

connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s

pool

(Optional, Object) The configuration for the connection pool

Details

size: (Optional, Integer) The size of the connection pool. Defaults to 5
keepAlive: (Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m

compressRequest

(Optional, Boolean) true if the requests must be compressed

followRedirects

(Optional, Boolean) true if redirects must be followed

Authentication

Credentials Secret for OpenAI

{
  "type": "http",
  "name": "My OpenAI Credentials",
  "secret": {
    ...
  }
}

apiKey: (Required, String) The API key of the credentials

OpenSearch

{
  "type": "opensearch",
  "name": "My OpenSearch Server",
  "config": {
    ...
  }
}

Note	This integration supports the `/ping` endpoint.

Server Configuration for OpenSearch

servers

(Required, Array of Strings) The URI for the OpenSearch installation. Multiple servers will be invoked in round-robin

pathPrefix

(Optional, String) The path prefix to add to the servers on each call

connection

(Optional, Object) The configuration of the HTTP connection to OpenSearch

Details

connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s

pool

(Optional, Object) The configuration for the connection pool

Details

size: (Optional, Integer) The size of the connection pool. Defaults to 5
keepAlive: (Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m

compressRequest

(Optional, Boolean) true if the requests must be compressed

followRedirects

(Optional, Boolean) true if redirects must be followed

Server Configuration for AWS OpenSearch

endpoint

(Required, String) The host to make the request to, without the http://

signature

(Required, Object) The signature for the requests

Details

region: (Required, String) The AWS region of the service
serviceName: (Required, String) The signing service name

connection

(Optional, Object) The configuration of the HTTP connection to OpenSearch

Details

connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s

pool

(Optional, Object) The configuration for the connection pool

Details

size: (Optional, Integer) The size of the connection pool. Defaults to 5
keepAlive: (Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m

compressRequest

(Optional, Boolean) true if the requests must be compressed

followRedirects

(Optional, Boolean) true if redirects must be followed

Authentication

Credentials Secret for OpenSeach with HTTP Basic Authentication

{
  "type": "http",
  "name": "My OpenSearch Credentials",
  "secret": {
    ...
  }
}

username: (Required, String) The username of the credentials
password: (Required, String) The password of the credentials

Credentials Secret for OpenSeach with AWS Authentication

{
  "type": "aws",
  "name": "My OpenSearch Credentials",
  "secret": {
    ...
  }
}

accessKeyId: (Required, String) The ID of your access key, used to identify the user
secretAccessKey: (Required, String) The secret access key, used to authenticate the user
sessionToken: (Optional, String) The session token from an AWS token service, used to authenticating that this user has received temporary permission to access some resource
expirationTime: (Optional, Duration) The time after which this identity will no longer be valid. If not provided, the expiration is unknown but it may still expire at some time

Note	See Manage access keys for IAM users

DSL

Table 4. DSL Filters for OpenSearch
Filter	OpenSearch Query Operator
Equals	Term Query when `normalize` is `false`, otherwise Match Query
Less Than	Range Query
Less Than or Equal To	Range Query
Between	Range Query
Greater Than	Range Query
Greater Than or Equal To	Range Query
In	Terms Query
Exists	Exists Query
And	Bool Query with `must` clauses
Or	Bool Query with `should` clauses
Regex	Regexp Query

Solr

{
  "type": "solr",
  "name": "My Solr Server",
  "config": {
    ...
  }
}

Note	This integration supports the `/ping` endpoint.

Server Configuration for Solr

servers

(Required, Array of Strings) The URI for the Solr installation. Multiple servers will be invoked in round-robin

connection

(Optional, Object) The configuration of the HTTP connection to Solr

Details

connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s

pool

(Optional, Object) The configuration for the connection pool

Details

size: (Optional, Integer) The size of the connection pool. Defaults to 5
keepAlive: (Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m

compressRequest

(Optional, Boolean) true if the requests must be compressed

followRedirects

(Optional, Boolean) true if redirects must be followed

Authentication

Credentials Secret for Solr

{
  "type": "http",
  "name": "My Solr Credentials",
  "secret": {
    ...
  }
}

username: (Required, String) The username of the credentials
password: (Required, String) The password of the credentials

DSL

Table 5. DSL Filters for Solr
Filter	Solr Query Operator
Equals	Equality Query
Less Than	Less Than Query
Less Than or Equal To	Less Than or Equals Query
Between	Range Query
Greater Than	Greater Than Query
Greater Than or Equal To	Greater Than or Equals Query
In	Terms Query
Exists	Exists Query
And	And Query
Or	Or Query

Vespa

{
  "type": "vespa",
  "name": "My Vespa Server",
  "config": {
    ...
  }
}

Note	This integration supports the `/ping` endpoint.

Note	This integration supports the circuit breaker configuration.

Note	Vespa Cloud is protected with mTLS. See Server to configure the keys or use token authentication.

Server Configuration for Vespa

servers

(Required, Array of Strings) The URI for the Vespa service. Multiple servers will be invoked in round-robin

connection

(Optional, Object) The configuration of the connection for Vespa

Details

connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s

pool

(Optional, Object) The configuration for the connection pool

Details

size: (Optional, Integer) The size of the connection pool. Defaults to 5
keepAlive: (Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m

compressRequest

(Optional, Boolean) true if the requests must be compressed

followRedirects

(Optional, Boolean) true if redirects must be followed

backoffPolicy

(Optional, Object) The configuration for retries to the Vespa service

Details

type: (Optional, String) The type of backoff policy to apply. One of NONE, CONSTANT, or EXPONENTIAL. Defaults to EXPONENTIAL
initialDelay: (Optional, Duration) The initial delay before retrying. Defaults to 50ms
retries: (Optional, Integer) The maximum number of retries. Defaults to 5

scroll

(Optional, Object) The scroll configuration for paginated requests

Details

{
  "scroll": {
    "size": 50
  }
}

size: (Required, String) The size of the scroll request

Authentication

Credentials Secret for Vespa with HTTP Basic Authentication

{
  "type": "http",
  "name": "My Vespa Credentials",
  "secret": {
    ...
  }
}

username: (Required, String) The username of the credentials
password: (Required, String) The password of the credentials

Credentials Secret for Vespa with HTTP Bearer Token

{
  "type": "http",
  "name": "My Vespa Credentials",
  "secret": {
    ...
  }
}

token: (Required, String) The token of the credentials

Credentials Secret for Vespa with API Key

{
  "type": "http",
  "name": "My Vespa Credentials",
  "secret": {
    ...
  }
}

apiKey: (Required, String) The API key of the credentials

Voyage AI

{
  "type": "voyage-ai",
  "name": "My Voyage AI Server",
  "config": {
    ...
  }
}

Note	This integration supports the circuit breaker configuration.

Server Configuration for Voyage AI

connection

(Optional, Object) The configuration of the connection for Voyage AI

Details

connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s

pool

(Optional, Object) The configuration for the connection pool

Details

size: (Optional, Integer) The size of the connection pool. Defaults to 5
keepAlive: (Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m

compressRequest

(Optional, Boolean) true if the requests must be compressed

followRedirects

(Optional, Boolean) true if redirects must be followed

backoffPolicy

(Optional, Object) The configuration of the back off policy for Voyage AI

Details

type: (Optional, String) The type of backoff policy to apply. One of NONE, CONSTANT, or EXPONENTIAL. Defaults to EXPONENTIAL
initialDelay: (Optional, Duration) The initial delay before retrying. Defaults to 50ms
retries: (Optional, Integer) The maximum number of retries. Defaults to 5

Authentication

Credentials Secret for Voyage AI

{
  "type": "http",
  "name": "My Voyage AI Credentials",
  "secret": {
    ...
  }
}

token: (Required, String) The token of the credentials

DSL

The Discovery Domain-Specific Language is a standardized definition on how to write JSON expressions that can be applied to all Discovery products.

Filters

"Equals" Filter

The value of the field must be exactly as the one provided.

{
  "equals": {
    "field": "my-field",
    "value": "my-value",
    "normalize": true
  }
}

When supported, the normalize field enables normalization as described by the filter provider. It is enabled by default.

"Less Than" Filter

The value of the field must be less than the one provided.

{
  "lt": {
    "field": "my-field",
    "value": 1
  }
}

"Less Than or Equal to" Filter

The value of the field must be less than or equals to the one provided.

{
  "lte": {
    "field": "my-field",
    "value": 1
  }
}

"Between" Filter

The value of the field must be greater than or equals to the "from" value (inclusive), and less than the "to" value (exclusive).

{
  "between": {
    "field": "my-field",
    "from": 1,
    "to": 10
  }
}

"Greater Than" Filter

The value of the field must be greater than the one provided.

{
  "gt": {
    "field": "my-field",
    "value": 1
  }
}

"Greater Than or Equal To" Filter

The value of the field must be greater than or equals to the one provided.

{
  "gte": {
    "field": "my-field",
    "value": 1
  }
}

"In" Filter

The value of the field must be one of the provided values.

{
  "in": {
    "field": "my-field",
    "values": [
      "my-value-a",
      "my-value-b"
    ]
  }
}

"Empty" Filter

Checks if a field is empty:

For a collection, true if its size is 0.
For a String, true if its length is 0.
For any other type, true if it is null.

{
  "empty": {
    "field": "my-field"
  }
}

"Exists" Filter

Checks if a field exists.

{
  "exists": {
    "field": "my-field"
  }
}

"Not" Filter

Negates the inner clause.

{
  "not": {
    "equals": {
      "field": "my-field",
      "value": "my-value"
    }
  }
}

"Null" Filter

Checks if a field is null. Note that while the "exists" filter checks whether the field is present or not, the "null" filter expects the field to be present but with null value.

{
  "null": {
    "field": "my-field"
  }
}

"Regex" Filter

Checks if a field matches with the given regex pattern.

{
  "regex": {
    "field": "my-field",
    "pattern": "my-pattern"
  }
}

"Boolean" Filter

and: All conditions in the list must be evaluated to true.

{
  "and": [
    {
      "equals": {
        "field": "my-field-a",
        "value": "my-value-a"
      }
    }, {
      "equals": {
        "field": "my-field-b",
        "value": "my-value-b"
      }
    }
  ]
}

or: At least one condition in the list must be evaluated to true.

{
  "or": [
    {
      "equals": {
        "field": "my-field-a",
        "value": "my-value-a"
      }
    }, {
      "equals": {
        "field": "my-field-b",
        "value": "my-value-b"
      }
    }
  ]
}

Projections

A projection allows you to select specific fields (attributes) to filter out from a request:

If no includes or excludes fields are defined, all fields are returned.
If only the includes fields are defined, only those fields are returned.
If only the excludes fields are defined, all available fields, except the ones in the exclusions are returned.
If both includes and excludes fields are defined, both are included in the projection.

Note

The details of how projections are processed might vary between uses cases of the DSL and/or providers, specially when it comes to projections with both included and excluded fields. It’s recommended to check the documentation of the specific component or API that’s going to be used for details like projections that aren’t allowed.

{
  "includes": ["my-field-a", "my-field-b"],
  "excludes": ["my-field-c", "my-field-d"]
}

Expression Language

The Expression Language is a flexible but simple way to manage and handle configurations. In a JSON, the use of expressions allows for values to be ambiguous to later be contextually processed:

{
  "dynamicField": "#{ first_math_function('input') + second_math_function('input') }",
  "staticField": "value"
}

As shown in the previous example, the syntax of an expression is one or multiple constants, operators and functions wrapped between the #{ and } tokens.

Note	The Expression Language is case-sensitive and all functions are defined in snake_case.

Constants

Table 6. Basic Constants
Constant	Value
NULL	`null`

Table 7. Mathematical Constants
Constant	Value
PI	3.14159265...
E	2.71828182...

Table 8. Boolean Constants
Constant	Value
TRUE	`true`
FALSE	`false`

Operators

Table 9. Mathematical Operators
Operator	Token
Plus	+
Minus	-
Multiplication	*
Division	/
Power of	^
Module	%

Table 10. Equality and Relational Operators
Operator	Token
Equals	=
Equals	==
Not equals	<>
Not equals	!=
Greater than	>
Greater than or equal to	>=
Less than	<
Less than or equal to	<=

Table 11. Boolean Operators
Operator	Token
And	&&
Or	\|\|
Not	!

Table 12. Date/Time Operators
Operator	Token
Plus	+
Minus	-

Table 13. String Operators
Operator	Token
Concat	+

Functions

Table 14. Basic Functions
Function	Description	Example
`coalesce(any, ...)`	Returns the first non-null value, or null if there are none	`coalesce(null, 2, 3) = 2`

Table 15. Mathematical Functions
Function	Description	Example
`abs(number)`	Returns the absolute value of a value	`abs(-7) = 7`
`ceiling(number)`	Rounds a number towards positive infinity	`ceiling(1.1) = 2`
`fact(number)`	Returns the factorial of a number	`fact(5) = 120`
`floor(number)`	Rounds a number towards negative infinity	`floor(1.9) = 1`
`log(number)`	Performs the logarithm with base e on a value	`log(5) = 1.609`
`log10(number)`	Performs the logarithm with base 10 on a value	`log10(5) = 0.698`
`max(number, ...)`	Returns the highest value from all the parameters provided	`max(5, 55, 6, 102) = 102`
`min(number, ...)`	Returns the lowest value from all the parameters provided	`min(5, 55, 6, 102) = 5`
`random()`	Returns random number between 0 and 1	`random() = 0.1613...`
`round(number, integer)`	Rounds a decimal number to a specified scale	`round(0.5652, 2) = 0.57`
`sum(number, ...)`	Returns the sum of the parameters	`sum(0.5, 3, 1) = 4.5`
`sqrt(number)`	Returns the square root of the value provided	`sqrt(4) = 2`

Table 16. Trigonometric Functions
Function	Description	Example
`acos(number)`	Returns the arc-cosine in degrees	`acos(1) = 0`
`acosh(number)`	Returns the hyperbolic arc-cosine in degrees	`acosh(1.5) = 0.96`
`acosr(number)`	Returns the arc-cosine in radians	`acosr(0.5) = 1.04`
`acot(number)`	Returns the arc-co-tangent in degrees	`acot(1) = 45`
`acoth(number)`	Returns the hyperbolic arc-co-tangent in degrees	`acoth(1.003) = 3.141`
`acotr(number)`	Returns the arc-co-tangent in radians	`acotr(1) = 0.785`
`asin(number)`	Returns the arc-sine in degrees	`asin(1) = 90`
`asinh(number)`	Returns the hyperbolic arc-sine in degrees	`asinh(6.76) = 2.61`
`asinr(number)`	Returns the arc-sine in radians	`asinr(1) = 1.57`
`atan(number)`	Returns the arc-tangent in degrees	`atan(1) = 45`
`atan2(number)`	Returns the angle of arc-tangent2 in degrees	`atan2(1, 0) = 90`
`atan2r(number)`	Returns the angle of arc-tangent2 in radians	`atan2r(1, 0) = 1.57`
`atanh(number)`	Returns the hyperbolic arc-tangent in degrees	`atanh(0.5) = 0.54`
`atanr(number)`	Returns the arc-tangent in radians	`atanr(1) = 0.78`
`cos(number)`	Returns the cosine in degrees	`cos(180) = -1`
`cosh(number)`	Returns the hyperbolic cosine in degrees	`cosh(PI) = 11.591`
`cosr(number)`	Returns the cosine in radians	`cosr(PI) = -1`
`cot(number)`	Returns the co-tangent in degrees	`cot(45) = 1`
`coth(number)`	Returns the hyperbolic co-tangent in degrees	`coth(PI) = 1.003`
`cotr(number)`	Returns the co-tangent in radians	`cotr(0.785) = 1`
`csc(number)`	Returns the co-secant in degrees	`csc(270) = -1`
`csch(number)`	Returns the hyperbolic co-secant in degrees	`csch(3*PI/2) = 0.017`
`cscr(number)`	Returns the co-secant in radians	`cscr(3*PI/2) = -1`
`deg(number)`	Converts an angle from radians to degrees	`deg(0.785) = 45`
`rad(number)`	Converts an angle from degrees to radians	`rad(45) = 0.785`
`sin(number)`	Returns the sine in degrees	`sin(150) = 0.5`
`sinh(number)`	Returns the hyperbolic sine in degrees	`sinh(2.61) = 6.762`
`sinr(number)`	Returns the sine in radians	`sinr(2.61) = 0.5`
`sec(number)`	Returns the secant in degrees	`sec(120) = -2`
`sech(number)`	Returns the hyperbolic secant in degrees	`sech(2.09) = 0.243`
`secr(number)`	Returns the secant in radians	`secr(2.09) = -2`
`tan(number)`	Returns the tangent in degrees	`tan(360) = 0`
`tanh(number)`	Returns the hyperbolic tangent in degrees	`tanh(2*PI) = 1`
`tanr(number)`	Returns the tangent in radians	`tanr(2*PI) = 0`

Table 17. Date/Time Functions
Function	Description	Example
`format(string, string)`	Formats a date/time with a given pattern as described in Date/Time	`format('2023-11-28T20:46', "d MMM uuuu") = "28 Nov 2023"`
`now()`	Gets the current datetime	`now() = 2023-12-06T09:27:21.123456Z`
to_date(string)	Parses the input pattern as described in Date/Time	`to_date("2023-12-08T10:30:00Z")`
to_date(number, number, number, number?, number?)	When providing a set of integers, year, month and day are required. Hour, minute and second are all optional. The order of the parameters must be as previously mentioned	`to_date(2023, 12, 8, 10, 30) = "2023-12-08T10:30:00Z"`

Table 18. Logic Functions
Function	Description	Example
`if(boolean, any, any)`	Conditional operation where if the boolean expression evaluates to `true`, the second parameter is returned. Otherwise, the third parameter is returned	`if(TRUE, 5+1, 6+2) = 6`
`not(boolean)`	Negates a boolean expression	`not(TRUE) = false`

Table 19. String Functions
Function	Description	Example
`lower(string)`	Converts String to lower case	`lower("THIS IS A TEST") = "this is a test"`
`upper(string)`	Converts String to upper case	`upper("this is a test") = "THIS IS A TEST"`
`starts_with(string, string, boolean?)`	Verifies with a boolean if a string begins with a given substring. Case sensitivity can optionally be specified. If the case sensitivity flag is not sent, it will be set to `true` by default	`starts_with("This is a test", "This", true) = true`
`ends_with(string, string, boolean?)`	Verifies with a boolean if a string ends with a given substring. Case sensitivity can optionally be specified. If the case sensitivity flag is not sent, it will be set to `true` by default	`ends_with("This is a test", "Test", false) = true`
`regex(string, pattern)`	Returns a boolean specifying if a string matches a given pattern	`regex("This is a Test", ".t") = true`
`is_empty(string)`	Returns a boolean specifying if a String is empty	`is_empty("") = true`
`is_blank(string)`	Whether the variable is a blank String	`is_blank("") = true`
`size(string)`	Returns the length of a given string	`size("Test") = 4`
`concat(string, ...)`	Concatenates a given set of strings	`concat("This", "is", "a test") = Thisisa test`
`split(string, pattern)`	Splits a String into a List by a regex value	`split("This,is,a,test", ",") = ["This", "is", "a", "test"]`
`strip(string)`	Strips the punctuation, replacing it by a space	`strip("This.is.a.test") = This is a test`
`contains(string, string)`	Returns a boolean specifying if the first string contains the second one	`contains("Hello World!", "World") = true`
`uuid()`	Generates a random UUID v4	`uuid() = "7583bc66-60a4-4ce5-a64d-15f245d52027"`
`to_number(string)`	Returns the number represented by a string	`to_number("5.2") = 5.2`

Table 20. Hash Functions
Function	Description	Example
`md5(any)`	Hashes a given object using MD5	`md5("This is a Test") = "2e674a93..."`
`sha256(any)`	Hashes a given object using SHA-256	`sha256("This is a Test") = "401b022b962452749..."`

Table 21. List Functions
Function	Description	Example
`is_empty(array)`	Returns a boolean specifying if a list is empty	`is_empty(array) = true`, where `array = []`
`size(array)`	Returns the amount of items in the list	`size(array) = 3`, where `array = [a, b, c]`
`contains(array, any)`	Returns a boolean specifying if the list contains the value	`contains(array, 3) = true`, where `array = [1, 2, 3]`
`get(array, number)`	Returns the value in the given index	`get(array, 1) = b`, where `array = [a, b, c]`
`concat(array, ...)`	Joins a given set of arrays into one	`concat(array1, array2, array3) = [a, b, c]`, where `array1 = [a]`, `array2 = [b]`, `array3 = [c]`

Note	An alternative syntax to access to the value on the position `i` of a given array is `array[i]`

Table 22. Map Functions
Function	Description	Example
`is_empty(map)`	Returns a boolean specifying if a map is empty	`is_empty(map) = true`, where `map = [<,>]`
`size(map)`	Returns the amount of items in the map	`size(map) = 3`, where `map = [<a,1>, <b,2>, <c,3>]`
`contains(map, any)`	Returns a boolean specifying if the map contains the key	`contains(map, a) = true`, where `map = [<a,1>, <b,2>, <c,3>]`
`get(map, any)`	Returns the value with the given key	`get(map, b) = 2`, where `map = [<a,1>, <b,2>, <c,3>]`

Table 23. JSON Functions
Function	Description	Example
`jsonpath(json, string)`	Finds a specific value within a JSON with a JSONPath string	`jsonpath(data('path/to/json'), '$.some.path')`

Table 24. Discovery Functions
Function	Description	Example
`file(string, string?)`	Reads a file from the storage. The file can optionally be obtained as a byte array representation if the `BYTES` parameter is sent. If not, it will be set to `STRING` by default. The result of this function is cached, and it is aware of any change on the file	`file('my-file.txt', 'STRING') = "plain text"`

Script Engine

The Script Engine enables the execution of scripts for advanced handling of execution data. Supports multiple scripting languages and provides tools for JSON manipulation and logging:

Bindings

Each script has bindings to interact with the execution context where it runs:

data

(Object) Allows the creation and manipulation of JsonNode instances. Also, if the script runs as part of Discovery Ingestion or Discovery Queryflow, the binding will expose its corresponding the context data.

Details

Method Description

ArrayNode arrayNode()

Returns a new JSON array

JsonNode get(String)

Obtains a deep copy of the nodes from the data generated during the execution, takes the path to the value or node. The input must be the JSON Pointer of the data field, for example: data.get("/myfield")

NullNode nullNode()

Returns a new null node

ObjectNode objectNode()

Returns a new JSON object

JsonNode output()

Obtains the value that will be used as output for the component

JsonNode parseJson(String)

Takes a JSON with a String format and parses it into a JSON document

JsonNode valueToJson(Object)

Takes any object and tries to convert it into a JSON document

void set(parameter)

Sets the output field to a primitive type (integer, long, str, float, double), or a JsonNode. The method will infer the type of parameter in languages that are dynamically typed such as python. If the use case needs it, a casting may help in controlling the output node

log

(Object) Supports a SLF4J logger that can log messages directly into the application

Python script example

value = 5
if (value <= 10):
  log.error("Example of an error")

Groovy script example

var response = data.objectNode();
response.put("intValue", 3);

log.info("Node set with value: " + data.output().get("intValue").asText());

JavaScript script example

var requestBody = data.get("/numberTest").asInt();
data.set(requestBody);

Template Engine

The Template Engine, provided by Freemarker, acts as a blueprint that uses a given input to generate various types of documents as either plain text or JSON.

Supported in both Discovery Ingestion and Discovery QueryFlow, can take a standard template and process it with contextual structured data to create verbalized representation of the information.

Consider the following input data:

{
  "name": "mary",
  "users": [
    {
      "name": "Jane Doe",
      "id": 0
    },
    {
      "name": "Mary",
      "id": 2
    },
    {
      "name": "Alice",
      "id": 3
    }
  ]
}

When processed with the following template:

Hello, ${name?capitalize}!

Users registered:
<#list users as user>
  <#if user.id == 0>
    name: admin, ID: ${user.id}
  <#else>
    Name: ${user.name}, ID: ${user.id}
  </#if>
</#list>

Then, the output would be:

Hello, Mary!

Users registered:
name: admin, ID: 0
Name: Mary, ID: 2
Name: Alice, ID: 3

Hello, ${name?capitalize}! will output a greeting to the name specified in the JSON data and capitalize the first letter. Given the JSON data, it will output Hello, Mary! because mary now is capitalized.
<#list users as user> is a directive that iterates over the users array in the JSON data.
<#if user.id == 0> within the list, checks if the user’s id is 0. If true, it outputs name: admin, ID: 0 instead of using the user’s actual name.
<#else> for all other users (where id is not 0), it outputs Name: user.name, ID: user.id.

Template Language

Placeholders

Placeholders are references to the data model passed to the template.

Syntax:

${variableName}

Example data model:

{
  "name": "Mary"
}

Example:

${name}

Output: Mary

Comments

Comments are a way to add notes or explanations within your templates.

Syntax:

<#-- Comment --#>

Example:

<#-- Hello this is my comment --#>

Directives

== Directives These are instructions that control the processing flow of the template (like loops and conditionals). The full list of the directives can be found in the Directive reference of the Freemarker documentation.

=== Assign
Used to define a variable.

Syntax:

<#assign name1=value1>

Example:

<#assign x=1>

Attempt, Recover

Used for error handling in templates. Attempt is to execute code that might fail and recover to define what to do if an error occurs in the attempt block.

Syntax:

<#attempt>
  attempt block
<#recover>
  recover block
</#attempt>

Example:

<#attempt>
  ${user.name}
<#recover>
  Unknown User
</#attempt>

Function, Return

Used to create a method variable, it must have a parameter that specifies the return value of the method.

Syntax:

<#function name param1 param2 ... paramN>
  ...
<#return returnValue>
  ...
</#function>

Example:

<#function avg x y>
  <#return (x + y) / 2>
</#function>

${avg(10, 20)}

Output: 15.

Global

Used to define a variable for all namespaces.

Syntax:

<#global name=value>

Example:

<#global x=1>

If, else, elseif

Used to conditionally skip a section of the template.

Syntax:

<#if condition>
  ...
<#elseif condition2>
  ...
<#else>
  ...
</#if>

Example:

<#if x == 1>
  x is 1
<#elseif x == 2>
  x is 2
<#else>
  x is not 1 nor 2
</#if>

Import

Used to bring all macros and functions from another template file into the current template namespace. Path is the path of the template file to import and hash is the name of variable by which you can access the namespace.

Syntax:

<#import path as hash>

Example:

<#import "library.ftl" as lib>

Include

Used to include the content of another template file into the current template output. It does not make macros or functions from the included file available in the current namespace. Path is the path of the template file to include.

Syntax:

<#include path>

Example:

<#include "header.ftl">

List

Used for iterating over a collection.

else: Within a list, it is used to specify output if the list is empty.
items: It refers to the current item in the iteration.
sep: Used to output something between items, like a separator.
break: Exits the loop prematurely.
continue: Skips the current iteration and moves to the next item.

Syntax:

<#list sequence as item>
  Part repeated for each item
</#list>

Example:

<#list users as user>
  ${user.name}
<#else>
  No users found.
</#list>

Macro

Used to define a reusable block of template code.

Syntax:

<#macro name param1 param2 ... paramN>
  ...
</#macro>

Example:

<#macro test>
  Test text
</#macro>

<#-- call the macro: -->
<@test/>

Output: Test text

File Storage

Files API

Upload a File

$ curl --request PUT 'core-api:8080/v2/file/my/key/to/my/file' --form 'file=@"/../../test.txt"'

Retrieve a File

$ curl --request GET 'core-api:8080/v2/file/my/key/to/my/file'

List all Files

$ curl --request GET 'core-api:8080/v2/file'

Delete a file

$ curl --request DELETE 'core-api:8080/v2/file/my/key/to/my/file'

The File Storage handles files with the help of a dedicated "folder" in the Object Storage.

File names can be constructed as nested paths, where a slash at the end of each sub path denotes this sub path as a parent/folder. This name must follow the next rules:

Parent names can contain alphanumeric characters (i.e. [A-Z], [a-z], [0-9]), hyphen (i.e. -), underscore (i.e. _) and spaces.
Character quantity must range from 1 to 255.
A nested path can consist of up to 10 levels.

Note	When executing the endpoints to upload or delete files, the Expression Language is notified about the changes to clear its internal cache.

Discovery Staging

Discovery Staging is a REST API on top of a Document Database. Its goal is to simplify and standardize the interactions of all Discovery products with any supported provider that can handle JSON content, while enabling the final user features such as:

A push model alternative for the ETL process in Discovery Ingestion.
An intermediate repository for Discovery Ingestion, reducing the time and costs of content reprocessing for each processing iteration.
Advanced search capabilities in Discovery QueryFlow such as facet snapping based on the user’s input.

Supported Providers

MongoDB

Being the NoSQL industry standard for document storage and with a MongoDB Atlas managed service available in the marketplace of all major Cloud Providers, makes MongoDB the default Document Database provider for Discovery Staging in all Discovery installations.

DocumentDB

Amazon DocumentDB (with MongoDB compatibility) is a good alternative Document Database provider for Discovery Staging in installations fully-managed by AWS,

Content

Content API

Get a single document

$ curl --request GET 'staging-api:8081/v2/content/{bucketName}/{contentId}?action={action}&include={include}&exclude={exclude}'

Path Parameters

bucketName: (Required, String) The name of the bucket.
contentId: (Required, String) The document ID.

Query Parameters

action: (Optional, String) The actions to filter the documents. Defaults to STORE.
include: (Optional, Array of Strings) determine the fields of the document’s content that will be included in the response.
exclude: (Optional, Array of Strings) determine the fields of the document’s content that will be excluded in the response.

Store

$ curl --request POST 'staging-api:8081/v2/content/{bucketName}/{contentId}?parentId={parentId}'

Path Parameters

bucketName: (Required, String) The name of the bucket.
contentId: (Required, String) The document ID.

Query Parameters

parentId: (Optional, String) The parent ID of the documents.

Details

Note	This endpoint is capable of updating an existing document by using the `contentId` of the original document

The final content must not exceed the maximum size supported by the chosen provider. Exceeding the limit is depicted by a 413 error in the Staging APIs response.

Documents are stored with metadata. This adds an extra size to the final document besides the content in the request body, so it is recommended to write the body with less than the limit depicted by the provider.

The following table details maximum provider limits:

Provider

Limit

MongoDB

BSON size limit (~16Mb / 16793600 Bytes)

DocumentDB

BSON size limit (~16Mb / 16793600 Bytes)

Delete

$ curl --request DELETE 'staging-api:8081/v2/content/{bucketName}/{contentId}'

Path Parameters

bucketName: (Required, String) The name of the bucket.
contentId: (Required, String) The document ID.

Delete multiple documents

$ curl --request DELETE 'staging-api:8081/v2/content/{bucketName}?parentId={parentId}' --data '{ ... }'

Path Parameters

bucketName: (Required, String) The name of the bucket.

Query Parameters

parentId: (Optional, String) The parent ID of the documents.

Body

The body payload is an optional DSL Filter to apply to the delete

Note	The `body` and `parentId` parameter are optional, but in order to avoid deleting all documents, at least one of them must be included.

Scroll

$ curl --request POST 'staging-api:8081/v2/content/{bucketName}/scroll?token={token}&parentId={parentId}&size={size}&action={action}' --data '{ ... }'

Path Parameters

bucketName: (Required, String) The name of the bucket.

Query Parameters

token: (Optional, Hex String) the token to paginate the documents.
parentId: (Optional, String) The parent ID of the documents.
size: (Optional, Int) The number of documents to scroll. Defaults to 25.
action: (Optional, Array of String) The actions to filter the documents. Defaults to STORE.

Body

The body payload is is an optional DSL Filter and an optional DSL Projection to apply to the scroll

{
  "fields": <Projection DSL>,
  "filters": <Filter DSL>
}

Details

Note	The `scroll` functionality is meant to be used when it is needed to iterate through all the documents on a bucket based on the filters and projections applied. Sorting is not available since the order is not relevant when scrolling.

$ curl --request POST 'staging-api:8081/v2/content/{bucketName}/search?parentId={parentId}&action={action}&page={page}&size={size}&sort={sort}' --data '{ ... }'

Path Parameters

bucketName: (Required, String) The name of the bucket.

Query Parameters

parentId: (Optional, String) The parent ID of the documents.
action: (Optional, Array of String) The actions to filter the documents. Defaults to STORE.
page: (Optional, Int) The page number. Defaults to 0.
size: (Optional, Int) The size of the page. Defaults to 20.
sort: (Optional, Array of String) The sort definition for the page.

Body

The body payload is an optional DSL Filter and an optional DSL Projection to apply to the search

{
  "fields": <Projection DSL>,
  "filters": <Filter DSL>
}

Note

The search functionality should be used when it is necessary to sort a collection and get some pages with the top results based on the provided sorting criteria, at the same time it allows to apply filters, and define what fields to include or exclude on the retrieved documents. It shouldn’t be used instead of the scroll endpoint, because this is meant to work as a query and match feature implemented using the tools offered by the provider, in order to return the most relevant results.

For both search and scroll endpoints, all the query-string parameters and including the body are optional. The parameter action can be specified to contain multiple values.

The fields and filters fields are only applied to the fields within the content field of the items on the bucket.

The content of the bucket is the data, stored as JSON.

Buckets

Buckets API

Get All

$ curl --request GET 'staging-api:8081/v2/bucket'

Get information

$ curl --request GET 'staging-api:8081/v2/bucket/{bucketName}'

Path Parameters

bucketName: (Required, String) The name of the bucket.

Delete

$ curl --request DELETE 'staging-api:8081/v2/bucket/{bucketName}'

Path Parameters

bucketName: (Required, String) The name of the bucket.

Purge

$ curl --request DELETE 'staging-api:8081/v2/bucket/{bucketName}/purge'

Path Parameters

bucketName: (Required, String) The name of the bucket.

Note	Only purges the documents with the `DELETE` action.

Delete Index

$ curl --request DELETE 'staging-api:8081/v2/bucket/{bucketName}/index/{indexName}'

Path Parameters

bucketName: (Required, String) The name of the bucket.
indexName: (Required, String) The name of the index.

Create an index

$ curl --request PUT 'staging-api:8081/v2/bucket/{bucketName}/index/{indexName}' --data '{ ... }'

Path Parameters

bucketName: (Required, String) The name of the bucket.
indexName: (Required, String) The name of the index.

Body

[
  { "fieldA": "ASC" },
  { "fieldB": "DESC" },
  ...
]

Note	However, an empty or null body is also allowed. In these cases, an ascending index is created with the index name as the field name.

Create a bucket

$ curl --request POST 'staging-api:8081/v2/bucket/{bucketName}' --data '{ ... }'

Path Parameters

bucketName: (Required, String) The name of the bucket.

Body

{
  "indices": [
    {
      "name": "myIndexA",
      "fields": [
        { "fieldA": "ASC" },
        { "fieldB": "DESC" },
        ...
      ]
    },
    ...
  ],
  "config":{}
}

A bucket is a complete collection of data. Several operations can be performed on a bucket, where the results vary depending on the user input from the HTTP request.

Metadata description

{
  "name": "<Text>",
  "documentCount": {
    "STORE": "<Number>",
    "DELETE": "<Number>"
  },
  "content": {
    "oldest": "<StagingDocument>",
    "newest": "<StagingDocument>"
  },
  "indices": [
    {
      "name": "<Text>",
      "fields": [
        {
          "fieldA": "ASC|DESC"
        }
      ]
    }
  ]
}

Property Type Description

name

Text

The bucket name

documentCount

JSON Object

The total documents in the bucket, divided by action

documentCount.STORE

Number

The number of documents currently in the bucket with a STORE action type

documentCount.DELETE

Number

The number of documents currently in the bucket with a DELETE action type

content

JSON Object

The content of the bucket, including the oldest and newest documents

content.oldest

Staging Document

The oldest document in the bucket

content.newest

Staging Document

The newest document in the bucket

indices

JSON Array

Array with the name and fields of every index in the bucket

Index description

Property Type Description

name

Text

The index name

fields

Key/Value Pair Array

The fields used for the index. The key of every element is the name of the field, and the value is its sort direction (ASC or DESC). Ascending by default.

Note	All indices are over the content of the document

Note	When creating an index, if any of the fields is duplicated the last value specification for a field will take precedence.

Note	Also in the value of the fields, apart from using (`ASC` or `DESC`) for sort direction you can also use: 0 -→ `ASC` 1 -→ `DESC`

Discovery Ingestion

Discovery Ingestion is a fully-featured extract, transform, and load (ETL) tool that orchestrates the communication with external services while applying data enrichment to the records detected in the given data source. It enables features such as:

Flexibility to represent complex data processing scenarios through a finite-state machine.
Distributed, auto-scalable model that only consumes resources as needed.
Extensive component library for data source scanning, records processing and hooks triggering.

Data Seed

Seeds API

Create a new Seed

$ curl --request POST 'ingestion-api:8080/v2/seed' --data '{ ... }'

Start the execution of an existing Seed

$ curl --request POST 'ingestion-api:8080/v2/seed/{id}?scanType={scan-type}' --data '{ ... }'

Query Parameters

scanType: (Required, String) The scan type for the seed execution. Currently, only FULL is supported

Body

The body payload is the execution properties, which overrides the ones configured in the Seed

Halt all executions of a Seed

$ curl --request POST 'ingestion-api:8080/v2/seed/{id}/halt'

List all Seeds

$ curl --request GET 'ingestion-api:8080/v2/seed'

Get a single Seed

$ curl --request GET 'ingestion-api:8080/v2/seed/{id}'

Update an existing Seed

$ curl --request PUT 'ingestion-api:8080/v2/seed/{id}' --data '{ ... }'

Note	The type of an existing seed can’t be modified.

Reset the metadata of an existing Seed

$ curl --request POST 'ingestion-api:8080/v2/seed/{id}/reset'

Delete an existing Seed

$ curl --request DELETE 'ingestion-api:8088/v2/seed/{id}'

Clone an existing Seed

$ curl --request POST 'ingestion-api:8088/v2/seed/{id}/clone?name=clone-new-name'

Query Parameters

name: (Required, String) The name of the new Seed

Search for Seeds using DSL Filters

$ curl --request POST 'ingestion-api:8080/v2/seed/search' --data '{ ... }'

Body

The body payload is a DSL Filter to apply to the search

Autocomplete for Seeds

$ curl --request GET 'ingestion-api:8080/v2/seed/autocomplete?q=value'

Query Parameters

q: (Required, String) The query to execute the autocomplete search

A seed defines the data source for the configuration and the pipeline to follow during the processing of each record through the finite-state machine.

{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  ...
}

type

(Required, String) The name of the component to execute

name

(Required, String) The unique name to identify the configuration

description

(Optional, String) The description for the configuration

config

(Required, Object) The configuration for the corresponding action of the component. All configurations will be affected by the Expression Language

server

(Optional, UUID/Object) Either the ID of the server configuration for the integration or an object with the detailed configuration

Details

{
  "server": {
    "id": "ba637726-555f-4c68-bfed-1c91f4803894",
    ...
  },
  ...
}

id: (Required, UUID) The ID of the server configuration for the integration
credential: (Optional, UUID) The ID of the credential to override the default authentication in the external service

pipeline

(Required, UUID) The ID of the pipeline configuration for all detected records

recordPolicy

(Optional, Object) The global configuration for each record during its processing. Can also be referred to as record

Details

{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "recordPolicy": {
    ...
  },
  ...
}

id

(Optional, String) The expression that represents the ID of the record during its processing through the finite-state machine. If not provided, the plain ID of the record will be used

retryPolicy

(Optional, Object) The retry policy for failed records in the Seed Execution

Details

{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "recordPolicy": {
    "retryPolicy": {
      ...
    },
    ...
  },
  ...
}

maxRetries: (Required, Integer) The maximum number of retries for processing the record. The retries are executed from the point where the records failed. Defaults to 3

timeoutPolicy

(Optional, Object) The timeout policy for records in the Seed Execution

Details

{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "recordPolicy": {
    "timeoutPolicy": {
      ...
    },
    ...
  },
  ...
}

scan: (Required, Duration) The timeout for scan on each slice with records. Default to 1h
process: (Required, Duration) The timeout for each record during their execution through the finite-state machine. Defaults to 60s

errorPolicy

(Optional, Object) The error policy for records in the Seed Execution

Details

{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "recordPolicy": {
    "errorPolicy": {
      ...
    },
    ...
  },
  ...
}

scan

(Required, String) The error policy for scanned records. Defaults to FATAL

FATAL: A single failed document aborts the complete process
IGNORE: Ignores the record scan error and the Seed Execution continues

processor

(Required, String) The error policy for records during their execution through the finite-state machine. Defaults to FAIL

FATAL: A single failed document aborts the complete process
FAIL: Either marks the document as failed, or sends it to a configured error handling state (if any). Other records continue their execution as expected
IGNORE: Ignores the record processing error and its execution continues

outboundPolicy

(Optional, Object) The policy for groups of records sent for processing within the finite-state machine. Applied by default to outbound records (i.e., records sent to the next state) in each state thereafter

Details

{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "recordPolicy": {
    "outboundPolicy": {
      ...
    },
    ...
  },
  ...
}

batchPolicy

(Optional, Object) The batch policy for outbound batches of records

Details

{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "recordPolicy": {
    "outboundPolicy": {
      "batchPolicy": {
        ...
      }
    },
    ...
  },
  ...
}

maxCount: (Required, Integer) The maximum record count in a batch before flushing. Defaults to 25
flushAfter: (Required, Duration) The timeout to flush if no other condition has been met. Default to 1m

beforeHooks

(Optional, Object) The Hooks to execute before starting the record processing

Details

{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "beforeHooks": {
    "hooks": [
      ...
    ],
    "timeout": "60s",
    "errorPolicy": "IGNORE"
  },
  ...
}

hooks

(Required, Array of Objects) The list of Hooks to execute

Details

{
  "hooks": [
    {
      "id": <Hook ID>,
      ...
    }
  ],
  "timeout": "60s",
  "errorPolicy": "IGNORE"
}

id

(Required, UUID) The ID of the Hook to execute

errorPolicy

(Optional, String) Overrides the global policy for errors during the execution of the Hook

FATAL: A single failed hook aborts the complete process
IGNORE: Ignores the hook error and the Seed Execution continues

timeout

(Optional, Duration) Overrides the global timeout for the execution of the Hook

active

(Optional, Boolean) false to disable the execution of the Hook

errorPolicy

(Required, String) The policy for errors during the execution of the Hook. Defaults to IGNORE

FATAL: A single failed hook aborts the complete process
IGNORE: Ignores the hook error and the Seed Execution continues

timeout

(Required, Duration) The timeout for the execution of the Hook. Defaults to 60s

afterHooks

(Optional, Object) The Hooks to execute after completing the record processing

Details

{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "afterHooks": {
    "hooks": [
      ...
    ],
    "timeout": "60s",
    "errorPolicy": "IGNORE"
  },
  ...
}

hooks

(Required, Array of Objects) The list of Hooks to execute

Details

{
  "hooks": [
    {
      "id": <Hook ID>,
      ...
    }
  ],
  "timeout": "60s",
  "errorPolicy": "IGNORE"
}

id

(Required, UUID) The ID of the Hook to execute

errorPolicy

(Optional, String) Overrides the global policy for errors during the execution of the Hook

FATAL: A single failed hook aborts the complete process
IGNORE: Ignores the hook error and the Seed Execution continues

timeout

(Optional, Duration) Overrides the global timeout for the execution of the Hook

active

(Optional, Boolean) false to disable the execution of the Hook

errorPolicy

(Required, String) The policy for errors during the execution of the Hook. Defaults to IGNORE

FATAL: A single failed hook aborts the complete process
IGNORE: Ignores the hook error and the Seed Execution continues

timeout

(Required, Duration) The timeout for the execution of the Hook. Defaults to 60s

properties

(Optional, Object) The properties to be referenced with the help of the Expression Language in the configuration of the seed itself, in processors and in hooks

Details

{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    "myProperty": "#{ seed.properties.keyA }"
  },
  "properties": {
    "keyA": "valueA"
  },
  "pipeline": <Pipeline ID>,
  ...
}

{
  "type": "my-component-type",
  "name": "My Component Processor",
  "config": {
    "myProperty": "#{ seed.properties.keyA }"
  },
  ...
}

labels

(Optional, Array of Objects) The labels for the configuration

Details

{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}

key: (Required, String) The key of the label
value: (Required, String) The value of the label

Records

Seed Records reflect the status and parent-child relationship of records during a specific seed’s latest execution.

Each seed record in a given seed is identifiable by combining its seed, its parent plainId (if any) and its own plainId assigned at scan time. However, these IDs are used to generate a more systems-friendly ID for each record, known as hashId. This last ID is made by first applying the SHA-256 algorithm to the combination of the parent plainId plus the record plainId, and then using padded Base64URL encoding to make the result safe to use in URLs.

Records API

List all Records of a given Seed

$ curl --request GET 'ingestion-api:8080/v2/seed/{seedId}/record'

Get a Record by Seed and ID

$ curl --request GET 'ingestion-api:8080/v2/seed/{seedId}/record/{recordId}'

{
  "id": {
    ...
  },
  "creationTimestamp": "2025-04-22T16:51:44Z",
  "lastUpdatedTimestamp": "2025-04-22T16:51:44Z",
  "parent": "FOMM0WPHMpEuBIxMg34VxOkMBi67eVq5R9V3BuLRDdg=",
  "status": "FAILURE",
  "errors": [
    ...
  ]
}

id

(Object) The ID of the record

Details

{
  "plain": "1",
  "hash": "a4ayc_80_OGda4BO_1o_V0etpOqiLx1JwB5S3beHW0s="
}

plain: (String) The ID of the record before hashing it
hash: (String) The ID of the record as a Base64URL string

creationTimestamp

(Timestamp) The timestamp when the record was created

lastUpdatedTimestamp

(Timestamp) The timestamp when the record was last updated

parent

(String) The parent record’s id as a Base64URL string

status

(String) The status of the record

Details

SUCCESS: The record was successfully processed
FAILURE: The record reported errors during its processing
QUARANTINE: The record has been processed many times, and it should not be processed again

errors

(Array of Objects) The record’s errors, if any

Pipeline

Pipelines API

Create a new Pipeline

$ curl --request POST 'ingestion-api:8080/v2/pipeline' --data '{ ... }'

List all Pipelines

$ curl --request GET 'ingestion-api:8080/v2/pipeline'

Get a single Pipeline

$ curl --request GET 'ingestion-api:8080/v2/pipeline/{id}'

Update an existing Pipeline

$ curl --request PUT 'ingestion-api:8080/v2/pipeline/{id}' --data '{ ... }'

Delete an existing Pipeline

$ curl --request DELETE 'ingestion-api:8088/v2/pipeline/{id}'

Clone an existing Pipeline

$ curl --request POST 'ingestion-api:8088/v2/pipeline/{id}/clone?name=clone-new-name'

Query Parameters

name: (Required, String) The name of the new Pipeline

Search for Pipelines using DSL Filters

$ curl --request POST 'ingestion-api:8080/v2/pipeline/search' --data '{ ... }'

Body

The body payload is a DSL Filter to apply to the search

Autocomplete for Pipelines

$ curl --request GET 'ingestion-api:8080/v2/pipeline/autocomplete?q=value'

Query Parameters

q: (Required, String) The query to execute the autocomplete search

A pipeline is the definition of the finite-state machine for records processing:

{
  "name": "My Pipeline",
  "initialState": "stateA",
  "states": {
    "stateA": {
      ...
    },

    "stateB": {
      ...
    }
  },
  ...
}

name

(Required, String) The unique name to identify the pipeline

description

(Optional, String) The description for the configuration

initialState

(Required, String) The state to use starting point of the pipeline as defined in the states field

states

(Required, Object) The states associated to the pipeline

labels

(Optional, Array of Objects) The labels for the configuration

Details

{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}

key: (Required, String) The key of the label
value: (Required, String) The value of the label

Processors

Processors API

Create a new Processor

$ curl --request POST 'ingestion-api:8080/v2/processor' --data '{ ... }'

List all Processors

$ curl --request GET 'ingestion-api:8080/v2/processor'

Get a single Processor

$ curl --request GET 'ingestion-api:8080/v2/processor/{id}'

Update an existing Processor

$ curl --request PUT 'ingestion-api:8080/v2/processor/{id}' --data '{ ... }'

Note	The type of an existing processor can’t be modified.

Delete an existing Processor

$ curl --request DELETE 'ingestion-api:8088/v2/processor/{id}'

Clone an existing Processor

$ curl --request POST 'ingestion-api:8088/v2/processor/{id}/clone?name=clone-new-name'

Query Parameters

name: (Required, String) The name of the new Processor

Search for Processors using DSL Filters

$ curl --request POST 'ingestion-api:8080/v2/processor/search' --data '{ ... }'

Body

The body payload is a DSL Filter to apply to the search

Autocomplete for Processors

$ curl --request GET 'ingestion-api:8080/v2/processor/autocomplete?q=value'

Query Parameters

q: (Required, String) The query to execute the autocomplete search

Each component is stateless, and it’s driven by the configuration defined in the processor and by the context created by the current seed execution. This design makes the processor the main building block of Discovery Ingestion.

They are intended to solve very specific tasks, which makes them re-usable and simple to integrate into any part of the configuration.

{
  "type": "my-component-type",
  "name": "My Component Processor",
  "config": {
    ...
  },
  ...
}

type

(Required, String) The name of the component to execute

name

(Required, String) The unique name to identify the configuration

description

(Optional, String) The description for the configuration

config

(Required, Object) The configuration for the corresponding action of the component. All configurations will be affected by the Expression Language

server

(Optional, UUID/Object) Either the ID of the server configuration for the integration or an object with the detailed configuration

Details

{
  "server": {
    "id": "ba637726-555f-4c68-bfed-1c91f4803894",
    ...
  },
  ...
}

id: (Required, UUID) The ID of the server configuration for the integration
credential: (Optional, UUID) The ID of the credential to override the default authentication in the external service

labels

(Optional, Array of Objects) The labels for the configuration

Details

{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}

key: (Required, String) The key of the label
value: (Required, String) The value of the label

Hooks

Hooks are a type of Processor, detached from the record processing. They are related to the execution of pre- or post-actions associated with a long seed execution (e.g. creating indices, changing aliases…).

There are two types of hooks BEFORE_HOOK and AFTER_HOOK. The before hooks are executed at the beginning and the after hooks are executed at the end of the record processing of an execution.

They are useful to do some single pre- or post-actions associated with a long seed execution (e.g. creating indices, changing aliases…).

Data Processing with a State Machine

State Types

Processor State

Executes a single or multiple processors in sequence:

{
  "myProcessorState": {
    "type": "processor",
    "processors": [
      ...
    ]
  }
}

type

(Required, String) The type of state. Must be processor

processors

(Required, Array of Objects) The processors to execute

Details

{
  "stateA": {
    "type": "processor",
    "processors": [
      {
        "id": <Processor ID>,
        ...
      }
    ],
    ...
  }
}

id

(Required, UUID) The ID of the processor to execute

outputField

(Optional, String) The output field that wraps the result of the processor execution. Defaults to the one defined in the component

active

(Optional, Boolean) false to disable the execution of the processor. Default is true

recordPolicy

(Optional, Object) The custom records configuration for the execution of the processor. Overrides the global one defined in the seed being executed. Can also be referred to as record

Details

{
  "id": <Processor ID>,
  "recordPolicy": {
    ...
  }
}

id

(Optional, String) The expression that represents the ID of the record during its processing. If not provided, the plain ID of the record will be used

timeout

(Optional, Duration) The timeout for each record during its processing

retryable

(Optional, Boolean) Whether the processor should be retried if failed. Defaults to false

errorPolicy

(Required, String) The error policy for records during their processing

FATAL: A single failed document aborts the complete process
FAIL: Either marks the document as failed, or sends it to a configured error handling state (if any). Other records continue their execution as expected
IGNORE: Ignores the record processing error and its execution continues

outboundPolicy

(Optional, Object) The custom policy for groups of records sent for processing to the next state within the finite-state machine

Details

{
  "id": <Processor ID>,
  "recordPolicy": {
    "outboundPolicy": {
      ...
    }
  }
}

batchPolicy

(Optional, Object) The batch policy for outbound batches of records, once their processor execution is completed

Details

{
  "id": <Processor ID>,
  "recordPolicy": {
    "outboundPolicy": {
      "batchPolicy": {
        ...
      }
    }
  }
}

maxCount: (Required, Integer) The maximum record count in a batch before flushing. Defaults to 25
flushAfter: (Required, Duration) The timeout to flush if no other condition has been met. Default to 1m

next

(Optional, String) The next state for the HTTP Request Execution after the completion of the state. If not provided, the current one will be assumed as the final state

onError

(Optional, String) The state of the to use as fallback if the execution of the current state fails. If undefined, the current HTTP Request Execution will complete with the corresponding error message

The output of each processor will be stored in the JSON Data Channel wrapped in the configured outputField:

{
  "defaultFieldName": {
    "outputKey": "outputValue"
  }
}

Switch State

Use DSL Filters and JSON Pointers over the JSON Data Channel to control the flow of the execution given the first matching condition:

{
  "mySwitchState": {
    "type": "switch",
    "options": [
      ...
    ],
    "default": "myDefaultState"
  }
}

type

(Required, String) The type of state. Must be switch

options

(Required, Array of Objects) The options to evaluate in the state

Details

{
  "type": "switch",
  "options": [
    {
      "condition": {
        "equals": {
          "field": "/my/input/field",
          "value": "valueA"
        },
        ...
      },
      "state": "myFirstState"
    },
    ...
  ],
  ...
}

condition: (Required, Object) The predicate described as a DSL Filter over the JSON processing data
state: (Optional, String) The next state for the finite-state machine if the condition evaluates to true

default

(Optional, String) The default state for the finite-state machine if no option evaluates to true

Note	If no state for the finite-state machine is selected, the current one will be assumed as the final state.

Seed Execution

Seed Executions API

List all Seed Executions of a Seed

$ curl --request GET 'ingestion-api:8080/v2/seed/{seedId}/execution'

Get a single Seed Executions of a Seed

$ curl --request GET 'ingestion-api:8080/v2/seed/{seedId}/execution/{executionId}'

Halt an existing Seed Execution of a Seed

$ curl --request POST 'ingestion-api:8080/v2/seed/{seedId}/execution/{executionId}/halt'

List all Audit Log entries for an existing Seed Execution of a Seed

$ curl --request GET 'ingestion-api:8080/v2/seed/{seedId}/execution/{executionId}/audit'

Get the Seed configuration for an existing Seed Execution of a Seed

$ curl --request GET  'ingestion-api:8080/v2/seed/{seedId}/execution/{executionId}/config/seed'

Get a Pipeline configuration for an existing Seed Execution of a Seed

$ curl --request GET  'ingestion-api:8080/v2/seed/{seedId}/execution/{executionId}/config/pipeline/{pipelineId}'

Get a Processor configuration for an existing Seed Execution of a Seed

$ curl --request GET  'ingestion-api:8080/v2/seed/{seedId}/execution/{executionId}/config/processor/{processorId}'

Get a Server configuration for an existing Seed Execution of a Seed

$ curl --request GET  'ingestion-api:8080/v2/seed/{seedId}/execution/{executionId}/config/server/{serverId}'

Get a Credential configuration for an existing Seed Execution of a Seed

$ curl --request GET  'ingestion-api:8080/v2/seed/{seedId}/execution/{executionId}/config/credential/{credentialId}'

Get the summary of jobs from a seed execution

$ curl --request GET  'ingestion-api:8080/v2/seed/{seedId}/execution/{executionId}/job/summary'

{
  "id": "1ed146d8-e5d8-49df-9b65-b9f6396183ff",
  "creationTimestamp": "2025-03-13T08:59:15Z",
  "lastUpdatedTimestamp": "2025-03-13T09:46:59Z",
  "triggerType": "MANUAL",
  "status": "DONE",
  "scanType": "FULL",
  "stages": [
      "BEFORE_HOOKS",
      "INGEST",
      "AFTER_HOOKS"
  ]
}

id

(UUID) A unique ID that identifies the seed execution

creationTimestamp

(Timestamp) The timestamp when the execution was triggered

lastUpdatedTimestamp

(Timestamp) The timestamp when the execution was last updated

triggerType

(String) The origin who triggered the execution. Currently, only MANUAL

status

(String) The status of the execution

Details

CREATED: The seed has been triggered, but the execution has not started
RUNNING: The seed is being executed
HALTING: The seed execution received a HALT request, but some processing might still be happening
HALTED: The seed execution is completely halted
DONE: The seed completed its execution successfully
FAILED: The seed failed during its execution

scanType

(String) The scan type for the execution. Currently, only FULL

stages

(Array of Strings) The competed stages of the execution

Details

BEFORE_HOOKS: The hooks before record data processing (if any).
INGEST: The record data processing
AFTER_HOOKS: The hooks after record data processing (if any).

The Seed Execution represents the currently active Seed. It updates with each complete stage of the execution.

When the user starts the execution of a seed, a copy of every user-created configuration related to the seed is stored for use during the entirety of the seed execution:

The configuration of the seed being executed
The configuration of any hook that’s used by the seed in execution
The configuration of the pipeline assigned to the seed
The configuration of any processor that’s used in any step of the pipeline
The configuration of any server that’s used either by the seed, or any of the related processors
The configuration of any credential that’s used either by the seed, or any of the related processors, or any of the related servers.

This means that changing the configuration of the entities used during the execution of a seed will have no effect on the outcome of it. This is done to avoid unexpected and inconsistent behaviors.

Note

For security reasons, when the snapshot of the configuration of a credential is stored, the associated secrets are not included in it. A reference to the underlying secret is saved instead. This means that changes applied to secrets mid-seed execution can unpredictably affect the current execution

Every generated record is tagged with an corresponding action to apply during a specific execution:

CREATE: It is a new record for the seed.
UPDATE: The record was processed during a previous seed execution, but its content has changed.
DELETE: The record is marked to be deleted.

During a seed execution, every record has a status that changes as the seed is processed:

PROCESSING: The record was detected and is currently being processed.
FAILED: The processing of the record failed.
DONE: The record was successfully processed.

Record Data Channels

During a seed execution, records can produce data in JSON format, as well as binary files.

JSON data is stored in a dedicated bucket within Discovery Staging and can be later referenced using JSON Pointers.

Binary data, such as images, videos or PDFS, is stored in a dedicated container inside the Object Storage.

Record Batches

Seeds can configure how batches are flushed through the finite-state machine.

The seed configuration and its override in the processor state defines the boundaries of the batch, where the first condition to be met will trigger the flush process where all the records in the batch to the next stage in the pipeline (such as next processor, next state from the state machine or even the end of the pipeline).

Expression Language Extensions

Table 25. *Discovery Ingestion* Expression Language Variables
Variable	Description	Example
`seed.id`	The ID of the seed in execution	`966f6b3f-7066-4fd5-8885-f45fef2fd59d`
`seed.type`	The type of the seed	`my-component-type`
`seed.name`	The name of the seed	`My Component Seed`
`seed.description`	The description of the seed	`Description of My Seed`
`seed.labels`	The labels of the seed, grouped by key	`[<keyA,[valueA,valueB]>, <keyB,valueC>]`
`seed.properties`	The properties to use during placeholders resolution	`{ "keyA": "valueA" }`
`execution.id`	The ID of the seed execution	`966f6b3f-7066-4fd5-8885-f45fef2fd59d`
`execution.startTimestamp`	The start time of the seed execution	`2025-03-05T20:24:39Z`
`execution.scanType`	The scan type of the seed execution	`FULL`
`execution.triggerType`	Trigger type of the seed execution	`MANUAL`
`execution.properties`	The properties to use during placeholders resolution	`{ "keyA": "valueA" }`
`processor.id`	The ID of the processor	`966f6b3f-7066-4fd5-8885-f45fef2fd59d`
`processor.type`	The type of the processor	`my-component-type`
`processor.name`	The name of the processor	`My Component Processor`
`processor.description`	The description of the processor	`Description of My Processor`
`processor.labels`	The labels of the processor, grouped by key	`[<keyA,[valueA,valueB]>, <keyB,valueC>]`
`pipeline.id`	The ID of the pipeline	`966f6b3f-7066-4fd5-8885-f45fef2fd59d`
`pipeline.name`	The name of the pipeline	`My Component Pipeline`
`pipeline.description`	The description of the pipeline	`Description of My Pipeline`
`pipeline.labels`	The labels of the pipeline, grouped by key	`[<keyA,[valueA,valueB]>, <keyB,valueC>]`
`record.id`	The ID of a generated record from a seed execution	`my-document-id`
`record.action`	The action of a generated record from a seed execution	`CREATED`
`record.parent`	The parent ID of a generated record from a seed execution	`my-document-parent-id`

Components

Elasticsearch

Uses the Elasticsearch integration to invoke the Elasticsearch API.

Scan Action: scan, search-after

Seed that uses the Search after parameter to retrieve all the documents from an index.

{
  "type": "elasticsearch",
  "name": "My Elasticsearch Scan Action",
  "config": {
    "action": "search-after",
    "index": "my-index",
    "sort": [
      ...
    ]
  },
  "pipeline": <Pipeline>,
  "server": <Elasticsearch Server>,
  ...
}

index

(Required, Array of String) The list of Elasticsearch indexes to search on

sort

(Required, Array of Objects) The list of sort options

Details

[
  { "<field>": "<sort_value>" },
  { "<field>": { "<sort_option>": "<sort_value>", ... } },
  ...
]

query

(Optional, Object) The query body for the search request. If not provided, a match all query will be used instead

size

(Optional, Integer) The maximum number of hits to return. Defaults to 100

metadata

(Optional, Boolean) Whether to include the metadata or no. Defaults to false

Hook Action: aliases

Hook that executes a native Elasticsearch query to the Aliases API.

{
  "type": "elasticsearch",
  "name": "My Elasticsearch Hook Action",
  "config": {
    "action": "aliases",
    "actions": [
      ...
    ]
  },
  "server": <Elasticsearch Server>,
  ...
}

actions: (Required, Array) The request body

Note	Currently, if at least one of the actions on the list is successful, the whole request will be successful. On the other side, the request only fails if none of them is successful.

Hook Action: create-index

Hook that executes a native Elasticsearch query to the Create Index API.

{
  "type": "elasticsearch",
  "name": "My Elasticsearch Hook Action",
  "config": {
    "action": "create-index",
    "index": "my-index",
    "body": {
      ...
    }
  },
  "server": <Elasticsearch Server>,
  ...
}

index: (Required, String) The index name
body: (Required, Object) The request body
waitForActiveShards: (Optional, Integer) The number of copies of each shard that must be active before proceeding with the operation */
masterTimeout: (Optional, String) The period to wait for the master node
timeout: (Optional, String) The period to wait for a response

Processor Action: bulk, hydrate

Processor that executes a bulk request to the Elasticsearch Bulk API

{
  "type": "elasticsearch",
  "name": "My Elasticsearch Processor Action",
  "config": {
    "action": "hydrate",
    "index": "my-index",
    "data": "#{ data('/my/record') }",
    ...
  },
  "server": <Elasticsearch Server>,
  ...
}

index

(Required, String) The Elasticsearch index to perform the action

data

(Required, Object) The data to hydrate

allowOverride

(Optional, Boolean) Wheter allow overriding an existing document or not. Defaults to true

bulk

(Optional, Object) The bulk configuration

Details

pipeline

(Optional, String) The ID of the Elasticsearch Pipeline to use to preprocess incoming documents

routing

(Optional, String) Used to route operations to a specific shard

waitForActiveShards

(Optional, String) The number of copies of each shard that must be active before proceeding with the Elasticsearch operation

timeout

(Optional, String) The period of time to wait for some operations

requireAlias

(Optional, Boolean) Whether the request’s actions must target an index alias

refresh

(Optional, String) The refresh type. Supported are: TRUE, FALSE and WAIT_FOR

Details

WAIT_FOR: Waits for a refresh to make the Elasticsearch operation visible to search

TRUE: Refreshes the affected shards to make the Elasticsearch operation visible to search

FALSE: Do nothing with the refreshes

flush

(Optional, Object) The flush configuration

Details

maxOperations: (Optional, Integer) The maximum number of operations. Defaults to 1000
maxConcurrent: (Optional, Integer) The maximum number of concurrent requests waiting to be executed by Elasticsearch. Deafult is 1
maxSize: (Optional, String) The maximum size of the bulk request. Defaults to 5MB
flushInterval: (Optional, Duration) The interval between flushes

Insights

The Insights component is designed to provide various actions that generate different metrics or information for later analysis.

Processor Action: engine-score:non-contextual

Processor that calculates the query score based on the result position metadata field value only. The engine scoring action is designed to power the Engine Scoring Dashboards by evaluating the quality of a search engine’s results in terms of precision and recall.

{
  "type": "insights",
  "name": "My Engine-Score Non-Contextual Processor Action",
  "config": {
    "action": "engine-score:non-contextual",
    "resultPosition": "25",
    ...
  }
}

resultPosition: (Required, Integer) Field containing the position of the search result to be used in the engine scoring calculation.
kfactor: (Optional, Double) Value between 0 and 1 used to determine the importance of the relevant records. Defaults to 0.9.
startPosition: (Optional, Integer) Indicate the start position to take into account when doing the K-factor calculation. Defaults to 1.
precision: (Optional, Integer) Number of digits to return after the decimal point for the score value. Defaults to 4.

MongoDB

Performs different actions on MongoDB collections, either reading or writing data depending on the action.

Scan Action: scan

Seed that finds all documents in a MongoDB collection and creates a record for each document found.

{
  "type": "mongo",
  "name": "My Mongo Scan Action",
  "config": {
    "action": "scan",
    "database": "my-database",
    "collection": "my-collection"
  },
  "pipeline": <Pipeline ID>,
  "server": <Mongo Server ID>,
  ...
}

database: (Required, String) The database to connect to
collection: (Required, Boolean) The collection whose documents are turned to records

Processor Action: bulk, hydrate

Processor that stores the records in the pipeline via Bulk Write operations on the specified MongoDB collection.

{
  "type": "mongo",
  "name": "My Mongo Processor Action",
  "config": {
    "database": "my-database",
    "collection": "my-collection",
    "allowOverride": true,
    ...
  },
  "server": <Mongo Server ID>,
  ...
}

database

(Required, String) The database to connect to

collection

(Required, String) The collection where the records are bulk written

allowOverride

(Optional, Boolean) Whether the records should be stored if there is one already with their ID. Defaults to true

data

(Optional, Object) The data to store on the collection. If not provided, it will store the data generated on a previous processor

flush

(Optional, Object) The flush configuration

Details

maxCount: (Optional, Integer) The maximum number of records in the bulk before flusing
maxWeight: (Optional, Long) The maximum weight allowed in a bulk request
flushAfter: (Optional, Duration) The time to wait before flushing a bulk request

OpenAI

Uses the OpenAI integration to send requests to OpenAI.

Processor Action: embeddings

Processor that execute embedding requests to the OpenAI API.

{
  "type": "openai",
  "name": "My OpenAI Processor Action",
  "config": {
    "action": "embeddings",
    "model": "openai-model",
    "input": "#{ data('/my/input') }",
    ...
  },
  "server": <OpenAI Server>,
  ...
}

model

(Required, String) The OpenAI model to use

input

(Required, String) The input to generate the embeddings

user

(Optional, String) An unique identifier representing the end-user

flush

(Optional, Object) The flush configuration

Details

maxCount: (Optional, Integer) The maximum number of records in the bulk before flusing
maxWeight: (Optional, Long) The maximum weight allowed in a bulk request
flushAfter: (Optional, Duration) The time to wait before flushing a bulk request

Script

Uses the Script Engine to execute a script for advanced handling of the execution data. Supports multiple scripting languages and provides tools for JSON manipulation and for logging.

Processor Action: process

Processor that executes a script that interacts with the record data generated from a seed execution.

{
  "type": "script",
  "name": "My Script Processor Action",
  "config": {
    "action": "process",
    "script": <Script>,
    ...
  }
}

action: (Required, String) The default script action. Must be process
language: (Optional, String) The language of the script. One of the supported script languages. Defaults to groovy
script: (Required, String) The script to run

Staging

Interacts with buckets and content from Discovery Staging.

Scan Action: scan, scroll

Seed that scrolls throughout a bucket, and creates the records to be ingested into the pipeline.

{
  "type": "staging",
  "name": "My Staging Scan Action",
  "config": {
    "action": "scroll",
    "bucket": "my-bucket",
    ...
  },
  "pipeline": <Pipeline ID>,
  ...
}

bucket: (Required, String) The bucket to scroll
metadata: (Optional, Boolean) Whether to include the metadata or not. Defaults to false
size: (Optional, Integer) The size of the contents result
filter: (Optional, DSL Filter) The filter to apply when scrolling
projection: (Optional, Projection) The projection to apply when scrolling
actions: (Optional, Array of Strings) The actions from the content to be scanned. Defaults to STORE and DELETE
parentId: (Optional, String) The parent ID to match

Hook Action: create-bucket

Hook that creates a bucket with the given configuration.

{
  "type": "staging",
  "name": "My Staging Action",
  "config": {
    "action": "create-bucket",
    ...
  }
  ...
}

bucket

(Required, String) The bucket name

config

(Optional, Object) The bucket configuration

indices

(Optional, Array of Objects) The indices for the bucket

Details

{
  "indices": [
    {
      "name": "myIndexA",
      "fields": [
        ...
      ]
    }
  ]
}

name

(Required, String) The index name

fields

(Required, Array of Objects) The index fields. Key/Value pairs with the field name, and the corresponding sort ordering, either ASC or DESC

Details

{
  "fields": [
    { "fieldA": "ASC" },
    { "fieldB": "DESC" }
  ]
}

Processor Action: store, hydrate

Processor that stores the records into the given bucket.

{
  "type": "staging",
  "name": "My Staging Processor Action",
  "config": {
    "action": "store",
    "bucket": "my-bucket",
    ...
  }
}

bucket: (Required, String) The bucket where the documents will be stored
parentId: (Optional, String) The parent ID of the documents to store
data: (Optional, Object) The data to store on the bucket. If not provided, it will store the data generated on a previous processor

Template

Uses the Template Engine to process dynamic data provided by the user to generate a text output based on a custom template.

Processor Action: process

Processor that processes the provided template with the defined configuration.

{
  "type": "template",
  "name": "My Template Processor Action",
  "config": {
    "action": "process",
    ...
  }
}

template

(Required, String) The template to process

bindings

(Required, Object) The bindings to replace in the template

Details

{
  "bindingA": "#{ data('/my/binding/field') }",
  ...
}

Can be later referenced in a template:

My bindingA value is ${bindingA}

outputFormat

(Optional, String) The output format of the precessed template. Supported formats are: JSON and PLAIN. Defaults to PLAIN

Vespa

Uses the Vespa integration to send HTTP requests to a Vespa service.

Action: store

Processor that upsert or delete documents from a Vespa app using the Document API.

{
  "type": "vespa",
  "name": "My Vespa Store Action",
  "config": {
    "action": "store",
    "namespace": "my-namespace",
    "documentType": "my-document-type",
    ...
  },
  "server": <Vespa Server>,
  ...
}

namespace: (Required, String) The namespace of the vespa document
documentType: (Required, String) The document type of the vespa document. Described in the schemas.sd.
data: (Optional, Object) The fields of the vespa document. If not provided, it will store the data generated on a previous processor

Voyage AI

Uses the Voyage AI integration to send requests to the Voyage AI API. Supports multiple actions for different endpoints of the service.

Action: embeddings

Processor that given input string and other arguments such as the preferred model name, it returns a response containing a list of embeddings. See Voyage AI Embeddings and the API Text embedding models endpoint.

{
  "type": "voyage-ai",
  "name": "My Embeddings Action",
  "config": {
    "action": "embeddings",
    "model": "voyage-large-2",
    "input": "#{ data('/input') }",
    ...
  },
  "server": <Voyage AI Server>,
  ...
}

model

(Required, String) The model to use for the request. See models.

input

(Required, String) The input document to be embedded.

truncation

(Optional, Boolean) Whether to truncate the input to satisfy the context length limit on the query and the documents. Defaults to true.

inputType

(Optional, String) Type of the input text. One of: QUERY or DOCUMENT. Defaults to null.

outputDimension

(Optional, Integer) The number of dimensions for resulting output embeddings. Defaults to null.

outputDatatype

(Optional, String) The data type for the embeddings to be returned. One of: FLOAT, INT8,UINT8, BINARY or UBINARY. Default to FLOAT.

encodingFormat

(Optional, String) Format in which the embeddings are encoded. One of: base64. Defaults to null.

flush

(Optional, Object) The flush configuration

Details

maxCount: (Optional, Integer) The maximum number of records in the bulk before flusing
maxWeight: (Optional, Long) The maximum weight allowed in a bulk request
flushAfter: (Optional, Duration) The time to wait before flushing a bulk request

Action: multimodal-embeddings

Processor that given an input list of multimodal inputs consisting of text, images, or an interleaving of both modalities and other arguments such as the preferred model name, it returns a response containing a list of embeddings. See Voyage AI Multimodal Embedding and the API Text multimodal embedding models endpoint.

{
  "type": "voyage-ai",
  "name": "My Multimodal Embeddings Action",
  "config": {
    "action": "multimodal-embeddings",
    "model": "voyage-multimodal-3",
    "input": "#{ data('/input') }",
    ...
  },
  "server": <Voyage AI Server>,
  ...
}

model

(Required, String) The model to use for the request. See models.

input

(Required, Object) The input object to be embedded.

Details

type

(Required, String) The type. One of: text, image_url or image_base64.

text

(Optional, String) The text if the type text is choosen.

Details

{
  "type": "text",
  "text": "This is a banana."
}

imageUrl

(Optional, String) The image url if the type image_url is choosen.

Details

{
  "type": "image_url",
  "imageUrl": "https://raw.githubusercontent.com/voyage-ai/voyage-multimodal-3/refs/heads/main/images/banana.jpg"
}

imageBase64

(Optional, Object) The base 64 encoded image if the type image_base64 is choosen.

Details

{
  "type": "image_base64",
  "imageBase64": {
      "mediaType": "image/jpeg",
      "base64": true,
      "data": "/9j/4AAQSkZJRgABAQEAYABgAAD(...)"
  }
}

mediaType: (Required, String) The data media type. Supported media types are: image/png, image/jpeg, image/webp, and image/gif.
base64: (Required, Boolean) Whether the data is encoded in Base64.
data: (Required, String) The data itself.

truncation

(Optional, Boolean) Whether to truncate the inputs to fit within the context length. Defaults to true.

inputType

(Optional, String) Type of the input text. One of: QUERY or DOCUMENT. Defaults to null.

outputEncoding

(Optional, String) Format in which the embeddings are encoded. One of: base64. Defaults to null.

flush

(Optional, Object) The flush configuration

Details

maxCount: (Optional, Integer) The maximum number of records in the bulk before flusing
maxWeight: (Optional, Long) The maximum weight allowed in a bulk request
flushAfter: (Optional, Duration) The time to wait before flushing a bulk request

Discovery QueryFlow

Discovery QueryFlow is a lightweight tool that allows users to create Custom REST Endpoints that interacts with external services with minimum overhead. It enables features such as:

Flexibility to represent complex query processing scenarios through a finite-state machine.
On-the-fly tuning of configurations for a fast feedback loop.
Extensive component library for advanced interpretation of the HTTP Request.

Custom REST Endpoints

Endpoints API

Create a new Endpoint

$ curl --request POST 'queryflow-api:8088/v2/endpoint' --data '{ ... }'

List all Endpoints

$ curl --request GET 'queryflow-api:8088/v2/endpoint'

Get a single Endpoint

$ curl --request GET 'queryflow-api:8088/v2/endpoint/{id}'

Update an existing Endpoint

$ curl --request PUT 'queryflow-api:8088/v2/endpoint/{id}' --data '{ ... }'

Note	The type of an existing endpoint can’t be modified.

Delete an existing Endpoint

$ curl --request DELETE 'queryflow-api:8088/v2/endpoint/{id}'

Enable an existing Endpoint

$ curl --request PATCH 'queryflow-api:8088/v2/endpoint/{id}/enable'

Disable an existing Endpoint

$ curl --request PATCH 'queryflow-api:8088/v2/endpoint/{id}/disable'

Clone an existing Endpoint

$ curl --request POST 'queryflow-api:8088/v2/endpoint/{id}/clone?name=clone-new-name&uri=clone-new-uri&method=clone-new-method'

Query Parameters

method: (Required, String) The HTTP Method of the new Endpoint
uri: (Required, String) The URI of the new Endpoint
name: (Required, String) The name of the new Endpoint

Search for Endpoints using DSL Filters

$ curl --request POST 'queryflow-api:8088/v2/endpoint/search' --data '{ ... }'

Body

The body payload is a DSL Filter to apply to the search

Autocomplete for Endpoints

$ curl --request GET 'queryflow-api:8088/v2/endpoint/autocomplete?q=value'

Query Parameters

q: (Required, String) The query to execute the autocomplete search

An endpoint is the definition of the finite-state machine for query processing:

{
  "uri": "/my/custom/endpoint",
  "httpMethod": "GET",
  "name": "My Custom Endpoint",
  "initialState": "stateA",
  "states": {
    "stateA": {
      ...
    },

    "stateB": {
      ...
    }
  },
  "timeout": "60s"
  ...
}

httpMethod

(Required, String) The HTTP method for the custom endpoint. Must be one of: GET, POST, PUT, DELETE

uri

(Required, String) The URI path for the custom endpoint (e.g. /my/path)

The URI can contain variables in any of its paths (e.g. /my/{pathA}, /{pathA}/{pathB}). If present, the values for every placeholder will be available as part of the metadata of the HTTP request and can be accessed in the configuration of the processors with the help of the Expression Language

Details

{
  "uri": "/my/{pathA}/endpoint",
  "httpMethod": "GET",
  "name": "My Custom Endpoint",
  "initialState": "stateA",
  "states": {
    ...
  },
  "timeout": "60s"
  ...
}

{
  "type": "my-component-type",
  "name": "My Component Processor",
  "config": {
    "myProperty": "#{ data('/httpRequest/pathVariables/pathA') }"
  }
  ...
}

type

(Required, String) The content-type of the HTTP response for the custom endpoint. Either default for application/json or stream for text/event-stream. Defaults to default

name

(Required, String) The unique name to identify the custom endpoint

description

(Optional, String) The description for the configuration

initialState

(Required, String) The state to use as starting point of the custom endpoint as defined in the states field

states

(Required, Object) The states associated to the endpoint

properties

(Optional, Object) The properties to be referenced with the help of the Expression Language in the configuration of the processors

Details

{
  "uri": "/my/custom/endpoint",
  "httpMethod": "GET",
  "name": "My Custom Endpoint",
  "initialState": "stateA",
  "states": {
    ...
  },
  "properties": {
    "keyA": "valueA"
  },
  "timeout": "60s"
  ...
}

{
  "type": "my-component-type",
  "name": "My Component Processor",
  "config": {
    "myProperty": "#{ endpoint.properties.keyA') }"
  },
  ...
}

timeout

(Required, Duration) The timeout for the execution of the custom endpoint

labels

(Optional, Array of Objects) The labels for the configuration

Details

{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}

key: (Required, String) The key of the label
value: (Required, String) The value of the label

Note	Loops are not forbidden as they might represent valid use cases depending on the configuration of the states. To avoid getting stuck in infinite loops, all endpoints are required to be configured with a timeout.

Processors

Processors API

Create a new Processor

$ curl --request POST 'queryflow-api:8080/v2/processor' --data '{ ... }'

List all Processors

$ curl --request GET 'queryflow-api:8088/v2/processor'

Get a single Processor

$ curl --request GET 'queryflow-api:8088/v2/processor/{id}'

Update an existing Processor

$ curl --request PUT 'queryflow-api:8088/v2/processor/{id}' --data '{ ... }'

Note	The type of an existing processor can’t be modified.

Delete an existing Processor

$ curl --request DELETE 'queryflow-api:8088/v2/processor/{id}'

Clone an existing Processor

$ curl --request POST 'queryflow-api:8088/v2/processor/{id}/clone?name=clone-new-name'

Query Parameters

name: (Required, String) The name of the new Processor

Search for Processors using DSL Filters

$ curl --request POST 'queryflow-api:8088/v2/processor/search' --data '{ ... }'

Body

The body payload is a DSL Filter to apply to the search

Autocomplete for Processors

$ curl --request GET 'queryflow-api:8088/v2/processor/autocomplete?q=value'

Query Parameters

q: (Required, String) The query to execute the autocomplete search

Each component is stateless, and it’s driven by the configuration defined in the processor and by the context created by the current HTTP Request. This design makes the processor the main building block of Discovery QueryFlow.

They are intended to solve very specific tasks, which makes them re-usable and simple to integrate into any part of the configuration.

{
  "type": "my-component-type",
  "name": "My Component Processor",
  "config": {
    ...
  },
  ...
}

type

(Required, String) The name of the component to execute

name

(Required, String) The unique name to identify the configuration

description

(Optional, String) The description for the configuration

config

(Required, Object) The configuration for the corresponding action of the component. All configurations will be affected by the Expression Language

snippets

(Optional, Object) The snippets to be referenced in the configuration with the help of the Expression Language

Details

{
  "type": "my-component-type",
  "name": "My Component Processor",
  "config": {
    "myProperty": "#{ snippets.snippetA }"
  },
  "snippets": {
    "snippetA": {
      ...
    }
  },
  ...
}

Note	Avoid the usage of any reserved operator such as hyphens in the name of a snippet.

server

(Optional, UUID/Object) Either the ID of the server configuration for the integration or an object with the detailed configuration

Details

{
  "server": {
    "id": "ba637726-555f-4c68-bfed-1c91f4803894",
    ...
  },
  ...
}

id: (Required, UUID) The ID of the server configuration for the integration
credential: (Optional, UUID) The ID of the credential to override the default authentication in the external service

labels

(Optional, Array of Objects) The labels for the configuration

Details

{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}

key: (Required, String) The key of the label
value: (Required, String) The value of the label

Query Processing with a State Machine

State Types

Processor State

Executes a single or multiple processors in sequence:

{
  "myProcessorState": {
    "type": "processor",
    "processors": [
      ...
    ]
  }
}

type

(Required, String) The type of state. Must be processor

processors

(Required, Array of Objects) The processors to execute

Details

{
  "stateA": {
    "type": "processor",
    "processors": [
      {
        "id": <Processor ID>,
        ...
      }
    ],
    ...
  }
}

id: (Required, UUID) The ID of the processor to execute
outputField: (Optional, String) The output field that wraps the result of the processor execution. Defaults to the one defined in the component
continueOnError: (Optional, Boolean) If true and the processor execution fails, its HTTP response will be stored in its corresponding Data Channel while the other processors in the state continue with their normal execution. If false, the error will either be handled by the onError state, or be spread to its invoker. Defaults to false
active: (Optional, Boolean) false to disable the execution of the processor

next

(Optional, String) The next state for the HTTP Request Execution after the completion of the state. If not provided, the current one will be assumed as the final state

onError

(Optional, String) The state of the to use as fallback if the execution of the current state fails. If undefined, the current HTTP Request Execution will complete with the corresponding error message.

The output of each processor will be stored in the JSON Data Channel wrapped in the configured outputField:

{
  "defaultFieldName": {
    "outputKey": "outputValue"
  }
}

If a processor produces a Server-Sent Event, each batch of the received chunks will be stored in the SSE Data Channel as a list wrapped in the configured outputField:

{
  "defaultFieldName": [
    {
      "name": "testFieldName",
      "data": "some "
    },
    {
      "name": "testFieldName",
      "data": "chunked d"
    },
    {
      "name": "testFieldName",
      "data": "ata"
    },
    {
      "name": "testFieldName",
      "data": "."
    }
  ]
}

Parallel Endpoint State

Executes a single or multiple Custom REST Endpoints in parallel:

{
  "myParallelEndpointState": {
    "type": "endpoint",
    "endpoints": {
      ...
    }
  }
}

type

(Required, String) The type of state. Must be endpoint

endpoints

(Required, Object) The endpoints to execute in parallel

Details

{
  "type": "endpoint",
  "endpoints": {
    "myEndpointA": {
      "id": <Endpoint ID>,
      ...
    },
    ...
  },
  ...
}

id

(Required, UUID) The ID of the endpoint to execute in parallel

httpRequest

(Optional, Object) The custom HTTP Request to invoke the endpoint. Defaults to the same HTTP Request as to one from the invoker. All fields in the httpRequest can be configured with the help of the Expression Language

Details

{
  "httpRequest": {
    "uri": "/my-endpoint",
    "method": <HTTP Method>,
    ...
  }
}

uri

(Required, String) The URI paths for the HTTP Request

method

(Required, String) The HTTP Method for the HTTP Request

headers

(Optional, Object) The headers for the HTTP Request. The value of each header can be either a single String, or an Array of Strings

Details

{
  "httpRequest": {
    "uri": "/my-endpoint",
    "headers": {
      "header-a": "value-a",
      "header-b": [
        "value-b-1",
        "value-b-2"
      ]
    }
  }
}

queryParams

(Optional, Object) The query parameters for the HTTP Request. The value of each query parameter can be either a single String, or an Array of Strings

Details

{
  "httpRequest": {
    "uri": "/my-endpoint",
    "queryParams": {
      "param-a": "value-a",
      "param-b": [
        "value-b-1",
        "value-b-2"
      ]
    }
  }
}

cookies

(Optional, Array of Objects) The list of cookies for the HTTP Request

Details

{
  "httpRequest": {
    "uri": "/my-endpoint",
    "cookies": [
      {
        "name": "cookie-name-a",
        "path": "/some/path/a",
        "value": "cookie-value-a",
        "domain": "cookie-domain-a",
        "maxAge": 1234
      },
      {
        "name": "cookie-name-b",
        "value": "cookie-value-b".
        ...
      }
    ]
  }
}

name: (Required, String) The name of the cookie
value: (Required, String) The value of the cookie
path: (Optional, String) The path of the cookie
domain: (Optional, String) The domain of the cookie
maxAge: (Optional, Integer) The maximum age of the cookie

body

(Optional, Object) The body of the HTTP Request

continueOnError

(Optional, Boolean) If true and the endpoint execution fails, its HTTP response will be stored in its corresponding Data Channel while the other endpoints in the state continue with their normal execution. If false, the error will either be handled by the onError state, or be spread to its invoker. Defaults to false

active

(Optional, Boolean) false to disable the execution of the endpoint. If all endpoints are disabled, the state output will be empty.

next

(Optional, String) The next state for the HTTP Request Execution after the completion of all configured endpoints. If not provided, the current one will be assumed as the final state and a 204 - No Content HTTP Response will be returned

onError

Note	Circular references with endpoints in endpoint states are not allowed.

Note	Endpoints with response type `text-event/stream` are currently unsupported.

The output of the state stored in the JSON Data Channel is a collection with each HTTP Response:

{
  "myParallelEndpointState": {
    "myEndpointA": {
      "statusCode": <HTTP Status Code>,
      ...
    },
    ...
  }
}

statusCode

(Integer) The HTTP Status Code of the HTTP Response

headers

(Object) The headers of the HTTP Response. The value of each header can be either a single String, or an Array of Strings

Details

{
  "httpResponse": {
    "headers": {
      "header-a": "value-a",
      "header-b": [
        "value-b-1",
        "value-b-2"
      ]
    },
    ...
  }
}

body

(Object) The body of the HTTP Response

Switch State

Use DSL Filters and JSON Pointers over the JSON Data Channel to control the flow of the execution given the first matching condition:

{
  "mySwitchState": {
    "type": "switch",
    "options": [
      ...
    ],
    "default": "myDefaultState"
  }
}

type

(Required, String) The type of state. Must be switch

options

(Required, Array of Objects) The options to evaluate in the state

Details

{
  "type": "switch",
  "options": [
    {
      "condition": {
        "equals": {
          "field": "/httpRequest/queryParams/input",
          "value": "valueA"
        },
        ...
      },
      "state": "myFirstState"
    },
    ...
  ],
  ...
}

condition: (Required, Object) The predicate described as a DSL Filter over the JSON processing data
state: (Optional, String) The next state for the finite-state machine if the condition evaluates to true

default

(Optional, String) The default state for the [queryflow-state-machine-request_finite-state machine_] if no option evaluates to true

Note	If no state for the finite-state machine is selected, the current one will be assumed as the final state.

Response State

Final state that formats the expected HTTP response when the endpoint is configured to return application/json.

{
  "myResponseState": {
    "type": "response",
    "statusCode": <HTTP Status Code>,
    ...
  }
}

type

(Required, String) The type of state. Must be response

statusCode

(Required, Integer) The HTTP status code in the range of [200, 400[ for the response. Defaults to 200

headers

(Optional, Object) The HTTP headers to return as part of the response

Details

{
  "type": "response",
  "headers": {
    "Etag": "#{ data('/my/etag/value') }",
    ...
  },
  ...
}

body

(Optional, Object) The HTTP JSON body to return as part of the response

Details

{
  "type": "response",
  "body": {
    "keyA": "#{ data('/my/data') }",
    ...
  },
  ...
}

snippets

(Optional, Object) The snippets to be referenced in the configuration with the help of the Expression Language

Details

{
  "type": "response",
  "body": {
    "myProperty": "#{ snippets.snippetA }"
  },
  "snippets": {
    "snippetA": {
      ...
    }
  },
  ...
}

Note	Avoid the usage of any reserved operator such as hyphens in the name of a snippet.

Error State

Final state that returns an error message with error code 9999 and custom HTTP Status Code and message.

{
  "myErrorState": {
    "type": "error",
    "statusCode": <HTTP Status Code>,
    "message": <String>
  }
}

type: (Required, String) The type of state. Must be error
statusCode: (Required, Integer) The HTTP status code in the range of [400, 600[ for the error message. Defaults to 500
message: (Required, String) The message to display with the error. Defaults to The request has failed due to reaching a configured error endpoint state

HTTP Request Execution

Data Channels

Some state types produce data that is available for subsequent states.

JSON Data Channel

The JSON Data Channel handles output in JSON format which can be later referenced using JSON Pointers.

The execution starts with the metadata of the invocation:

{
  "id": "55d22c60-6d61-41ce-b8b1-c0f1acd6e5e4",
  "httpRequest": {
    ...
  },
  "pageable": {
    ...
  },
  "properties": {
    ...
  }
}

id

(UUID) An auto-generated ID for the request

httpRequest

(Object) The HTTP Request tat triggered the execution

Details

{
  "httpRequest": {
    "uri": "/my-endpoint",
    "method": <HTTP Method>,
    ...
  }
}

uri

(Required, String) The URI paths for the HTTP Request

method

(Required, String) The HTTP Method for the HTTP Request

headers

(Optional, Object) The headers for the HTTP Request. The value of each header can be either a single String, or an Array of Strings

Details

{
  "httpRequest": {
    "uri": "/my-endpoint",
    "headers": {
      "header-a": "value-a",
      "header-b": [
        "value-b-1",
        "value-b-2"
      ]
    }
  }
}

queryParams

(Optional, Object) The query parameters for the HTTP Request. The value of each query parameter can be either a single String, or an Array of Strings

Details

{
  "httpRequest": {
    "uri": "/my-endpoint",
    "queryParams": {
      "param-a": "value-a",
      "param-b": [
        "value-b-1",
        "value-b-2"
      ]
    }
  }
}

cookies

(Optional, Array of Objects) The list of cookies for the HTTP Request

Details

{
  "httpRequest": {
    "uri": "/my-endpoint",
    "cookies": [
      {
        "name": "cookie-name-a",
        "path": "/some/path/a",
        "value": "cookie-value-a",
        "domain": "cookie-domain-a",
        "maxAge": 1234
      },
      {
        "name": "cookie-name-b",
        "value": "cookie-value-b".
        ...
      }
    ]
  }
}

name: (Required, String) The name of the cookie
value: (Required, String) The value of the cookie
path: (Optional, String) The path of the cookie
domain: (Optional, String) The domain of the cookie
maxAge: (Optional, Integer) The maximum age of the cookie

body

(Optional, Object) The body of the HTTP Request

pageable

(Object) The pagination request parameters

Details

{
  "page": 0,
  "size": 25,
  "sort": [
    ...
  ]
}

page

(Integer) The page number

size

(Integer) The size of the page

sort

(Array of Objects) The sort definition for the page

Details

{
  "property" : "fieldA",
  "direction" : "ASC"
}

property: (String) The property where the sort was applied
direction: (String) The direction where the sort was applied. Either ASC or DESC

properties

(Object) The execution properties as configured in the Endpoint

Following outputs will be added in order. The data will never override anything previously generated.

When searching for a path, the JSON Pointer will be evaluated against the most recent output. If it is a match, the node is returned. Otherwise, the search continues with the previous ones until reaching the original HTTP request.

Note	If multiple states generate the same data structure and the one in the back needs to be referenced, the root name of the output can be customized.

SSE Data Channel

Some components produce text/event-stream content type. This data is later emitted to the corresponding HTTP Response for the request.

Note	It is possible to have multiple configurations producing events

Expression Language Extensions

Table 26. *Discovery QueryFlow* Expression Language Variables
Variable	Description	Example
`endpoint.id`	The ID of the endpoint in execution	`aa4a232e-8878-47ac-82d7-48ff77dbc039`
`endpoint.httpMethod`	The HTTP method of the endpoint in execution	`GET`
`endpoint.uri`	The URI paths of the endpoint in execution	`/my/custom/endpoint`
`endpoint.name`	The name of the endpoint in execution	`My Custom Endpoint`
`endpoint.description`	The description of the endpoint in execution	`Description of My Custom Endpoint`
`endpoint.properties`	The properties of the endpoint in execution	`[<myPropertyA,1>, <myPropertyB,Text Value>, <myPropertyC,>]`
`endpoint.labels`	The labels of the endpoint in execution, grouped by key	`[<keyA,[valueA,valueB]>, <keyB,valueC>]`
`processor.id`	The ID of the processor in execution during a processor state	`9aa1fec5-bf0d-4563-ade7-2cb270230bd7`
`processor.type`	The type of the processor in execution during a processor state	`my-component-type`
`processor.name`	The name of the processor in execution during a processor state	`My Component Processor`
`processor.description`	The description of the processor in execution during a processor state	`Description of My Component Processor`
`processor.labels`	The labels of the processor in execution during a processor state, grouped by key	`[<keyA,[valueA,valueB]>, <keyB,valueC>]`
`execution.id`	The unique ID of the current HTTP request	`b366282d-fa00-4411-be02-874111ee317c`
`execution.startTimestamp`	The timestamp when the current HTTP request started	`2023-12-08T10:30:00Z`

Table 27. *Discovery QueryFlow* Expression Language Functions
Function	Description	Example
`data(string)`	Finds a specific node within the JSON processing channel using a JSON Pointer	`data('/path/to/field')`

HTTP Response

By default, all Endpoints return a response of application/json content type. This HTTP response is either:

The configured response state or error state.
The most recent entry in the JSON Data Channel, where:
- If the root of the document is named httpResponse, the statusCode, headers and body will be used accordingly
  Details
  { "httpResponse": { "statusCode": 200, "headers": { "header-a": "value-a", "header-b": [ "value-b-1", "value-b-2" ] }, "body": { ... } } }
  
  statusCode
  
  (Integer) The HTTP Status Code of the HTTP Response
  
  headers
  
  (Object) The headers of the HTTP Response. The value of each header can be either a single String, or an Array of Strings
  
  Details
  
  { "httpResponse": { "headers": { "header-a": "value-a", "header-b": [ "value-b-1", "value-b-2" ] }, ... } }
  
  body
  
  (Object) The body of the HTTP Response
- If the node is any other from a processor state, the body will be unwrapped from the outputField and the status code will be 200.
In any other case, the response will be 204 - No Content.

The Endpoints of type stream return a 207 - Multi-Status response of type text/event-stream. Their body consists on the data of every emitted Server-Sent Event (SSE) and stored in the SSE Data Channel.

Details

name: <Output Field Name>
data: <Chunk>

name: (String) The configured outputFieldName for the processor
data: (String) The chunked data from the stream

Invoking a Configured Endpoint

Once an Endpoint is fully configured, it can be invoked as any other REST API by calling its HTTP Method/URI under the /api root path:

Invoke an Endpoint

$ curl --request GET 'queryflow-api:8088/v2/api/my-endpoint?param=value'

The HTTP Response corresponds to the execution of the finite-state machine for the context created by the HTTP Request:

{
  "myResponse": {
    ...
  }
}

Given that the definition of the Endpoint can grow in complexity, the risk of something breaking increases: a condition failed, the output was not as expected, a parameter was wrong…

In order to identify the problem, the /debug root path offers a complete tracing of the execution for the endpoint. Each one of the states, their output, their errors and the overall step-by-step followed by the finite-state machine will be displayed:

Debug an Endpoint

$ curl --request GET 'queryflow-api:8088/v2/debug/my-endpoint?param=value'

{
  "duration": 692,
  "execution": [
    {
      "state": "stateA",
      ...
    },
    ...
  ]
}

duration

(Integer) The duration of the HTTP Request Execution in milliseconds

execution

(Array of Objects) The details of the finite-state machine invocation. Each entry will depend on the state type in execution

Details

state

(String) The name of the state in execution. The same state can be invoked multiple times with different results

Processor State

{
  "state": "myProcessorState",
  "result": [
    {
      "processor": <Processor ID>,
      ...
    },
    ...
  ]
}

result

(Array of Objects) The result of the state execution

Details

processor

(UUID) The ID of the executed processor

JSON Data Channel

{
  "processor": <Processor ID>,
  "output": {
    "myOutputField": {
      ...
    }
  },
  "duration": 12
}

output: (Object) The data stored in the JSON Data Channel after wrapping the execution of the processor in the configured outputField
duration: (Integer) The duration of the execution of the state in milliseconds

SSE Data Channel

{
  "processor": <Processor ID>,
  "output": [
    {
      "name": "testFieldName",
      "data": "some "
    },
    {
      "name": "testFieldName",
      "data": "chunked d"
    },
    {
      "name": "testFieldName",
      "data": "ata"
    },
    {
      "name": "testFieldName",
      "data": "."
    }
  ],
  "duration": {
    "execution": 5,
    "stream": 20
  }
}

output

(Array of Objects) The chunks stored in the SSE Data Channel, using the configured outputField as name

Details

{
  "output": [
    {
      "name": "testFieldName",
      "data": "some "
    },
    {
      "name": "testFieldName",
      "data": "chunked d"
    },
    {
      "name": "testFieldName",
      "data": "ata"
    },
    {
      "name": "testFieldName",
      "data": "."
    },
    ...
  ],
  ...
}

name: (String) The configured outputField for the processor in the state
data: (String) The chunk of data

Parallel Endpoint State

{
  "state": "myParallelEndpointState",
  "result": {
    "tagA": {
      ...
    },
    ...
  }
}

result

(Object) The tag of each configured endpoint and their corresponding response

Details

{
  "state": "myParallelEndpointState",
  "result": {
    "tagA": {
      "status": 200,
      "headers": {
        ...
      },
      "body": {
        ...
      }
    }
  }
}

statusCode

(Integer) The HTTP Status Code of the HTTP Response

headers

(Object) The headers of the HTTP Response. The value of each header can be either a single String, or an Array of Strings

Details

{
  "httpResponse": {
    "headers": {
      "header-a": "value-a",
      "header-b": [
        "value-b-1",
        "value-b-2"
      ]
    },
    ...
  }
}

body

(Object) The body of the HTTP Response

Switch State

{
  "state": "mySwitchState",
  "matchType": "MATCH",
  "condition": {
    ...
  }
}

matchType: (String) The type of match after executing the state. One of MATCH for a condition that matched, DEFAULT for the default option or NONE of no condition matched and no default option is configured
condition: (Object) The configured DSL Filter that matched during the state execution

Response State

{
  "state": "myResponseState",
  "result": {
    ...
  }
}

result

(Object) The HTTP Response as configured in the state

Details

{
  "state": "myResponseState",
  "result": {
    "status": 200,
    "headers": {
      ...
    },
    "body": {
      ...
    }
  }
}

statusCode

(Integer) The HTTP Status Code of the HTTP Response

headers

(Object) The headers of the HTTP Response. The value of each header can be either a single String, or an Array of Strings

Details

{
  "httpResponse": {
    "headers": {
      "header-a": "value-a",
      "header-b": [
        "value-b-1",
        "value-b-2"
      ]
    },
    ...
  }
}

body

(Object) The body of the HTTP Response

Error State

{
  "state": "myErrorState",
  "result": {
    ...
  }
}

result

(Object) The HTTP Response as configured in the state

Details

{
  "state": "myErrorState",
  "result": {
    "status": 200,
    "headers": {
      ...
    },
    "body": {
      ...
    }
  }
}

statusCode

(Integer) The HTTP Status Code of the HTTP Response

headers

(Object) The headers of the HTTP Response. The value of each header can be either a single String, or an Array of Strings

Details

{
  "httpResponse": {
    "headers": {
      "header-a": "value-a",
      "header-b": [
        "value-b-1",
        "value-b-2"
      ]
    },
    ...
  }
}

body

(Object) The body of the HTTP Response

Note	The debug request is exactly the same as the one sent to the `/api` path.

Note	The debug response contains the `X-Request-ID` header with the UUID of the execution.

Components

Amazon Bedrock

Send requests to Amazon Bedrock. Supports multiple actions for different endpoints of the service.

Processor Action: invoke-model

Processor that invokes the specified Amazon Bedrock model to run inference using the prompt and inference parameters provided in the configuration.

{
  "type": "amazon-bedrock",
  "name": "My Amazon Bedrock Processor Action",
  "config": {
    "action": "invoke-model",
    ...
  },
  "server": <Amazon Bedrock Server>,
  ...
}

model: (Required, String) The model to invoke
request: (Required, Object) The body of the request
stream: (Optional, Boolean) Whether to enable streaming. Defaults to false

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "amazonBedrock": {
    ...
  }
}

Elasticsearch

Uses the Elasticsearch integration to send requests to the Elasticsearch API. Support multiple actions for common operations such as search, but also provides a mechanism to send raw Elasticsearch queries.

Action: autocomplete

Processor that executes a completion suggester query.

{
  "type": "elasticsearch",
  "name": "My Elasticsearch Processor Action",
  "config": {
    "action": "autocomplete",
    "index": "my-index",
    "text": "#{ data('my/query') }",
    "field": "content",
    ...
  },
  "server": <Elasticsearch Server>,
  ...
}

index: (Required, String) The index where to search
text: (Required, String) The text to search
field: (Required, String) The field where to search
skipDuplicates: (Optional, Boolean) Whether to skip duplicate suggestions
size: (Optional, Integer) The amount of suggestions

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "elasticsearch": {
    ...
  }
}

Action: knn

Processor that executes a k-nearest neighbor (kNN) query using approximate kNN.

{
  "type": "elasticsearch",
  "name": "My Elasticsearch Processor Action",
  "config": {
    "action": "knn",
    "index": "my-index",
    "field": "content",
    "maxResults": 5,
    "vector": "#{ data('my/vector') }",
    "k": 5,
    "candidatesPerShard": 20,
    ...
  },
  "server": <Elasticsearch Server>,
  ...
}

index: (Required, String) The index where to search
field: (Required, String) The field where to search
maxResults: (Required, Integer) The maximum number of results
vector: (Required, Array of Float) The source vector to compare
k: (Required, Long) The number of nearest neighbors
candidatesPerShard: (Required, Long) The number of nearest neighbors considered per shard
query: (Optional, Object) The query to filter in addition to the kNN search

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "elasticsearch": {
    ...
  }
}

Action: native

Processor that executes a native Elasticsearch query.

{
  "type": "elasticsearch",
  "name": "My Elasticsearch Processor Action",
  "config": {
    "action": "native",
    "path": "/my-index/_doc/1",
    "method": "GET",
    ...
  },
  "server": <Elasticsearch Server>,
  ...
}

path: (Required, String) The endpoint of the request, excluding schema, host, port and any path included as part of the connection
method: (Required, String) The HTTP method for the request
queryParams: (Optional, Map of String/String) The map of query parameters for the URL
body: (Optional, Object) The JSON body to submit

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "elasticsearch": {
    ...
  }
}

Action: search

Processor that executes a match query on the index.

{
  "type": "elasticsearch",
  "name": "My Elasticsearch Processor Action",
  "config": {
    "action": "search",
    "index": "my-index",
    "text": "#{ data('my/query') }",
    "field": "content",
    ...
  },
  "server": <Elasticsearch Server>
  ...
}

index: (Required, String) The index where to search
text: (Required, String) The text to search
field: (Required, String) The field where to search
suggest: (Optional, Object) The suggester to apply
aggregations: (Optional, Map of String/Object) The field with the aggregations to apply
filter: (Optional, DSL Filter) The filters to apply
highlight: (Optional, Object) The highlighter to apply
pageable: (Optional, Pagination) The pagination parameters

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "elasticsearch": {
    ...
  }
}

Action: store

Processor that executes a store request to Elasticsearch.

{
  "type": "elasticsearch",
  "name": "My Elasticsearch Processor Action",
  "config": {
    "action": "store",
    "index": "my-index",
    "document": {
      ...
    },
    ...
  },
  "server": <Elasticsearch Server>,
  ...
}

index: (Required, String) The index where to store the document
document: (Required, Object) The document to be stored
id: (Optional, String) The ID of the document to be stored. If not provided, it will be autogenerated
allowOverride: (Optional, Boolean) Whether the document can be overridden or not. Defaults to false

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "elasticsearch": {
    ...
  }
}

Action: vector

Processor that executes a script score query using exact kNN.

{
  "type": "elasticsearch",
  "name": "My Elasticsearch Processor Action",
  "config": {
    "action": "vector",
    "index": "my-index",
    "field": "my_vector_field",
    "vector": "#{ data('my/vector') }",
    "minScore": 0.92,
    "maxResults": 5,
    "query": {
      ...
    },
    ...
  },
  "server": <Elasticsearch Server>,
  ...
}

index: (Required, String) The index where to search
field: (Required, String) The field with the vector
vector: (Required, Array of Float) The source vector to compare
minScore: (Required, Double) The minimum score for results
maxResults: (Required, Integer) The maximum number of results
query: (Optional, Object) The query to apply together with the vector search
function: (Optional, String) The type of function to use. One of cosineSimilarity, dotProduct, l1norm or l2norm. Defaults to cosineSimilarity

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "elasticsearch": {
    ...
  }
}

Hugging Face

Uses the Hugging Face integration to send requests to the Inference API. Supports multiple actions for different endpoints of the service.

Action: summarization

Processor that summarizes a single or multiple texts organized by an autogenerated id. See Summarization Task

{
  "type": "hugging-face",
  "name": "My Summarization Action",
  "config": {
    "action": "summarization",
    "model": "Falconsai/text_summarization",
    "input": "#{ data(\"/httpRequest/body/input\") }",
    ...
  },
  "server": <Hugging Face Server>,
  ...
}

model

(Required, String) The model to use for the request

input

(Required, Array of Strings) The list of texts to summarize

parameters

(Optional, Object) The parameters for the request

Details

minLength: (Optional, Integer) The minimum length of the output tokens
maxLength: (Optional, Integer) The maximum length of the output tokens

topK: (Optional, Integer) The top tokens to consider to create new text
topP: (Optional, Float) Defines the tokens that are within the sample operation for the query
temperature: (Optional, Float) The temperature of the sampling operation. Defaults to 1.0
repetitionPenalty: (Optional, Float) The repetition penalty for the request.
maxTime: (Optional, Float) The maximum time that the request should take

options

(Optional, Object) The request options

Details

useCache: (Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true
waitForModel: (Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable

Note	See Inference API Parameters

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

For a single text input:

{
  "huggingFace": "Your summarized text"
}

For multiple text inputs:

{
  "huggingFace": [
    "Your summarized text for your first input",
    "Your summarized text for your second input",
    ...
  ]
}

Note	Please note that the order of the responses corresponds to the order of the inputs.

Action: text-generation

Processor that continues the text from a prompt. See Text Generation Task

{
  "type": "hugging-face",
  "name": "My Text Generation Action",
  "config": {
    "action": "text-geneartion",
    "model": "gpt2-large",
    "input": "#{ data(\"/httpRequest/body/input\") }",
    ...
  },
  "server": <Hugging Face Server>,
  ...
}

model

(Required, String) The model to use for the request

input

(Required, Strings) The prompt from which to generate the response

parameters

(Optional, Object) The parameters for the request

Details

maxNewTokens: (Optional, Integer) The number of tokens to be generated
returnFullText: (Optional, Boolean) Whether to include the input text within the answer or not. Defaults to true
numReturnSequences: (Optional, Integer) The number of proposition to be returned
doSample: (Optional, Boolean) Whether to use sampling or not. Use greedy decoding otherwise. Defaults to true
topK: (Optional, Integer) The top tokens to consider to create new text
topP: (Optional, Float) Defines the tokens that are within the sample operation for the query
temperature: (Optional, Float) The temperature of the sampling operation. Defaults to 1.0
repetitionPenalty: (Optional, Float) The repetition penalty for the request.
maxTime: (Optional, Float) The maximum time that the request should take

options

(Optional, Object) The request options

Details

useCache: (Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true
waitForModel: (Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable

Note	See Inference API Parameters

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

{
  "huggingFace": "My autogenerated text"
}

Action: feature-extraction

Processor that extracts a matrix of numerical features from a single or multiple texts organized by an autogenerated id. Seed Feature Extraction Task

{
  "type": "hugging-face",
  "name": "My Feature Extraction Action",
  "config": {
    "action": "feature-extraction",
    "model": "facebook/bart-base",
    "input": "#{ data(\"/httpRequest/body/input\") }",
    ...
  },
  "server": <Hugging Face Server>,
  ...
}

model

(Required, String) The model to use for the request

input

(Required, Array of String) The list of texts to extract the numerical features

options

(Optional, Object) The request options

Details

useCache: (Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true
waitForModel: (Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable

Note	See Inference API Parameters

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

For a single text input:

{
  "huggingFace":
     [[ 2.2187119  , 2.7539337  , 1.0330348  , ... ],
      [ -0.2937546 , 0.29999846 , -1.7008113 , ... ],
      [ 0.09872855 , 0.53532976 , 0.7232368  , ... ]]
}

For a multiple text inputs:

{
  "huggingFace":
  [
    [
      [ 2.2187119  , 2.7539337  , 1.0330348  , ... ],
      [ -0.2937546 , 0.29999846 , -1.7008113 , ... ],
      ...
    ],
    [
      [ 2.821799  , 2.7055995   , 1.1408421  , ... ],
      [ 1.4287674 , 0.39487326  , -3.7841866 , ... ],
      ...
    ]
  ]
}

Note	Please note that the order of the responses corresponds to the order of the inputs.

Action: fill-mask

Processor that replaces a missing word in a sentence with multiple fitting possibilities. The name of the [MASK] token to be replaced is defined by the chosen model. See Fill Mask Task

{
  "type": "hugging-face",
  "name": "My Fill Mask Action",
  "config": {
    "action": "fill-mask",
    "model": "distilroberta-base",
    "input": "#{ data(\"/httpRequest/body/input\") }",
    ...
  },
  "server": <Hugging Face Server>,
  ...
}

model

(Required, String) The model to use for the request

input

(Required, Array of String) The list of texts to fill their masks

options

(Optional, Object) The request options

Details

useCache: (Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true
waitForModel: (Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable

Note	See Inference API Parameters

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

For a single text input:

{
  "huggingFace": [
    {
      "sequence": "Paris is the capital of france",
      "score": 0.2705707,
      "token": 812,
      "tokenStr": " capital"
    },
    ...
  ]
}

For multiple text inputs:

{
  "huggingFace": [
    [
      {
        "sequence": "Paris is the capital of france",
        "score": 0.2705707,
        "token": 812,
        "tokenStr": " capital"
      },
      ...
    ],
    [
      {
        "sequence": "The Eiffle tower is one of the main tourist spots in Paris.",
        "score": 0.9013709,
        "token": 8376,
        "tokenStr": " tourist"
      },
      ...
    ],
    ...
  ]
}

Action: text-clasification

Processor that classifies a text into a group of labels, it provides a score for each label. These labels are determined by the model that is used. See Text Clasification Task

{
  "type": "hugging-face",
  "name": "My Text Clasification Action",
  "config": {
    "action": "text-classification",
    "model": "distilbert-base-uncased-finetuned-sst-2-english",
    "input": "#{ data(\"/httpRequest/body/input\") }",
    ...
  },
  "server": <Hugging Face Server>,
  ...
}

model

(Required, String) The model to use for the request

input

(Required, Array of String) The list of texts to classify

options

(Optional, Object) The request options

Details

useCache: (Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true
waitForModel: (Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable

Note	See Inference API Parameters

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

For a single text input:

{
  "huggingFace": [
    {"label": "LabelA", "score": 0.9998608827590942},
    ...
  ]
}

For multiple text inputs:

{
  "huggingFace": [
    [
      {
        "label": "LabelA",
        "score": 0.9998608827590942
      },
      ...
    ],
    [
      {
        "label": "LabelC",
        "score": 0.9968926310539246
      },
      ...
    ],
    ...
  ]
}

Note	Please note that the order of the responses corresponds to the order of the inputs.

Action: zero-shot-classification

Processor that classifies a text into a group of labels without having seen any training examples for those labels, it provides a score for each label. See Zero Shot Classification Task

{
  "type": "hugging-face",
  "name": "My Zero Shot Classification Action",
  "config": {
    "action": "zero-shot-classification",
    "model": "facebook/bart-large-mnli",
    "input": "#{ data(\"/httpRequest/body/input\") }",
    ...
  },
  "server": <Hugging Face Server>,
  ...
}

model

(Required, String) The model to use for the request

input

(Required, Array of String) The list of texts to classify

parameters

(Optional, Object) The parameters for the request

Details

candidateLabels: (Required, Array of String) The list of possible labels to classify the input
multiLabel: (Optional, Boolean) Whether classes can overlap or not. Defaults to false

options

(Optional, Object) The request options

Details

useCache: (Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true
waitForModel: (Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable

Note	See Inference API Parameters

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

For a single text input:

{
  "huggingFace": [
    {
      "label": "labelA",
      "score": 0.9998608827590942
    },
    ...
  ]
}

For a multi text inputs:

{
  "huggingFace": [
    [
      {
        "label": "labelA",
        "score": 0.9998608827590942
      },
      ...
    ],
    [
      {
        "label": "labelA",
        "score": 0.9968926310539246
      },
      ...
    ],
    ...
  ]
}

Note	Please note that the order of the responses corresponds to the order of the inputs.

Action: token-classification

Processor that assigns a label to the tokens from a single or multiple texts organized by an autogenerated id. See Token Classification Task

{
  "type": "hugging-face",
  "name": "My Token Classification Action",
  "config": {
    "action": "token-classification",
    "model": "dslim/bert-base-NER",
    "input": "#{ data(\"/httpRequest/body/input\") }",
    ...
  },
  "server": <Hugging Face Server>,
  ...
}

model

(Required, String) The model to use for the request

input

(Required, Array of String) The list of texts to classify their tokens

parameters

(Optional, Object) The parameters for the request

Details

aggregationStrategy: (Optional, String) The aggregation strategy to use in the request

Details

NONE: Every token gets classified without further aggregation

SIMPLE: Entities are grouped according to the default schema (B-, I- tags get merged when the tag is similar)

FIRST: Same as the SIMPLE strategy except words cannot end up with different tags. Words will use the tag of the first token when there is ambiguity

AVERAGE: Same as the SIMPLE strategy except words cannot end up with different tags. Scores are averaged across tokens and then the maximum label is applied

MAX: Same as the SIMPLE strategy except words cannot end up with different tags. Word entity will be the token with the maximum score

options

(Optional, Object) The request options

Details

useCache: (Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true
waitForModel: (Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable

Note	See Inference API Parameters

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

For a single text input:

{
  "huggingFace": [
    {
      "score": 0.9990085,
      "word": "Omar",
      "start": 11,
      "end": 15,
      "entityGroup": "PER"
    },
    ...
  ]
}

For a multi text inputs:

{
  "huggingFace": [
    [
      {
        "score": 0.9990085,
        "word": "Omar",
        "start": 11,
        "end": 15,
        "entityGroup": "PER"
      },
      ...
    ],
    [
      {
        "score": 0.9949533,
        "word": "George Washington",
        "start": 0,
        "end": 17,
        "entityGroup": "PER"
      },
      ...
    ],
    ....
  ]
}

Note	Please note that the order of the responses corresponds to the order of the inputs.

Action: question-answering

Processor that answers a question based on given contexts. See Question Answering Task

{
  "type": "hugging-face",
  "name": "My Question Answering Action",
  "config": {
    "action": "question-answering",
    "model": "deepset/roberta-base-squad2",
    "input": "#{ data(\"/httpRequest/body/input\") }",
    ...
  },
  "server": <Hugging Face Server>,
  ...
}

model

(Required, String) The model to use for the request

input

(Required, Object) The input for the request

Details

question: (Required, String) The question to anwser by the model
context: (Required, Array of String) The list of context to use to answer the question

minScore

(Optional, Float) The min score for each answer

options

(Optional, Object) The request options

Details

useCache: (Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true
waitForModel: (Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable

Note	See Inference API Parameters

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

{
  "huggingFace": [
    {
      "answer": "Clara",
      "score": 0.8979613184928894,
      "start": 11,
      "end": 16
    },
    {
      "answer": "Los Angeles",
      "score": 0.013939359225332737,
      "start": 20,
      "end": 31
    },
    ...
  ]
}

Note	Please note that the responses are sorted in descending order according to their score value.

Language Detector

The language Detector component uses Lingua to identify the language from a specified text input. The languages are referenced using ISO-639-1 (alpha-2 code).

Note	Each time a language model is referenced, it will be loaded in memory. Loading too many languages increases the risk of high memory consumption issues.

Processor Action: process

Processor that detects the language of a provided text.

{
  "type": "language-detector",
  "name": "My Language Detector Processor Action",
  "config": {
    "action": "process",
    ...
  }
}

text

(Required, String) The text to be evaluated.

defaultLanguage

(Optional, String) Default language to select in case no other is detected. Defaults to "en".

minDistance

(Optional, Double) Distance between the input and the language model. Defaults to 0.0.

supportedLanguages

(Optional, Array of Strings) List of languages supported by the detector. Defaults to [ "en" ].

Details

{
  "supportedLanguages": [
    "en",
    "pt"
  ],
  ...
}

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "language": {
    ...
  }
}

Logger

Logs any type of key/value entry assigned to the message configuration in the given level.

Note	The log component always returns an empty output in the data of the execution. The result of the component will only be available in the log file of the QueryFlow API.

Processor Action: process

Processor that logs any key/value entry provided.

{
  "type": "logger",
  "name": "My Logger Processor Action",
  "config": {
    "action": "process",
    ...
  }
}

message: (Required, String) Message to log.
level: (Optional, String) Logging level. One of: INFO, DEBUG or ERROR. Defaults to INFO.
loggerName: (Optional, String) Name of the logger.

MongoDB

Uses the MongoDB integration to send requests to the MongoDB server.

Action: aggregate

Processor that runs a configured aggregation pipeline on a MongoDB database.

{
  "type": "mongo",
  "name": "My MongoDB Processor Action",
  "config": {
    "action": "aggregate",
    "database": "my-database",
    "collection": "my-collection",
    "stages": [
      ...
    ],
    ...
  },
  "server": <MongoDB Server>,
  ...
}

database: (Required, String) The database name
collection: (Required, String) The collection name
stages: (Required, Array of Objects) The list of MongoDB stages

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "mongo": {
    ...
  }
}

Action: autocomplete

Processor that uses the autocomplete operator in a compound must clause, filters are applied in the filter clause.

{
  "type": "mongo",
  "name": "My MongoDB Processor Action",
  "config": {
    "action": "autocomplete",
    "database": "my-database",
    "collection": "my-collection",
    "index": "my-index",
    "path": "my-field",
    "queries": [
      ...
    ],
    ...
  },
  "server": <MongoDB Server>,
  ...
}

database: (Required, String) The database name
collection: (Required, String) The collection name
index: (Required, String) The name for the MongoDB full-text search index
path: (Required, String) The indexed field to search
queries: (Required, Array of Strings) The phrase or phrases to autocomplete
tokenOrder: (Optional, String) The order in which the tokens will be searched. One of ANY or SEQUENTIAL
filter: (Optional, DSL Filter) The filter to apply

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "mongo": {
    ...
  }
}

Action: search

Processor that uses the text operator in a compound must clause, filters are applied in the filter clause.

{
  "type": "mongo",
  "name": "My MongoDB Processor Action",
  "config": {
    "action": "search",
    "database": "my-database",
    "collection": "my-collection",
    "index": "my-index",
    "paths": [
      ...
    ],
    "queries": [
      ...
    ],
    ...
  },
  "server": <MongoDB Server>,
  ...
}

database

(Required, String) The database name

collection

(Required, String) The collection name

index

(Required, String) The name for the MongoDB full-text search index

paths

(Required, Array of Strings) The path or paths to the fields to search

queries

(Required, Array of Strings) The phrase or phrases to autocomplete

pageable

(Optional, Object) The pagination object

Details

{
  "page": 0,
  "size": 25,
  "sort": [
    ...
  ]
}

page

(Integer) The page number

size

(Integer) The size of the page

sort

(Array of Objects) The sort definition for the page

Details

{
  "property" : "fieldA",
  "direction" : "ASC"
}

property: (String) The property where the sort was applied
direction: (String) The direction where the sort was applied. Either ASC or DESC

filter

(Optional, DSL Filter) The filter to apply

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "mongo": {
    ...
  }
}

Action: vector

Processor that uses the knnBeta operator and its filter field if provided.

{
  "type": "mongo",
  "name": "My MongoDB Processor Action",
  "config": {
    "action": "vector",
    "database": "my-database",
    "collection": "my-collection",
    "index": "my-index",
    "vector": "#{ data('my/vector') }",
    "path": "my-field",
    "k": 5,
    "minScore": 0.92,
    ...
  },
  "server": <MongoDB Server>,
  ...
}

database: (Required, String) The database name
collection: (Required, String) The collection name
index: (Required, String) The name for the MongoDB full-text search index
vector: (Required, Array of Float) The kNN vector
path: (Required, String) The path to search for the vector in the documents
k: (Required, Integer) The number of the nearest neighbors to return
minScore: (Required, Double) The minimum score for results
filter: (Optional, Object) The search operator filter to apply in the query

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "mongo": {
    ...
  }
}

Neo4j

Executes a read query to a Neo4j server to gather search results from it.

Processor Action: process

Processor that executes a query to a Neo4j server.

{
  "type": "neo4j",
  "name": "My Neo4j Processor Action",
  "config": {
    "action": "process",
    ...
  },
  "server": <Neo4j Server>,
  ...
}

database: (Required, String) The Neo4j database to query
query: (Required, String) The query to be executed
parameters: (Required, Map of String/Object) Parameters to be used in the query

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "neo4j": {
    ...
  }
}

OpenAI

Uses the OpenAI integration to send requests to OpenAI. Supports multiple actions for different endpoint of the service.

Action: chat-completion

Processor that executes a chat completion request to OpenAI API.

{
  "type": "openai",
  "name": "My Chat Completion Action",
  "config": {
    "action": "chat-completion",
    "model": "gpt-4",
    ...
  },
  "server": <OpenAI Server>,
  ...
}

model

(Required, String) The OpenIA model to use

user

(Required, String) The unique identifier representing the end-user

messages

(Required, Array of Objects) The list of messages for the request

Details

[
  {"role": "system", "content": "You are a helpful assistant" },
  {"role": "user", "content": "Hi!" },
  {"role": "assistant", "content": "Hi, how can assist you today?" },
]

role: (Required, String) The role of the message. Must be one of system, user or assistant
content: (Required, String) Then content of the message
name: (Optional, String) The name of the author of the message

frequencyPenalty

(Optional, Double) Positive values penalize new tokens based on their existing frequency in the text so far. Value must be between -2.0 and 2. Defaults to 0.0

presencePenalty

(Optional, Double) Positive values penalize new tokens based on whether they appear in the text so far. Value must be between -2.0 and 2. Defaults to 0.0

temperature

(Optional, Double) Sampling temperature to use. Value must be between 0 and 2. Defaults to 1

Note	Is generally recommend altering this or topP but not both.

topP: (Optional, Double) An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass. Defaults to 1

Note	Is generally recommend altering this or temperature but not both.

n: (Optional, Integer) How many chat completion choices to generate for each input message. Defaults to 1
maxTokens: (Optional, Integer) The maximum number of tokens to generate in the chat completion. Defaults to 2048
stop: (Optional, Array of String) Up to 4 sequences where the API will stop generating further tokens
stream: (Optional, Boolean) Whether enable streaming or not. Defaults to false

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

For non-streaming reponse:

{
  "openai": {
    "created": <Timestamp>,
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "The response from the model"
        },
        "finishReason": "stop"
      }
    ],
    "model": "gpt-4-0613",
    "usage": {
      "promptTokens": 34,
      "completionTokens": 95,
      "totalTokens": 129
    }
  }
}

For streaming response:

[
  {
    "name": "openai",
    "data": "The"
  },
  {
    "name": "openai",
    "data": " reponse"
  },
  {
    "name": "openai",
    "data": " from"
  },
  {
    "name": "openai",
    "data": " the"
  },
  {
    "name": "openai",
    "data": " model"
  },
  ...
]

Action: embeddings

Processor executes an embeddings request to OpenAI API.

{
  "type": "openai",
  "name": "My OpenAI Embeddings Action",
  "config": {
    "action": "embeddings",
    "model": "text-embedding-ada-002",
    ...
  }
}

model: (Required, String) The OpenIA model to use
user: (Required, String) The unique identifier representing the end-user
input: (Required, Array of Strings) The list of input texts to be processed

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

{
  "openai": {
    "embeddings": [
      {
        "embedding": [ ... ],
        "index": 0
      },
      {
        "embedding": [ ... ],
        "index": 1
      }
    ],
    "model": "text-embedding-ada-002-v2",
    "usage": {
      "promptTokens": 4,
      "totalTokens": 4
    }
  }
}

Opensearch

Uses the Opensearch integration to send requests to the Opensearch API. It supports multiple actions for common operations such as search, but also provides a mechanism to send raw OpenSearch queries.

Action: autocomplete

Processor that executes a completion suggester query.

{
  "type": "opensearch",
  "name": "My Opensearch Processor Action",
  "config": {
    "action": "autocomplete",
    "index": "my-index",
    "text": "#{ data('my/query') }",
    "field": "content",
    ...
  },
  "server": <Opensearch Server>,
  ...
}

index: (Required, String) The index where to search
text: (Required, String) The text to autocomplete
field: (Required, String) The field where to search
skipDuplicates: (Optional, Boolean) Whether to skip duplicate suggestions
size: (Optional, Integer) The amount of suggestions

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "opensearch": {
    ...
  }
}

Action: fetch

Processor that executes a GET request to retrieve a specified JSON document from an index.

{
  "type": "opensearch",
  "name": "My Opensearch Processor Action",
  "config": {
    "action": "fetch",
    "index": "my-index",
    "id": "document-ID",
    ...
  },
  "server": <Opensearch Server>,
  ...
}

index: (Required, String) The index where to search
id: (Required, String) The ID of the document
fields: (Optional, Projection) The source fields to be included or excluded

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "opensearch": {
    ...
  }
}

Action: knn

Processor that executes an Approximate k-NN query.

{
  "type": "opensearch",
  "name": "My Opensearch Processor Action",
  "config": {
    "action": "knn",
    "index": "my-index",
    "field": "vector-field",
    "vector": "#{ data('my/vector') }",
    "minScore": 0.92,
    "maxResults": 5,
    "k": 5,
    ...
  },
  "server": <Opensearch Server>,
  ...
}

index: (Required, String) The index where to search
field: (Required, String) The field with the vector
vector: (Required, Array of Float) The source vector to compare
minScore: (Required, Double) The minimum score for results
maxResults: (Required, Integer) The maximum number of results
k: (Required, Integer) The number of neighbours the search of each graph will return
query: (Optional, Object) The query to filter in addition to the kNN search

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "opensearch": {
    ...
  }
}

Action: native

Processor that executes a native Opensearch query.

{
  "type": "opensearch",
  "name": "My Opensearch Processor Action",
  "config": {
    "action": "native",
    "path": "/my-index/_doc/1",
    "method": "GET",
    ...
  },
  "server": <Opensearch Server>,
  ...
}

path: (Required, String) The endpoint of the request, excluding schema, host, port and any path included as part of the connection
method: (Required, String) The HTTP method for the request
queryParams: (Optional, Map of String/String) The map of query parameters for the URL
body: (Optional, Object) The JSON body to submit

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "opensearch": {
    ...
  }
}

Action: search

Processor that executes a match query on the index.

{
  "type": "opensearch",
  "name": "My Opensearch Processor Action",
  "config": {
    "action": "search",
    "index": "my-index",
    "text": "#{ data('my/query') }",
    "field": "content",
    ...
  },
  "server": <Opensearch Server>
  ...
}

index: (Required, String) The index where to search
text: (Required, String) The text to search
field: (Required, String) The field where to search
suggest: (Optional, Object) The suggester to apply
aggregations: (Optional, Map of String/Object) The field with the aggregations to apply
filter: (Optional, DSL Filter) The filters to apply
highlight: (Optional, Object) The highlighter to apply
pageable: (Optional, Pagination) The pagination parameters

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "opensearch": {
    ...
  }
}

Action: store

Processor that stores or updates documents in the given index of Opensearch.

{
  "type": "opensearch",
  "name": "My Opensearch Processor Action",
  "config": {
    "action": "store",
    "index": "my-index",
    "id": "document-id",
    "document": {
      ...
    },
    ...
  },
  "server": <Opensearch Server>,
  ...
}

index: (Required, String) The index where to store the document
id: (Required, String) The ID of the document to be stored.
document: (Required, Object) The document to be stored
allowOverride: (Optional, Boolean) Whether the document can be overridden or not. Defaults to false

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "opensearch": {
    ...
  }
}

Action: vector

Processor that executes an Exact kNN with scoring script query.

{
  "type": "opensearch",
  "name": "My Opensearch Processor Action",
  "config": {
    "action": "vector",
    "index": "my-index",
    "field": "my_vector_field",
    "vector": "#{ data('my/vector') }",
    "minScore": 0.92,
    "maxResults": 5,
    "query": {
      ...
    },
    ...
  },
  "server": <Opensearch Server>,
  ...
}

index: (Required, String) The index where to search
field: (Required, String) The field with the vector
vector: (Required, Array of Float) The source vector to compare
minScore: (Required, Double) The minimum score for results
maxResults: (Required, Integer) The maximum number of results
function: (Required, String) The function used for the k-NN calculation. The available functions can be found here
query: (Optional, Object) The query to apply together with the vector search

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "opensearch": {
    ...
  }
}

Question Detector

The Question Detector component validates if an input text contains a question. It uses languages codes that are referenced using ISO-639-1 (alpha-2 code).

Processor Action: process

Processor that detects if the probided text is a question.

{
  "type": "question-detector",
  "name": "My Question Detector Processor Action",
  "config": {
    "action": "process",
    ...
  }
}

text

(Required, String) The text to be evaluated.

language

(Required, String) The language to use.

questionPrefixes

(Optional, Map of String/List) Words that indicate a question. Defaults to { "en": [ "what", "who", "why", "where", "when", "how" ] }

Details

{
  "questionPrefixes": {
    "es": [
      "que",
      "quien",
      "porque",
      "donde",
      "cuando",
      "como"
    ]
  },
  ...
}

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "isQuestion": {
    ...
  }
}

Script

Uses the Script Engine to execute a script for advanced handling of the execution data. Supports multiple scripting languages and provides tools for JSON manipulation and for logging.

Processor Action: process

Processor that executes a script to process and interact with data produced in previous states.

{
  "type": "script",
  "name": "My Script Processor Action",
  "config": {
    "action": "process",
    "script": <Script>,
    ...
  }
}

language: (Optional, String) The language of the script. One of the supported script languages. Defaults to groovy
script: (Required, String) The script to run

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "script": {
    ...
  }
}

Facet Snap

Tries to snap facet values based on a list of tokens extracted from the user query. These facet snaps are returned as a Filter (see Filters DSL) that can be later used as clauses on the query sent to the search engine

Action: filter

Processor that creates a filter based on the facet values snapped using the query tokens provided as input.

{
  "type": "snap",
  "name": "My Snap Filter Action",
  "config": {
    "action": "filter",
    "query": "#{ data(\"/httpRequest/queryParams/q\") }",
    "tokens": "#{ data(\"/tokens\") }",
    ...
  },
  ...
}

tokens

(Required, Array of Strings) The list of tokens to snap to

query

(Required, String) The search query to use

facetStore

(Required, String) The Discovery Staging bucket to get the facets from

Details

The facets stored on the bucket are expected to have the following format:

{
  "name": "name",
  "value": "value",
  "properties": {}
}

name: (Required, String) The name of the facet
value: (Required, String) The value of the facet
properties: (Optional, Object) The facet properties. Useful to store additional information for the facet.

snapMode

(Optional, String) The mode to compare facets when snapping

Details

QUERY: The facets will be matched against the input query text

TOKENS: The facets will be matched against the input tokens, separated by whitespace. This is useful if you are applying any processing to the tokens

includeFacets

(Optional, Array of Strings) A list of facets to include when snapping

excludeFacets

(Optional, Array of Strings) A list of facets to ignore when snapping

matchAllFacets

(Optional, Boolean) If true, the returned Filter will match all facet fields using and. If false, the returned Filter will match any facet field using or. Defaults to false

greedyMatch

(Optional, Boolean) If true, snap to the biggest possible facet for each token only, preventing any overlap between matches. If false, snap to every possible facet for each token, allowing overlapped matches. Defaults to false

maxDisambiguateOffset

(Optional, Integer) The maximum offset size to check when disambiguating. If -1 checks all tokens available on both sides. Defaults to -1

Tip	Input tokens for this action can be retrieved using the Tokenizer component.

Tip	For faster query responses from the facet store, create indices for both `name` and `value` fields.

The response of the action is stored in the JSON Data Channel and besides outputting the filter, the Snap Filter action also provides the snapped facet objects and query ngrams that matched them, for later use as input on other actions.

{
  "snap": {
    "snappedFacets": [
      {
        "facet": { "name": "brand", "value": "nike", "properties": { "code": 123 } },
        "ngram": {
          "value": "nike",
          "offset": { "start": 7, "end": 11 },
          "tokens": [
            { "term": "nike", "offset": { "start": 7, "end": 11 } }
          ]
        }
      },
      {
        "facet": { "name": "size", "value": "7" },
        "ngram": {
          "value": "7",
          "offset": { "start": 5, "end": 6 },
          "tokens": [
            { "term": "size", "offset": { "start": 0, "end": 4 } },
            { "term": "7", "offset": { "start": 5, "end": 6 } }
          ]
        }
      }
    ],
    "filter": {
      "or": [
        { "in": { "field": "size", "values": [ "7" ] } },
        { "in": { "field": "brand", "values": [ "nike" ] } }
      ]
    }
  }
}

Note	The resulting snapped facets are ordered by ngram size, descending. If two ngrams have the same number of tokens they are ordered by appearance.

Action: mask

Processor that creates a masked query based on the snap results of the Snap Filter Action. It replaces facet matches (both name and value) with a given map of facet masks in the input query.

{
  "type": "snap",
  "name": "My Snap Mask Filter Action",
  "config": {
    "action": "mask",
    "query": "#{ data(\"/httpRequest/queryParams/q\") }",
    "tokens": "#{ data(\"/tokens\") }",
    "snappedFacets": "#{ data(\"/snap/snappedFacets\") }",
    ...
  },
  ...
}

Note	Please note how the `snappedFacets` is reading from the output of a previously executed Snap Filter Action.

tokens: (Required, Array of Strings) The list of tokens to snap to
query: (Required, String) The search query to use

snappedFacets

(Required, List of Objects) The list of facets that matched a ngram value

Details

[
  {
    "facet": {
      "name": "name",
      "value": "value",
      "properties": {}
    },
    "ngram": {
      "value": "value",
      "offset": { "start": 0, "end": 7 },
      "tokens": [ { "term": "term", "offset": { "start": 0, "end": 7 } }, ... ]
    }
  }
]

facet

(Required, Object) The snapped facet value

Details

name: (Required, String) The name of the facet
value: (Required, String) The value of the facet
properties: (Optional, Object) The facet properties. Useful to store additional information for the facet.

ngram

(Required, Object) The snapped ngram value

Details

value

(Required, String) The ngram value

offset

(Required, Object) The ngram query offset

Details

start: (Required, Integer) The offset start index
end: (Required, Integer) The offset end index

tokens

(Required, Array of Objects) The tokens that are part of the ngram

Details

term

(Required, String) The term for this token

offset

(Required, Object) The token offset

Details

start: (Required, Integer) The offset start index
end: (Required, Integer) The offset end index

entityMasks

(Optional, Object) Masks to apply to the given facets

Details

Each entry of the entityMasks object must be a pair of strings:

{
  "size": "[SIZE]",
  "brand": "[BRAND]",
  ...
}

The response of the action is stored in the JSON Data Channel as:

{
  "snap": "[SIZE] [BRAND] sneakers"
}

Action: clear

Processor that creates a simplified query based on the snap results of the Snap Filter Action. It removes facet matches (both name and value) from the input tokens and joins the remaining tokens with whitespace.

{
  "type": "snap",
  "name": "My Snap Clear Action",
  "config": {
    "action": "clear",
    "tokens": "#{ data(\"/tokens\") }",
    "snappedFacets": "#{ data(\"/snap/snappedFacets\") }"
  }
}

Note	Please note how the `snappedFacets` is reading from the output of a previously executed Snap Filter Action.

tokens

(Required, Array of Strings) The list of tokens to snap to

snappedFacets

(Required, List of Objects) The list of facets that matched a ngram value

Details

[
  {
    "facet": {
      "name": "name",
      "value": "value",
      "properties": {}
    },
    "ngram": {
      "value": "value",
      "offset": { "start": 0, "end": 7 },
      "tokens": [ { "term": "term", "offset": { "start": 0, "end": 7 } }, ... ]
    }
  }
]

facet

(Required, Object) The snapped facet value

Details

name: (Required, String) The name of the facet
value: (Required, String) The value of the facet
properties: (Optional, Object) The facet properties. Useful to store additional information for the facet.

ngram

(Required, Object) The snapped ngram value

Details

value

(Required, String) The ngram value

offset

(Required, Object) The ngram query offset

Details

start: (Required, Integer) The offset start index
end: (Required, Integer) The offset end index

tokens

(Required, Array of Objects) The tokens that are part of the ngram

Details

term

(Required, String) The term for this token

offset

(Required, Object) The token offset

Details

start: (Required, Integer) The offset start index
end: (Required, Integer) The offset end index

The response of the action is stored in the JSON Data Channel as:

{
  "snap": "sneakers"
}

Solr

Uses the Solr integration to send requests to Solr.

Action: native

Processor that executes a native indexing query.

{
  "type": "solr",
  "name": "My Solr Processor Action",
  "config": {
    "action": "native",
    "path": "/select",
    "method": "POST",
    "queryParams": {
      "q": "description:Pureinsights"
    },
    ...
  },
  "server": <Solr Server>,
  ...
}

path: (Required, String) The Solr operation path to be used for the request
method: (Required, String) The HTTP method for the request
queryParams: (Required, Map of String/String) The map of query parameters for the request
body: (Optional, Object) The JSON body to submit for the request
maxResponseMapDepth: (Optional, Integer) The maximum depth for response object deserialization. Defaults to 5

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "solr": {
    ...
  }
}

Action: search

Processor that executes a standard search query.

{
  "type": "solr",
  "name": "My Solr Processor Action",
  "config": {
    "action": "search",
    "query": "#{ data('my/query') }",
    ...
  },
  "server": <Solr Server>,
  ...
}

query: (Required, String) The search query to be executed
fields: (Optional, Array of Strings) The optional returned fields of the document. If not set, all the fields in the document are returned
highlight: (Optional, Boolean) Whether to enable highlighting in the resulting query or not
filterQueries: (Optional, String) The filter queries to be applied to the search
maxResponseMapDepth: (Optional, Integer) The maximum depth for response object deserialization. Defaults to 5

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "solr": {
    ...
  }
}

Staging

Interacts with buckets and content from Discovery Staging.

Action: fetch

Gets a document from the given bucket.

{
  "type": "staging",
  "name": "My Staging Processor Action",
  "config": {
    "action": "fetch",
    "bucket": "my-bucket",
    "id": "my-document-id",
    ...
  }
}

bucket: (Required, String) The bucket name
id: (Required, String) The ID of the document to fetch
fields: (Optional, Projection) The projection to apply on the document

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "staging": {
    ...
  }
}

Action: store

Stores a document into the given bucket.

{
  "type": "staging",
  "name": "My Staging Processor Action",
  "config": {
    "action": "store",
    "bucket": "my-bucket",
    "document": {
      ...
    },
    ...
  }
}

bucket: (Required, String) The bucket name
document: (Required, Object) The document to store
id: (Optional, String) The ID of the document to store. If not provided, a random UUID will be used
allowOverride: (Optional, Boolean) Whether allow overriding an existing document or not. Defaults to false

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "staging": {
    ...
  }
}

Action: search

Search for documents in the given bucket.

{
  "type": "staging",
  "name": "My Staging Processor Action",
  "config": {
    "action": "search",
    "bucket": "my-bucket",
    ...
  }
}

bucket

(Required, String) The bucket name

actions

(Required, Array of Strings) The actions to filter the documents. Defaults to STORE

projection

(Optional, Projection) The projection to apply on the search

filter

(Optional, DSL Filter) The filter to apply on the search

parentId

(Optional, String) The parent ID to match

pageable

(Optional, Object) The pagination object

Details

{
  "page": 0,
  "size": 25,
  "sort": [
    ...
  ]
}

page

(Integer) The page number

size

(Integer) The size of the page

sort

(Array of Objects) The sort definition for the page

Details

{
  "property" : "fieldA",
  "direction" : "ASC"
}

property: (String) The property where the sort was applied
direction: (String) The direction where the sort was applied. Either ASC or DESC

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "staging": {
    ...
  }
}

Template

Uses the Template Engine to transform a standard template with contextual structured data, generating a verbalized representation of the information. It can generate various types of documents as either plain text or JSON.

Processor Action: process

Processor that processes the provided template with the defined configuration.

{
  "type": "template",
  "name": "My Template Processor Action",
  "config": {
    "action": "process",
    ...
  }
}

template

(Required, String) The template to process

bindings

(Required, Object) The bindings to replace in the template

Details

{
  "bindingA": "#{ data('/my/binding/field') }",
  ...
}

Can be later referenced in a template:

My bindingA value is ${bindingA}

outputFormat

(Optional, String) The output format of the precessed template. Supported formats are: JSON and PLAIN. Defaults to PLAIN

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "template": {
    ...
  }
}

Tokenizer

Tokenizes a specified text input using Lucene. Supported analyzers:

Note	All analyzers (except the custom that needs the tokenizer configuration) will be used as they are built by default if no configuration is specified. Further configurations under the field `analyzer` can be specified.

Note	Currently, the custom analyzer does not support parameters that require a file name for certain filters. For example, the stop filter, which expects on an external file to specify the stop words, is not yet supported.

Processor Action: process

Processor that tokenize any entry provided.

{
  "type": "tokenizer",
  "name": "My Tokenizer Processor Action",
  "config": {
    "action": "process",
    ...
  }
}

analyzer

(Optional, String or Map of String/Object) The analyzer to use for the tokenization. Defaults to standard.

Details

Custom analyzer configuration

tokenizer

(Required, String or Map of String/Object) Tokenizer for the custom analyzer. Params of the tokenizer can be configured.

Details

{
  "analyzer": {
    "type": "custom",
    "tokenizer": "whitespace",
    ...
  },
  ...
}

{
  "analyzer": {
    "type": "custom",
    "tokenizer": {
      "type": "standard",
      "maxTokenLength": 4
    },
    ...
  },
  ...
}

filters

(Optional, List of Objects) List of filters to be applied. Params of the tokenfilters can be configured.

Details

{
  "analyzer": {
    "type": "custom",
    "filters": [
      "lowercase"
    ],
    ...
  },
  ...
}

{
  "analyzer": {
    "type": "custom",
    "filters": [
      "lowercase",
      {
        "type": "edgeNgram",
        "minGramSize": 2,
        "maxGramSize": 3
      }
    ],
    ...
  },
  ...
}

Language analyzers configuration

stopwords

(Optional, List of Strings or Map of String/Object) A set of common words usually not useful for search.

Details

{
  "analyzer": {
    "type": "english",
    "stopwords":{
      "tokens": [
        "the"
      ],
      "ignoreCase": true
    },
    ...
  },
  ...
}

{
  "analyzer": {
    "type": "french",
    "stopwords": [
      "va"
    ],
    ...
  },
  ...
}

stemExclusion

(Optional, List of Strings or Map of String/Object) A set of words to not be stemmed.

Details

{
  "analyzer": {
    "type": "spanish",
    "stemExclusion":{
      "tokens": [
        "nunca"
      ],
      "ignoreCase": true
    },
    ...
  },
  ...
}

{
  "analyzer": {
    "type": "english",
    "stemExclusion": [
      "quick"
    ]
    ...
  },
  ...
}

Standard analyzer configuration

maxTokenLength

(Optional, Int) The maximum token length the analyzer will emit. Defaults to 255.

stopwords

(Optional, List of Strings or Map of String/Object) A set of common words usually not useful for search.

Details

{
  "analyzer": {
    "type": "standard",
    "stopwords":{
      "tokens": [
        "the"
      ],
      "ignoreCase": true
    },
    ...
  },
  ...
}

{
  "analyzer": {
    "type": "standard",
    "stopwords": [
      "va"
    ],
    ...
  },
  ...
}

Whitespace analyzer configuration

maxTokenLength: (Optional, Int) The maximum token length the analyzer will emit. Defaults to 255.

attributes

(Optional, List of strings) The attributes to include with each token. Supports term (adds the token itself) and offset (adds the relative start and end position of the token in the input text). Default ["term", "term"]

Note	The attributes are added to the configuration as a list, all of those included will be added to the output. That list, if specified, cannot be empty.

text: (Required, String) The text to tokenize.

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "tokens": {
    ...
  }
}

Examples

Default Configuration example

{
  "type": "tokenizer",
  "name": "My Tokenizer Processor Action",
  "config": {
    "action": "process",
    "text": "#{ data(\"/httpRequest/queryParams/q\") }"
  }
}

Whitespace Analyzer with Term Attribute

{
  "type": "tokenizer",
  "name": "My Tokenizer Processor Action",
  "config": {
    "action": "process",
    "analyzer": "whitespace",
    "attributes": [
      "term"
    ],
    "text": "#{ data(\"/httpRequest/body/custom/field\") }"
  }
}

Advanced Configuration Analyzer

{
  "type": "tokenizer",
  "name": "My Tokenizer Processor Action",
  "config": {
    "action": "process",
    "analyzer": {
      "type": "english",
      "stopwords":{
        "tokens": [
          "the"
        ],
        "ignoreCase": true
      },
      "stemExclusion": [
        "quick"
      ]
    },
    "attributes": [
      "term"
    ],
    "text": "#{ data(\"/httpRequest/queryParams/q\") }"
  }
}

Advanced Configuration Max Token Length

{
  "type": "tokenizer",
  "name": "My Tokenizer Processor Action",
  "config": {
    "action": "process",
    "analyzer": {
      "type": "whitespace",
      "maxTokenLength": 4
    },
    "attributes": [
      "term"
    ],
    "text": "#{ data(\"/httpRequest/queryParams/q\") }"
  }
}

Simple custom analyzer

{
  "type": "tokenizer",
  "name": "My Tokenizer Processor Action",
  "config": {
    "action": "process",
    "text": "Hi, my cat is INJURED in its paw.",
    "analyzer": {
      "type": "custom",
      "tokenizer": "whitespace",
      "filters": [
        "lowercase"
      ]
    },
    "attributes": [
      "term"
    ]
  }
}

Custom analyzer with parameters

{
  "type": "tokenizer",
  "name": "My Tokenizer Processor Action",
  "config": {
    "action": "process",
    "text": "Hi, my cat is INJURED in its paw.",
    "analyzer": {
      "type": "custom",
      "tokenizer": {
        "type": "standard",
        "maxTokenLength": 4
      },
      "filters": [
        "lowercase",
        {
          "type": "edgeNgram",
          "minGramSize": 2,
          "maxGramSize": 3
        }
      ]
    },
    "attributes": [
      "term"
    ]
  }
}

Vespa

Uses the Vespa integration to send requests to a Vespa service.

Action: native

Processor that executes an HTTP request to a Vespa service.

{
  "type": "vespa",
  "name": "My Vespa Native Action",
  "config": {
    "action": "native",
    "method": "GET",
    "path": "/state/v1/health",
    ...
  },
  "server": <Vespa Server>
}

method: (Required, String) The HTTP method for the request
path: (Required, String) The endpoint of the request, excluding schema, host, port and any path included as part of the connection
queryParams: (Optional, Map of String/String) The map of query parameters for the URL
body: (Optional, Object) The JSON body to submit

The response of the action is stored in the JSON Data Channel as returned by the invoked API:

{
  "vespa": {
    ...
  }
}

Voyage AI

Uses the Voyage AI integration to send requests to the Voyage AI API. Supports multiple actions for different endpoints of the service.

Action: reranking

Processor that given a query and many documents, returns the (ranks of) relevancy between the query and documents. See Voyage AI Rerankers and the API Rerankers endpoint.

{
  "type": "voyage-ai",
  "name": "My Reranking Action",
  "config": {
    "action": "reranking",
    "model": "rerank-lite-1",
    "query": "Sample query",
    "documents": ["Sample document 1", "Sample document 2"],
    ...
  },
  "server": <Voyage AI Server>,
  ...
}

model: (Required, String) The model to use for the request. See models.
query: (Required, String) The query as a string.
documents: (Required, List of Strings) Documents to be reranked as a list of string.
truncation: (Optional, Boolean) Whether to truncate the input to satisfy the context length limit on the query and the documents. Defaults to true.
topK: (Optional, Integer) The number of most relevant documents to return.
returnDocuments: (Optional, Boolean) Whether to return the documents in the response. Default to false.

Action: embeddings

Processor that given input string (or a list of strings) and other arguments such as the preferred model name, it returns a response containing a list of embeddings. See Voyage AI Embeddings and the API Text embedding models endpoint.

{
  "type": "voyage-ai",
  "name": "My Embeddings Action",
  "config": {
    "action": "embeddings",
    "model": "voyage-large-2",
    "input": ["Sample text 1", "Sample text 2"],
    ...
  },
  "server": <Voyage AI Server>,
  ...
}

model: (Required, String) The model to use for the request. See models.
input: (Required, String or List of Strings) Documents to be embedded.
truncation: (Optional, Boolean) Whether to truncate the input texts to fit within the context length. Defaults to true.
inputType: (Optional, String) Type of the input text. One of: QUERY or DOCUMENT. Defaults to null.
outputDimension: (Optional, Integer) The number of dimensions for resulting output embeddings. Defaults to null.
outputDatatype: (Optional, String) The data type for the embeddings to be returned. One of: FLOAT, INT8,UINT8, BINARY or UBINARY. Default to FLOAT.
encodingFormat: (Optional, String) Format in which the embeddings are encoded. One of: base64. Defaults to null.

Action: multimodal-embeddings

{
  "type": "voyage-ai",
  "name": "My Multimodal Embeddings Action",
  "config": {
    "action": "multimodal-embeddings",
    "model": "voyage-multimodal-3",
    "inputs": [
      {
        "content": [
          {
            "type": "text",
            "text": "This is a banana."
          },
          {
            "type": "image_url",
            "imageUrl": "https://myimageurl.com"
          },
          ...
        ]
      },
      ...
    ],
    ...
  },
  "server": <Voyage AI Server>,
  ...
}

model

(Required, String) The model to use for the request. See models.

inputs

(Required, List of Objects) A list of multimodal inputs to be vectorized.

Details

type

(Required, String) The type. One of: text, image_url or image_base64.

text

(Optional, String) The text if the type text is choosen.

Details

{
  "type": "text",
  "text": "This is a banana."
}

imageUrl

(Optional, String) The image url if the type image_url is choosen.

Details

{
  "type": "image_url",
  "imageUrl": "https://raw.githubusercontent.com/voyage-ai/voyage-multimodal-3/refs/heads/main/images/banana.jpg"
}

imageBase64

(Optional, Object) The base 64 encoded image if the type image_base64 is choosen.

Details

{
  "type": "image_base64",
  "imageBase64": {
      "mediaType": "image/jpeg",
      "base64": true,
      "data": "/9j/4AAQSkZJRgABAQEAYABgAAD(...)"
  }
}

mediaType: (Required, String) The data media type. Supported media types are: image/png, image/jpeg, image/webp, and image/gif.
base64: (Required, Boolean) Whether the data is encoded in Base64.
data: (Required, String) The data itself.

truncation

(Optional, Boolean) Whether to truncate the inputs to fit within the context length. Defaults to true.

inputType

(Optional, String) Type of the input text. One of: QUERY or DOCUMENT. Defaults to null.

outputEncoding

(Optional, String) Format in which the embeddings are encoded. One of: base64. Defaults to null.

Labeling a Configuration

Labels API

Create

$ curl --request POST 'core-api:8080/v2/label' --data '{ ... }'

Body

{
  "key": "My Label Key",
  "value": "My Label Value"
}

key: (Required, String) The key of the label
value: (Required, String) The value of the label

Get All

$ curl --request GET 'core-api:8080/v2/label?page={page}&size={size}&sort={sort}'

Query Parameters

page: (Optional, Int) The page number. Defaults to 0.
size: (Optional, Int) The size of the page. Defaults to 25.
sort: (Optional, Array of String) The sort definition for the page.

Get One

$ curl --request GET 'core-api:8080/v2/label/{id}'

Path Parameters

id: (Required, String) The label ID.

Update

$ curl --request PUT 'core-api:8080/v2/label/{id}' --data '{ ... }'

Path Parameters

id: (Required, String) The label ID.

Body

{
  "key": "My Label Key",
  "value": "My Label Value"
}

key: (Required, String) The key of the label
value: (Required, String) The value of the label

Delete

$ curl --request DELETE 'core-api:8080/v2/label/{id}'

Path Parameters

id: (Required, String) The label ID.

Note	Both `key` and `value` properties will be trimmed.

Labels are simple key/value pairs that can help to reference user configurations. Any configuration can be tagged with labels either previously created in here, or during the CRUD process of the entity itself. Labels are limited to 45 characters max, for both key and value.

Note	When creating multiple labels during the CRUD process of other entities (e.g. a server or a credential, duplicates will be ignored.

To create a new label directly from an entity configuration, the following property must be included as part of the body payload:

{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}

key: (Required, String) The key of the label
value: (Required, String) The value of the label

Backup & Restore

Core Backup API

Export entities

$ curl --request GET 'core-api:8080/v2/export'

Import entities

$ curl --request POST 'core-api:8080/v2/import?onConflict={onConflict}' --form 'file=@/../../export-20240319T1030.zip'

Query Parameters

onConflict: (Optional, String) The action to execute when there is a conflict with imported entities. Defalts to FAIL. Supported actions are: IGNORE, UPDATE and PLAIN.

Queryflow Backup API

Export entities

$ curl --request GET 'queryflow-api:8088/v2/export'

Import entities

$ curl --request POST 'queryflow-api:8088/v2/import?onConflict={onConflict}' --form 'file=@/../../export-20240319T1030.zip'

Query Parameters

onConflict: (Optional, String) The action to execute when there is a conflict with imported entities. Defalts to FAIL. Supported actions are: IGNORE, UPDATE and PLAIN.

Ingestion Backup API

Export entities

$ curl --request GET 'ingestion-api:8080/v2/export'

Import entities

$ curl --request POST 'ingestion-api:8080/v2/import?onConflict={onConflict}' --form 'file=@/../../export-20240319T1030.zip'

Query Parameters

onConflict: (Optional, String) The action to execute when there is a conflict with imported entities. Defalts to FAIL. Supported actions are: IGNORE, UPDATE and PLAIN.

Each product Core, Queryflow, and Ingestion has its own backup and restore API. The entity distribution is as follows:

Core Entities

Credential
Server

Note	Labels are skipped as they will be handled during the creation of other entities.

Note	Secrets are not part of this process due to security reasons. All credentials assume their referenced secret currently exists or will be created by different means.

Queryflow Entities

Processor
Endpoint

Ingestion Entities

Processor
Pipeline
Seed
Seed-Schedule

The backup and restore for the entities is done through a single export-{timestamp}.zip ZIP file that contains a New Line Delimited JSON (ndjson) file per entity type. Each configuration is exported in the correct order, so it can be imported back without missing dependency problems.

Note	Manual modifications of the exported file might corrupt the backup.

Conflict resolution strategy

Since the ID of each exported entity is expected to remain the same after importing, conflicts might arise. The restore process has 3 different resolution strategies:

IGNORE: The input entity will be ignored, keeping the existing one unchanged.
UPDATE: The current entity will be updated with the input entity values.
PLAIN: The current entity will not be modified, and an error will be thrown

Appendix A: Pagination and Sorting

Any endpoint that paginates results receives the following optional query parameters:

page: (Optional, Integer) The page to retrieve. Defaults to 0

Note	If the provided value is invalid, it will be replaced by the default one

size: (Optional, Integer) The size of the page. Must be an integer between 1 and 100. Defaults to 25

Note	If the provided value is invalid or out of range, it will be replaced by the default one

sort: (Optional, String) The sorting fields, with an optional direction. Ascending by default: sort=<string>[,(asc\|desc)]

Note	This parameter can be used multiple times: `sort=fieldA&sort=fieldB,desc`

The response of a paginated request either an empty payload with a 204 - No Content status code, or a 200 - OK with the results page:

{
  "content": [
    {
      ...
    },
    ...
  ],
  "pageable": {
    ...
  },
  "totalSize": 1,
  "totalPages": 1,
  "numberOfElements": 1,
  "pageNumber": 0,
  "empty": false,
  "size": 25,
  "offset": 0
}

content

(Array of Objects) The page content

pageable

(Object) The page request information

Details

{
  "page": 0,
  "size": 25,
  "sort": [
    ...
  ]
}

page

(Integer) The page number

size

(Integer) The size of the page

sort

(Array of Objects) The sort definition for the page

Details

{
  "property" : "fieldA",
  "direction" : "ASC"
}

property: (String) The property where the sort was applied
direction: (String) The direction where the sort was applied. Either ASC or DESC

totalSize

(Integer) The total number of records

totalPages

(Integer) The total number of pages

numberOfElements

(Integer) The number of elements on the returned slice of content

pageNumber

(Integer) The current page number

empty

(Boolean) true if the page has no content

size

(Integer) The size of the returned slice of content

offset

(Integer) The page offset

Appendix B: Date and Time Patterns

In some instances, the use of string patterns may be required to represent dates. These patterns consist in a series of letters and symbols that represent the structure the date should follow as an output. To create them follow the next table of definitions:

Table 28. Date and Time Symbols
Symbol	Meaning
G	The era (i.e. AD)
u	The year
y	The year of the era
D	The day of the year
M	The month of the year
L	The month of the year
d	The day of the month
Q	Quarter of the year
q	Quarter of the year
Y	Week based year
w	Week of based year
W	The week of the month
E	The day of the week
e	The localized day of the week
c	The localized day of the week
F	The week of the month
a	The am/pm of the day
h	The clock hour (1-12)
K	The clock hour (0-11)
k	The clock hour (1-24)
H	The hour of the day (0-23)
m	Minute of the hour
s	Second of the minute
S	The fraction of the second
A	The milliseconds
n	The nanoseconds
N	The nanoseconds of the day
V	The time-zone ID
z	The time-zone name
O	The localized zone offset
X	The zone offset
x	The zone offset
Z	The zone offset
p	Pad the next
'	Escape for text
''	A single quote
[	Start of an optional section
]	End of an optional section

Each symbol may be used 'n' consecutive times (e.g. uuuu), this will determine the use of a short or long form of the representation. The definition of these forms may vary depending on the type a symbol represents, the following list shows the basic representations depending on the type and the 'n' times a symbol is repeated:

Text
- n < 4: Abbreviation (e.g. Wed for wednesday)
- n = 4: Full
- n = 5: Normally one letter (e.g. W for wednesday)
Number
- n: The number with zero padding for the extra quantity (e.g. 3 -> 001)
  - c, F -> n <= 1
  - d, H, h, K, k, m, and s -> n <= 2
  - D -> n <= 3
Number and Text (Combination of both)
- n >= 3: Seen as a Text
- n < 3: Seen as a Number
Fractions
- n <= 9: The number of truncations to the value
Year
- n = 2: Two numbers (e.g. 23 for 2023)
- n <= 4, n != 2: The full year
ZoneId
- n = 2: Outputs the zone id
Zone names
- 1 <= n <= 3: Short name
- n = 4: Full name
Offset for 'X' an 'x'
- n = 1: Just the hour if minute is zero, otherwise include minute
- n = 2: Hour and minute
- n = 3: Hour and minute with a colon
- n = 4: Hour, minute and second
- n = 5: Hour, minute and second with a colon
Offset for 'O'
- n = 1: Short offset (e.g. GMT+1)
- n = 4: Full offset (e.g. GMT+1:00)
Offset for 'Z'
- n <= 3: Hour and minute
- n = 4: The full offset (e.g. GMT+1:00)
- n = 5: The hour and minute with colon
Pad
- n: Number of the width

For example, the pattern "dd MM:ppppppuuuu" creates the following date "30 11: 2023". Special characters can be combined with the symbols, such as the ':' and the spaces. The latter can also be achieved with the pad, in the example 2 additional spaces are added between the year and the ':'.

Appendix C: Error Messages

If a request to any API produces an error, a standard response will be returned:

{
  "status": 409,
  "code": 2001,
  "messages": [
    "Duplicate entry for field(s): name"
  ],
  "timestamp": "2023-01-28T01:52:22.117244600Z"
}

Response Body

status: (Integer) The HTTP status code of the response
code: (Integer) The internal error code.

Each code is composed by 2 digits that represent the category, and 2 digits that represent the specific error, where the category can be:

messages: (Array of String) An optional list of messages describing the error
timestamp: (Timestamp) The UTC timestamp when the error happened

Error Codes

The error code is a better description of the error. It extends the information provided by the HTTP status code as the same status could be caused by different problems.

Each code is composed by 2 digits that represent the category, and 2 digits that represent the specific error, where the category can be:

10 - Resources: access to entities or endpoints
20 - Data integrity: entities referencing other entities, or other constraints such as unique keys
30 - Data validation: input data (format, missing fields…)
40 - Execution: problems while invoking an action
70 - Security: access and permissions
80 - Third-party: communication with external services
99 - Others: any other issue

Table 29. Error Codes
Code	Description
1001	The endpoint or HTTP Method is undefined or disabled
1002	The requested bucket is missing
1003	The requested resource is missing
2001	The entity already exists (same name or any other combination of fields defined as unique)
2002	The entity to delete is referenced by other entities
3001	The input data is corrupted
3002	The input data is missing or invalid
3003	The input data is too large
4001	The action could not be executed due to the current state of the system
4002	The action was terminated due to a timeout
4003	The Core DSL expression could not be executed
7001	The action could not be executed due to the permissions of the user
8001	Could not establish connection to an external service
8002	The external service returned an error
9901	Custom user error
9999	Undefined error

Appendix D: Metrics

Each Discovery component publishes metrics regarding health, performance, and Discovery-specific workloads. The idea of this page is to give the user a headstart in the understading of these metrics, by highlighting those most commonly used, their meaning, and the dimensions each of them have. Keep in mind that the majority of these have, by default, a time lapse of one minute.

Dimensions

Each metric may have dimensions, which can be used as metric filters. These filters helps ensure only certain published values are taken into account. As an example, the component dimension, which most metrics have, can be used to narrow down metrics to a specific component or API, like the Ingestion Script Component or the QueryFlow API. Below are some of the most commonly used dimensions, however, it’s important to note that there are plenty more, so it’s recommended to check them in each desired metric.

Dimension Description

component

The Discovery component that published the metric

cache

Found in cache metrics, its the specific type of cache that produced the metric, i.e. script, endpoint, etc.

result

Found in metrics that measure an operation, it refers to the operation’s result, such as a job status or a cache hit or miss

seed.id

Found in metrics related to an Ingestion Seed Execution, it indicades the seed’s ID

Common Metrics

These metrics are published by most Discovery Components and are related to more than one product.

Metric Name Description Dimensions

jvm.threads.throttling.monitor.count

The amount of threads currently being throttled. Mostly used to monitor the thottler service used in Ingestion Components

component

cache.gets.count

The amount of times a cache was called in the last time lapse

cache
component
result (can be a hit or a miss)

Ingestion Metrics

These are metrics published by Ingestion Components that help monitor a Seed Execution.

Metric Name Description Dimensions

ingestion.session.jobs.value

The amount of jobs currently being executed

component
seed.id

ingestion.session.records.count

The amount of records collected (i.e that were processed) in the last time lapse

component
result (the failed result is useful for finding record errors)
seed.id

ingestion.jobs.avg

The average time, in milliseconds, that it took to execute jobs completed in the last time lapse

component
result
seed.id

ingestion.jobs.count

The amount of jobs completed in the last time lapse

component
result (the failed result can be used to find out)
seed.id

More information about default metrics published by all Discovery products can be found in the Micronaut-Micrometer documentation, at the the Provided Binders sub-section.