Discovery Reference Guide

Version 2.8.0

The Pureinsights Discovery Platform is a cloud-based family of products, aimed to help with the creation, maintenance and monitoring of production-ready AI applications.

There is not a single recipe that fits all: each collection, each use case and each implementation is different in its own way, and evolving from a prototype to a live solution is not a trivial task. Here is where Discovery shows its value:

  • An architecture that follows a pay-as-you-go model for only the resources required by the specific use case.

  • A no-code approach with building blocks configured through finite-state machines that provides flexibility while reducing the hassle of developer-related tasks such as error handling and orchestration of services.

  • Changes in configuration and tuning up happens on-the-fly, without the downtime of redeploying.

  • Data processing pipelines to extract, transform and load collections from different sources (ETL).

  • Data storage as a "push" model alternative to traditional ETL solutions.

  • Custom REST endpoints with advanced capabilities that adapts to the complexities of processing a query with minimum overhead.

  • Observability with standard monitoring and alerting tools.

What’s new in 2.8.0?

MCP Server in Discovery QueryFlow

The Model Context Protocol is now available in Discovery QueryFlow with the support of custom MCP Servers exposed through entrypoints.

New Oracle Database Integration

It is now possible to create servers to connect to an Oracle Database. The new type uses the Oracle JDBC driver to interact with an Oracle database and execute SQL statements. There is also a new Oracle Database component to query tables and retrieve their values from the database.

New SMB Integration

It is now possible to create servers to connect to an SMB server. There is also a new Filesystem component, which allows crawling SMB servers to extract their content.

New Schedules API in Discovery Ingestion

A new Schedules API has been exposed to allow the automatic execution of Seeds based on a cron expression.

Staging Component in Discovery Ingestion now supports Incremental Scan

The Ingestion Staging Component now supports Incremental Scan using the Checksum identification mechanism.

New Merge Function in Expression Language

It is now possible to use a merge() function with the Expression Language to merge two maps or arrays.

New LDAP Integration

It is now possible to create servers to connect to an LDAP server. There is also a new LDAP component to run searches on an LDAP server and retrieve users and groups from the directory.

New SharePoint Online Ingestion Component

A new Ingestion SharePoint Online Component has been added, to crawl Sites, Subsites, Lists and ListItems from SharePoint Online. With the scan action you can retrieve the information and even download the files and attachments from the ListItems.

Breaking changes

New root path for configuring REST Endpoints in Discovery QueryFlow

In order to prepare for new "entrypoints" in Discovery QueryFlow, the configuration of REST Endpoints is moved from the /v2/endpoint root path, to /v2/entrypoint/endpoint.

Ingestion Staging Component Scan now only retrieves record with the action STORE

The Ingestion Staging Component now only retrieves records with the action STORE instead of being able to choose the actions that it can retrieve. The actions field from the configuration has been removed

Basics

Discovery Products

Discovery is a platform composed by 3 products:

  • Discovery Ingestion: a distributed ETL, supported by a finite-state machine that represents the data transformation and loading.

  • Discovery Staging: an abstraction for a Document Database, where collections are represented as buckets with an HTTP interface for simple CRUD operations.

  • Discovery QueryFlow: a configurable REST API with custom endpoints, supported by a finite-state machine that represents the query processing.

All products are supported by the Discovery Core Libraries and API: a common layer for shared concepts for retries, error handling, autoscaling and metrics/logging.

Each independent product has its own value and can exist by itself (although both Discovery Ingestion and Discovery QueryFlow require Discovery Staging for their internal operations). However, using them together brings all the tools to create an end-to-end solution.

Architecture

Discovery Architecture

One of the main goals of Discovery, is being cloud agnostic: using the native resources of each cloud provider without affecting the application itself.

This decision has a direct impact in the overall architecture, as external services should be abstracted in a way they can be mapped by a managed service available on each cloud provider.

  • The relational database is the main storage for configurations, metadata and the state of the multiple executions.

  • The document database is the service abstracted by Discovery Staging. The default implementation for all cloud providers is MongoDB Atlas.

  • The object storage supports the file server and data processing of binary files in Discovery Ingestion.

  • The message queue handles asynchronous communication between the components.

  • The secrets manager stores secure information such as passwords and credentials. The default implementation is an internal secrets provider.

  • The search services is an optional addition to Discovery installations that enable features such as full-text search for entities (i.e. /search and /autocomplete endpoints), and advanced solutions such as Search Analytics dashboards. Supports Elasticsearch and OpenSearch.

The Discovery Core Libraries and API is an interface for every Discovery product to interact with these external services. However, despite of this standardization, each independent product is designed based on its own needs: Discovery QueryFlow and Discovery Staging are monoliths due to their need of fast responses, but Discovery Ingestion follows an event-driven architecture that targets scalability with distributed components processing data in parallel.

AWS

Discovery can be integrated with Amazon Web Services (AWS) with managed services that natively support the installation requirements.

The application is deployed using Amazon Elastic Container Service and AWS Lambda in a private subnet, later exposed with Amazon API Gateway and Amazon Route 53.

Other services such as Amazon EventBridge and Amazon CloudWatch support the correct control, autoscaling and monitoring of all components.

Discovery for AWS

Monitoring

All Discovery products constantly publish metrics to a selected monitoring and observability tool:

Integrations

Integrations to external services are represented by Servers, optionally authenticated with a Credential that references an encrypted Secret.

They are re-usable configurations that can later be referenced in Discovery Ingestion Components and Discovery QueryFlow Components.

Integrations

Connecting to an external service

Servers API
Create a new Server
$ curl --request POST 'core-api:12010/v2/server' --data '{ ... }'
List all Servers
$ curl --request GET 'core-api:12010/v2/server'
Get a single Server
$ curl --request GET 'core-api:12010/v2/server/{id}'
Test a Server connection
$ curl --request GET 'core-api:12010/v2/server/{id}/ping'
Update an existing Server
$ curl --request PUT 'core-api:12010/v2/server/{id}' --data '{ ... }'
Note

The type of an existing server can’t be modified.

Delete an existing Server
$ curl --request DELETE 'core-api:12010/v2/server/{id}'
Clone an existing Server
$ curl --request POST 'core-api:12010/v2/server/{id}/clone?name=clone-new-name'
Query Parameters
name

(Required, String) The name of the new Server

Search for Servers using DSL Filters
$ curl --request POST 'core-api:12010/v2/search/search' --data '{ ... }'
Body

The body payload is a DSL Filter to apply to the search

Autocomplete for Servers
$ curl --request GET 'core-api:12010/v2/search/autocomplete?q=value'
Query Parameters
q

(Required, String) The query to execute the autocomplete search

A server has the properties to create an authenticated connection to an external service:

{
  "type": "my-external-service",
  "name": "My External Service Configuration",
  "config": {
    ...
  },
  ...
}
type

(Required, String) The type of external supported service.

name

(Required, String) The unique name to identify the external service.

description

(Optional, String) The description for the configuration.

config

(Required, Object) The configuration to connect to the external service.

credential

(Optional, UUID) The ID of the credential to authenticate in the external service.

proxy

(Optional, UUID) The ID of a proxy server entity used to route requests to the external service. Note that the proxy is another server entity.

certificates

(Optional, Object) The custom certificates for encrypted connection (SSL/TLS), loaded using the secret service. The value can be either the string with the secret of the certificate, or a detailed configuration.

Details
{
  "certificates": {
    "sample-a": {
      "type": "X.509",
      "value": "CERTIFICATE_SECRET"
    },
    "sample-b": {
      "value": "CERTIFICATE_SECRET"
    },
    "sample-c": "CERTIFICATE_SECRET",
    "sample-d": {
      "type": "X.509",
      "value": {
        "secret": "CERTIFICATE_SECRET",
        "key": "field"
      }
    },
    "sample-e": {
      "secret": "CERTIFICATE_SECRET",
      "key": "field"
    }
},
  ...
}
type

(Optional, String) The type of certificate. Defaults to X.509.

value

(Required, String) The secret of the certificate in the secret service. This could be a simple string with the secret name, or an object with the secret name and the key that specifies the field that will be read in the secret.

Note

The existence of the certificate secret will be verified.

keys

(Optional, Object) The keys with their respective certificate chain for encrypted connection (mTLS), loaded using the secret service.

Details
{
  "keys": {
    "keyA": {
      "value": "SECRET_KEY",
      "certificateChain": [
        {
          "value": "SECRET_CERT0"
        },
        {
          "type": "X.509",
          "value": "SECRET_CERT1"
        },
        {
          "type": "X.509",
          "value": {
            "secret": "SECRET_CERT2",
            "key": "field"
          }
        },
        {
          "secret": "SECRET_CERT3",
          "key": "field"
        },
        "SECRET_CERT4"
      ]
    }
  }
}
value

(Required, String) The secret of the key in the secret service. The contents are expected to be PKCS8 encoded key in PEM format. This could be a simple string with the secret name, or an object with the secret name and the key that specifies the field that will be read in the secret.

certificateChain

(Required, List) One or more certificates associated to the key.

Note

The existence of the key secret will be verified.

circuitBreaker

(Optional, Object) The circuit breaker configuration as a mechanism to handle request errors and limitations of an external service.

Details
{
  "circuitBreaker": {
    "waitInOpenState": "90s",
    "maxTestRequests": 1
  },
  ...
}
waitInOpenState

(Optional, Duration) The maximum time to wait in OPEN state before transitioning to HALF_OPEN state.

maxTestRequests

(Optional, Integer) The maximum number of requests on HALF_OPEN state.

labels

(Optional, Array of Objects) The labels for the configuration.

Details
{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}
key

(Required, String) The key of the label.

value

(Required, String) The value of the label.

Authentication and Credentials

Credentials API
Create a new Credential
$ curl --request POST 'core-api:12010/v2/credential' --data '{ ... }'
List all Credentials
$ curl --request GET 'core-api:12010/v2/credential'
Get a single Credential
$ curl --request GET 'core-api:12010/v2/credential/{id}'
Note

Reading credentials will never expose the referenced secret.

Update an existing Credential
$ curl --request PUT 'core-api:12010/v2/credential/{id}' --data '{ ... }'
Note

The type of an existing credential can’t be modified.

Delete an existing Credential
$ curl --request DELETE 'core-api:12010/v2/credential/{id}'
Clone an existing Credential
$ curl --request POST 'core-api:12010/v2/credential/{id}/clone?name=clone-new-name'
Query Parameters
name

(Required, String) The name of the new Server.

Search for Credentials using DSL Filters
$ curl --request POST 'core-api:12010/v2/credential/search' --data '{ ... }'
Body

The body payload is a DSL Filter to apply to the search.

Autocomplete for Credentials
$ curl --request GET 'core-api:12010/v2/credential/autocomplete?q=value'
Query Parameters
q

(Required, String) The query to execute the autocomplete search.

A credential references a secret with the authentication parameters required to connect to an external service:

{
  "type": "my-external-service",
  "name": "My External Service Credential",
  "secret": "MY_SECRET",
  ...
}

When the secret provider is internal, it is possible to create a secret during the creation of the credential:

{
  "type": "my-external-service",
  "name": "My External Service Credential",
  "secret": {
    "name": "MY_SECRET",
    "content": {
      "username": <username>,
      "password": <password>,
    },
    ...
  },
  ...
}
Note

It is assumed that the referenced secret exists, and has the correct JSON-formatted authentication information. However, this is only a soft-reference, and any deletion of secret keys won’t be noticed until the next time it is required.

type

(Required, String) The type of credentials for the external supported service.

name

(Required, String) The unique name to identify the credentials.

description

(Optional, String) The description for the configuration.

secret

(Required, String or Object) Either the secret key to connect to the external service, or an object with the authentication details.

labels

(Optional, Array of Objects) The labels for the configuration.

Details
{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}
key

(Required, String) The key of the label.

value

(Required, String) The value of the label.

Secrets

Secrets API
Create a new Secret
$ curl --request POST 'core-api:12010/v2/secret' --data '{ ... }'
List all Secrets
$ curl --request GET 'core-api:12010/v2/secret'
Get a single Secret
$ curl --request GET 'core-api:12010/v2/secret/{id}'
Note

Reading secrets will never expose their encrypted data.

Update an existing Secret
$ curl --request PUT 'core-api:12010/v2/secret/{id}' --data '{ ... }'
Delete an existing Secret
$ curl --request DELETE 'core-api:12010/v2/secret/{id}'

A secret is a representation of a secure JSON. This document could be anything, but its most common usage is for credentials:

{
  "name": "MY_SECRET",
  "content": {
    "username": <username>,
    "password": <password>,
  },
  ...
}
Note

When the secrets are backed up by an external service, Discovery won’t expose any CRUD for their management.

name

(Required, String) The unique name to identify the secret.

description

(Optional, String) The description for the configuration.

content

(Required, Object) The JSON to securely store.

labels

(Optional, Array of Objects) The labels for the configuration.

Details
{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}
key

(Required, String) The key of the label.

value

(Required, String) The value of the label.

Discovery Features for Integrations

Circuit Breaker

Circuit Breaker configuration for an Integration
{
  "type": "my-external-service",
  "name": "My External Service",
  "config": {
    ...
  },
  "circuitBreaker": {
    ...
  },
  ...
}

A Circuit Breaker is a fault toleration technique intended to gracefully handle errors when communicating with external services.

Its main goal is to avoid problems such as resource exhaustion due to rate limiting constrains.

DSL

The DSL is a specification aimed to have a single, unified language for all user interactions with Discovery and its integrations. Given the "simplified" nature of the language, its use as part of any integration will depend on how well it can be adapted to the capabilities of the integration itself.

Any implementation available will be described in detail in its corresponding section.

Ping

A ping is a mechanism to validate the configuration of an integration (both Server and Credentials working together), using the /ping endpoint:

$ curl --request GET 'core-api:12010/v2/server/{id}/ping'

Proxy Server

Proxy Server configuration for an Integration
{
  "type": "my-external-service",
  "name": "My External Service",
  "config": {
    ...
  },
  "proxy": <Proxy Server ID>,
  ...
}

A Proxy Server can be configured for integrations that require routing requests through an intermediate server. To enable this, the ID of the proxy server must be configured as part of the Server.

The proxy is represented as a separate server entity linked to the client server. Its own configuration object can define connection details such as host and port. Additionally, if a credential is associated with the proxy server, it can be used to authenticate with the proxy using fields like username and password.

TLS/mTLS

TLS/mTLS configuration for an Integration
{
  "type": "my-external-service",
  "name": "My External Service",
  "config": {
    ...
  },
  "certificates": {
    ...
  },
  ...
}

In scenarios where security is a bigger concern, it might be required to use custom certificates that validate the identity of any of the parties involved in the communication.

This handshake, either one-way (TLS) or mutual (mTLS), can be specified as part of the Server configuration.

Supported Services

Amazon Bedrock

{
  "type": "amazon-bedrock",
  "name": "My Amazon Bedrock Server",
  "config": {
    ...
  },
  "credential": <Amazon Bedrock Credential ID> (1)
}
1 Optional. See the Authentication section.
Table 1. Discovery Features for Amazon Bedrock
Feature Supported

Circuit Breaker

Yes

DSL

No

Ping

No

Proxy Server

No

TLS/mTLS

No

Server Configuration for Amazon Bedrock
region

(Required, String) The AWS region.

apiCallTimeout

(Optional, Duration) The complete duration of an API call.

connection

(Optional, Object) The configuration of the connection to Amazon Bedrock.

Details
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

backoffPolicy

(Optional, Object) The configuration for retries to Amazon Bedrock.

Details
type

(Optional, String) The type of backoff policy to apply. One of NONE, CONSTANT, or EXPONENTIAL. Defaults to EXPONENTIAL.

initialDelay

(Optional, Duration) The initial delay before retrying. Defaults to 50ms.

retries

(Optional, Integer) The maximum number of retries. Defaults to 5.

Authentication
Credentials for Amazon Bedrock
{
  "type": "aws",
  "name": "My Amazon Bedrock Credentials",
  "secret": "MY_AMAZON_BEDROCK_SECRET"
}
Secret for Amazon Bedrock with AWS Credentials
{
  "name": "MY_AMAZON_BEDROCK_SECRET",
  "content": {
    ...
  }
}
accessKeyId

(Required, String) The ID of your access key, used to identify the user.

secretAccessKey

(Required, String) The secret access key, used to authenticate the user.

sessionToken

(Optional, String) The session token from an AWS token service, used to authenticating that this user has received temporary permission to access some resource.

expirationTime

(Optional, Duration) The time after which this identity will no longer be valid. If not provided, the expiration is unknown but it may still expire at some time.

Amazon S3

{
  "type": "amazon-s3",
  "name": "My Amazon S3 Server",
  "config": {
    ...
  },
  "credential": <Amazon S3 Credential ID>, (1)
  "proxy": <Amazon S3 Proxy Server ID> (2)
}
1 Optional. See the Authentication section.
2 Optional. See the Proxy section.
Table 2. Discovery Features for Amazon S3
Feature Supported

Circuit Breaker

No

DSL

No

Ping

No

Proxy Server

Yes

TLS/mTLS

No

Server Configuration for Amazon S3
region

(Required, String) The AWS region.

retries

(Optional, Integer) The number of retry attempts for failed requests to S3.

connection

(Optional, Object) The connection configuration.

Details
{
  "connection": {
    "timeout": "5s",
    "health": {
      "minimumThroughputInBps": 1,
      "minimumThroughputTimeout": "5s"
    }
  }
  ...
}
timeout

(Optional, Duration) The timeout duration for establishing a connection to S3.

health

(Optional, Object) The health connection configuration.

Details
minimumThroughputInBps

(Required, Long) The minimum throughput in bytes per second that is considered healthy for a connection.

minimumThroughputTimeout

(Required, Duration) The timeout duration used to evaluate if the minimum throughput has been met.

accelerate

(Optional, Boolean) Enables S3 Transfer Acceleration to speed up uploads.

checksumValidationEnabled

(Optional, Boolean) Enables validation of checksums during uploads and downloads.

crossRegionAccessEnabled

(Optional, Boolean) Allows the S3 client to access buckets in different regions from the one specified.

forcePathStyle

(Optional, Boolean) Forces the use of path-style access (s3.amazonaws.com/bucket) instead of virtual-hosted-style (bucket.s3.amazonaws.com).

maxConcurrency

(Optional, Integer) The maximum number of concurrent S3 requests allowed.

minimumPartSizeInBytes

(Optional, Long) The minimum size, in bytes, for each part in a multipart upload. Affects upload efficiency and limits.

initialReadBufferSizeInBytes

(Optional, Long) The initial buffer size, in bytes, used when reading data from S3.

thresholdInBytes

(Optional, Long) The file size threshold, in bytes, above which a multipart upload is initiated.

targetThroughputInGbps

(Optional, Double) The target throughput for the client in gigabits per second, used to optimize performance.

endpointOverride

(Optional, String) Overrides the default S3 endpoint with a custom URI. Useful for local testing or VPC endpoints.

Authentication
Credentials for Amazon S3
{
  "type": "aws",
  "name": "My Amazon S3 Credentials",
  "secret": "MY_AMAZON_S3_SECRET"
}
Secret for Amazon S3 with AWS Credentials
{
  "name": "MY_AMAZON_S3_SECRET",
  "content": {
    ...
  }
}
accessKeyId

(Required, String) The ID of your access key, used to identify the user.

secretAccessKey

(Required, String) The secret access key, used to authenticate the user.

sessionToken

(Optional, String) The session token from an AWS token service, used to authenticating that this user has received temporary permission to access some resource.

expirationTime

(Optional, Duration) The time after which this identity will no longer be valid. If not provided, the expiration is unknown but it may still expire at some time.

Proxy Server
Server Configuration for Amazon S3 through a Proxy Server
{
  "type": "amazon-s3",
  "name": "My Amazon S3 Server",
  "config": {
    ...
  },
  "proxy": <Proxy Server ID>,
  ...
}
Proxy Server Configuration for Amazon S3
{
  "type": "proxy",
  "name": "My Amazon S3 Proxy Server",
  "config": {
    ...
  },
  "credential": <Credential ID with the proxy authentication>
}
host

(Required, String) The hostname of the proxy server.

port

(Required, Integer) The port of the proxy server.

Credentials for an Amazon S3 Proxy Server
{
  "type": "proxy",
  "name": "My Amazon S3 Proxy Server Credentials",
  "secret": "MY_AMAZON_S3_PROXY_SECRET"
}
Secret for an Amazon S3 Proxy Server
{
  "name": "MY_AMAZON_S3_PROXY_SECRET",
  "content": {
    ...
  }
}
username

(Required, String) The username of the credentials.

password

(Required, String) The password of the credentials.

Elasticsearch

{
  "type": "elasticsearch",
  "name": "My Elasticsearch Server",
  "config": {
    ...
  },
  "credential": <Elasticsearch Credential ID> (1)
}
1 Optional. See the Authentication section.
Table 3. Discovery Features for Elasticsearch
Feature Supported

Circuit Breaker

No

DSL

Yes

Ping

Yes

Proxy Server

No

TLS/mTLS

Yes

Server Configuration for Elasticsearch
servers

(Required, Array of Strings) The URI for the Elasticsearch installation. Multiple servers will be invoked in round-robin.

pathPrefix

(Optional, String) The path prefix to add to the servers on each call.

connection

(Optional, Object) The configuration of the HTTP connection to Elasticsearch.

Details
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

compressRequest

(Optional, Boolean) true if the requests must be compressed.

followRedirects

(Optional, Boolean) true if redirects must be followed.

Server Configuration for Elastic Cloud
cloudId

(Required, String) The ID of the instance in Elastic Cloud.

connection

(Optional, Object) The configuration of the HTTP connection to Elasticsearch.

Details
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

compressRequest

(Optional, Boolean) true if the requests must be compressed.

followRedirects

(Optional, Boolean) true if redirects must be followed.

Authentication
Credentials for Elasticsearch
{
  "type": "elasticsearch",
  "name": "My Elasticsearch Credentials",
  "secret": "MY_ELASTICSEARCH_SECRET"
}
Secret for Elasticsearch with HTTP Basic Authentication
{
  "name": "MY_ELASTICSEARCH_SECRET",
  "content": {
    ...
  }
}
username

(Required, String) The username of the credentials.

password

(Required, String) The password of the credentials.

Secret for Elasticsearch with HTTP Bearer Token
{
  "name": "MY_ELASTICSEARCH_SECRET",
  "content": {
    ...
  }
}
token

(Required, String) The token of the credentials.

Secret for Elasticsearch with API Key
{
  "name": "MY_ELASTICSEARCH_SECRET",
  "content": {
    ...
  }
}
apiKey

(Required, String) The API key of the credentials.

DSL
Table 4. DSL Filters for Elasticsearch
Filter Elasticsearch Query Operator

Equals

Term Query when normalize is false, otherwise Match Query

Less Than

Range Query

Less Than or Equal To

Range Query

Between

Range Query

Greater Than

Range Query

Greater Than or Equal To

Range Query

In

Terms Query

Exists

Exists Query

And

Bool Query with must clauses

Or

Bool Query with should clauses

Regex

Regexp Query

Hugging Face

{
  "type": "hugging-face",
  "name": "My Hugging Face Server",
  "config": {
    ...
  },
  "credential": <Hugging Face Credential ID> (1)
}
1 Optional. See the Authentication section.
Table 5. Discovery Features for Hugging Face
Feature Supported

Circuit Breaker

No

DSL

No

Ping

Yes

Proxy Server

No

TLS/mTLS

Yes

Server Configuration for Hugging Face
servers

(Required, Array of Strings) The URI for the Hugging Face Inference API service. Multiple servers will be invoked in round-robin.

connection

(Optional, Object) The configuration of the HTTP connection to Hugging Face.

Details
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

compressRequest

(Optional, Boolean) true if the requests must be compressed.

followRedirects

(Optional, Boolean) true if redirects must be followed.

Authentication
Credentials for Hugging Face
{
  "type": "hugging-face",
  "name": "My Hugging Face Credentials",
  "secret": "MY_HUGGING_FACE_SECRET"
}
Secret for Hugging Face
{
  "name": "MY_HUGGING_FACE_SECRET",
  "content": {
    ...
  }
}
token

(Required, String) The token of the credentials.

LDAP

{
  "type": "ldap",
  "name": "My LDAP Server",
  "config": {
    ...
  }
}
Table 6. Discovery Features for LDAP
Feature Supported

Circuit Breaker

No

DSL

No

Ping

Yes

Proxy Server

No

TLS/mTLS

Yes

Server Configuration for LDAP
servers

(Required, Array of Objects) List of LDAP server addresses. If multiple servers are provided, requests are distributed among them using a round-robin strategy.

Details
hostname

(Required, String) The host of the LDAP server.

port

(Optional, Integer) The port of the LDAP server. Defaults to 389. When using TLS, this port is usually 636.

connection

(Optional, Object) The configuration of the connection to the LDAP server.

Details
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

Authentication
Credentials for LDAP
{
  "type": "ldap",
  "name": "My LDAP Credentials",
  "secret": "MY_LDAP_SECRET"
}
Secret for LDAP
{
  "name": "MY_LDAP_SECRET",
  "content": {
    ...
  }
}
bindDN

(Required, String) The distinguished name that will be used to authenticate.

password

(Required, String) The password of the credentials.

MongoDB

{
  "type": "mongo",
  "name": "My MongoDB Server",
  "config": {
    ...
  },
  "credential": <MongoDB Credential ID> (1)
}
1 Optional. See the Authentication section.
Table 7. Discovery Features for MongoDB
Feature Supported

Circuit Breaker

No

DSL

Yes

Ping

Yes

Proxy Server

No

TLS/mTLS

No

Server Configuration for MongoDB/MongoDB Atlas
servers

(Required, Array of Strings) The connection string for the MongoDB/MongoDB Atlas installation. Multiple servers represent a replica set.

connection

(Optional, Object) The configuration of the connection to MongoDB.

Details
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

compressors

(Optional, Array of Strings) A list of data compressors. One of SNAPPY, ZLIB or ZSTD.

tls

(Optional, Boolean) true if the connection should be done through SSL. Defaults to false.

retryWrites

(Optional, Boolean) true if the connection should retry requests. Defaults to true.

Authentication
Credentials for MongoDB/MongoDB Atlas
{
  "type": "mongo",
  "name": "My MongoDB Credentials",
  "secret": "MY_MONGODB_SECRET"
}
Secret for MongoDB/MongoDB Atlas using SCRAM-SHA-1
{
  "name": "MY_MONGODB_SECRET",
  "content": {
    "mechanism": "SCRAM-SHA-1",
    ...
  }
}
mechanism

(Required, String) The authentication mechanism. Must be SCRAM-SHA-1.

username

(Required, String) The username of the credentials.

password

(Required, String) The password of the credentials.

source

(Required, String) The database name associated with the user’s authentication data. Defaults to admin.

Secret for MongoDB/MongoDB Atlas using SCRAM-SHA-256
{
  "name": "MY_MONGODB_SECRET",
  "content": {
    "mechanism": "SCRAM-SHA-256",
    ...
  }
}
mechanism

(Required, String) The authentication mechanism. Must be SCRAM-SHA-256.

username

(Required, String) The username of the credentials.

password

(Required, String) The password of the credentials.

source

(Required, String) The username of the credentials. Defaults to admin.

Secret for MongoDB/MongoDB using AWS IAM
{
  "name": "MY_MONGODB_SECRET",
  "content": {
    "mechanism": "MONGODB-AWS",
    ...
  }
}
mechanism

(Required, String) The authentication mechanism. Must be MONGODB-AWS.

accessKeyId

(Required, String) The AWS access key ID.

secretAccessKey

(Optional, String) The AWS secret access key.

sessionToken

(Optional, String) The AWS session token for authentication with temporary credentials when using an AssumeRole request, or when working with AWS resources that specify this value such as Lambda.

DSL
Table 8. DSL Filters for MongoDB
Filter Mongo Query Operator

Equals

$eq

Less Than

$lt

Less Than or Equal To

$lte

Between

An $and of $gte and $lt

Greater Than

$gt

Less Than or Equal To

$gte

In

$in

Exists

$exists

And

$and

Or

$or

Not

$not

Regex

$regex with regular expressions

Table 9. DSL Filters for MongoDB Atlas Search
Filter Mongo Query Operator

Equals

Text Operator with a string query

Less Than

Range Operator with numbers or dates

Less Than or Equal To

Range Operator with numbers or dates

Between

Range Operator with numbers or dates

Greater Than

Range Operator with numbers or dates

Greater Than or Equal To

Range Operator with numbers or dates

In

In Operator

Exists

Exists Operator

And

Compound Operator with must

Or

Compound Operator with should

Regex

Regex Operator with the keyword analyzer

Neo4j

{
  "type": "neo4j",
  "name": "My Neo4j Server",
  "config": {
    ...
  },
  "credential": <Neo4j Credential ID> (1)
}
1 Optional. See the Authentication section.
Table 10. Discovery Features for Neo4j
Feature Supported

Circuit Breaker

No

DSL

No

Ping

Yes

Proxy Server

No

TLS/mTLS

No

Server Configuration for Neo4j
server

(Required, String) The URI to connect to Neo4j, following the supported schemes.

connection

(Optional, Object) The configuration of the connection to Neo4j.

Details
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

Authentication
Credentials for Neo4j
{
  "type": "neo4j",
  "name": "My Neo4j Credentials",
  "secret": "MY_NEO4J_SECRET"
}
Secret for Neo4j
{
  "name": "MY_NEO4J_SECRET",
  "content": {
    ...
  }
}
username

(Required, String) The username of the credentials.

password

(Required, String) The password of the credentials.

OpenAI

{
  "type": "openai",
  "name": "My OpenAI Server",
  "config": {
    ...
  },
  "credential": <OpenAI Credential ID> (1)
}
1 Optional. See the Authentication section.
Table 11. Discovery Features for OpenAI
Feature Supported

Circuit Breaker

Yes

DSL

No

Ping

No

Proxy Server

No

TLS/mTLS

No

Server Configuration for OpenAI
organizationId

(Optional, String) The Organization ID to be added to the requests header.

connection

(Optional, Object) The configuration of the connection for OpenAI.

Details
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

compressRequest

(Optional, Boolean) true if the requests must be compressed.

followRedirects

(Optional, Boolean) true if redirects must be followed.

maxRetries

(Optional, Integer) The maximum number of retries for each request. Defaults to 2.

baseUrl

(Optional, String) The custom base URL to connect to the OpenAI service. Defaults to https://api.openai.com/v1.

Authentication
Credentials for OpenAI
{
  "type": "openai",
  "name": "My OpenAI Credentials",
  "secret": "MY_OPENAI_SECRET"
}
Secret for OpenAI
{
  "name": "MY_OPENAI_SECRET",
  "content": {
    ...
  }
}
apiKey

(Required, String) The API key of the credentials.

projectId

(Optional, String) The Project ID to be added to the requests header.

OpenSearch

{
  "type": "opensearch",
  "name": "My OpenSearch Server",
  "config": {
    ...
  },
  "credential": <OpenSearch Credential ID> (1)
}
1 Optional. See the Authentication section.
Table 12. Discovery Features for OpenSearch
Feature Supported

Circuit Breaker

No

DSL

Yes

Ping

Yes

Proxy Server

No

TLS/mTLS

Yes

Server Configuration for OpenSearch
servers

(Required, Array of Strings) The URI for the OpenSearch installation. Multiple servers will be invoked in round-robin.

pathPrefix

(Optional, String) The path prefix to add to the servers on each call.

connection

(Optional, Object) The configuration of the HTTP connection to OpenSearch.

Details
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

compressRequest

(Optional, Boolean) true if the requests must be compressed.

followRedirects

(Optional, Boolean) true if redirects must be followed.

Server Configuration for AWS OpenSearch
endpoint

(Required, String) The host to make the request to, without the http://.

signature

(Required, Object) The signature for the requests.

Details
region

(Required, String) The AWS region of the service.

serviceName

(Required, String) The signing service name.

connection

(Optional, Object) The configuration of the HTTP connection to OpenSearch.

Details
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

compressRequest

(Optional, Boolean) true if the requests must be compressed.

followRedirects

(Optional, Boolean) true if redirects must be followed.

Authentication
Credentials for OpenSearch
{
  "type": "opensearch",
  "name": "My OpenSearch Credentials",
  "secret": "MY_OPENSEARCH_SECRET"
}
Secret for OpenSearch with HTTP Basic Authentication
{
  "name": "MY_OPENSEARCH_SECRET",
  "content": {
    ...
  }
}
username

(Required, String) The username of the credentials.

password

(Required, String) The password of the credentials.

Secret for OpenSearch with AWS Authentication
{
  "name": "MY_OPENSEARCH_SECRET",
  "content": {
    ...
  }
}
accessKeyId

(Required, String) The ID of your access key, used to identify the user.

secretAccessKey

(Required, String) The secret access key, used to authenticate the user.

sessionToken

(Optional, String) The session token from an AWS token service, used to authenticating that this user has received temporary permission to access some resource.

expirationTime

(Optional, Duration) The time after which this identity will no longer be valid. If not provided, the expiration is unknown but it may still expire at some time.

DSL
Table 13. DSL Filters for OpenSearch
Filter OpenSearch Query Operator

Equals

Term Query when normalize is false, otherwise Match Query

Less Than

Range Query

Less Than or Equal To

Range Query

Between

Range Query

Greater Than

Range Query

Greater Than or Equal To

Range Query

In

Terms Query

Exists

Exists Query

And

Bool Query with must clauses

Or

Bool Query with should clauses

Regex

Regexp Query

Oracle Database

{
  "type": "oracledb",
  "name": "My Oracle Database Server",
  "config": {
    ...
  },
  "credential": <Oracle Database Credential ID> (1)
}
1 Optional. See the Authentication section.
Table 14. Discovery Features for Oracle Database
Feature Supported

Circuit Breaker

No

DSL

No

Ping

Yes

Proxy Server

No

TLS/mTLS

No

The ping feature is implemented using the SERVER validation mode. For more information, see this chart.

Server Configuration for Oracle Database
serviceName

(Required, String) The service name of the Oracle Database. Used to complete the JDBC URL.

jdbcConnector

(Required, String) The path of the file in the Object Storage that has the Oracle JDBC driver in a .jar format. This file must be in the object storage in order to connect to an Oracle database. The supported driver is the Thin driver, which can be downloaded from the official Oracle Website. For Discovery, the driver that supports JDK17 must be used.

servers

(Required, Array of Objects) The addresses of the Oracle databases.

Details
hostname

(Required, String) The host of the Oracle database.

port

(Optional, Integer) The port of the Oracle database. Defaults to 1521.

protocol

(Optional, String) The listener protocol address. Defaults to TCP. The other supported protocols are TCPS and BEQ.

defaultSchema

(Optional, String) The default schema to be used by the user in the Oracle database.

fetchSize

(Optional, Integer) The number of rows to be fetched. Defaults to 10.

loadBalance

(Optional, String) Used to enable or disable client load balancing for multiple protocol addresses. Defaults to OFF. The other supported value is ON.

failOver

(Optional, String) To enable or disable connect-time failover for multiple protocol addresses. Defaults to ON. The other supported value is OFF.

serverType

(Optional, String) Sets the Oracle Database server architecture. Defaults to DEDICATED. The other supported value is SHARED.

connection

(Optional, Object) The configuration of the connection to the Oracle database.

Details
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

Authentication
Credentials for Oracle Database
{
  "type": "oracledb",
  "name": "My Oracle Database Credentials",
  "secret": "MY_ORACLEDB_SECRET"
}
Secret for Oracle Database
{
  "name": "MY_ORACLEDB_SECRET",
  "content": {
    ...
  }
}
username

(Required, String) The username of the credentials.

password

(Required, String) The password of the credentials.

SharePoint Online

{
  "type": "sharepoint",
  "name": "My SharePoint Online Server",
  "config": {
    ...
  }
}
Table 15. Discovery Features for SharePoint Online
Feature Supported

Circuit Breaker

Yes

DSL

No

Ping

No

Proxy Server

No

TLS/mTLS

No

Server Configuration for SharePoint Online
tenantUrl

(Required, String) The Domain URL of the SharePoint tenant to crawl.

connection

(Optional, Object) The configuration of the connection to SharePoint Online.

Details
{
  "connection": {
    "connectTimeout": "60s",
    "readTimeout": "60s",
    "pool": {
      "size": 5,
      "keepAlive": "5m"
    }
    "followRedirects": true
  },
  ...
}
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

followRedirects

(Optional, Boolean) true if redirects must be followed.

Authentication
Credentials for SharePoint Online
{
  "type": "sharepoint",
  "name": "My SharePoint Online Credentials",
  "secret": "MY_SHAREPOINT_ONLINE_CERTIFICATE"
}
Secret for SharePoint Online with Azure Entra ID Certificate
{
  "name": "MY_SHAREPOINT_ONLINE_CERTIFICATE",
  "content": {
    ...
  }
}
tenantId

(Required, String) The Azure Tenant ID.

clientId

(Required, String) The application ID.

certificate

(Required, String) The contents of your certificate configured in your Entra application.

privateKey

(Required, String) The contents of your private-key associated to your certificate.

SMB

{
  "type": "smb",
  "name": "My SMB Server",
  "config": {
    ...
  },
  "credential": <SMB Credential ID> (1)
}
1 Optional. See the Authentication section.
Table 16. Discovery Features for SMB
Feature Supported

Circuit Breaker

No

DSL

No

Ping

Yes

Proxy Server

No

TLS/mTLS

No

Server Configuration for SMB
servers

(Required, Array of Strings) The list of the SMB paths, including the share name.

Authentication
Credentials for SMB
{
  "type": "smb",
  "name": "My SMB Credentials",
  "secret": "MY_SMB_SECRET"
}
Secret for SMB
{
  "name": "MY_SMB_SECRET",
  "content": {
    ...
  }
}
username

(Optional, String) The username of the credentials. Defaults to GUEST.

password

(Optional, String) The password of the credentials. Defaults to an empty string.

domain

(Optional, String) The domain of the credentials. Defaults to ?.

Solr

{
  "type": "solr",
  "name": "My Solr Server",
  "config": {
    ...
  },
  "credential": <Solr Credential ID> (1)
}
1 Optional. See the Authentication section.
Table 17. Discovery Features for Solr
Feature Supported

Circuit Breaker

No

DSL

Yes

Ping

Yes

Proxy Server

No

TLS/mTLS

No

Server Configuration for Solr
servers

(Required, Array of Strings) The URI for the Solr installation. Multiple servers will be invoked in round-robin.

connection

(Optional, Object) The configuration of the HTTP connection to Solr.

Details
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

compressRequest

(Optional, Boolean) true if the requests must be compressed.

followRedirects

(Optional, Boolean) true if redirects must be followed.

Authentication
Credentials for Solr
{
  "type": "solr",
  "name": "My Solr Credentials",
  "secret": "MY_SOLR_SECRET"
}
Secret for Solr
{
  "name": "MY_SOLR_SECRET",
  "content": {
    ...
  }
}
username

(Required, String) The username of the credentials.

password

(Required, String) The password of the credentials.

DSL
Table 18. DSL Filters for Solr
Filter Solr Query Operator

Equals

Equality Query

Less Than

Less Than Query

Less Than or Equal To

Less Than or Equals Query

Between

Range Query

Greater Than

Greater Than Query

Greater Than or Equal To

Greater Than or Equals Query

In

Terms Query

Exists

Exists Query

And

And Query

Or

Or Query

Vespa

{
  "type": "vespa",
  "name": "My Vespa Server",
  "config": {
    ...
  },
  "credential": <Vespa Credential ID> (1)
}
1 Optional. See the Authentication section.
Table 19. Discovery Features for Vespa
Feature Supported

Circuit Breaker

Yes

DSL

No

Ping

Yes

Proxy Server

No

TLS/mTLS

Yes

Note

Vespa Cloud is protected with mTLS. See Server to configure the keys or use token authentication.

Server Configuration for Vespa
servers

(Required, Array of Strings) The URI for the REST service. Multiple servers will be invoked in round-robin.

connection

(Optional, Object) The configuration of the HTTP connection.

Details
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

compressRequest

(Optional, Boolean) true if the requests must be compressed.

followRedirects

(Optional, Boolean) true if redirects must be followed.

backoffPolicy

(Optional, Object) The configuration for retries to the REST service.

Details
type

(Optional, String) The type of backoff policy to apply. One of NONE, CONSTANT, or EXPONENTIAL. Defaults to EXPONENTIAL.

initialDelay

(Optional, Duration) The initial delay before retrying. Defaults to 50ms.

retries

(Optional, Integer) The maximum number of retries. Defaults to 5.

scroll

(Optional, Object) The scroll configuration for paginated requests.

Details
{
  "scroll": {
    "size": 50
  }
}
size

(Required, String) The size of the scroll request.

Authentication
Credentials for Vespa
{
  "type": "vespa",
  "name": "My Vespa Credentials",
  "secret": "MY_VESPA_SECRET"
}
Secret for Vespa with HTTP Basic Authentication
{
  "name": "MY_VESPA_SECRET",
  "content": {
    ...
  }
}
username

(Required, String) The username of the credentials.

password

(Required, String) The password of the credentials.

Secret for Vespa with HTTP Bearer Token
{
  "name": "MY_VESPA_SECRET",
  "content": {
    ...
  }
}
token

(Required, String) The token of the credentials.

Secret for Vespa with API Key
{
  "name": "MY_VESPA_SECRET",
  "content": {
    ...
  }
}
apiKey

(Required, String) The API key of the credentials.

Voyage AI

{
  "type": "voyage-ai",
  "name": "My Voyage AI Server",
  "config": {
    ...
  },
  "credential": <Voyage AI Credential ID> (1)
}
1 Optional. See the Authentication section.
Table 20. Discovery Features for Voyage AI
Feature Supported

Circuit Breaker

Yes

DSL

No

Ping

No

Proxy Server

No

TLS/mTLS

Yes

Server Configuration for Voyage AI
connection

(Optional, Object) The configuration of the connection for Voyage AI.

Details
connectTimeout

(Optional, Duration) The timeout to connect to the service. Defaults to 60s.

readTimeout

(Optional, Duration) The timeout to read the first package from the service. Defaults to 60s.

pool

(Optional, Object) The configuration for the connection pool.

Details
size

(Optional, Integer) The size of the connection pool. Defaults to 5.

keepAlive

(Optional, Duration) The duration before evicting a connection from the pool. Defaults to 5m.

compressRequest

(Optional, Boolean) true if the requests must be compressed.

followRedirects

(Optional, Boolean) true if redirects must be followed.

backoffPolicy

(Optional, Object) The configuration of the back off policy for Voyage AI.

Details
type

(Optional, String) The type of backoff policy to apply. One of NONE, CONSTANT, or EXPONENTIAL. Defaults to EXPONENTIAL.

initialDelay

(Optional, Duration) The initial delay before retrying. Defaults to 50ms.

retries

(Optional, Integer) The maximum number of retries. Defaults to 5.

Authentication
Credentials for Voyage AI
{
  "type": "voyage-ai",
  "name": "My Voyage AI Credentials",
  "secret": "MY_VOYAGE_AI_SECRET"
}
Secret for Voyage AI
{
  "name": "MY_VOYAGE_AI_SECRET",
  "content": {
    ...
  }
}
token

(Required, String) The token of the credentials.

DSL

The Discovery Domain-Specific Language is a standardized definition on how to write JSON expressions that can be applied to all Discovery products.

Filters

"Equals" Filter

The value of the field must be exactly as the one provided.

{
  "equals": {
    "field": "my-field",
    "value": "my-value",
    "normalize": true
  }
}

When supported, the normalize field enables normalization as described by the filter provider. It is enabled by default.

"Less Than" Filter

The value of the field must be less than the one provided.

{
  "lt": {
    "field": "my-field",
    "value": 1
  }
}

"Less Than or Equal to" Filter

The value of the field must be less than or equals to the one provided.

{
  "lte": {
    "field": "my-field",
    "value": 1
  }
}

"Between" Filter

The value of the field must be greater than or equals to the "from" value (inclusive), and less than the "to" value (exclusive).

{
  "between": {
    "field": "my-field",
    "from": 1,
    "to": 10
  }
}

"Greater Than" Filter

The value of the field must be greater than the one provided.

{
  "gt": {
    "field": "my-field",
    "value": 1
  }
}

"Greater Than or Equal To" Filter

The value of the field must be greater than or equals to the one provided.

{
  "gte": {
    "field": "my-field",
    "value": 1
  }
}

"In" Filter

The value of the field must be one of the provided values.

{
  "in": {
    "field": "my-field",
    "values": [
      "my-value-a",
      "my-value-b"
    ]
  }
}

"Empty" Filter

Checks if a field is empty:

  • For a collection, true if its size is 0.

  • For a String, true if its length is 0.

  • For any other type, true if it is null.

{
  "empty": {
    "field": "my-field"
  }
}

"Exists" Filter

Checks if a field exists.

{
  "exists": {
    "field": "my-field"
  }
}

"Not" Filter

Negates the inner clause.

{
  "not": {
    "equals": {
      "field": "my-field",
      "value": "my-value"
    }
  }
}

"Null" Filter

Checks if a field is null. Note that while the "exists" filter checks whether the field is present or not, the "null" filter expects the field to be present but with null value.

{
  "null": {
    "field": "my-field"
  }
}

"Regex" Filter

Checks if a field matches with the given regex pattern.

{
  "regex": {
    "field": "my-field",
    "pattern": "my-pattern"
  }
}

"Boolean" Filter

  • and: All conditions in the list must be evaluated to true.

{
  "and": [
    {
      "equals": {
        "field": "my-field-a",
        "value": "my-value-a"
      }
    }, {
      "equals": {
        "field": "my-field-b",
        "value": "my-value-b"
      }
    }
  ]
}
  • or: At least one condition in the list must be evaluated to true.

{
  "or": [
    {
      "equals": {
        "field": "my-field-a",
        "value": "my-value-a"
      }
    }, {
      "equals": {
        "field": "my-field-b",
        "value": "my-value-b"
      }
    }
  ]
}

Projections

A projection allows you to select specific fields (attributes) to filter out from a request:

  • If no includes or excludes fields are defined, all fields are returned.

  • If only the includes fields are defined, only those fields are returned.

  • If only the excludes fields are defined, all available fields, except the ones in the exclusions are returned.

  • If both includes and excludes fields are defined, both are included in the projection.

Note

The details of how projections are processed might vary between uses cases of the DSL and/or providers, specially when it comes to projections with both included and excluded fields. It’s recommended to check the documentation of the specific component or API that’s going to be used for details like projections that aren’t allowed.

{
  "includes": ["my-field-a", "my-field-b"],
  "excludes": ["my-field-c", "my-field-d"]
}
includes

(Array of Strings) The list of fields to include in the projection.

excludes

(Array of Strings) The list of fields to exclude from the projection.

Sort

A collection of sort clauses to be applied in the given order, following the rules of the corresponding implementation.

[
  {
    "property": "sort-field-a",
    "direction": "ASC"
  },
  {
    "property": "sort-field-b",
    "direction": "DESC"
  }
]
property

(String) The property where the sort was applied.

direction

(String) The direction of the applied sorting. Either ASC or DESC.

Expression Language

The Expression Language is a flexible but simple way to manage and handle configurations. In a JSON, the use of expressions allows for values to be ambiguous to later be contextually processed:

{
  "dynamicField": "#{ first_math_function('input') + second_math_function('input') }",
  "staticField": "value"
}

As shown in the previous example, the syntax of an expression is one or multiple constants, operators and functions wrapped between the #{ and } tokens.

Note

The Expression Language is case-sensitive and all functions are defined in snake_case.

Constants

Table 21. Basic Constants
Constant Value

NULL

null

Table 22. Mathematical Constants
Constant Value

PI

3.14159265...

E

2.71828182...

Table 23. Boolean Constants
Constant Value

TRUE

true

FALSE

false

Operators

Table 24. Mathematical Operators
Operator Token

Plus

+

Minus

-

Multiplication

*

Division

/

Power of

^

Module

%

Table 25. Equality and Relational Operators
Operator Token

Equals

=

Equals

==

Not equals

<>

Not equals

!=

Greater than

>

Greater than or equal to

>=

Less than

<

Less than or equal to

<=

Table 26. Boolean Operators
Operator Token

And

&&

Or

||

Not

!

Table 27. Date/Time Operators
Operator Token

Plus

+

Minus

-

Table 28. String Operators
Operator Token

Concat

+

Functions

Table 29. Basic Functions
Function Description Example

coalesce(any, ...)

Returns the first non-null value, or null if there are none

coalesce(null, 2, 3) = 2

Table 30. Mathematical Functions
Function Description Example

abs(number)

Returns the absolute value of a value

abs(-7) = 7

ceiling(number)

Rounds a number towards positive infinity

ceiling(1.1) = 2

fact(number)

Returns the factorial of a number

fact(5) = 120

floor(number)

Rounds a number towards negative infinity

floor(1.9) = 1

log(number)

Performs the logarithm with base e on a value

log(5) = 1.609

log10(number)

Performs the logarithm with base 10 on a value

log10(5) = 0.698

max(number, ...)

Returns the highest value from all the parameters provided

max(5, 55, 6, 102) = 102

min(number, ...)

Returns the lowest value from all the parameters provided

min(5, 55, 6, 102) = 5

random()

Returns random number between 0 and 1

random() = 0.1613...

round(number, integer)

Rounds a decimal number to a specified scale

round(0.5652, 2) = 0.57

sum(number, ...)

Returns the sum of the parameters

sum(0.5, 3, 1) = 4.5

sqrt(number)

Returns the square root of the value provided

sqrt(4) = 2

Table 31. Trigonometric Functions
Function Description Example

acos(number)

Returns the arc-cosine in degrees

acos(1) = 0

acosh(number)

Returns the hyperbolic arc-cosine in degrees

acosh(1.5) = 0.96

acosr(number)

Returns the arc-cosine in radians

acosr(0.5) = 1.04

acot(number)

Returns the arc-co-tangent in degrees

acot(1) = 45

acoth(number)

Returns the hyperbolic arc-co-tangent in degrees

acoth(1.003) = 3.141

acotr(number)

Returns the arc-co-tangent in radians

acotr(1) = 0.785

asin(number)

Returns the arc-sine in degrees

asin(1) = 90

asinh(number)

Returns the hyperbolic arc-sine in degrees

asinh(6.76) = 2.61

asinr(number)

Returns the arc-sine in radians

asinr(1) = 1.57

atan(number)

Returns the arc-tangent in degrees

atan(1) = 45

atan2(number)

Returns the angle of arc-tangent2 in degrees

atan2(1, 0) = 90

atan2r(number)

Returns the angle of arc-tangent2 in radians

atan2r(1, 0) = 1.57

atanh(number)

Returns the hyperbolic arc-tangent in degrees

atanh(0.5) = 0.54

atanr(number)

Returns the arc-tangent in radians

atanr(1) = 0.78

cos(number)

Returns the cosine in degrees

cos(180) = -1

cosh(number)

Returns the hyperbolic cosine in degrees

cosh(PI) = 11.591

cosr(number)

Returns the cosine in radians

cosr(PI) = -1

cot(number)

Returns the co-tangent in degrees

cot(45) = 1

coth(number)

Returns the hyperbolic co-tangent in degrees

coth(PI) = 1.003

cotr(number)

Returns the co-tangent in radians

cotr(0.785) = 1

csc(number)

Returns the co-secant in degrees

csc(270) = -1

csch(number)

Returns the hyperbolic co-secant in degrees

csch(3*PI/2) = 0.017

cscr(number)

Returns the co-secant in radians

cscr(3*PI/2) = -1

deg(number)

Converts an angle from radians to degrees

deg(0.785) = 45

rad(number)

Converts an angle from degrees to radians

rad(45) = 0.785

sin(number)

Returns the sine in degrees

sin(150) = 0.5

sinh(number)

Returns the hyperbolic sine in degrees

sinh(2.61) = 6.762

sinr(number)

Returns the sine in radians

sinr(2.61) = 0.5

sec(number)

Returns the secant in degrees

sec(120) = -2

sech(number)

Returns the hyperbolic secant in degrees

sech(2.09) = 0.243

secr(number)

Returns the secant in radians

secr(2.09) = -2

tan(number)

Returns the tangent in degrees

tan(360) = 0

tanh(number)

Returns the hyperbolic tangent in degrees

tanh(2*PI) = 1

tanr(number)

Returns the tangent in radians

tanr(2*PI) = 0

Table 32. Date/Time Functions
Function Description Example

format(string, string)

Formats a date/time with a given pattern as described in Date/Time

format('2023-11-28T20:46', "d MMM uuuu") = "28 Nov 2023"

now()

Gets the current datetime

now() = 2023-12-06T09:27:21.123456Z

to_date(string)

Parses the input pattern as described in Date/Time

to_date("2023-12-08T10:30:00Z")

to_date(number, number, number, number?, number?)

When providing a set of integers, year, month and day are required. Hour, minute and second are all optional. The order of the parameters must be as previously mentioned

to_date(2023, 12, 8, 10, 30) = "2023-12-08T10:30:00Z"

Table 33. Logic Functions
Function Description Example

if(boolean, any, any)

Conditional operation where if the boolean expression evaluates to true, the second parameter is returned. Otherwise, the third parameter is returned

if(TRUE, 5+1, 6+2) = 6

not(boolean)

Negates a boolean expression

not(TRUE) = false

Table 34. String Functions
Function Description Example

lower(string)

Converts String to lower case

lower("THIS IS A TEST") = "this is a test"

upper(string)

Converts String to upper case

upper("this is a test") = "THIS IS A TEST"

starts_with(string, string, boolean?)

Verifies with a boolean if a string begins with a given substring. Case sensitivity can optionally be specified. If the case sensitivity flag is not sent, it will be set to true by default

starts_with("This is a test", "This", true) = true

ends_with(string, string, boolean?)

Verifies with a boolean if a string ends with a given substring. Case sensitivity can optionally be specified. If the case sensitivity flag is not sent, it will be set to true by default

ends_with("This is a test", "Test", false) = true

regex(string, pattern)

Returns a boolean specifying if a string matches a given pattern

regex("This is a Test", ".t") = true

replace_all(string, string, string)

Replaces all appearances of a regex pattern in the first string with the third string

replace_all("This.is.a.test", ".", "#") = This#is#a#test

is_empty(string)

Returns a boolean specifying if a String is empty

is_empty("") = true

is_blank(string)

Whether the variable is a blank String

is_blank("") = true

size(string)

Returns the length of a given string

size("Test") = 4

concat(string, ...)

Concatenates a given set of strings

concat("This", "is", "a test") = Thisisa test

split(string, pattern)

Splits a String into a List by a regex value

split("This,is,a,test", ",") = ["This", "is", "a", "test"]

strip(string)

Strips the punctuation, replacing it by a space

strip("This.is.a.test") = This is a test

contains(string, string)

Returns a boolean specifying if the first string contains the second one

contains("Hello World!", "World") = true

uuid()

Generates a random UUID v4

uuid() = "7583bc66-60a4-4ce5-a64d-15f245d52027"

to_number(string)

Returns the number represented by a string

to_number("5.2") = 5.2

Table 35. Hash Functions
Function Description Example

md5(any)

Hashes a given object using MD5

md5("This is a Test") = "2e674a93..."

sha256(any)

Hashes a given object using SHA-256

sha256("This is a Test") = "401b022b962452749..."

Table 36. List Functions
Function Description Example

is_empty(array)

Returns a boolean specifying if a list is empty

is_empty(array) = true, where array = []

size(array)

Returns the amount of items in the list

size(array) = 3, where array = [a, b, c]

contains(array, any)

Returns a boolean specifying if the list contains the value

contains(array, 3) = true, where array = [1, 2, 3]

get(array, number)

Returns the value in the given index

get(array, 1) = b, where array = [a, b, c]

concat(array, ...)

Joins a given set of arrays into one

concat(array1, array2, array3) = [a, b, c], where array1 = [a], array2 = [b], array3 = [c]

merge(dst_array, src_array)

Merges the source array into the destination array

merge(dst_array, src_array) = ["A", 2, "A2", 2], where dst_array = ["A", 2] and src_array = ["A2", 2]

Note

An alternative syntax to access to the value on the position i of a given array is array[i]

Table 37. Map Functions
Function Description Example

is_empty(map)

Returns a boolean specifying if a map is empty

is_empty(map) = true, where map = [<,>]

size(map)

Returns the amount of items in the map

size(map) = 3, where map = [<a,1>, <b,2>, <c,3>]

contains(map, any)

Returns a boolean specifying if the map contains the key

contains(map, a) = true, where map = [<a,1>, <b,2>, <c,3>]

get(map, any)

Returns the value with the given key

get(map, b) = 2, where map = [<a,1>, <b,2>, <c,3>]

merge(dst_map, src_map)

Merges the source map into the destination map. If a key exists in both maps, the value from the source map overwrites the destination value.

merge(dst_map, src_map) = {"name":"Bob","age":35,"email":"bob@ex.com"}, where dst_map = {"name":"Bob","age":30} and src_map = {"age":35,"email":"bob@ex.com"}

Table 38. JSON Functions
Function Description Example

jsonpath(json, string)

Finds a specific value within a JSON with a JSONPath string

jsonpath(data('path/to/json'), '$.some.path')

to_json(string)

Converts a String into its corresponding JSON representation

to_json('{ "key": "value" }')

Table 39. Discovery Functions
Function Description Example

file(string, string?)

Reads a file from the storage. The file can optionally be obtained as a byte array representation if the BYTES parameter is sent. If not, it will be set to STRING by default. The result of this function is cached, and it is aware of any change on the file

file('my-file.txt', 'STRING') = "plain text"

Script Engine

The Script Engine enables the execution of scripts for advanced handling of execution data. Supports multiple scripting languages and provides tools for JSON manipulation and logging:

Bindings

Each script has bindings to interact with the execution context where it runs:

data

(Object) Allows the creation and manipulation of JsonNode instances. Also, if the script runs as part of Discovery Ingestion or Discovery Queryflow, the binding will expose its corresponding the context data.

Details
Method Description

ArrayNode arrayNode()

Returns a new JSON array

JsonNode get(String)

Obtains a deep copy of the nodes from the data generated during the execution, takes the path to the value or node. The input must be the JSON Pointer of the data field, for example: data.get("/myfield")

NullNode nullNode()

Returns a new null node

ObjectNode objectNode()

Returns a new JSON object

JsonNode output()

Obtains the value that will be used as output for the component

JsonNode parseJson(String)

Takes a JSON with a String format and parses it into a JSON document

JsonNode valueToJson(Object)

Takes any object and tries to convert it into a JSON document

void set(parameter)

Sets the output field to a primitive type (integer, long, str, float, double), or a JsonNode. The method will infer the type of parameter in languages that are dynamically typed such as python. If the use case needs it, a casting may help in controlling the output node

byte[] file(String)

Reads files or binary data from the context data or the File Storage and return the contents. Returns null if not found

log

(Object) Supports a SLF4J logger that can log messages directly into the application

Python script example
value = 5
if (value <= 10):
  log.error("Example of an error")
Groovy script example
var response = data.objectNode();
response.put("intValue", 3);

log.info("Node set with value: " + data.output().get("intValue").asText());
JavaScript script example
var requestBody = data.get("/numberTest").asInt();
data.set(requestBody);

Template Engine

The Template Engine, provided by Freemarker, acts as a blueprint that uses a given input to generate various types of documents as either plain text or JSON.

Supported in both Discovery Ingestion and Discovery QueryFlow, can take a standard template and process it with contextual structured data to create verbalized representation of the information.

Consider the following input data:

{
  "name": "mary",
  "users": [
    {
      "name": "Jane Doe",
      "id": 0
    },
    {
      "name": "Mary",
      "id": 2
    },
    {
      "name": "Alice",
      "id": 3
    }
  ]
}

When processed with the following template:

Hello, ${name?capitalize}!

Users registered:
<#list users as user>
  <#if user.id == 0>
    name: admin, ID: ${user.id}
  <#else>
    Name: ${user.name}, ID: ${user.id}
  </#if>
</#list>

Then, the output would be:

Hello, Mary!

Users registered:
name: admin, ID: 0
Name: Mary, ID: 2
Name: Alice, ID: 3
  1. Hello, ${name?capitalize}! will output a greeting to the name specified in the JSON data and capitalize the first letter. Given the JSON data, it will output Hello, Mary! because mary now is capitalized.

  2. <#list users as user> is a directive that iterates over the users array in the JSON data.

  3. <#if user.id == 0> within the list, checks if the user’s id is 0. If true, it outputs name: admin, ID: 0 instead of using the user’s actual name.

  4. <#else> for all other users (where id is not 0), it outputs Name: user.name, ID: user.id.

Template Language

Placeholders

Placeholders are references to the data model passed to the template.

Syntax:

${variableName}

Example data model:

{
  "name": "Mary"
}

Example:

${name}

Output: Mary

Comments

Comments are a way to add notes or explanations within your templates.

Syntax:

<#-- Comment --#>

Example:

<#-- Hello this is my comment --#>
Directives

== Directives These are instructions that control the processing flow of the template (like loops and conditionals). The full list of the directives can be found in the Directive reference of the Freemarker documentation.

=== Assign
Used to define a variable.

Syntax:

<#assign name1=value1>

Example:

<#assign x=1>
Attempt, Recover

Used for error handling in templates. Attempt is to execute code that might fail and recover to define what to do if an error occurs in the attempt block.

Syntax:

<#attempt>
  attempt block
<#recover>
  recover block
</#attempt>

Example:

<#attempt>
  ${user.name}
<#recover>
  Unknown User
</#attempt>
Function, Return

Used to create a method variable, it must have a parameter that specifies the return value of the method.

Syntax:

<#function name param1 param2 ... paramN>
  ...
<#return returnValue>
  ...
</#function>

Example:

<#function avg x y>
  <#return (x + y) / 2>
</#function>

${avg(10, 20)}

Output: 15.

Global

Used to define a variable for all namespaces.

Syntax:

<#global name=value>

Example:

<#global x=1>
If, else, elseif

Used to conditionally skip a section of the template.

Syntax:

<#if condition>
  ...
<#elseif condition2>
  ...
<#else>
  ...
</#if>

Example:

<#if x == 1>
  x is 1
<#elseif x == 2>
  x is 2
<#else>
  x is not 1 nor 2
</#if>
Import

Used to bring all macros and functions from another template file into the current template namespace. Path is the path of the template file to import and hash is the name of variable by which you can access the namespace.

Syntax:

<#import path as hash>

Example:

<#import "library.ftl" as lib>
Include

Used to include the content of another template file into the current template output. It does not make macros or functions from the included file available in the current namespace. Path is the path of the template file to include.

Syntax:

<#include path>

Example:

<#include "header.ftl">
List

Used for iterating over a collection.

  • else: Within a list, it is used to specify output if the list is empty.

  • items: It refers to the current item in the iteration.

  • sep: Used to output something between items, like a separator.

  • break: Exits the loop prematurely.

  • continue: Skips the current iteration and moves to the next item.

Syntax:

<#list sequence as item>
  Part repeated for each item
</#list>

Example:

<#list users as user>
  ${user.name}
<#else>
  No users found.
</#list>
Macro

Used to define a reusable block of template code.

Syntax:

<#macro name param1 param2 ... paramN>
  ...
</#macro>

Example:

<#macro test>
  Test text
</#macro>

<#-- call the macro: -->
<@test/>

Output: Test text

File Storage

Files API
Upload a File
$ curl --request PUT 'core-api:12010/v2/file/my/key/to/my/file' --form 'file=@"/../../test.txt"'
Retrieve a File
$ curl --request GET 'core-api:12010/v2/file/my/key/to/my/file'
List all Files
$ curl --request GET 'core-api:12010/v2/file'
Delete a file
$ curl --request DELETE 'core-api:12010/v2/file/my/key/to/my/file'

The File Storage handles files with the help of a dedicated "folder" in the Object Storage.

File names can be constructed as nested paths, where a slash at the end of each sub path denotes this sub path as a parent/folder. This name must follow the next rules:

  • Parent names can contain alphanumeric characters (i.e. [A-Z], [a-z], [0-9]), hyphen (i.e. -), underscore (i.e. _) and spaces.

  • Character quantity must range from 1 to 255.

  • A nested path can consist of up to 10 levels.

Note

When executing the endpoints to upload or delete files, the Expression Language is notified about the changes to clear its internal cache.

File Storage

Discovery Staging

Discovery Staging is a REST API on top of a Document Database. Its goal is to simplify and standardize the interactions of all Discovery products with any supported provider that can handle JSON content, while enabling the final user features such as:

Discovery Staging

Supported Providers

MongoDB

Being the NoSQL industry standard for document storage and with a MongoDB Atlas managed service available in the marketplace of all major Cloud Providers, makes MongoDB the default Document Database provider for Discovery Staging in all Discovery installations.

Amazon DocumentDB

Amazon DocumentDB (with MongoDB compatibility) is a good alternative Document Database provider for Discovery Staging in installations fully-managed by AWS.

Azure DocumentDB

Azure DocumentDB is a MongoDB compatible engine that supports hybrid and multicloud architectures with enterprise-grade performance, availability, and easy Azure AI integration.

Content

Content API
Get a single document
$ curl --request GET 'staging-api:12020/v2/content/{bucketName}/{contentId}?action={action}&include={include}&exclude={exclude}'
Path Parameters
bucketName

(Required, String) The name of the bucket.

contentId

(Required, String) The document ID.

Query Parameters
action

(Optional, String) The actions to filter the documents. Defaults to STORE.

include

(Optional, Array of Strings) determine the fields of the document’s content that will be included in the response.

exclude

(Optional, Array of Strings) determine the fields of the document’s content that will be excluded in the response.

Store
$ curl --request POST 'staging-api:12020/v2/content/{bucketName}'
Path Parameters
bucketName

(Required, String) The name of the bucket.

$ curl --request POST 'staging-api:12020/v2/content/{bucketName}/{contentId}?parentId={parentId}'
Path Parameters
bucketName

(Required, String) The name of the bucket.

contentId

(Required, String) The document ID.

Query Parameters
parentId

(Optional, String) The parent ID of the documents.

Details
Note

This endpoint is capable of updating an existing document by using the contentId of the original document

The final content must not exceed the maximum size supported by the chosen provider. Exceeding the limit is depicted by a 413 error in the Staging APIs response.

Documents are stored with metadata. This adds an extra size to the final document besides the content in the request body, so it is recommended to write the body with less than the limit depicted by the provider.

The following table details maximum provider limits:

Provider

Limit

MongoDB

BSON size limit (~16Mb / 16793600 Bytes)

Amazon DocumentDB

BSON size limit (~16Mb / 16793600 Bytes)

Azure DocumentDB

BSON size limit (~16Mb / 16793600 Bytes)

Delete
$ curl --request DELETE 'staging-api:12020/v2/content/{bucketName}/{contentId}'
Path Parameters
bucketName

(Required, String) The name of the bucket.

contentId

(Required, String) The document ID.

Delete multiple documents
$ curl --request DELETE 'staging-api:12020/v2/content/{bucketName}?parentId={parentId}' --data '{ ... }'
Path Parameters
bucketName

(Required, String) The name of the bucket.

Query Parameters
parentId

(Optional, String) The parent ID of the documents.

Body

The body payload is an optional DSL Filter to apply to the delete

Note

The body and parentId parameter are optional, but in order to avoid deleting all documents, at least one of them must be included.

Scroll
$ curl --request POST 'staging-api:12020/v2/content/{bucketName}/scroll?token={token}&parentId={parentId}&size={size}&action={action}' --data '{ ... }'
Path Parameters
bucketName

(Required, String) The name of the bucket.

Query Parameters
token

(Optional, Hex String) the token to paginate the documents.

parentId

(Optional, String) The parent ID of the documents.

size

(Optional, Int) The number of documents to scroll. Defaults to 25.

action

(Optional, Array of String) The actions to filter the documents. Defaults to STORE.

Body

The body payload is an optional DSL Filter and an optional DSL Projection to apply to the scroll

{
  "fields": <Projection DSL>,
  "filters": <Filter DSL>
}
Details
Note

The scroll functionality is meant to be used when it is needed to iterate through all the documents on a bucket based on the filters and projections applied. Sorting is not available since the order is not relevant when scrolling.

Search
$ curl --request POST 'staging-api:12020/v2/content/{bucketName}/search?parentId={parentId}&action={action}&page={page}&size={size}&sort={sort}' --data '{ ... }'
Path Parameters
bucketName

(Required, String) The name of the bucket.

Query Parameters
parentId

(Optional, String) The parent ID of the documents.

action

(Optional, Array of String) The actions to filter the documents. Defaults to STORE.

page

(Optional, Int) The page number. Defaults to 0.

size

(Optional, Int) The size of the page. Defaults to 20.

sort

(Optional, Array of String) The sort definition for the page.

Body

The body payload is an optional DSL Filter and an optional DSL Projection to apply to the search

{
  "fields": <Projection DSL>,
  "filters": <Filter DSL>
}
Note

The search functionality should be used when it is necessary to sort a collection and get some pages with the top results based on the provided sorting criteria, at the same time it allows to apply filters, and define what fields to include or exclude on the retrieved documents. It shouldn’t be used instead of the scroll endpoint, because this is meant to work as a query and match feature implemented using the tools offered by the provider, in order to return the most relevant results.

For both search and scroll endpoints, all the query-string parameters and including the body are optional. The parameter action can be specified to contain multiple values.

The fields and filters fields are only applied to the fields within the content field of the items on the bucket.

The content of the bucket is the data, stored as JSON.

Buckets

Buckets API
Get All
$ curl --request GET 'staging-api:12020/v2/bucket'
Get information
$ curl --request GET 'staging-api:12020/v2/bucket/{bucketName}'
Path Parameters
bucketName

(Required, String) The name of the bucket.

Delete
$ curl --request DELETE 'staging-api:12020/v2/bucket/{bucketName}'
Path Parameters
bucketName

(Required, String) The name of the bucket.

Purge
$ curl --request DELETE 'staging-api:12020/v2/bucket/{bucketName}/purge'
Path Parameters
bucketName

(Required, String) The name of the bucket.

Note

Only purges the documents with the DELETE action.

Delete Index
$ curl --request DELETE 'staging-api:12020/v2/bucket/{bucketName}/index/{indexName}'
Path Parameters
bucketName

(Required, String) The name of the bucket.

indexName

(Required, String) The name of the index.

Create an index
$ curl --request PUT 'staging-api:12020/v2/bucket/{bucketName}/index/{indexName}' --data '{ ... }'
Path Parameters
bucketName

(Required, String) The name of the bucket.

indexName

(Required, String) The name of the index.

Body
[
  { "fieldA": "ASC" },
  { "fieldB": "DESC" },
  ...
]
Note

However, an empty or null body is also allowed. In these cases, an ascending index is created with the index name as the field name.

Create a bucket
$ curl --request POST 'staging-api:12020/v2/bucket/{bucketName}' --data '{ ... }'
Path Parameters
bucketName

(Required, String) The name of the bucket.

Body
{
  "indices": [
    {
      "name": "myIndexA",
      "fields": [
        { "fieldA": "ASC" },
        { "fieldB": "DESC" },
        ...
      ]
    },
    ...
  ],
  "config":{}
}
Rename a bucket
$ curl --request POST 'staging-api:12020/v2/bucket/{bucketName}/rename' --data '{ ... }'
Path Parameters
bucketName

(Required, String) The name of the bucket to rename.

Body
{
  "name": "new-name",
  "allowOverride": false
}
name

(Required, String) The new name of the bucket.

allowOverride

(Optional, Boolean) Whether a bucket can be overridden if there is already one with the new name. Defaults to false.

Note
This endpoint is not supported on Amazon DocumentDB elastic clusters. See the Architecture section.

A bucket is a complete collection of data. Several operations can be performed on a bucket, where the results vary depending on the user input from the HTTP request.

Metadata description
{
  "name": "<Text>",
  "documentCount": {
    "STORE": "<Number>",
    "DELETE": "<Number>"
  },
  "content": {
    "oldest": "<StagingDocument>",
    "newest": "<StagingDocument>"
  },
  "indices": [
    {
      "name": "<Text>",
      "fields": [
        {
          "fieldA": "ASC|DESC"
        }
      ]
    }
  ]
}
Property Type Description

name

Text

The bucket name

documentCount

JSON Object

The total documents in the bucket, divided by action

documentCount.STORE

Number

The number of documents currently in the bucket with a STORE action type

documentCount.DELETE

Number

The number of documents currently in the bucket with a DELETE action type

content

JSON Object

The content of the bucket, including the oldest and newest documents

content.oldest

Staging Document

The oldest document in the bucket

content.newest

Staging Document

The newest document in the bucket

indices

JSON Array

Array with the name and fields of every index in the bucket

Index description

Property Type Description

name

Text

The index name

fields

Key/Value Pair Array

The fields used for the index. The key of every element is the name of the field, and the value is its sort direction (ASC or DESC). Ascending by default.

Note

All indices are over the content of the document

Note

When creating an index, if any of the fields is duplicated the last value specification for a field will take precedence.

Note

Also in the value of the fields, apart from using (ASC or DESC) for sort direction you can also use:

  • 0 -→ ASC

  • 1 -→ DESC

Discovery Ingestion

Discovery Ingestion is a fully-featured extract, transform, and load (ETL) tool that orchestrates the communication with external services while applying data enrichment to the records detected in the given data source. It enables features such as:

Discovery Ingestion

Data Seed

Seeds API
Create a new Seed
$ curl --request POST 'ingestion-api:12030/v2/seed' --data '{ ... }'
Start the execution of an existing Seed
$ curl --request POST 'ingestion-api:12030/v2/seed/{id}?scanType={scan-type}' --data '{ ... }'
Query Parameters
scanType

(Required, String) The scan type for the seed execution. Either FULL or INCREMENTAL

Body

The body payload is the execution properties, which overrides the ones configured in the Seed

Halt all executions of a Seed
$ curl --request POST 'ingestion-api:12030/v2/seed/{id}/halt'
List all Seeds
$ curl --request GET 'ingestion-api:12030/v2/seed'
Get a single Seed
$ curl --request GET 'ingestion-api:12030/v2/seed/{id}'
Update an existing Seed
$ curl --request PUT 'ingestion-api:12030/v2/seed/{id}' --data '{ ... }'
Note

The type of an existing seed can’t be modified.

Reset the metadata of an existing Seed
$ curl --request POST 'ingestion-api:12030/v2/seed/{id}/reset'
Delete an existing Seed
$ curl --request DELETE 'ingestion-api:12030/v2/seed/{id}'
Clone an existing Seed
$ curl --request POST 'ingestion-api:12030/v2/seed/{id}/clone?name=clone-new-name'
Query Parameters
name

(Required, String) The name of the new Seed

Search for Seeds using DSL Filters
$ curl --request POST 'ingestion-api:12030/v2/seed/search' --data '{ ... }'
Body

The body payload is a DSL Filter to apply to the search

Autocomplete for Seeds
$ curl --request GET 'ingestion-api:12030/v2/seed/autocomplete?q=value'
Query Parameters
q

(Required, String) The query to execute the autocomplete search

A seed defines the data source for the configuration and the pipeline to follow during the processing of each record through the finite-state machine.

{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  ...
}
type

(Required, String) The name of the component to execute.

name

(Required, String) The unique name to identify the configuration.

description

(Optional, String) The description for the configuration.

config

(Required, Object) The configuration for the corresponding action of the component. All configurations will be affected by the Expression Language.

server

(Optional, UUID/Object) Either the ID of the server configuration for the integration or an object with the detailed configuration.

Details
{
  "server": {
    "id": "ba637726-555f-4c68-bfed-1c91f4803894",
    ...
  },
  ...
}
id

(Required, UUID) The ID of the server configuration for the integration.

credential

(Optional, UUID) The ID of the credential to override the default authentication in the external service.

pipeline

(Required, UUID) The ID of the pipeline configuration for all detected records.

recordPolicy

(Optional, Object) The global configuration for the seed execution.

Details
{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "recordPolicy": {
    ...
  },
  ...
}
timeoutPolicy

(Optional, Object) The policy for handling timeouts during the scan of records.

Details
{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "recordPolicy": {
    "timeoutPolicy": {
      ...
    },
    ...
  },
  ...
}
slice

(Optional, Duration) The timeout for scan on each slice with records. Defaults to 1h.

errorPolicy

(Optional, String) The error policy for scanned records. Defaults to FATAL.

  • FATAL: A single failed document aborts the complete process

  • IGNORE: Ignores the record scan error and the Seed Execution continues

outboundPolicy

(Optional, Object) The policy for groups of records sent for processing within the finite-state machine. Applied by default to outbound records (i.e., records sent to the pipeline).

Details
{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "recordPolicy": {
    "outboundPolicy": {
      ...
    },
    ...
  },
  ...
}
idPolicy

(Optional, Object) The policy for generating the records IDs. If not provided, the plain ID of the record will be used.

Details
{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "recordPolicy": {
    "outboundPolicy": {
      "idPolicy": {
        ...
      }
    },
    ...
  },
  ...
}
generator

(Optional, String) The expression that represents the ID used for the scanned records.

batchPolicy

(Optional, Object) The batch policy for outbound batches of records.

Details
{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "recordPolicy": {
    "outboundPolicy": {
      "batchPolicy": {
        ...
      }
    },
    ...
  },
  ...
}
maxCount

(Optional, Integer) The maximum record count in a batch before flushing. Defaults to 25.

flushAfter

(Optional, Duration) The timeout to flush if no other condition has been met. Defaults to 1m.

snapshotPolicy

(Optional, Object) The configuration for the incremental scan feature.

Details
{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "recordPolicy": {
    "snapshotPolicy": {
      ...
    },
    ...
  },
  ...
}
checksumExpression

(Optional, String) The expression to determine if the record has changed during an incremental scan by checksum. Defaults to all the fields in the document.

beforeHooks

(Optional, Object) The Hooks to execute before starting the record processing.

Details
{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "beforeHooks": {
    "hooks": [
      ...
    ],
    "timeout": "60s",
    "errorPolicy": "IGNORE"
  },
  ...
}
hooks

(Required, Array of Objects) The list of Hooks to execute

Details
{
  "hooks": [
    {
      "id": <Hook ID>,
      ...
    }
  ],
  "timeout": "60s",
  "errorPolicy": "IGNORE"
}
id

(Required, UUID) The ID of the Hook to execute

errorPolicy

(Optional, String) Overrides the global policy for errors during the execution of the Hook

  • FATAL: A single failed hook aborts the complete process

  • IGNORE: Ignores the hook error and the Seed Execution continues

timeout

(Optional, Duration) Overrides the global timeout for the execution of the Hook

active

(Optional, Boolean) false to disable the execution of the Hook

errorPolicy

(Required, String) The policy for errors during the execution of the Hook. Defaults to IGNORE

  • FATAL: A single failed hook aborts the complete process

  • IGNORE: Ignores the hook error and the Seed Execution continues

timeout

(Required, Duration) The timeout for the execution of the Hook. Defaults to 60s

afterHooks

(Optional, Object) The Hooks to execute after completing the record processing.

Details
{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "afterHooks": {
    "hooks": [
      ...
    ],
    "timeout": "60s",
    "errorPolicy": "IGNORE"
  },
  ...
}
hooks

(Required, Array of Objects) The list of Hooks to execute

Details
{
  "hooks": [
    {
      "id": <Hook ID>,
      ...
    }
  ],
  "timeout": "60s",
  "errorPolicy": "IGNORE"
}
id

(Required, UUID) The ID of the Hook to execute

errorPolicy

(Optional, String) Overrides the global policy for errors during the execution of the Hook

  • FATAL: A single failed hook aborts the complete process

  • IGNORE: Ignores the hook error and the Seed Execution continues

timeout

(Optional, Duration) Overrides the global timeout for the execution of the Hook

active

(Optional, Boolean) false to disable the execution of the Hook

errorPolicy

(Required, String) The policy for errors during the execution of the Hook. Defaults to IGNORE

  • FATAL: A single failed hook aborts the complete process

  • IGNORE: Ignores the hook error and the Seed Execution continues

timeout

(Required, Duration) The timeout for the execution of the Hook. Defaults to 60s

properties

(Optional, Object) The properties to be referenced with the help of the Expression Language in the configuration of the seed itself, in processors and in hooks.

Details
{
  "type": "my-component-type",
  "name": "My Component Seed",
  "config": {
    "myProperty": "#{ seed.properties.keyA }"
  },
  "properties": {
    "keyA": "valueA"
  },
  "pipeline": <Pipeline ID>,
  ...
}
{
  "type": "my-component-type",
  "name": "My Component Processor",
  "config": {
    "myProperty": "#{ seed.properties.keyA }"
  },
  ...
}
labels

(Optional, Array of Objects) The labels for the configuration.

Details
{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}
key

(Required, String) The key of the label.

value

(Required, String) The value of the label.

Records

Records API
List all Records of a given Seed
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/record'
Get a Record by Seed and ID
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/record/{recordId}'
Get the summary of records from a Seed
$ curl --request GET  'ingestion-api:12030/v2/seed/{seedId}/record/summary'

Seed Records reflect the status and parent-child relationship of records during a specific seed’s latest execution.

Each seed record in a given seed is identifiable by combining its seed, its parent plainId (if any) and its own plainId assigned at scan time. However, these IDs are used to generate a more systems-friendly ID for each record, known as hashId. This last ID is made by first applying the SHA-256 algorithm to the combination of the parent plainId plus the record plainId, and then using padded Base64URL encoding to make the result safe to use in URLs.

{
  "id": {
    ...
  },
  "creationTimestamp": "2025-04-22T16:51:44Z",
  "lastUpdatedTimestamp": "2025-04-22T16:51:44Z",
  "parent": "FOMM0WPHMpEuBIxMg34VxOkMBi67eVq5R9V3BuLRDdg=",
  "status": "FAILURE",
  "errors": [
    ...
  ]
}
id

(Object) The ID of the record.

Details
{
  "plain": "1",
  "hash": "a4ayc_80_OGda4BO_1o_V0etpOqiLx1JwB5S3beHW0s="
}
plain

(String) The ID of the record before hashing it.

hash

(String) The ID of the record as a Base64URL string.

creationTimestamp

(Timestamp) The timestamp when the record was created.

lastUpdatedTimestamp

(Timestamp) The timestamp when the record was last updated.

parent

(String) The parent record’s id as a Base64URL string.

status

(String) The status of the record.

Details
  • SUCCESS: The record was successfully processed.

  • FAILURE: The record reported errors during its processing.

  • QUARANTINE: The record has been processed many times, and it should not be processed again.

errors

(Array of Objects) The record’s errors, if any.

Details

Each item in the errors array represents an individual error encountered during record processing.

{
  ...,
  "errors": [
    {
      "id": "a08aa41c-9d5c-436f-94b0-e3ac19b805d3",
      "error": {
        "code": 4001,
        "status": 409,
        "messages": [
            "com.pureinsights.pdp.core.CoreException: Script execution failed: java.lang.Exception: Error"
        ],
        "timestamp": "2025-06-04T05:18:42.071218Z"
      },
      "retry": 0
    },
    ...
  ],
  ...
}
id

(String) The ID of the error document.

error

(Object) Contains the detailed error information.

Details
code

(Integer) The internal code for the error, helpful for identifying the type of failure.

status

(Integer) The associated HTTP status code, indicating the general category of the failure.

messages

(List of strings) A list of one or more error messages giving a detailed description of what went wrong.

timestamp

(Timestamp) The time when the error occurred.

retry

(Integer) The number of times the system has retried processing the record after the error occurred.

Schedule

Schedules API
Create a new Schedule
$ curl --request POST 'ingestion-api:12030/v2/seed/schedule' --data '{ ... }'
List all Schedules
$ curl --request GET 'ingestion-api:12030/v2/seed/schedule'
Get a single Schedule
$ curl --request GET 'ingestion-api:12030/v2/seed/schedule/{id}'
Update an existing Schedule
$ curl --request PUT 'ingestion-api:12030/v2/seed/schedule/{id}' --data '{ ... }'
Delete an existing Schedule
$ curl --request DELETE 'ingestion-api:12030/v2/seed/schedule/{id}'
Clone an existing Schedule
$ curl --request POST 'ingestion-api:12030/v2/seed/schedule/{id}/clone?name=clone-new-name'
Query Parameters
name

(Required, String) The name of the new Schedule.

Search for Schedules using DSL Filters
$ curl --request POST 'ingestion-api:12030/v2/seed/schedule/search' --data '{ ... }'
Body

The body payload is a DSL Filter to apply to the search.

Autocomplete for Schedules
$ curl --request GET 'ingestion-api:12030/v2/seed/schedule/autocomplete?q=value'
Query Parameters
q

(Required, String) The query to execute the autocomplete search.

A Schedule allows to execute a Seed automatically according to a cron-based expression.

{
  "name": "My Schedule",
  "expression": "1 * * * *",
  "seed": <Seed ID>,
  "scanType": "FULL"
  ...
}
name

(Required, String) The unique name to identify the schedule.

description

(Optional, String) The description for the configuration.

expression

(Required, String) The cron expression in UNIX format that defines when the Seed must be executed.

seed

(Required, UUID) The ID of the Seed to execute.

scanType

(Required, String) The scan type for the seed execution. Either FULL or INCREMENTAL.

properties

(Optional, Object) The execution properties, which override the ones configured in the Seed.

labels

(Optional, Array of Objects) The labels for the configuration.

Details
{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}
key

(Required, String) The key of the label.

value

(Required, String) The value of the label.

Discovery Features for Data Seeds

Incremental Scan

The scan must be executed in one of two modes: FULL or INCREMENTAL.

While the former overrides any previous execution information and, given its nature, only identifies the CREATE action, the latter uses a snapshot from the previous execution to also identify UPDATE, DELETE and NO_CHANGE actions. The mechanism behind the identification varies depending on the Seed action:

  • Checksum - always scans the complete data source and selects each record’s action after a comparison based on a configurable checksum. This is always the case for the split operation.

  • Custom - the incremental scan mechanism is defined via the configuration of the Seed action.

  • None - the Seed action does not support incremental scans.

Distributed Scan

The scan phase of ingestion, as the starting point of the data processing, might need to identify a large number of records. This could cause either performance problems caused by the sequential nature of scanning, or failures due to timeouts.

In general, all Seed actions should handle this problem by internally slicing the total of records into smaller pages that can be retrieved in a distributed environment. However, some implementations might not offer a proper mechanism for this pagination, and they will be forced to handle the complete scan as a single request.

Hierarchical Records

Some data sources might expose a parent-children relationship on their records. Whenever possible, this information will be captured and exposed through the Expression Language and the Record API.

Pipeline

Pipelines API
Create a new Pipeline
$ curl --request POST 'ingestion-api:12030/v2/pipeline' --data '{ ... }'
List all Pipelines
$ curl --request GET 'ingestion-api:12030/v2/pipeline'
Get a single Pipeline
$ curl --request GET 'ingestion-api:12030/v2/pipeline/{id}'
Update an existing Pipeline
$ curl --request PUT 'ingestion-api:12030/v2/pipeline/{id}' --data '{ ... }'
Delete an existing Pipeline
$ curl --request DELETE 'ingestion-api:12030/v2/pipeline/{id}'
Clone an existing Pipeline
$ curl --request POST 'ingestion-api:12030/v2/pipeline/{id}/clone?name=clone-new-name'
Query Parameters
name

(Required, String) The name of the new Pipeline.

Search for Pipelines using DSL Filters
$ curl --request POST 'ingestion-api:12030/v2/pipeline/search' --data '{ ... }'
Body

The body payload is a DSL Filter to apply to the search.

Autocomplete for Pipelines
$ curl --request GET 'ingestion-api:12030/v2/pipeline/autocomplete?q=value'
Query Parameters
q

(Required, String) The query to execute the autocomplete search.

A pipeline is the definition of the finite-state machine for records processing:

{
  "name": "My Pipeline",
  "initialState": "stateA",
  "states": {
    "stateA": {
      ...
    },

    "stateB": {
      ...
    }
  },
  ...
}
name

(Required, String) The unique name to identify the pipeline.

description

(Optional, String) The description for the configuration.

initialState

(Required, String) The state, as defined in the states field, to be used as starting point for new and updated records that need to be processed through the pipeline.

deleteState

(Optional, String) The state, as defined in the states field, to be used as starting point for deleted records that need to be processed through the pipeline. Although this field is not required, omitting it may lead to unexpected errors because deleted records are still processed by the pipeline, starting from the initialState. Since deleted records can differ in structure or content from newly created or updated records, it is recommended to process them in separate states that can handle deletion events.

states

(Required, Object) The states associated to the pipeline.

labels

(Optional, Array of Objects) The labels for the configuration.

Details
{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}
key

(Required, String) The key of the label.

value

(Required, String) The value of the label.

recordPolicy

(Required, Object) The global record policies to be applied during the execution of a pipeline.

Details
{
  "name": "My Pipeline",
  "initialState": "stateA",
  "states": {
    ...
  },
  "recordPolicy": {
    ...
  },
  ...
}
idPolicy

(Optional, Object) The policy for handling the generation of records IDs.

Details
{
  "name": "My Pipeline",
  "initialState": "stateA",
  "states": {
    ...
  },
  "recordPolicy": {
    "idPolicy": {
      ...
    }
  },
  ...
}
mask

(Optional, String) The expression that represents the ID of the record during its processing. If not provided, the plain ID of the record will be used.

retryPolicy

(Optional, Object) The policy for handling the records retries.

Details
{
  "name": "My Pipeline",
  "initialState": "stateA",
  "states": {
    ...
  },
  "recordPolicy": {
    "retryPolicy": {
      ...
    }
  },
  ...
}
active

(Optional, Boolean) Whether a processor should be retried if failed during the execution. Defaults to true.

maxRetries

(Optional, Integer) The maximum number of retries for processing the record. The retries are executed from the point where the records failed. Defaults to 3.

timeoutPolicy

(Optional, Object) The policy for handling the records timeout.

Details
{
  "name": "My Pipeline",
  "initialState": "stateA",
  "states": {
    ...
  },
  "recordPolicy": {
    "timeoutPolicy": {
      ...
    }
  },
  ...
}
record

(Optional, Duration) The timeout for each record during its processing. Defaults to 60s.

errorPolicy

(Optional, String) The error policy for records during their processing. Defaults to FAIL.

  • FATAL: A single failed document aborts the complete process

  • FAIL: Either marks the document as failed, or sends it to a configured error handling state (if any). Other records continue their execution as expected

  • IGNORE: Ignores the record processing error and its execution continues

outboundPolicy

(Optional, Object) The custom policy for groups of records sent for processing to the next state within the finite-state machine.

Details
{
  "name": "My Pipeline",
  "initialState": "stateA",
  "states": {
    ...
  },
  "recordPolicy": {
    "outboundPolicy": {
      ...
    }
  },
  ...
}
mode

(Optional, String) The output mode for processors that might be affected by operations like splitting. Either INLINE for a normal output where any "split" will be represented as an array, or SPLIT for children documents hierarchically related to the origin. Defaults to INLINE.

splitPolicy

(Optional, Object) The splitting policy for outbound batches of children records (where supported).

Details
{
  "name": "My Pipeline",
  "initialState": "stateA",
  "states": {
    ...
  },
  "recordPolicy": {
    "outboundPolicy": {
      "splitPolicy": {
        ...
      }
    }
  },
  ...
}
children

(Optional, Object) The configuration of the children records after the split of the parent.

Details
{
  "name": "My Pipeline",
  "initialState": "stateA",
  "states": {
    ...
  },
  "recordPolicy": {
    "outboundPolicy": {
      "splitPolicy": {
        "children": {
          ...
        }
      }
    }
  },
  ...
}
idPolicy

(Optional, Object) The configuration of the ID for each of the new child record.

Details
{
  "name": "My Pipeline",
  "initialState": "stateA",
  "states": {
    ...
  },
  "recordPolicy": {
    "outboundPolicy": {
      "splitPolicy": {
        "children": {
          "idPolicy": {
            ...
          }
        }
      }
    }
  },
  ...
}
generator

(Optional, Object) The expression that represents the ID of each child record. If not provided, the value assigned is composed of the parent record ID, followed by a colon, and an incremental number, i.e. <parentRecordId>:<incrementalNumber>.

batchPolicy

(Optional, Object) The batch policy for outbound batches of records, once their processor execution is completed.

Details
{
  "name": "My Pipeline",
  "initialState": "stateA",
  "states": {
    ...
  },
  "recordPolicy": {
    "outboundPolicy": {
      "batchPolicy": {
        ...
      }
    }
  },
  ...
}
maxCount

(Optional, Integer) The maximum record count in a batch before flushing. Defaults to 25.

flushAfter

(Optional, Duration) The timeout to flush if no other condition has been met. Defaults to 1m.

Processors

Processors API
Create a new Processor
$ curl --request POST 'ingestion-api:12030/v2/processor' --data '{ ... }'
List all Processors
$ curl --request GET 'ingestion-api:12030/v2/processor'
Get a single Processor
$ curl --request GET 'ingestion-api:12030/v2/processor/{id}'
Update an existing Processor
$ curl --request PUT 'ingestion-api:12030/v2/processor/{id}' --data '{ ... }'
Note

The type of an existing processor can’t be modified.

Delete an existing Processor
$ curl --request DELETE 'ingestion-api:12030/v2/processor/{id}'
Clone an existing Processor
$ curl --request POST 'ingestion-api:12030/v2/processor/{id}/clone?name=clone-new-name'
Query Parameters
name

(Required, String) The name of the new Processor

Search for Processors using DSL Filters
$ curl --request POST 'ingestion-api:12030/v2/processor/search' --data '{ ... }'
Body

The body payload is a DSL Filter to apply to the search

Autocomplete for Processors
$ curl --request GET 'ingestion-api:12030/v2/processor/autocomplete?q=value'
Query Parameters
q

(Required, String) The query to execute the autocomplete search

Each component is stateless, and it’s driven by the configuration defined in the processor and by the context created by the current seed execution. This design makes the processor the main building block of Discovery Ingestion.

They are intended to solve very specific tasks, which makes them re-usable and simple to integrate into any part of the configuration.

{
  "type": "my-component-type",
  "name": "My Component Processor",
  "config": {
    ...
  },
  ...
}
type

(Required, String) The name of the component to execute

name

(Required, String) The unique name to identify the configuration

description

(Optional, String) The description for the configuration.

config

(Required, Object) The configuration for the corresponding action of the component. All configurations will be affected by the Expression Language

server

(Optional, UUID/Object) Either the ID of the server configuration for the integration or an object with the detailed configuration.

Details
{
  "server": {
    "id": "ba637726-555f-4c68-bfed-1c91f4803894",
    ...
  },
  ...
}
id

(Required, UUID) The ID of the server configuration for the integration.

credential

(Optional, UUID) The ID of the credential to override the default authentication in the external service.

labels

(Optional, Array of Objects) The labels for the configuration.

Details
{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}
key

(Required, String) The key of the label.

value

(Required, String) The value of the label.

Hooks

Hooks are a type of Processor, detached from the record processing. They are related to the execution of pre- or post-actions associated with a long seed execution (e.g. creating indices, changing aliases…​).

There are two types of hooks BEFORE_HOOK and AFTER_HOOK. The before hooks are executed at the beginning and the after hooks are executed at the end of the record processing of an execution.

They are useful to do some single pre- or post-actions associated with a long seed execution (e.g. creating indices, changing aliases…​).

Data Processing with a State Machine

State Types

Processor State

Executes a single or multiple processors in sequence:

{
  "myProcessorState": {
    "type": "processor",
    "processors": [
      ...
    ]
  }
}
type

(Required, String) The type of state. Must be processor.

processors

(Required, Array of Objects) The processors to execute.

Details
{
  "stateA": {
    "type": "processor",
    "processors": [
      {
        "id": <Processor ID>,
        ...
      }
    ],
    ...
  }
}
id

(Required, UUID) The ID of the processor to execute.

outputField

(Optional, String) The output field that wraps the result of the processor execution. Defaults to the one defined in the component.

recordPolicy

(Optional, Object) The custom records configuration for the execution of the processor. Overrides the global one defined in the seed or pipeline. Can also be referred to as record.

Details
{
  "id": <Processor ID>,
  "recordPolicy": {
    ...
  }
}
idPolicy

(Optional, Object) The policy for handling the records IDs.

Details
{
  "id": <Processor ID>,
  "recordPolicy": {
    "idPolicy": {
      ...
    }
  }
}
mask

(Optional, String) The expression that represents the ID of the record during its processing. If not provided, the plain ID of the record will be used.

timeoutPolicy

(Optional, Object) The policy for handling the records timeout.

Details
{
  "id": <Processor ID>,
  "recordPolicy": {
    "timeoutPolicy": {
      ...
    }
  }
}
record

(Optional, Duration) The timeout for each record during its processing.

errorPolicy

(Optional, String) The error policy for records during their processing. Defaults to FAIL.

  • FATAL: A single failed document aborts the complete process

  • FAIL: Either marks the document as failed, or sends it to a configured error handling state (if any). Other records continue their execution as expected

  • IGNORE: Ignores the record processing error and its execution continues

retryPolicy

(Optional, Object) The policy for handling the records retries.

Details
{
  "id": <Processor ID>,
  "recordPolicy": {
    "retryPolicy": {
      ...
    }
  }
}
active

(Optional, Boolean) Whether the processor should be retried if failed.

outboundPolicy

(Optional, Object) The custom policy for groups of records sent for processing to the next state within the finite-state machine.

Details
{
  "id": <Processor ID>,
  "recordPolicy": {
    "outboundPolicy": {
      ...
    }
  }
}
mode

(Optional, String) The output mode for processors that might be affected by operations like splitting. Either INLINE for a normal output where any "split" will be represented as an array, or SPLIT for children documents hierarchically related to the origin. Defaults to INLINE.

splitPolicy

(Optional, Object) The splitting policy for outbound batches of children records (where supported).

Details
{
  "id": <Processor ID>,
  "recordPolicy": {
    "outboundPolicy": {
      "splitPolicy": {
        ...
      }
    }
  }
}
children

(Optional, Object) The configuration of the children records after the split of the parent.

Details
{
  "id": <Processor ID>,
  "recordPolicy": {
    "outboundPolicy": {
      "splitPolicy": {
        "children": {
          ...
        }
      }
    }
  }
}
idPolicy

(Optional, Object) The configuration of the ID for each of the new child records.

Details
{
  "id": <Processor ID>,
  "recordPolicy": {
    "outboundPolicy": {
      "splitPolicy": {
        "children": {
          "idPolicy": {
            ...
          }
        }
      }
    }
  }
}
generator

(Optional, Object) The expression that represents the ID of each child record. If not provided, the value assigned is composed of the parent record ID, followed by a colon, and an incremental number, i.e. <parentRecordId>:<incrementalNumber>.

snapshotPolicy

(Optional, Object) The configuration for the incremental scan feature.

Details
{
  "id": <Processor ID>,
  "recordPolicy": {
    "outboundPolicy": {
      "splitPolicy": {
        "children": {
          "snapshotPolicy": {
            ...
          }
        }
      }
    }
  }
}
checksumExpression

(Optional, String) The expression to determine if the record has changed during an incremental scan. Defaults to all the fields in the document.

batchPolicy

(Optional, Object) The batch policy for outbound batches of records, once their processor execution is completed.

Details
{
  "id": <Processor ID>,
  "recordPolicy": {
    "outboundPolicy": {
      "batchPolicy": {
        ...
      }
    }
  }
}
maxCount

(Optional, Integer) The maximum record count in a batch before flushing. Defaults to 25.

flushAfter

(Optional, Duration) The timeout to flush if no other condition has been met. Defaults to 1m.

active

(Optional, Boolean) false to disable the execution of the processor. Default is true.

next

(Optional, String) The next state for the HTTP Request Execution after the completion of the state. If not provided, the current one will be assumed as the final state.

onError

(Optional, String) The state of the to use as fallback if the execution of the current state fails. If undefined, the current HTTP Request Execution will complete with the corresponding error message.

mode

(Optional, Object) The execution mode for the configured processors.

Details
{
  "stateA": {
    "type": "processor",
    "mode": {
      "type": "group",
      ...
    },
    ...
  }
}
type

(Required, String) The type of execution mode for the processors in the state. It can be group for a grouped execution of processors, split to split the records before the execution of the processors in the state, or collapse to specify an expression that represents the output of the state. Defaults to group.

Group mode
{
  "stateA": {
    "type": "processor",
    "mode": {
      "type": "group"
    },
    ...
  }
}
Collapse mode
{
  "stateA": {
    "type": "processor",
    "mode": {
      "type": "collapse",
      "output": {
        "title": "#{ concat(data('/title'), ' - ', data('/subtitle')) }",
        "description": "#{ data('/description') }"
      }
    },
    ...
  }
}
output

(Required, Object) The expression that represents the output of the state.

Split mode
{
  "stateA": {
    "type": "processor",
    "mode": {
      "type": "split",
      "source": " #{data('/dataToSplit')} ",
      ...
    }
    ...
  }
}
source

(Required, Array of Objects) The array with the content to split.

splitPolicy

(Optional, Object) The splitting policy for outbound batches of children records.

Details
{
  "stateA": {
    "type": "processor",
    "mode": {
      "type": "split",
      "source": " #{data('/dataToSplit')} ",
      "splitPolicy": {
        ...
      }
    }
    ...
  }
}
children

(Optional, Object) The configuration of the children records after the split of the parent.

Details
{
  "stateA": {
    "type": "processor",
    "mode": {
      "type": "split",
      "source": " #{ data('/dataToSplit') } ",
      "splitPolicy": {
        "children": {
          ...
        }
      }
    }
    ...
  }
}
idPolicy

(Optional, Object) The configuration of the ID for each of the new child record.

Details
{
  "stateA": {
    "type": "processor",
    "mode": {
      "type": "split",
      "source": " #{data('/dataToSplit')} ",
      "splitPolicy": {
        "children": {
          "idPolicy": {
            ...
          }
        }
      }
    }
    ...
  }
}
generator

(Optional, Object) The expression that represents the ID of each child record. If not provided, the value assigned is composed of the parent record ID, followed by a colon, and an incremental number, i.e. <parentRecordId>:<incrementalNumber>.

The output of each processor will be stored in the JSON Data Channel wrapped in the configured outputField:

{
  "defaultFieldName": {
    "outputKey": "outputValue"
  }
}
Switch State

Use DSL Filters and JSON Pointers over the JSON Data Channel to control the flow of the execution given the first matching condition:

{
  "mySwitchState": {
    "type": "switch",
    "options": [
      ...
    ],
    "default": "myDefaultState"
  }
}
type

(Required, String) The type of state. Must be switch

options

(Required, Array of Objects) The options to evaluate in the state

Details
{
  "type": "switch",
  "options": [
    {
      "condition": {
        "equals": {
          "field": "/my/input/field",
          "value": "valueA"
        },
        ...
      },
      "state": "myFirstState"
    },
    ...
  ],
  ...
}
condition

(Required, Object) The predicate described as a DSL Filter over the JSON processing data

state

(Optional, String) The next state for the finite-state machine if the condition evaluates to true

default

(Optional, String) The default state for the finite-state machine if no option evaluates to true

Note

If no state for the finite-state machine is selected, the current one will be assumed as the final state.

Seed Execution

Seed Executions API
List all Seed Executions of a Seed
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/execution'
Get a single Seed Executions of a Seed
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}'
Halt an existing Seed Execution of a Seed
$ curl --request POST 'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/halt'
List all Audit Log entries for an existing Seed Execution of a Seed
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/audit'
Get the Seed configuration for an existing Seed Execution of a Seed
$ curl --request GET  'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/config/seed'
Get a Pipeline configuration for an existing Seed Execution of a Seed
$ curl --request GET  'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/config/pipeline/{pipelineId}'
Get a Processor configuration for an existing Seed Execution of a Seed
$ curl --request GET  'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/config/processor/{processorId}'
Get a Server configuration for an existing Seed Execution of a Seed
$ curl --request GET  'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/config/server/{serverId}'
Get a Credential configuration for an existing Seed Execution of a Seed
$ curl --request GET  'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/config/credential/{credentialId}'
Get the summary of jobs from a seed execution
$ curl --request GET  'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/job/summary'
Get the summary of records from a seed execution
$ curl --request GET  'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/record/summary'
{
  "id": "1ed146d8-e5d8-49df-9b65-b9f6396183ff",
  "creationTimestamp": "2025-03-13T08:59:15Z",
  "lastUpdatedTimestamp": "2025-03-13T09:46:59Z",
  "triggerType": "MANUAL",
  "status": "DONE",
  "scanType": "FULL",
  "stages": [
      "BEFORE_HOOKS",
      "INGEST",
      "AFTER_HOOKS"
  ]
}
id

(UUID) A unique ID that identifies the seed execution

creationTimestamp

(Timestamp) The timestamp when the execution was triggered

lastUpdatedTimestamp

(Timestamp) The timestamp when the execution was last updated

triggerType

(String) The origin who triggered the execution. Currently, only MANUAL

status

(String) The status of the execution

Details
  • CREATED: The seed has been triggered, but the execution has not started

  • RUNNING: The seed is being executed

  • HALTING: The seed execution received a HALT request, but some processing might still be happening

  • HALTED: The seed execution is completely halted

  • DONE: The seed completed its execution successfully

  • FAILED: The seed failed during its execution

scanType

(String) The scan type for the execution. Currently, only FULL

stages

(Array of Strings) The competed stages of the execution

Details
  • BEFORE_HOOKS: The hooks before record data processing (if any).

  • INGEST: The record data processing

  • AFTER_HOOKS: The hooks after record data processing (if any).

The Seed Execution represents the currently active Seed. It updates with each complete stage of the execution.

When the user starts the execution of a seed, a copy of every user-created configuration related to the seed is stored for use during the entirety of the seed execution:

  • The configuration of the seed being executed

  • The configuration of any hook that’s used by the seed in execution

  • The configuration of the pipeline assigned to the seed

  • The configuration of any processor that’s used in any step of the pipeline

  • The configuration of any server that’s used either by the seed, or any of the related processors

  • The configuration of any credential that’s used either by the seed, or any of the related processors, or any of the related servers.

This means that changing the configuration of the entities used during the execution of a seed will have no effect on the outcome of it. This is done to avoid unexpected and inconsistent behaviors.

Note

For security reasons, when the snapshot of the configuration of a credential is stored, the associated secrets are not included in it. A reference to the underlying secret is saved instead. This means that changes applied to secrets mid-seed execution can unpredictably affect the current execution

Every generated record is tagged with an corresponding action to apply during a specific execution:

  • CREATE: It is a new record for the seed.

  • UPDATE: The record was processed during a previous seed execution, but its content has changed.

  • DELETE: The record is marked to be deleted.

During a seed execution, every record has a status that changes as the seed is processed:

  • PROCESSING: The record was detected and is currently being processed.

  • FAILED: The processing of the record failed.

  • DONE: The record was successfully processed.

Record Data Channels

During a seed execution, records can produce data in JSON format, as well as binary files.

JSON data is stored in a dedicated bucket within Discovery Staging and can be later referenced using JSON Pointers.

Note

New data nodes will never override data nodes previously generated.

Note

When searching for a path, the JSON Pointer will be evaluated against the most recent output. If it is a match, the node is returned. Otherwise, the search continues with the previous one.

Binary data, such as images, videos or PDFs, is stored in a dedicated container inside the Object Storage.

Record Batches

Seeds can configure how batches are flushed through the finite-state machine.

The seed configuration and its override in the processor state defines the boundaries of the batch, where the first condition to be met will trigger the flush process where all the records in the batch to the next stage in the pipeline (such as next processor, next state from the state machine or even the end of the pipeline).

Expression Language Extensions
Table 40. Discovery Ingestion Expression Language Variables
Variable Description Example

seed.id

The ID of the seed in execution

966f6b3f-7066-4fd5-8885-f45fef2fd59d

seed.type

The type of the seed

my-component-type

seed.name

The name of the seed

My Component Seed

seed.description

The description of the seed

Description of My Seed

seed.labels

The labels of the seed, grouped by key

[<keyA,[valueA,valueB]>, <keyB,valueC>]

seed.properties

The properties to use during placeholders resolution

{ "keyA": "valueA" }

execution.id

The ID of the seed execution

966f6b3f-7066-4fd5-8885-f45fef2fd59d

execution.startTimestamp

The start time of the seed execution

2025-03-05T20:24:39Z

execution.scanType

The scan type of the seed execution

FULL

execution.triggerType

Trigger type of the seed execution

MANUAL

execution.properties

The properties to use during placeholders resolution

{ "keyA": "valueA" }

processor.id

The ID of the processor

966f6b3f-7066-4fd5-8885-f45fef2fd59d

processor.type

The type of the processor

my-component-type

processor.name

The name of the processor

My Component Processor

processor.description

The description of the processor

Description of My Processor

processor.labels

The labels of the processor, grouped by key

[<keyA,[valueA,valueB]>, <keyB,valueC>]

pipeline.id

The ID of the pipeline

966f6b3f-7066-4fd5-8885-f45fef2fd59d

pipeline.name

The name of the pipeline

My Component Pipeline

pipeline.description

The description of the pipeline

Description of My Pipeline

pipeline.labels

The labels of the pipeline, grouped by key

[<keyA,[valueA,valueB]>, <keyB,valueC>]

record.id

The ID of a generated record from a seed execution. If an ID mask is configured through ID policies, the masked ID will be returned

my-document-id-with-mask

record.plainId

The ID of a generated record from a seed execution

my-document-id

record.action

The action of a generated record from a seed execution

CREATED

record.parent

The parent ID of a generated record from a seed execution

my-document-parent-id

self.id

The record ID as detected from the source

NOTE: Only the generator field from the idPolicy in the Seed, Pipeline and Processor State outboundPolicy configuration supports this variable. It will not be resolved if used in any other field

my-document-id

self.data

The record content. Nested fields can be accessed as well (e.g. self.data.fieldB.nested)

NOTE: Only the generator field from the idPolicy in the Seed, Pipeline and Processor State outboundPolicy configuration supports this variable. It will not be resolved if used in any other field

{ "fieldA": "valueA", "fieldB": {"nested": "value"} }

Table 41. Discovery Ingestion Expression Language Functions
Function Description Example

data(string)

Finds a specific node within the Record data channel using a JSON Pointer

data('/path/to/field')

data(integer)

References a specific node within the Record data channel using a 0-based index for the first data node generated in the channel. The method supports negative numbers, where -1 represent the latest data node generated. If the index is not found, the result is null

data(0), data(-1)

Discovery Features for Data Processing

Record Splitting

The nature of some processors and their actions is splitting a record into multiple children (e.g. a CSV file where each line represents a new record).

The configuration of the pipeline (and the processors within the pipeline) not only supports this SPLIT behavior, but also allows an INLINE mode where children are output as an array instead.

Components

Amazon S3

Performs different actions on Amazon S3, either downloading or uploading files depending on the action.

Scan Action: scan

Seed that retrieves the files stored in a bucket of Amazon S3.

Table 42. Discovery Features for Data Seeds
Feature Supported

Incremental Scan

Checksum

Distributed Scan

Yes

Hierarchical Records

No

Configuration example
{
  "type": "amazon-s3",
  "name": "My Amazon S3 Scan Action",
  "config": {
    "action": "scan",
    "bucket": "my-bucket", (1)
    "prefix": "/my/prefix",
    "pageSize": 100

  },
  "pipeline": <Pipeline ID>, (2)
  "server": <Amazon S3 Server ID> (3)
}
1 This configuration field is required.
2 See the ingestion pipelines section.
3 See the Amazon S3 integration section.

Each configuration field is defined as follows:

bucket

(Required, String) The name of the bucket to scan.

prefix

(Optional, String) The bucket prefix to filter documents.

pageSize

(Optional, Integer) The maximum number of elements per page.

Processor Action: download

Processor that by giving the name of an Amazon S3 bucket and the file key, it downloads the file and saves it in the Object Storage.

Table 43. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "amazon-s3",
  "name": "My Download Action",
  "config": {
    "action": "download", (1)
    "bucket": "my-s3-bucket", (1)
    "key": "image.jpg",
    "failOnMissing": true
  },
  "server": <Amazon S3 Server ID> (2)
}
1 These configuration fields are required.
2 See the Amazon S3 integration section.

Each configuration field is defined as follows:

bucket

(Required, String) The S3 bucket where the document will be downloaded.

key

(Required, String) The key of the document within the bucket.

metadata

(Optional, Boolean) Whether to include metadata when downloading the document. Defaults to true.

failOnMissing

(Optional, Boolean) Whether to throw an error if the document is not found in the bucket. Defaults to true.

The resulting download information will be saved in each record’s s3 field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting information in a record would be:

Record output example
{
  "s3": {
    "metadata": {}, // If the flag was set to true
    "file": {
      "@discovery": "object",
      "bucket": "ingestion",
      "key": "ingestion-88542806-94a1-4874-a1b2-fb4af5c7c540/6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b/02519ad0-d8b6-4736-99d8-dd5744c6837c/ec48e0b4-b39a-4f6b-8856-c7104ab7e200"
    }
  }
}
Processor Action: upload

Processor that by giving the name of an Amazon S3 bucket, the file key and the file from the Object Storage, uploads the file to the given bucket.

Table 44. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "amazon-s3",
  "name": "My Upload Action",
  "config": {
    "action": "upload", (1)
    "bucket": "amazon-s3-bucket", (1)
    "key": "uploadedFile.jpg",
    "file": "#{ file('fileToUpload', 'BYTES') }",
    "metadata": false
  },
  "server": <Amazon S3 Server ID> (2)
}
1 These configuration fields are required.
2 See the Amazon S3 integration section.

Each configuration field is defined as follows:

bucket

(Required, String) The S3 bucket where the document will be uploaded.

key

(Required, String) The key of the document within the bucket.

file

(Required, String) The file to upload. This file can be obtained from the object storage with the file function from the Expression Language.

metadata

(Optional, Boolean) Whether to include metadata in the response when uploading the document. Defaults to false.

The resulting upload information will be saved in each record’s s3 field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting information in a record would be:

Record output example
{
  "s3": {
    "metadata": {}, // If the flag was set to true
  }
}

Chunker

Splits big documents into smaller units that are easier to interpret by LLMs, it exposes different strategies that can be used.

Processor Action: sentence

Processor that splits the text by sentences.

Table 45. Discovery Features for Data Processing
Feature Supported

Record Splitting

Yes

Configuration example
{
  "type": "chunker",
  "name": "My Chunk-By-Sentence Action",
  "config": {
    "action": "sentence", (1)
    "text": " #{data('/text')} ", (1)
    "sentences": 4,
    "maxChars": 200,
    "overlap": "1"
  }
}
1 These configuration fields are required.

Each configuration field is defined as follows:

text

(Required, String) The text to process.

sentences

(Optional, Integer) The number of sentences per chunk. Defaults to 20.

overlap

(Optional, String/Integer) The amount of sentences to overlap, it can be either a percentage, or the number of sentences. Defaults to 10%.

maxChars

(Optional, Integer) The maximum number of chars per chunk.

The resulting information will be saved in each record’s chunker field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting information in a record would be:

Record output example
{
  "chunker": {
    "chunks": [
      "Lorem ipsum dolor sit amet consectetur adipiscing elit. Placerat in id cursus mi pretium tellus duis. Urna tempor pulvinar vivamus fringilla lacus nec metus.",
      "Urna tempor pulvinar vivamus fringilla lacus nec metus. Integer nunc posuere ut hendrerit semper vel class. Conubia nostra inceptos himenaeos orci varius natoque penatibus.",
      "Lectus commodo augue arcu dignissim velit aliquam imperdiet. Cras eleifend turpis fames primis vulputate ornare sagittis. Libero feugiat tristique accumsan maecenas potenti ultricies habitant.",
      "Libero feugiat tristique accumsan maecenas potenti ultricies habitant. Cubilia curae hac habitasse platea dictumst lorem ipsum. Faucibus ex sapien vitae pellentesque sem placerat in.",
      "Faucibus ex sapien vitae pellentesque sem placerat in. Tempus leo eu aenean sed diam urna tempor."
    ],
    "errors": [
      {
        "index": 5,
        "text": "Mus donec rhoncus eros lobortis nulla molestie mattis Purus est efficitur laoreet mauris pharetra vestibulum fusce Sodales consequat magna ante condimentum neque at luctus Ligula congue sollicitudin erat viverra ac tincidunt nam.",
        "error": {
          "status": 400,
          "code": 3003,
          "messages": [
            "Chunk of size 229 exceeds maximum char limit of 200"
          ],
          "timestamp": "2025-09-11T15:43:42.925739900Z"
        }
      }
    ]
  }
}
Note

If the overlapped text exceeds the maxChars value, then the number of overlapped items will be reduced until there is a valid overlap. This includes no overlap at all, if needed.

Processor Action: word

Processor that splits the text by words.

Table 46. Discovery Features for Data Processing
Feature Supported

Record Splitting

Yes

Configuration example
{
  "type": "chunker",
  "name": "My Chunk-By-Word Action",
  "config": {
    "action": "word", (1)
    "text": " #{data('/text')} ", (1)
    "words": 8,
    "maxChars": 70,
    "overlap": 3
  }
}
1 These configuration fields are required.

Each configuration field is defined as follows:

text

(Required, String) The text to process.

words

(Optional, Integer) The number of words per chunk. Defaults to 20.

overlap

(Optional, String/Integer) The amount of words to overlap, it can be either a percentage, or the number of words. Defaults to 10%.

maxChars

(Optional, Integer) The maximum number of chars per chunk.

The resulting information will be saved in each record’s chunker field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting information in a record would be:

Record output example
{
  "chunker": {
    "chunks": [
      "Lorem ipsum dolor sit amet consectetur adipiscing elit",
      "consectetur adipiscing elit Placerat in id cursus mi",
      "id cursus mi pretium tellus duis Urna tempor",
      "duis Urna tempor pulvinar vivamus fringilla lacus nec",
      "fringilla lacus nec metus Integer nunc posuere ut",
      "nunc posuere ut hendrerit semper vel class Conubia",
      "vel class Conubia nostra inceptos himenaeos orci varius",
      "himenaeos orci varius natoque penatibus. Mus donec rhoncus",
      "Mus donec rhoncus eros lobortis nulla molestie mattis.",
      "molestie mattis. Sodales consequat magna ante condimentum neque",
      "ante condimentum neque at luctus. Ligula congue sollicitudin",
      "Ligula congue sollicitudin erat viverra ac tincidunt nam.",
      "ac tincidunt nam. Lectus commodo augue arcu dignissim",
      "augue arcu dignissim velit aliquam imperdiet. Cras eleifend",
      "imperdiet. Cras eleifend turpis fames primis vulputate ornare",
      "primis vulputate ornare sagittis. Libero feugiat tristique accumsan",
      "feugiat tristique accumsan maecenas potenti ultricies habitant. Cubilia",
      "ultricies habitant. Cubilia curae hac habitasse platea dictumst",
      "habitasse platea dictumst lorem ipsum. Faucibus ex sapien",
      "Faucibus ex sapien vitae pellentesque sem placerat in.",
      "sem placerat in. Tempus leo eu aenean sed",
      "eu aenean sed diam urna tempor."
    ],
    "errors": [
      {
        "index": 48,
        "text": "Purusestefficiturlaoreetmaurispharetravestibulumfuscetravestibulumfusce.",
        "error": {
          "status": 400,
          "code": 3003,
          "messages": [
            "Chunk of size 72 exceeds maximum char limit of 70"
          ]
        }
      }
    ]
  }
}
Note

If the overlapped text exceeds the maxChars value, then the number of overlapped items will be reduced until there is a valid overlap. This includes no overlap at all, if needed.

Elasticsearch

Uses the Elasticsearch integration to invoke the Elasticsearch API.

Scan Action: scan, search-after

Seed that uses the Search after parameter to retrieve all the documents from an index.

Table 47. Discovery Features for Data Seeds
Feature Supported

Incremental Scan

Checksum

Distributed Scan

Yes

Hierarchical Records

No

Configuration example
{
  "type": "elasticsearch",
  "name": "My Elasticsearch Scan Action",
  "config": {
    "action": "search-after",
    "index": "my-index", (1)
    "sort": [ (1)
      { "field-a": "asc" },
      { "field-b": { "order": "desc" } }
    ],
    "size": 100,
    "metadata": false,
    "query": {
      "match": {
        "field-a": {
            "query": "value"
        }
      }
    },

  },
  "pipeline": <Pipeline ID>, (2)
  "server": <Elasticsearch Server ID> (3)
}
1 These configuration fields are required.
2 See the ingestion pipelines section.
3 See the Elasticsearch integration section.

Each configuration field is defined as follows:

index

(Required, Array of Strings or String) The list of Elasticsearch indexes to search on. Can also be configured as a single string if there’s only one index to search.

sort

(Required, Array of Objects) The list of sort options.

Sort object format
{ "<field>": "<sort_value>" },

or

{ "<field>": { "<sort_option>": "<sort_value>", ... } }
query

(Optional, Object) The query body for the search request. If not provided, a match all query will be used instead.

size

(Optional, Integer) The maximum number of hits to return. Defaults to 100.

metadata

(Optional, Boolean) Whether to include the metadata or no. Defaults to false.

Hook Action: aliases

Hook that executes a native Elasticsearch query to the Aliases API.

Configuration example
{
  "type": "elasticsearch",
  "name": "My Elasticsearch Hook Action",
  "config": {
    "action": "aliases",
    "actions": [ (1)
      {
        "add": {
          "index": "my-index-1",
          "alias": "my-alias-1"
        }
      },
      {
        "remove": {
          "index": "my-index-2",
          "alias": "my-alias-2"
        }
      }
    ]
  },
  "server": <Elasticsearch Server ID> (2)
}
1 This configuration field is required. The exact expected structure of the action object is defined by the Elasticsearch API, these are examples at the time of writing.
2 See the Elasticsearch integration section.

Each configuration field is defined as follows:

actions

(Required, Array of Objects) The request body. Each element in the array should represent an alias action.

Note

Currently, if at least one of the actions on the list is successful, the whole request will be successful. On the other side, the request only fails if none of them is successful.

Hook Action: create-index

Hook that executes a native Elasticsearch query to the Create Index API.

Configuration example
{
  "type": "elasticsearch",
  "name": "My Elasticsearch Hook Action",
  "config": {
    "action": "create-index",
    "index": "my-index", (1)
    "body": { (1) (2)
      "mappings": {
        "properties": {
          "field": { "type": "text" }
        }
      }
    },
    "waitForActiveShards": 2,
    "masterTimeout": "5m",
    "timeout": "180s"
  },
  "server": <Elasticsearch Server ID> (3)
}
1 These configuration fields are required.
2 The exact expected structure of the body object is defined by the Elasticsearch API, this is just an example at the time of writing.
3 See the Elasticsearch integration section.

Each configuration field is defined as follows:

index

(Required, String) The index name.

body

(Required, Object) The request body. The body should represent an index body.

waitForActiveShards

(Optional, Integer) The number of copies of each shard that must be active before proceeding with the operation.

masterTimeout

(Optional, String) The period to wait for the master node. The string should represent a duration according to the Elasticsearch API.

timeout

(Optional, String) The period to wait for a response. The string should represent a duration according to the Elasticsearch API.

Processor Action: bulk, hydrate

Processor that executes a bulk request to the Elasticsearch Bulk API.

Table 48. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "elasticsearch",
  "name": "My Elasticsearch Processor Action",
  "config": {
    "action": "hydrate",
    "index": "my-index", (1)
    "data": "#{ data('/my/record') }",
    "allowOverride": true,
    "bulk": <Bulk Configuration> (2)
  },
  "server": <Elasticsearch Server ID>, (3)
}
1 This configuration field is required.
2 See the bulk configuration.
3 See the Elasticsearch integration section.

Each configuration field is defined as follows:

index

(Required, String) The Elasticsearch index to perform the action.

data

(Optional, Object) The data to hydrate. If not provided, the component will use the output from the last processor that generated data for each record.

allowOverride

(Optional, Boolean) Whether allow overriding an existing document or not. Defaults to true.

bulk

(Optional, Object) The bulk configuration.

Details
Configuration example
{
  "pipeline": "pipelineID",
  "refresh": "WAIT_FOR",
  "requireAlias": false,
  "routing": "shard-route",
  "timeout": "1m",
  "waitForActiveShards": "1",
  "flush": { (1)
    "maxOperations": 1000,
    "maxConcurrentRequests": 1,
    "maxSize": "5mb",
    "flushInterval": "5s"
  }
}

Each configuration field is defined as follows:

pipeline

(Optional, String) The ID of the Elasticsearch pipeline that’ll be used to preprocess incoming documents.

routing

(Optional, String) Used to route operations to a specific shard.

waitForActiveShards

(Optional, String) The number of copies of each shard that must be active before proceeding with the Elasticsearch operation.

timeout

(Optional, String) The period of time to wait for some operations. The string should represent a duration according to the Elasticsearch API.

requireAlias

(Optional, Boolean) Whether the request’s actions must target an index alias.

refresh

(Optional, String) The refresh type. Supported types are: TRUE, FALSE and WAIT_FOR.

Type definitions

WAIT_FOR: Waits for a refresh to make the Elasticsearch operation visible to search

TRUE: Refreshes the affected shards to make the Elasticsearch operation visible to search

FALSE: Do nothing with the refreshes

flush

(Optional, Object) The flush configuration.

Field definitions
maxOperations

(Optional, Integer) The maximum number of operations. Defaults to 1000.

maxConcurrent

(Optional, Integer) The maximum number of concurrent requests waiting to be executed by Elasticsearch. Default is 1.

maxSize

(Optional, String) The maximum size of the bulk request. Defaults to 5MB.

flushInterval

(Optional, Duration) The interval between flushes.

Field Mapper

Provides the user a dynamic way to map fields and execute some actions to them in order to customize how data is structured. You can leverage the Expression Language to achieve this.

Processor Action: process

Processor that allows to format the output data of the records.

Table 49. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "fieldmapper",
  "name": "My Field Mapper Processor Action",
  "config": {
    "action": "process",
    "output": {    (1)
      "id": "#{ concat(data('/id'), ' - ', data('/title')) }",
      "value": "#{ data('/description') }"
    }
  }
}
1 This configuration field is required.

Each configuration field is defined as follows:

output

(Required, Object) The object with the expressions that represent the output of the processor.

Record output example
{
  "fieldMapper": {
    "id": "output-Id",
    "value": "outputValue"
  }
}

Filesystem crawler

Uses Norconex Filesystem Crawler for the crawling of filesystems to extract their content. Currently, it only supports the SMB protocol.

Scan Action: scan

Seed that crawls a filesystem using basic configurations.

Table 50. Discovery Features for Data Seeds
Feature Supported

Incremental Scan

Custom

Distributed Scan

No

Hierarchical Records

No

Configuration example
{
  "type": "filesystem",
  "name": "My Filesystem Crawler Scan Action",
  "config": {
    "action": "scan",
    "startPaths": ["/my/path"],
    "maxDocuments": 0,
    "metadataFilters": [
      ...
    ],
    "referenceFilters": [
      ...
    ]
  },
  "pipeline": <Pipeline ID>, (1)
  "server": <SMB server ID>  (2)
}
1 See the ingestion pipelines section.
2 See the SMB integration section.

Each configuration field is defined as follows:

startPaths

(Optional, Array of Strings) The list of starting paths to crawl. The values are appended to the servers field value of the SMB server configuration.

maxDocuments

(Optional, Integer) The maximum number of documents to successfully process. Default is unlimited.

metadataFilters

(Optional, Array of Objects) The list of filters to apply based on the documents metadata.

Details
{
  "metadataFilters": [
    {
      "field": "Content-Type",
      "values": [
        "pdf"
      ],
      "mode": "EXCLUDE"
    }
  ]
}

Each configuration field is defined as follows:

field

(Required, String) The name of the metadata field.

values

(Required, Array of Strings) The list of regex values used to filter from the field specified.

mode

(Optional, String) The mode to define if the documents are either included or excluded from the result. One of INCLUDE or EXCLUDE. Defaults to INCLUDE.

referenceFilters

(Optional, Array of Objects) The list of filters to apply based on the documents reference (i.e. URLs).

Details
{
  "referenceFilters": [
    {
      "type": "EXTENSION",
      "filter": "pdf, csv",
      "mode": "EXCLUDE",
      "caseSensitive": true
    }
  ]
}

Each configuration field is defined as follows:

type

(Required, String) The type of filter. One of:

  • EXTENSION: Filters by document extension.

  • REGEX: Filters by regular expressions.

filter

(Required, String) The value of the filter to apply. If the type is EXTENSION, the value must be a comma-separated list of extensions or a single extension. In the case of the REGEX type, the value is the regular expression.

mode

(Optional, String) The mode to define if the documents are either included or excluded from the result. One of INCLUDE or EXCLUDE. Defaults to INCLUDE.

caseSensitive

(Optional, Boolean) Whether the filter is case-sensitive or not.

Extracted ACL data for SMB

When using the SMB protocol, the ACL data is extracted per each document processed:

{
  "properties": {
    "collector": {
      "acl": {
        "smb": [
          {
            "domainSid": "",
            "ace": "",
            "sidAsText": "",
            "sid": "",
            "type": "",
            "accountName": "",
            "domainName": "",
            "typeAsText": ""
          }
        ]
      },
      ...
    },
    ...
  },
  ...
}
Details
domainSid

(String) The domain Security Identifier.

ace

(String) The Access Control Entry.

sidAsText

(String) The SID as text.

sid

(String) The Security Identifier.

type

(String) The ACL type. If the access was allowed, the value is 0. Otherwise, the value is 1.s

accountName

(String) The name of the account.

domainName

(String) The name of the domain.

typeAsText

(String) The ACL type as text.

HTML

Uses Jsoup to process HTML files.

Processor Action: select

Processor that retrieves elements that match a CSS selector query from an HTML document.

Table 51. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "html",
  "name": "My HTML Processor Select Action",
  "config": {
    "action": "select",
    "file": "#{ file('my-file') }", (1)
    "baseUri": "",
    "charset": "UTF-8",
    "selectors": { (1)
      "mySelector": {
        "selector": "::text:not(:blank)", (1)
        "mode": "NODES"
      }
    }
  }
}
1 These configuration fields are required.

Each configuration field is defined as follows:

file

(Required, File) The HTML file to be processed. Notice that the expected value is not the name of the file but the actual content. You can leverage the Expression Language to do that.

baseUri

(Optional, String) The URL of the source, to resolve relative links against. Defaults to "".

charset

(Optional, String) The character set of the file’s content. If null, determines the charset from the http-equiv meta tag if present, or falls back to UTF-8 if not.

selectors

(Required, Map of String/Object) The set of selector configurations.

Field definitions
selector

(Required, String) The CSS selector query.

mode

(Optional, String) The output format of the selection. Either TEXT, HTML or NODES. Defaults to TEXT.

Note

The NODES mode enables the use of Node Pseudo Selectors. The output for this mode depends on the operator that is used, some output text while others HTML.

The selected text or html will be saved in each record’s html field by default. This field’s name can be overwritten in the processor state configuration. For example, for the processor configured above and an HTML document <p>Hello World!</p>, the processor’s output in a record would be:

Record output example
{
  "html": {
    "mySelector": "Hello World!"
  }
}
Processor Action: extract

Processor that extracts and formats tables and description lists from an HTML document.

Table 52. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "html",
  "name": "My HTML Processor Extract Action",
  "config": {
    "action": "extract",
    "file": "#{ file('my-file') }", (1)
    "baseUri": "",
    "charset": "UTF-8",
    "table": {
      "active": true,
      "titles": [
        "caption"
      ]
    },
    "descriptionList": {
      "active": true,
      "titles": [
        "h1"
      ]
    }
  }
}
1 This configuration field is required.

Each configuration field is defined as follows:

file

(Required, File) The HTML file to be processed. Notice that the expected value is not the name of the file but the actual content. You can leverage the Expression Language to do that.

baseUri

(Optional, String) The URL of the source, to resolve relative links against. Defaults to "".

charset

(Optional, String) The character set of the file’s content. If null, determines the charset from the http-equiv meta tag if present, or falls back to UTF-8 if not.

table

(Optional, Object) The configurations for extracting tables.

Field definitions
active

(Optional, Boolean) Whether the extractor is active. Defaults to true.

titles

(Optional, Array of String) The list of HTML tags to be considered as the title for each element. A title is selected when either the first child or the previous sibling of the element matches any of the given tags.

descriptionList

(Optional, Object) The configurations for extracting description lists.

Field definitions
active

(Optional, Boolean) Whether the extractor is active. Defaults to true.

titles

(Optional, Array of String) The list of HTML tags to be considered as the title for each element. A title is selected when either the first child or the previous sibling of the element matches any of the given tags.

The extracted tables and description lists will be saved in each record’s html field by default. This field’s name can be overwritten in the processor state configuration. For example, for the processor configured above and an HTML document:

<div>
    <div>
        <h1>Description List</h1>
        <dl>
            <dt>Term 1</dt>
            <dd>Detail 1</dd>
            <dd>Detail 2</dd>
            <dt>Term 2</dt>
        </dl>
    </div>
    <table>
        <caption>Table</caption>
        <tr>
            <th>Header 1</th>
            <th>Header 2</th>
        </tr>
        <tr>
            <td>Data 1</td>
            <td>Data 2</td>
        </tr>
    </table>
</div>

The processor’s output in a record would be:

Record output example
{
  "html": {
    "tables": [
      {
        "title": "Table",
        "table": [
          [
            {
              "tag": "header",
              "text": "Header 1"
            },
            {
              "tag": "header",
              "text": "Header 2"
            }
          ],
          [
            {
              "tag": "data",
              "text": "Data 1"
            },
            {
              "tag": "data",
              "text": "Data 2"
            }
          ]
        ]
      }
    ],
    "descriptionLists": [
      {
        "title": "Description List",
        "descriptionList": [
          {
            "term": "Term 1",
            "details": [
              "Detail 1",
              "Detail 2"
            ]
          },
          {
            "term": "Term 2",
            "details": []
          }
        ]
      }
    ]
  }
}
Processor Action: remove

Processor that removes elements that match a CSS selector query from an HTML document.

Table 53. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "html",
  "name": "My HTML Processor Remove Action",
  "config": {
    "action": "remove",
    "file": "#{ file('my-file') }", (1)
    "baseUri": "",
    "charset": "UTF-8",
    "selector": "header, footer" (1)
  }
}
1 These configuration fields are required.

Each configuration field is defined as follows:

file

(Required, File) The HTML file to be processed. Notice that the expected value is not the name of the file but the actual content. You can leverage the Expression Language to do that.

baseUri

(Optional, String) The URL of the source, to resolve relative links against. Defaults to "".

charset

(Optional, String) The character set of the file’s content. If null, determines the charset from the http-equiv meta tag if present, or falls back to UTF-8 if not.

selector

(Required, String) The CSS selector query.

The remaining html will be saved in each record’s html field by default. This field’s name can be overwritten in the processor state configuration. For example, for the processor configured above and an HTML document:

<html>
    <head></head>
    <body>
        <header>header text</header>
        <p>body text</p>
        <footer>footer text</footer>
    </body>
</html>

The processor’s output in a record would be:

Record output example
{
  "html": "<html>\n <head></head>\n <body>\n  <p>body text</p>\n </body>\n</html>"
}

Insights

The Insights component is designed to provide various actions that generate different metrics or information for later analysis.

Processor Action: engine-score:non-contextual

Processor that calculates the query score based on the result position metadata field value only. The engine scoring action is designed to power the Engine Scoring Dashboards by evaluating the quality of a search engine’s results in terms of precision and recall.

Table 54. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "insights",
  "name": "My Engine-Score Non-Contextual Processor Action",
  "config": {
    "action": "engine-score:non-contextual",
    "resultPosition": "25", (1)
    "kfactor": "0.9",
    "startPosition": "1",
    "precision": "4"
  }
}
1 This configuration field is required.

Each configuration field is defined as follows:

resultPosition

(Required, Integer) Field containing the position of the search result to be used in the engine scoring calculation.

kfactor

(Optional, Double) Value between 0 and 1 used to determine the importance of the relevant records. Defaults to 0.9.

startPosition

(Optional, Integer) Indicate the start position to take into account when doing the K-factor calculation. Defaults to 1.

precision

(Optional, Integer) Number of digits to return after the decimal point for the score value. Defaults to 4.

The resulting score will be saved in each record’s score field by default. This field’s name can be overwritten in the processor state configuration. For example, given a resulting score of 0.1, the processor’s output in a record would be:

Record output example
{
  "score": 0.1
}

Language Detector

The Language Detector component uses Lingua to identify the language from a specified text input. The languages are referenced using ISO-639-1 (alpha-2 code).

Note

Each time a language model is referenced, it will be loaded in memory. Loading too many languages increases the risk of high memory consumption issues.

Processor Action: process

Processor that detects the language of a provided text.

Table 55. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "language",
  "name": "My Language Detector Processor Action",
  "config": {
    "action": "process",
    "text": {    (1)
      "inputA": "#{ data('/fieldA') }",
      "inputB": "#{ data('/fieldB') }"
    },
    "defaultLanguage": "en",
    "minDistance": 0.0,
    "supportedLanguages": ["en", "es"]
  }
}
1 This configuration field is required.

Each configuration field is defined as follows:

text

(Required, Object) The text to be evaluated. It can be either a String with a single input, or a Map for multi-input processing.

defaultLanguage

(Optional, String) Default language to select in case no other is detected. Defaults to en.

minDistance

(Optional, Double) Distance between the input and the language model. Defaults to 0.0.

supportedLanguages

(Optional, Array of Strings) List of languages supported by the detector. At least 2 supported languages must be set. Defaults to [ "en", "es" ].

The output of the processor will be saved in the record’s language field by default. This field’s name can be overwritten in the processor state configuration. For example:

Record output example for single input
{
  "language": "es"
}
Record output example for multi-input
{
  "language": {
    "inputA": "es",
    "inputB": "en"
  }
}

LDAP

Performs a search on an LDAP directory server, like retrieving users and groups.

Scan Action: scan

Seed that runs a search in the directory and creates a record for each entry returned.

Table 56. Discovery Features for Data Seeds
Feature Supported

Incremental Scan

Checksum

Distributed Scan

Yes

Hierarchical Records

No

Configuration example
{
  "type": "ldap",
  "name": "My LDAP Scan Action",
  "config": {
    "action": "scan",
    "baseDN": "ou=Users,dc=my-domain,dc=com", (1)
    "filter":"(objectClass=inetOrgPerson)", (1)
    "projection": <DSL Projection>, (2)
    "pageSize": 500
  },
  "pipeline": <Pipeline ID>, (3)
  "server": <LDAP Server ID> (4)
}
1 These fields are required.
2 See the DSL Projection section.
3 See the ingestion pipelines section.
4 See the LDAP integration section.

Each configuration field is defined as follows:

baseDN

(Required, String) The base distinguished name for the search request.

filter

(Required, String) The string representation of the filter to use to identify matching entries.

projection

(Optional, DSL Projection) The projection to apply to the search’s attributes.

pageSize

(Optional, Integer) The number of entries that will be fetched using pagination. Defaults to 1000. There are some LDAP servers that do not support pagination. In that case, it is recommended to retrieve every desired entry at once if possible.

MongoDB

Performs different actions on MongoDB collections, either reading or writing data depending on the action.

Scan Action: scan

Seed that finds all documents in a MongoDB collection and creates a record for each document found.

Table 57. Discovery Features for Data Seeds
Feature Supported

Incremental Scan

Checksum

Distributed Scan

Yes

Hierarchical Records

No

Configuration example
{
  "type": "mongo",
  "name": "My Mongo Scan Action",
  "config": {
    "action": "scan",
    "database": "my-database", (1)
    "collection": "my-collection", (1)
    "filter": { ... },
    "projection": { ... },
    "sort": [ ... ],
    "size": 100, (1)
    "fields": {
      "id": "_id",
      "token": "_id"
    }
  },
  "pipeline": <Pipeline ID>, (2)
  "server": <MongoDB Server ID> (3)
}
1 These configuration fields are required.
2 See the ingestion pipelines section.
3 See the MongoDB integration section.

Each configuration field is defined as follows:

database

(Required, String) The database to connect to.

collection

(Required, Boolean) The collection whose documents are turned to records.

filter

(Optional, DSL Filter) The filter to apply to the request.

projection

(Optional, DSL Projection) The projection to apply to the response.

sort

(Optional, DSL Sort) The sort to apply to the request.

size

(Required, Integer) The size for pagination. Defaults to 100.

fields

(Optional, Object) The name of the control fields from the source collection.

Details
id

(Optional, String) The name of the field with the ID of the record. Defaults to _id.

token

(Optional, String) The name of the field with the pagination token of the collection. Defaults to _id.

Processor Action: bulk, hydrate

Processor that stores the records in the pipeline via Bulk Write operations on the specified MongoDB collection.

Table 58. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "mongo",
  "name": "My Mongo Processor Action",
  "config": {
    "action": "bulk",
    "database": "my-database", (1)
    "collection": "my-collection", (1)
    "allowOverride": true,
    "data": "#{ data('/my/record') }",
    "flush": <Bulk Flush Configuration> (2)
  },
  "server": <MongoDB Server ID>, (3)
}
1 These configuration fields are required.
2 See the flush configuration.
3 See the MongoDB integration section.

Each configuration field is defined as follows:

database

(Required, String) The database to connect to.

collection

(Required, String) The collection where the records are bulk written.

allowOverride

(Optional, Boolean) Whether the records should be stored if there is one already with their ID. Defaults to true.

data

(Optional, Object) The data to store on the collection. If not provided, the component will use the output from the last processor that generated data for each record.

flush

(Optional, Object) The flush configuration.

Details
Configuration example
{
  "maxCount": 25,
  "maxWeight": 25,
  "flushAfter": "PT5M"
}

Each configuration field is defined as follows:

maxCount

(Optional, Integer) The maximum number of records in the bulk before flushing.

maxWeight

(Optional, Long) The maximum weight allowed in a bulk request.

flushAfter

(Optional, Duration) The time to wait before flushing a bulk request.

OCR

Performs a Optical Character Recognition (OCR) operation over files to extract text from non-searchable PDfs or image based PDFs.

Processor Action: process

Processor that executes the OCR action to extract text from files. Using the Tesseract library.

Table 59. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "ocr",
  "name": "My OCR Processor Action",
  "config": {
    "action": "process",
    "file": "#{ file('my-file', 'BYTES') }", (1)
    "languages": ["EN"],
    "pageSegmentationMode": 3
  }
}
1 This configuration field is required.

Each configuration field is defined as follows:

file

(Required, File) The file to extract the text from. Notice that the expected value is not the name of the file but the actual content. You can leverage the Expression Language to do that.

languages

(Optional, Array of Strings) The list languages that the OCR operation should support. The languages are codes of two letters as described by the ISO 639-1 standard. Defaults to ["EN"].

pageSegmentationMode

(Optional, Integer) The mode to do page segmentation. See Tesseract Page Segmentation Method to get the list of available modes. Defaults to 3.

The recognized text will be saved in each record’s ocr field by default. This field’s name can be overwritten in the processor state configuration. For example, given a document with the Hello World! phrase in it, the processor’s output in a record would be:

Record output example
{
  "ocr": "Hello World!"
}

OpenAI

Uses the OpenAI integration to send requests to OpenAI. Additionally, supports text trimming based on OpenAI models' tokenizing and token limits, by integrating the tiktokten library.

Processor Action: chat-completion

Processor that executes a chat completion request to OpenAI API.

Table 60. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "openai",
  "name": "My Chat Completion Action",
  "config": {
    "action": "chat-completion",
    "model": "openai-model", (1)
    "messages": [ (1) (2)
      {"role": "system", "content": "You are a helpful assistant" },
      {"role": "user", "content": "Hi!" },
      {"role": "assistant", "content": "Hi, how can assist you today?" }
    ],
    "promptCacheKey": "pureinsights",
    "frequencyPenalty": 0.0,
    "presencePenalty": 0.0,
    "temperature": 1,
    "topP": 1,
    "n": 1,
    "maxTokens": 2048,
    "stop": [],
    "responseFormat": <Response format configuration> (3)
  },
  "server": <OpenAI Server ID>, (4)
}
1 These configuration fields are required.
2 See the messages configuration definition.
3 See the response format configuration.
4 See the OpenAI integration section.

Each configuration field is defined as follows:

model

(Required, String) The OpenIA model to use.

messages

(Required, Array of Objects) The list of messages for the request.

Field definitions
role

(Required, String) The role of the message. Must be one of system, user or assistant.

content

(Required, String) Then content of the message.

name

(Optional, String) The name of the author of the message.

promptCacheKey

(Optional, String) Value used by OpenAI to cache responses for similar requests to optimize the cache hit rates.

frequencyPenalty

(Optional, Double) Positive values penalize new tokens based on their existing frequency in the text so far. Value must be between -2.0 and 2. Defaults to 0.0.

presencePenalty

(Optional, Double) Positive values penalize new tokens based on whether they appear in the text so far. Value must be between -2.0 and 2. Defaults to 0.0.

temperature

(Optional, Double) Sampling temperature to use. Value must be between 0 and 2. Defaults to 1.

Note
It’s generally recommended to alter either this or the topP field, but not both.
topP

(Optional, Double) An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass. Defaults to 1.

Note
It’s generally recommended to alter either this or the temperature field, but not both.
n

(Optional, Integer) How many chat completion choices to generate for each input message. Defaults to 1.

maxTokens

(Optional, Integer) The maximum number of tokens to generate in the chat completion. Defaults to 2048.

stop

(Optional, Array of String) Up to 4 sequences where the API will stop generating further tokens.

responseFormat

(Optional, Object) An object specifying the format that the model must output. Learn more in OpenAI’s Structured Outputs guide.

Details
Configuration example
{
  "type": "json_schema", (1)
  "json_schema": { (2)
    "name":   "a name for the schema",
    "strict": true,
    "schema": { (3)
      "type": "object",
      "properties": {
        "equation": { "type": "string" },
        "answer":   { "type": "string" }
      },
      "required": ["equation", "answer"],
      "additionalProperties": false
    }
  }
}
1 The response type is always required.
2 JSON schemas can only be used and are required with json_schema types of response formats.
3 The exact expected structure of the schema object is defined by the OpenAI API, this is just an example at the time of writing.

Each configuration field is defined as follows:

type

(Required, String) The type of response format being defined. Allowed values: text, json_schema and json_object.

json_schema

(Optional, Object) Structured Outputs configuration options, including a JSON Schema. This field can only be used and is in fact required with response formats of the json_schema type. See OpenAI’s response formats definitions for more details.

Field definitions
name

(Required, String) The name of the response format. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

description

(Optional, String) A description of what the response format is for, used by the model to determine how to respond in the format.

schema

(Optional, Object) The schema for the response format, described as a JSON Schema object. Learn how to build JSON schemas here.

strict

(Optional, Boolean) Whether to enable strict schema adherence when generating the output. If set to true, the model will always follow the exact schema defined in the schema field. Defaults to false.

The resulting chat completion will be saved in each record’s openai field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting chat completion in a record would be:

Record output example
{
  "openai": {
    "created": "2025-07-24T15:38:59Z",
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "{\"content\":\"Let's solve the equation step by step:\\n\\n1. Start with: 8x + 31 = 2\\n2. Subtract 31 from both sides: 8x = 2 - 31\\n3. Simplify: 8x = -29\\n4. Divide both sides by 8: x = -29/8\\n\\nSo the solution is:\\n\\nx = -29/8\",\"next\":\"Do you have any other equations you want to solve, or would you like to see this as a decimal?\"}",
        },
        "finish_reason": "stop"
      }
    ],
    "model": "gpt-4.1-2025-04-14",
    "usage": {
      "prompt_tokens": 61,
      "completion_tokens": 116,
      "total_tokens": 177
    }
  }
}
Processor Action: embeddings

Processor that executes embedding requests to the OpenAI API.

Table 61. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "openai",
  "name": "My OpenAI Processor Action",
  "config": {
    "action": "embeddings",
    "model": "openai-model", (1)
    "input": "#{ data('/my/input') }", (1)
    "user": "pureinsights",
    "flush": <Bulk Flush Configuration> (2)
  },
  "server": <OpenAI Server ID>, (3)
}
1 These configuration fields are required.
2 See the flush configuration.
3 See the OpenAI integration section.

Each configuration field is defined as follows:

model

(Required, String) The OpenAI model to use.

input

(Required, String) The input to generate the embeddings.

user

(Optional, String) An unique identifier representing the end-user.

flush

(Optional, Object) The flush configuration.

Details
Configuration example
{
  "maxCount": 25,
  "maxWeight": 25,
  "flushAfter": "PT5M"
}

Each configuration field is defined as follows:

maxCount

(Optional, Integer) The maximum number of records in the bulk before flushing.

maxWeight

(Optional, Long) The maximum weight allowed in a bulk request.

flushAfter

(Optional, Duration) The time to wait before flushing a bulk request.

The resulting embeddings will be saved in each record’s openai field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting embeddings in a record would be:

Record output example
{
  "openai": [-0.006929283495992422, -0.005336422007530928, ...]
}
Processor Action: trim

Processor that trims a given text based on an OpenAI model’s tokenizing and either its own token limit, or a custom one.

Table 62. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "openai",
  "name": "My OpenAI Trim Action",
  "config": {
    "action": "trim",
    "text": "#{ data(\"/text\") }", (1)
    "model": "gpt-5.2", (2)
    "tokenLimit": 5 (2)
  }
}
1 This configuration field is required
2 At least one of these configuration fields are required

Each configuration field is defined as follows:

text

(Required, String) The text to trim.

model

(Optional, String) The OpenAI model whose encoding is used to tokenize the text and whose token limit determines whether to truncate the text. If a custom token limit is defined, it’ll override the model’s. If no model is provided, a default o200k_base encoding will be used.

Note
In order to determine the encoding and token limit for models used in chat completion requests, the processor will only take into account those models' "version" when trimming. In this context, "version" translates to the ChatGPT version used, such as gpt-5.2, gpt-4.1, o4, etc. This means that a model field configured as gpt-5.2-2025-12-11 will result in the processor only taking into account the gpt-5.2 version included, and ignore the rest. Consequently, non-existent models such as o4-mini-thismodeldoesntexist are considered valid for this action as long as the model version can be inferred.
tokenLimit

(Optional, Integer) The positive integer used as token limit when determining whether to truncate the encoded text or not. If defined, it’ll override the provided model’s token limit, if any.

The result of the trimming process will be saved in each record’s openai field by default. This field’s name can be overwritten in the processor state configuration. Given the example configuration shown above, with a text input of The brown fox jumps over the lazy dog, the trim result would be saved as:

Record output example
{
  "openai": {
    "text": "The brown fox jumps over",
    "size": 24,
    "tokens": 5,
    "truncated": true,
    "remainder": [
      " the",
      " lazy",
      " dog"
    ]
  }
}

OpenSearch

Uses the OpenSearch integration to invoke the OpenSearch API.

Hook Action: aliases

Hook that executes a native OpenSearch query to the Aliases API.

Configuration example
{
  "type": "opensearch",
  "name": "My OpenSearch Hook Action",
  "config": {
    "action": "aliases",
    "actions": [ (1)
      {
        "add": {
          "index": "my-index-1",
          "alias": "my-alias-1"
        }
      },
      {
        "remove": {
          "index": "my-index-2",
          "alias": "my-alias-2"
        }
      }
    ],
    "clusterManagerTimeout": "30s",
    "timeout": "30s"
  },
  "server": <OpenSearch Server ID> (2)
}
1 This configuration field is required. The exact expected structure of the action object is defined by the OpenSearch API, this is just an example at the time of writing.
2 See the OpenSearch integration section.

Each configuration field is defined as follows:

actions

(Required, Array of Objects) Set of actions to perform on the index. Each element in the array should represent an alias action.

clusterManagerTimeout

(Optional, String) The amount of time to wait for a response from the cluster manager node. The string should contain an OpenSearch API time unit.

timeout

(Optional, String) The amount of time to wait for a response from the cluster. The string should contain an OpenSearch API time unit.

Hook Action: create-index

Hook that executes a native OpenSearch query to the Create Index API.

Configuration example
{
  "type": "opensearch",
  "name": "My OpenSearch Hook Action",
  "config": {
    "action": "create-index",
    "index": "my-index", (1)
    "body": { (1) (2)
      "mappings": {
        "properties": {
          "age": {
            "type": "integer"
          }
        }
      }
    }
  },
  "server": <OpenSearch Server ID> (3)
}
1 These configuration fields are required.
2 The exact expected structure of the body object is defined by the OpenSearch API, this is just an example at the time of writing.
3 See the OpenSearch integration section.

Each configuration field is defined as follows:

index

(Required, String) The index name.

body

(Required, Object) The request body. The body should represent an index body.

waitForActiveShards

(Optional, Integer) The number of active shards that must be available before OpenSearch processes the request.

clusterManagerTimeout

(Optional, String) The amount of time to wait for a response from the cluster manager node. The string should contain an OpenSearch API time unit.

timeout

(Optional, String) The amount of time to wait for a response from the cluster. The string should contain an OpenSearch API time unit.

Processor Action: bulk, hydrate

Processor that executes a bulk request to the OpenSearch Bulk API.

Table 63. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "opensearch",
  "name": "My OpenSearch Processor Action",
  "config": {
    "action": "hydrate",
    "index": "my-index", (1)
    "data": "#{ data('/my/record') }",
    "bulk": <Bulk Configuration>, (2)
    "flush": <Bulk Flush Configuration>, (3)
  },
  "server": <OpenSearch Server ID> (4)
}
1 This configuration field is required.
2 See the bulk configuration.
3 See the flush configuration.
4 See the OpenSearch integration section.

Each configuration field is defined as follows:

index

(Required, String) The OpenSearch index to perform the action.

data

(Optional, Object) The data to hydrate. If not provided, the component will use the output from the last processor that generated data for each record.

bulk

(Optional, Object) The bulk configuration.

Details
Configuration example
{
  "pipeline": "pipelineID",
  "refresh": "WAIT_FOR",
  "requireAlias": false,
  "routing": "shard-route",
  "waitForActiveShards": "1",
  "timeout": "PT1M"
}

Each configuration field is defined as follows:

pipeline

(Optional, String) The ID of the OpenSearch Pipeline that’ll be used to preprocess incoming documents.

refresh

(Optional, String) The refresh type. Supported are: TRUE, FALSE and WAIT_FOR.

Type definitions

WAIT_FOR: Waits for a refresh to make the OpenSearch operation visible to search.

TRUE: Refreshes the affected shards to make the OpenSearch operation visible to search.

FALSE: Do nothing with the refreshes.

requireAlias

(Optional, Boolean) Whether the request’s actions must target an index alias.

routing

(Optional, String) Used to route operations to a specific shard.

waitForActiveShards

(Optional, String) The number of copies of each shard that must be active before proceeding with the OpenSearch operation. The string should contain an OpenSearch API time unit.

timeout

(Optional, String) The period of time to wait for some operations. The string should contain an OpenSearch API time unit.

flush

(Optional, Object) The flush configuration.

Details
Configuration example
{
  "maxCount": 25,
  "maxWeight": 25,
  "flushAfter": "PT5M"
}

Each configuration field is defined as follows:

maxCount

(Optional, Integer) The maximum number of records in the bulk before flushing.

maxWeight

(Optional, Long) The maximum weight allowed in a bulk request.

flushAfter

(Optional, Duration) The time to wait before flushing a bulk request.

Oracle Database

Performs different actions on an Oracle database, mainly reading from a table.

Scan Action: scan

Seed that runs an SQL query and creates a record for each row returned from the database.

Table 64. Discovery Features for Data Seeds
Feature Supported

Incremental Scan

Checksum

Distributed Scan

Yes

Hierarchical Records

No

Configuration example
{
  "type": "oracledb",
  "name": "My Oracle DB Scan Action",
  "config": {
    "action": "scan",
    "sql": "SELECT * FROM my_table", (1)
    "pageSize": 500
  },
  "pipeline": <Pipeline ID>, (2)
  "server": <Oracle Database Server ID> (3)
}
1 This field is required.
2 See the ingestion pipelines section.
3 See the Oracle Database integration section.

Each configuration field is defined as follows:

pageSize

(Optional, Integer) The number of records that will be fetched using pagination. Defaults to 1000

sql

(Required, String) The SQL query that will be executed. This component supports processing records with pagination through the use of the offset and pageSize variables, which can be defined in the SQL query using the Mustache format. For example:

Pagination SQL example
SELECT record_id AS id, name FROM my_table WHERE field = 'condition' OFFSET {{offset}} ROWS FETCH NEXT {{pageSize}} ROWS ONLY

This SQL query retrieves the number of records specified by the pageSize variable, and the offset value ensures that previously processed records are skipped. Currently, pageSize and offset are the only variables supported and they are not required. To fetch new documents a new Scan job is created with the updated values for offset and pageSize.

The Oracle data types supported are the following:

Details
  • CHAR

  • VARCHAR2

  • NCHAR

  • NVARCHAR2

  • NUMBER

  • FLOAT

  • BINARY_FLOAT

  • BINARY_DOUBLE

  • BOOLEAN

  • DATE

  • TIMESTAMP

  • TIMESTAMP WITH TIME ZONE

  • TIMESTAMP WITH LOCAL TIME ZONE

  • INTERVAL DAY TO SECOND

  • INTERVAL YEAR TO MONTH

  • ROWID

  • RAW and BLOB: When serialized into JSON, it becomes a string encoded in Base 64.

  • CLOB

  • NCLOB

  • Collection (VARRAY or nested table)

The following data types are not supported and require a different approach:

Details
  • BFILE: The BFILE 's content should be read within the database and returned as a LOB in the results.

  • LONG and LONGRAW: The use of these data types is not recommended. They should be converted to a LOB.

  • Oracle Object and Object reference: The object’s fields should be dereferenced and included as separate columns in the results. See the DEREF function to get an object reference’s fields.

  • REF CURSOR: Not currently supported.

If a data type is not included in this list, if it cannot be converted to a regular Java data type, it most likely is not supported.

NOTE: By default, the record’s ID will be obtained from the field with the name "ID", case-insensitive. This is why in the pagination example, the record_id is selected with an alias id. This can be overriden with the field recordPolicy.outboundPolicy.idPolicy.generator in the seed’s configuration. See the Data Seed section.

Random

Generates records with random content

Scan action plain, scan

Seed that generates a predetermined amount of records with a single text field whose value is a random text of random length withing the given range.

Table 65. Discovery Features for Data Seeds
Feature Supported

Distributed Scan

Yes

Hierarchical Records

No

Configuration example
{
  "type": "random",
  "name": "My Random Plain Action",
  "config": {
    "action": "plain",
    "records": 1000, (1)
    "charsPerRecord": 50 (1)
  },
  "pipeline": <Pipeline ID>, (2)
}
1 These configuration fields are required.
2 See the ingestion pipelines section.

Each configuration field is defined as follows:

records

(Required, Integer) The amount of records to generate.

charsPerRecord

(Required, Object or Integer) The set amount of random characters each record will have in it’s text field, if configured as a single positive integer. This value, if given, should be between 1 and 2147483647. If a range for a randomly chosen amount of characters is wanted instead, then this field can also be configured as an object for this purpose, as is next shown:

Details
{
  "charsPerRecord": {
    "min": 10,
    "max": 1000
  }
}
min

(Required, Integer) The minimum amount of characters that a record may have, inclusive. Must be higher than 1.

max

(Required, Integer) The maximum amount of characters that a record may have, inclusive. Must be higher than min and lower than 2147483647.

Script

Uses the Script Engine to execute a script for advanced handling of the execution data. Supports multiple scripting languages and provides tools for JSON manipulation and for logging.

Processor Action: process

Processor that executes a script that interacts with the record data generated from a seed execution.

Table 66. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "script",
  "name": "My Script Processor Action",
  "config": {
    "action": "process",
    "language": "groovy",
    "script": <Script>, (1)
  }
}
1 This configuration field is required.

Each configuration field is defined as follows:

language

(Optional, String) The language of the script. Must be one of the supported script languages. Defaults to groovy.

script

(Required, String) The script to run.

Any output set in the output() object during the script execution will be saved in the record’s script field by default. This field’s name can be overwritten in the processor state configuration. For example, if a script runs a single output().put("field", "value") instruction, it’ll be saved in the records as:

Record output example
{
  "script": {
    "field": "value"
  }
}

SharePoint Online

Crawls information from Sites, Lists and ListItems from SharePoint Online.

Scan Action: scan

Seed that crawls the sites information from a SharePoint tenant. Crawls the parent site, their subsites, lists and list items, including the files and attachments for list items.

Note
Calendar Lists currently only crawl present and future events. Past events are not retrieved.
Table 67. Discovery Features for Data Seeds
Feature Supported

Incremental Scan

Checksum

Distributed Scan

Yes

Hierarchical Records

Yes

Configuration example
{
  "type": "sharepoint",
  "name": "My SharePoint Online Scan Action",
  "config": {
    "action": "scan",
    "sites": [ (1)
      /sites/my-site,
      /sites/my-other-site
      /sites/my-parent-site/my-subsite
    ],
    "checkNoCrawl": true
  },
  "pipeline": <Pipeline ID>, (2)
  "server": <SharePoint Online Server ID> (3)
}
1 This configuration field is required.
2 See the ingestion pipelines section.
3 See the SharePoint Online integration section.

Each configuration field is defined as follows:

sites

(Required, Array of Strings) The list of sites to crawl. The URL must be of the form: /sites/*.

checkNoCrawl

(Optional, Boolean) If false, the noCrawl flag on sites and lists is ignored. Otherwise, any site or list marked as noCrawl will not be processed. Defaults to true.

Staging

Interacts with buckets and content from Discovery Staging.

Scan Action: scan, scroll

Seed that scrolls throughout a bucket, and creates the records to be ingested into the pipeline.

Table 68. Discovery Features for Data Seeds
Feature Supported

Incremental Scan

Checksum

Distributed Scan

Yes

Hierarchical Records

No

Configuration example
{
  "type": "staging",
  "name": "My Staging Scan Action",
  "config": {
    "action": "scroll",
    "bucket": "my-bucket", (1)
    "metadata": false,
    "size": 35,
    "filter": <DSL Filter>, (2)
    "projection": <DSL Projection>, (3)
    "parentId": <Staging Document ID>
  },
  "pipeline": <Pipeline ID>, (4)
}
1 This configuration field is required.
2 See the DSL Filter section.
3 See the DSL Projection section.
4 See the ingestion pipelines section.

Each configuration field is defined as follows:

bucket

(Required, String) The bucket to scroll.

metadata

(Optional, Boolean) Whether to include the metadata or not. Defaults to false.

size

(Optional, Integer) The size of the contents result.

filter

(Optional, DSL Filter) The filter to apply when scrolling.

projection

(Optional, DSL Projection) The projection to apply when scrolling.

parentId

(Optional, String) The parent ID to match.

Hook Action: create-bucket

Hook that creates a bucket with the given configuration.

Configuration example
{
  "type": "staging",
  "name": "My Staging Hook Action",
  "config": {
    "action": "create-bucket",
    "bucket": "new-bucket", (1)
    "config": {},
    "indices": [ (2)
      {
        "name": "my-index-1",
        "fields": [ (3)
          { "fieldA": "ASC" },
          { "fieldB": "DESC" },
        ]
      },
      {
        "name": "my-index-2",
        "fields": [
          { "fieldC": "DESC" }
        ]
      }
    ],
  }
}
1 This configuration field is required.
2 See the index definition.
3 See the index field definition.

Each configuration field is defined as follows:

bucket

(Required, String) The bucket name.

config

(Optional, Object) The bucket configuration.

indices

(Optional, Array of Objects) The list of indices for the bucket.

Field definitions
name

(Required, String) The index name.

fields

(Required, Array of Objects) The index fields. Key/Value pairs with the field name, and the corresponding sort ordering, either ASC or DESC.

Field object format
{ "<field>": "<ASC or DESC>" }
Hook Action: delete-many

Hook that deletes multiple documents from a bucket.

Configuration example
{
  "type": "staging",
  "name": "My Staging Hook Action",
  "config": {
    "action": "delete-many",
    "bucket": "my-bucket", (1)
    "parentId": <Staging Document ID>,
    "filter": <DSL Filter>, (2)
  }
}
1 This configuration field is required.
2 See the DSL Filter section.

Each configuration field is defined as follows:

bucket

(Required, String) The bucket whose documents are deleted

parentId

(Optional, String) Documents with this parent document will be deleted. If the filter field is also present, then documents that have this parent document and pass the filter are the only ones deleted.

filter

(Optional, DSL Filter) Documents that pass this filter will be deleted. If the parentId field is also present, then documents that pass the filter and and have the configured parent document are the only ones deleted.

Note

Either the parentId field or the filter field must be provided, or both. Otherwise the hook will fail.

Processor Action: store, hydrate

Processor that stores the records into the given bucket.

Table 69. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "staging",
  "name": "My Staging Processor Action",
  "config": {
    "action": "store",
    "bucket": "my-bucket", (1)
    "data": "#{ data('/my/record') }"
  }
}
1 This configuration field is required.

Each configuration field is defined as follows:

bucket

(Required, String) The bucket where the documents will be stored.

parentId

(Optional, String) The parent ID of the documents to store.

data

(Optional, Object) The data to store on the bucket. If not provided, the component will use the output from the last processor that generated data for each record.

metadata

(Optional, Boolean) Whether to store the metadata from the saved documents or not. If true, the metadata will be included as part of the execution data of each record. Defaults to false.

Template

Uses the Template Engine to process dynamic data provided by the user to generate a text output based on a custom template.

Processor Action: process

Processor that processes the provided template with the defined configuration.

Table 70. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "template",
  "name": "My Template Processor Action",
  "config": {
    "action": "process",
    "template": "Hello, ${name}!", (1)
    "bindings": { (1)
      "name": "John"
    },
    "outputFormat": "PLAIN"
  }
}
1 These configuration fields are required.

Each configuration field is defined as follows:

template

(Required, String) The template to process.

bindings

(Required, Object) The bindings to replace in the template.

Binding object format
{
  "bindingA": "#{ data('/my/binding/field') }",
  ...
}

Each binding, defined as a key in the object, can be later referenced in a template:

My bindingA value is ${bindingA}
outputFormat

(Optional, String) The output format of the precessed template. Supported formats are JSON and PLAIN. Defaults to PLAIN

The output of the processor will be saved in the record’s template field by default. This field’s name can be overwritten in the processor state configuration. For example, given a processor with a PLAIN output format:

Record output example
{
  "template": "plain value"
}

Tika

Integrates with the Apache Tika library to detect and extract content and metadata from various file types. Tika supports a wide range of document formats and provides a consistent interface for text and metadata extraction.

Processor Action: parse,process

Processor that parses plain text and XHTML content from an input file using the Apache Tika Java API. It supports custom Tika configuration files and can include metadata in the output. This action is suitable for content extraction workflows that need to handle diverse file formats.

Table 71. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "tika",
  "name": "Tika Text Parser",
  "config": {
    "action": "process",
    "file": "#{ file('file.txt', 'BYTES') }", (1)
    "metadata": <Metadata>, (3)
    "config": "#{ file('tikaConfig.xml') }",
    "outputFormat": "plain" (1) (2)
  }
}
1 This configuration field is required.
2 Must be either plain or xhtml.
3 See the metadata configuration.

Each configuration field is defined as follows:

file

(Required, InputStream) The file to parse using Tika.

outputFormat

(Required, String) Defines the format of the parsed output. Must be either plain (plain text output) or xhtml (XHTML content output).

metadata

(Optional, Object)

Details
Configuration example
{
  "input": { (2)
    "keyA": "valueA"
  },
  "output": true (1)
}
1 This configuration field is required.
2 See the metadata input format.

Each configuration field is defined as follows:

input

(Required, Map of String/Object) A set of key-value pairs passed as input metadata to Tika.

Details
Configuration example
{
  "Accept-Encoding": "gzip, deflate, br"
}
output

(Required, Boolean) If set to true, Tika’s output metadata will be included in the result.

config

(Optional, InputStream) The Tika configuration XML, as defined in the official Tika configuration documentation.

The resulting parsed content and optional metadata will be saved in each record’s tika field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting Tika output in a record would be:

Record output example for plain text
{
  "tika": {
    "plain": "The Universe:\nThe universe is all of space and time and their contents.\nThe portion of the universe that can be seen by humans is approximately 93 billion light-years in diameter at present, but the total size of the universe is not known.\n",
    "metadata": {
      "keyA": "valueA",
      "X-TIKA:Parsed-By": [
        "org.apache.tika.parser.DefaultParser",
        "org.apache.tika.parser.csv.TextAndCSVParser"
      ],
      "X-TIKA:Parsed-By-Full-Set": [
        "org.apache.tika.parser.DefaultParser",
        "org.apache.tika.parser.csv.TextAndCSVParser"
      ],
      "Content-Encoding": "ISO-8859-1",
      "X-TIKA:detectedEncoding": "ISO-8859-1",
      "X-TIKA:encodingDetector": "UniversalEncodingDetector",
      "Content-Type": "text/plain; charset=ISO-8859-1"
    }
  }
}
Record output example for xhtml
{
  "tika": {
    "file": {
      "@discovery": "object",
      "bucket": "ingestion",
      "key": "ingestion-4049a2dd-17ca-440c-893b-cddfd817e45f/6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b/f63067a4-ebc1-403b-a7a3-57b8a93a6d28/b1e92337-9ae0-427f-81e6-724e96af0970"
    },
    "metadata": {
      "keyA": "valueA",
      "X-TIKA:Parsed-By": [
        "org.apache.tika.parser.DefaultParser",
        "org.apache.tika.parser.html.JSoupParser"
      ],
      "dc:title": "Rooster",
      "author": "Wikipedia contributors",
      "X-TIKA:Parsed-By-Full-Set": [
        "org.apache.tika.parser.DefaultParser",
        "org.apache.tika.parser.html.JSoupParser"
      ],
      "Content-Encoding": "ISO-8859-1",
      "dc:creator": "Wikipedia contributors",
      "Content-Type-Hint": "text/html; charset=UTF-8",
      "X-TIKA:detectedEncoding": "ISO-8859-1",
      "X-TIKA:encodingDetector": "UniversalEncodingDetector",
      "Content-Type": "application/xhtml+xml; charset=ISO-8859-1"
    }
  }
}

Vespa

Uses the Vespa integration to send HTTP requests to a Vespa service.

Processor Action: store

Processor that upserts or deletes documents from a Vespa app using the Document API.

Table 72. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "vespa",
  "name": "My Vespa Store Action",
  "config": {
    "action": "store",
    "namespace": "my-namespace", (1)
    "documentType": "my-document-type", (1)
    "condition": "schema.field==value",
    "data": "#{ data('/my/record') }",
    "route": "default",
    "timeout": "180s",
    "traceLevel": 0
  },
  "server": <Vespa Server ID> (2)
}
1 These configuration fields are required.
2 See the Vespa integration section.

Each configuration field is defined as follows:

Note

All optional configuration fields will not be specified in requests if they’re missing from the configuration.

namespace

(Required, String) The namespace of the vespa document.

documentType

(Required, String) The document type of the vespa document. Described in the schemas.sd.

condition

(Optional, String) The condition sent as query parameter in requests.

route

(Optional, String) The custom route for the requests.

timeout

(Optional, String) The timeout, in seconds, for the requests. Refer to the linked page for other available time units.

traceLevel

(Optional, Integer) The trace level for the request’s logging. Must be a whole number between 0 and 9, where higher gives more details.

data

(Optional, Object) The fields of the vespa document. If not provided, the component will use the output from the last processor that generated data for each record.

Voyage AI

Uses the Voyage AI integration to send requests to the Voyage AI API. Supports multiple actions for different endpoints of the service.

Processor Action: embeddings

Processor that given input string and other arguments such as the preferred model name, it returns a response containing a list of embeddings. See Voyage AI Embeddings and the API Text embedding models endpoint.

Table 73. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "voyage-ai",
  "name": "My Embeddings Action",
  "config": {
    "action": "embeddings",
    "model": "voyage-large-2", (1)
    "input": "#{ data('/input') }", (1)
    "truncation": true,
    "inputType": "DOCUMENT",
    "outputDimension": 1536,
    "outputDatatype": "FLOAT",
    "encodingFormat": "Base64",
    "flush": <Bulk Flush Configuration> (2)
  },
  "server": <Voyage AI Server> (3)
}
1 These configuration fields are required.
2 See the flush configuration.
3 See the Voyage AI integration section.

Each configuration field is defined as follows:

model

(Required, String) The model to use for the request. See models.

input

(Required, String) The input document to be embedded.

truncation

(Optional, Boolean) Whether to truncate the input to satisfy the context length limit on the query and the documents. Defaults to true.

inputType

(Optional, String) Type of the input text. One of: QUERY or DOCUMENT. Defaults to null.

outputDimension

(Optional, Integer) The number of dimensions for resulting output embeddings. Defaults to null.

outputDatatype

(Optional, String) The data type for the embeddings to be returned. One of: FLOAT, INT8,UINT8, BINARY or UBINARY. Defaults to FLOAT.

encodingFormat

(Optional, String) Format in which the embeddings are encoded. Defaults to null, but can be set to Base64.

flush

(Optional, Object) The flush configuration.

Details
Configuration example
{
  "maxCount": 25,
  "maxWeight": 25,
  "flushAfter": "PT5M"
}

Each configuration field is defined as follows:

maxCount

(Optional, Integer) The maximum number of records in the bulk before flushing.

maxWeight

(Optional, Long) The maximum weight allowed in a bulk request.

flushAfter

(Optional, Duration) The time to wait before flushing a bulk request.

The resulting embeddings will be saved in each record’s voyage-ai field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting embeddings in a record would be:

Record output example
{
  "voyage-ai": [-0.006929283495992422, -0.005336422007530928, ...]
}
Processor Action: multimodal-embeddings

Processor that given an input list of multimodal inputs consisting of text, images, or an interleaving of both modalities and other arguments such as the preferred model name, returns a response containing a list of embeddings. See Voyage AI Multimodal Embedding and the API Text multimodal embedding models endpoint.

Table 74. Discovery Features for Data Processing
Feature Supported

Record Splitting

No

Configuration example
{
  "type": "voyage-ai",
  "name": "My Multimodal Embeddings Action",
  "config": {
    "action": "multimodal-embeddings",
    "model": "voyage-multimodal-3", (1)
    "input": <Input Object>, (1) (2)
    "truncation": true,
    "inputType": "DOCUMENT",
    "outputEncoding": "Base64",
    "flush": <Bulk Flush Configuration> (3)
  },
  "server": <Voyage AI Server> (4)
}
1 These configuration fields are required.
2 There are multiple types of accepted inputs, check the input object definition for details.
3 See the embedding’s action flush configuration.
4 See the Voyage AI integration section.

Each configuration field is defined as follows:

model

(Required, String) The model to use for the request. See models.

input

(Required, Object) The input object to be embedded.

Field definitions
type

(Required, String) The type. One of: text, image_url or image_base64.

text

(Optional, String) The text if the type text is chosen.

Text input example
{
  "type": "text",
  "text": "This is a banana."
}
imageUrl

(Optional, String) The image URL if the type image_url is chosen.

Image URL input example
{
  "type": "image_url",
  "imageUrl": "https://raw.githubusercontent.com/voyage-ai/voyage-multimodal-3/refs/heads/main/images/banana.jpg"
}
imageBase64

(Optional, Object) The base 64 encoded image if the type image_base64 is chosen.

Image Base64 input example
{
  "type": "image_base64",
  "imageBase64": {
      "mediaType": "image/jpeg",
      "base64": true,
      "data": "/9j/4AAQSkZJRgABAQEAYABgAAD(...)"
  }
}

Each configuration field is defined as follows:

mediaType

(Required, String) The data media type. Supported media types are: image/png, image/jpeg, image/webp, and image/gif.

base64

(Required, Boolean) Whether the data is encoded in Base64.

data

(Required, String) The data itself.

truncation

(Optional, Boolean) Whether to truncate the inputs to fit within the context length. Defaults to true.

inputType

(Optional, String) Type of the input text. One of: QUERY or DOCUMENT. Defaults to null.

outputEncoding

(Optional, String) Format in which the embeddings are encoded. Defaults to null, but can be set to Base64.

flush

(Optional, Object) The flush configuration. Its definition is the same as for the embedding’s action flush.

The resulting embeddings will be saved in each record’s voyage-ai field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting embeddings in a record would be:

Record output example
{
  "voyage-ai": [-0.006929283495992422, -0.005336422007530928, ...]
}

Web Crawler

Uses Norconex Web Crawler for the crawling of websites to extract their content.

Scan Action: scan

Seed that crawls a website using basic configurations.

Table 75. Discovery Features for Data Seeds
Feature Supported

Incremental Scan

Custom

Distributed Scan

No

Hierarchical Records

No

Configuration example
{
  "type": "webcrawler",
  "name": "My Web Crawler Scan Action",
  "config": {
    "action": "scan",
    "urls": ["https://pureinsights"], (1)
    "userAgent": "pureinsights-website-connector",
    "delay": "200ms",
    "maxDepth": 0,
    "maxDocuments": 0,
    "ignoreRobotsTxt": false,
    "ignoreRobotsMeta": false,
    "ignoreSiteMap": false,
    "ignoreCanonicalLinks": false,
    "metadataFilters": [
      ...
    ],
    "referenceFilters": [
      ...
    ],
    "connection": {
      ...
    }
  },
  "pipeline": <Pipeline ID>, (2)
}
1 This configuration field is required.
2 See the ingestion pipelines section.

Each configuration field is defined as follows:

urls

(Required, Array of Strings) The list of starting URLs to crawl.

userAgent

(Optional, String) The identifier for the websites to identify the crawler. Defaults to pureinsights-website-connector.

delay

(Optional, Duration) The delay to handle interval between each page download. Defaults to 200ms.

maxDepth

(Optional, Integer) The maximum number of levels deep to crawl from the starting URLs. Default is unlimited.

maxDocuments

(Optional, Integer) The maximum number of documents to successfully process. Default is unlimited.

ignoreRobotsTxt

(Optional, Boolean) Whether to ignore crawling instructions in robots.txt files. Default is false.

ignoreRobotsMeta

(Optional, Boolean) Whether to ignore in-page robot rules. Default is false.

ignoreSiteMap

(Optional, Boolean) Whether to ignore sitemap detection and resolution for the URLs to process. Default is false.

ignoreCanonicalLinks

(Optional, Boolean) Whether to ignore canonical links found in HTTP headers and in HTML files section. Default is false.

metadataFilters

(Optional, Array of Objects) The list of filters to apply based on the documents metadata.

Details
{
  "metadataFilters": [
    {
      "field": "Content-Type",
      "values": [
        "pdf"
      ],
      "mode": "EXCLUDE"
    }
  ]
}

Each configuration field is defined as follows:

field

(Required, String) The name of the metadata field.

values

(Required, Array of Strings) The list of values used to filter from the field specified.

mode

(Optional, String) The mode to define if the documents are either included or excluded from the result. One of INCLUDE or EXCLUDE. Defaults to INCLUDE.

referenceFilters

(Optional, Array of Objects) The list of filters to apply based on the documents reference (i.e. URLs).

Details
{
  "referenceFilters": [
    {
      "type": "BASIC",
      "filter": "my-page",
      "mode": "EXCLUDE",
      "caseSensitive": true
    }
  ]
}

Each configuration field is defined as follows:

type

(Required, String) The type of filter. One of:

  • WILDCARD: Filters by using wildcards with the * and ? characters.

  • BASIC: Text is matched as specified.

  • CSV: Same as having multiple BASIC filters, separated by commas.

  • REGEX: Filters by regular expressions.

filter

(Required, String) The value of the filter to apply.

mode

(Optional, String) The mode to define if the documents are either included or excluded from the result. One of INCLUDE or EXCLUDE. Defaults to INCLUDE.

caseSensitive

(Optional, Boolean) Whether the filter is case-sensitive or not.

connection

(Optional, Object) The configuration of the connection for the HTTP Fetcher.

Details
{
  "connection": {
    "connectTimeout": "60s",
    "socketTimeout": "60s",
    "requestTimeout": "60s",
    "pool": {

    }
  }
}

Each configuration field is defined as follows:

connectTimeout

(Optional, Duration) The timeout to connect to the website. Defaults to 60s.

socketTimeout

(Optional, Duration) The maximum period of inactivity between two consecutive data packets. Defaults to 60s.

requestTimeout

(Optional, Duration) The timeout for a requested connection. Defaults to 60s.

pool

(Optional, Object) The configuration of the connection pool.

Details
{
  "pool": {
    "size": 200
  }
}
size

(Optional, Integer) The maximum number of connections that can be created. Defaults to 200.

Discovery QueryFlow

Discovery QueryFlow is a lightweight tool that allows to process external requests through configurable entrypoints with minimum overhead, while enabling:

  • Flexibility to represent complex query processing scenarios through a finite-state machine.

  • On-the-fly tuning of configurations for a fast feedback loop.

  • Extensive component library for advanced interpretation of the external request.

Discovery QueryFlow

Entrypoints

REST Endpoints

REST Endpoints API
Create a new Endpoint
$ curl --request POST 'queryflow-api:12040/v2/entrypoint/endpoint' --data '{ ... }'
List all Endpoints
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/endpoint'
Get a single Endpoint
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/endpoint/{id}'
Update an existing Endpoint
$ curl --request PUT 'queryflow-api:12040/v2/entrypoint/endpoint/{id}' --data '{ ... }'
Note

The type of an existing endpoint can’t be modified.

Delete an existing Endpoint
$ curl --request DELETE 'queryflow-api:12040/v2/entrypoint/endpoint/{id}'
Enable an existing Endpoint
$ curl --request PATCH 'queryflow-api:12040/v2/entrypoint/endpoint/{id}/enable'
Disable an existing Endpoint
$ curl --request PATCH 'queryflow-api:12040/v2/entrypoint/endpoint/{id}/disable'
Clone an existing Endpoint
$ curl --request POST 'queryflow-api:12040/v2/entrypoint/endpoint/{id}/clone?name=clone-new-name&uri=clone-new-uri&method=clone-new-method'
Query Parameters
method

(Required, String) The HTTP Method of the new Endpoint

uri

(Required, String) The URI of the new Endpoint

name

(Required, String) The name of the new Endpoint

Search for Endpoints using DSL Filters
$ curl --request POST 'queryflow-api:12040/v2/entrypoint/endpoint/search' --data '{ ... }'
Body

The body payload is a DSL Filter to apply to the search

Autocomplete for Endpoints
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/endpoint/autocomplete?q=value'
Query Parameters
q

(Required, String) The query to execute the autocomplete search

Discovery QueryFlow enables the creation of custom RESTful APIs where each endpoint is defined by: a unique URI, an HTTP Method, the MIME type it produces, and the request pipeline with its corresponding finite-state machine for processing.

{
  "type": "default",
  "uri": "/my/custom/endpoint",
  "httpMethod": "GET",
  "name": "My Custom Endpoint",
  "pipeline": <Pipeline ID>,
  "timeout": "60s"
  ...
}
type

(Required, String) The produced MIME type in the HTTP response, defined by the Accept HTTP Request header. Either json (or default) for application/json or stream for text/event-stream. Defaults to json

httpMethod

(Required, String) The HTTP method for the custom endpoint. Must be one of: GET, POST, PUT, DELETE, PATCH.

uri

(Required, String) The URI path for the custom endpoint (e.g. /my/path).

The URI can contain variables in any of its paths (e.g. /my/{pathA}, /{pathA}/{pathB}). If present, the values for every placeholder will be available as part of the metadata of the HTTP request and can be accessed in the configuration of the processors with the help of the Expression Language.

Details
{
  "uri": "/my/{pathA}/endpoint",
  "httpMethod": "GET",
  "name": "My Custom Endpoint",
  "timeout": "60s"
  ...
}
{
  "type": "my-component-type",
  "name": "My Component Processor",
  "config": {
    "myProperty": "#{ data('/httpRequest/pathVariables/pathA') }"
  }
  ...
}
name

(Required, String) The unique name to identify the custom endpoint

description

(Optional, String) The description for the configuration.

pipeline

(Required, UUID) The ID of the pipeline configuration where the request is processed.

config

(Optional, Object) The configuration for the response after the execution of the request, based on the endpoint type.

JSON
{
  "uri": "/my/endpoint",
  "httpMethod": "GET",
  "name": "My Custom Endpoint",
  "config": {
    ...
  },
  "timeout": "60s"
  ...
}
statusCode

(Required, Integer) The HTTP status code in the range of [200, 599[ for the response. Defaults to 200.

headers

(Optional, Object) The HTTP headers to return as part of the response.

Details
{
  "uri": "/my/endpoint",
  "httpMethod": "GET",
  "name": "My Custom Endpoint",
  "config": {
    "headers": {
      "Etag": "#{ data('/my/etag/value') }",
      ...
    },
    ...
  },
  "timeout": "60s"
  ...
}
body

(Optional, Object) The HTTP JSON body to return as part of the response.

Details
{
  "type": "response",
  "body": {
    "keyA": "#{ data('/my/data') }",
    ...
  },
  ...
}
snippets

(Optional, Object) The snippets to be referenced in the configuration with the help of the Expression Language.

Details
{
  "uri": "/my/endpoint",
  "httpMethod": "GET",
  "name": "My Custom Endpoint",
  "config": {
    "body": {
     "myProperty": "#{ snippets.snippetA }"
    },
    "snippets": {
      "snippetA": {
       ...
      }
    },
    ...
  },
  "timeout": "60s"
  ...
}
Note

Avoid the usage of any reserved operator such as hyphens in the name of a snippet.

Stream (SSE)
{
  "uri": "/my/endpoint",
  "httpMethod": "GET",
  "name": "My Custom Endpoint",
  "config": {
    ...
  },
  "timeout": "60s"
  ...
}
statusCode

(Required, Integer) The HTTP status code in the range of [200, 599[ for the response. Defaults to 207.

properties

(Optional, Object) The properties to be referenced with the help of the Expression Language in the configuration of the processors

Details
{
  "uri": "/my/custom/endpoint",
  "httpMethod": "GET",
  "name": "My Custom Endpoint",
  "properties": {
    "keyA": "valueA"
  },
  "timeout": "60s"
  ...
}
{
  "type": "my-component-type",
  "name": "My Component Processor",
  "config": {
    "myProperty": "#{ endpoint.properties.keyA') }"
  },
  ...
}
timeout

(Required, Duration) The timeout for the execution of the custom endpoint

labels

(Optional, Array of Objects) The labels for the configuration.

Details
{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}
key

(Required, String) The key of the label.

value

(Required, String) The value of the label.

Metadata

The execution starts with the metadata of the invocation stored in the JSON Data Channel:

{
  "id": "55d22c60-6d61-41ce-b8b1-c0f1acd6e5e4",
  "httpRequest": {
    ...
  },
  "pageable": {
    ...
  },
  "properties": {
    ...
  }
}
id

(UUID) An auto-generated ID for the execution.

httpRequest

(Object) The HTTP Request tat triggered the execution.

Details
{
  "httpRequest": {
    "uri": "/my-endpoint",
    "method": "POST",
    "headers": {
      "header-a": "value-a",
      "header-b": [
        "value-b-1",
        "value-b-2"
      ]
    },
    "queryParams": {
      "param-a": "value-a",
      "param-b": [
        "value-b-1",
        "value-b-2"
      ]
    },
    "cookies": [
      {
        "name": "cookie-name-a",
        "path": "/some/path/a",
        "value": "cookie-value-a",
        "domain": "cookie-domain-a",
        "maxAge": 1234
      },
      {
        "name": "cookie-name-b",
        "value": "cookie-value-b"
      }
    ],
    "body": {
      "body-a": "value-a"
    },
    "pathVariables": {
      "variable-a": "value-a"
    }
  }
}
uri

(Required, String) The URI path for the HTTP Request.

method

(Required, String) The HTTP Method for the HTTP Request.

headers

(Optional, Object) The headers for the HTTP Request. The value of each header can be either a single String, or an Array of Strings.

queryParams

(Optional, Object) The query parameters for the HTTP Request. The value of each query parameter can be either a single String, or an Array of Strings.

cookies

(Optional, Array of Objects) The list of cookies for the HTTP Request.

Details
name

(Required, String) The name of the cookie.

value

(Required, String) The value of the cookie.

path

(Optional, String) The path of the cookie.

domain

(Optional, String) The domain of the cookie.

maxAge

(Optional, Integer) The maximum age of the cookie.

body

(Optional, Object) The body of the HTTP Request.

pathVariables

(Optional, Object) The variables of the HTTP Request’s URI path. The value of each variable must be a single String.

pageable

(Object) The pagination request parameters.

Details
Page configuration example
{
  "page": 0,
  "size": 25,
  "sort": [ // (1)
    {
      "property" : "fieldA",
      "direction" : "ASC"
    },
    {
      "property" : "fieldB",
      "direction" : "DESC"
    }
  ]
}

Each configuration field is defined as follows:

page

(Integer) The page number.

size

(Integer) The size of the page.

sort

(Array of Objects) The sort definitions for the page.

Field definitions
property

(String) The property where the sort was applied.

direction

(String) The direction of the applied sorting. Either ASC or DESC.

properties

(Object) The execution properties as configured in the Endpoint.

Expression Language Extensions
Table 76. Discovery QueryFlow Expression Language Variables
Variable Description Example

endpoint.id

The ID of the endpoint in execution

aa4a232e-8878-47ac-82d7-48ff77dbc039

endpoint.httpMethod

The HTTP method of the endpoint in execution

GET

endpoint.uri

The URI paths of the endpoint in execution

/my/custom/endpoint

endpoint.name

The name of the endpoint in execution

My Custom Endpoint

endpoint.description

The description of the endpoint in execution

Description of My Custom Endpoint

endpoint.properties

The properties of the endpoint in execution

[<myPropertyA,1>, <myPropertyB,Text Value>, <myPropertyC,>]

endpoint.labels

The labels of the endpoint in execution, grouped by key

[<keyA,[valueA,valueB]>, <keyB,valueC>]

Invoking a REST Endpoint

Once a REST Endpoint is fully configured, it can be invoked as any other REST API by calling its HTTP Method + URI + Type under the /api root path:

Invoke a REST Endpoint with JSON response
$ curl --request GET 'queryflow-api:12040/v2/api/my-endpoint?param=value'
$ curl --request GET 'queryflow-api:12040/v2/api/my-endpoint?param=value' --header 'Accept: application/json'

The expected HTTP response the first match of the following rules:

  1. The predefined response as configured in the endpoint.

  2. The most recent entry in the JSON Data Channel, where:

    • If the root of the document is named httpResponse, the statusCode, headers and body will be used accordingly

      Details
      {
        "httpResponse": {
          "statusCode": 200,
          "headers": {
            "header-a": "value-a",
            "header-b": [
              "value-b-1",
              "value-b-2"
            ]
          },
          "body": {
          ...
          }
        }
      }
      statusCode

      (Integer) The HTTP Status Code of the HTTP Response.

      headers

      (Object) The headers of the HTTP Response. The value of each header can be either a single String, or an Array of Strings.

      Details
      {
        "httpResponse": {
          "headers": {
            "header-a": "value-a",
            "header-b": [
              "value-b-1",
              "value-b-2"
            ]
          },
          ...
        }
      }
      body

      (Object) The body of the HTTP Response.

    • If the node is any other from a processor state, the body will be unwrapped from the outputField and the status code will be 200.

  3. In any other case, the response will be 204 - No Content.

Invoke a REST Endpoint with SSE response
$ curl --request GET 'queryflow-api:12040/v2/api/my-endpoint?param=value' --header 'Accept: text/event-stream'

A Server-Sent Event (SSE) response is immediately returned with its defined HTTP status code (or 207 - Multi-Status if undefined) with the events emitted through the SSE Data Channel.

Once the execution is completed, the connection will be closed by the server.

Details
name: <Output Field Name>
data: <Data>
name

(String) The configured outputFieldName for the processor.

data

(Object) The JSON data through the channel.

Debugging a REST Endpoint

Given that the definition of the Endpoint can grow in complexity, the risk of something breaking increases: a condition failed, the output was not as expected, a parameter was wrong…​

In order to identify the problem, the /debug root path offers a complete tracing of the execution for the endpoint. Each one of the states, their output, their errors and the overall step-by-step followed by the finite-state machine will be displayed:

Debug an Endpoint
$ curl --request GET 'queryflow-api:12040/v2/debug/my-endpoint?param=value'
$ curl --request GET 'queryflow-api:12040/v2/debug/my-endpoint?param=value' --header 'Accept: application/json'
$ curl --request GET 'queryflow-api:12040/v2/debug/my-endpoint?param=value' --header 'Accept: text/event-stream'
Event: Execution Start
[
  {
    "timestamp": 1769746448166,
    "event": "execution:start"
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

Event: Execution Complete
[
  {
    "timestamp": 1769746448166,
    "event": "execution:complete"
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

Event: Execution Error
[
  {
    "timestamp": 1769746448166,
    "event": "execution:error",
    "errorMessage": {
      ...
    }
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

errorMessage

(Object) The error message of the event.

Event: Execution Timeout
[
  {
    "timestamp": 1769746448166,
    "event": "execution:timeout",
    "duration": 100
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

duration

(Long) The duration in milliseconds of the execution.

Event: JSON Data
[
  {
    "timestamp": 1769746448166,
    "event": "data:json",
    "data": {
      ...
    }
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

data

(JSON) The JSON data generated.

Event: Server-Sent Event Data
[
  {
    "timestamp": 1769746448166,
    "event": "data:sse",
    "data": {
      ...
    }
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

data

(JSON) The Server-Sent Event generated.

Event: State Start
[
  {
    "timestamp": 1769746448166,
    "event": "state:start",
    "type": ""
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

type

(String) The type of state.

Event: State Complete
[
  {
    "timestamp": 1769746448166,
    "event": "state:complete",
    "duration": 100
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

duration

(Long) The duration in milliseconds of the execution.

Event: State Error
[
  {
    "timestamp": 1769746448166,
    "event": "state:error",
    "duration": 100,
    "errorMessage": {
      ...
    }
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

duration

(Long) The duration in milliseconds of the execution.

errorMessage

(Object) The error message of the event.

Event: Processor Step Start
[
  {
    "timestamp": 1769746448166,
    "event": "step:start",
    "stepIndex": 0
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

stepIndex

(Integer) The index of the step in execution.

Event: Processor Step Skip
[
  {
    "timestamp": 1769746448166,
    "event": "step:skip",
    "stepIndex": 0
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

stepIndex

(Integer) The index of the skipped step.

Event: Processor Step Complete
[
  {
    "timestamp": 1769746448166,
    "event": "step:complete",
    "stepIndex": 0
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

stepIndex

(Integer) The index of the step in execution.

Event: Processor Step Failure via Error Policy
[
  {
    "timestamp": 1769746448166,
    "event": "step:failure",
    "stepIndex": 0,
    "errorMessage": {
      ...
    }
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

stepIndex

(Integer) The index of the step in execution.

errorMessage

(Object) The error message of the event.

Event: Processor Step Error
[
  {
    "timestamp": 1769746448166,
    "event": "step:error",
    "stepIndex": 0,
    "errorMessage": {
      ...
    }
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

stepIndex

(Integer) The index of the step in execution.

errorMessage

(Object) The error message of the event.

Event: Switch Match
[
  {
    "timestamp": 1769746448166,
    "event": "switch:match",
    "option": {
      ...
    }
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

option

(DSL Filter) The DSL Filter that matched.

Event: Switch Default
[
  {
    "timestamp": 1769746448166,
    "event": "switch:default"
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

Event: Parallel Pipeline Start
[
  {
    "timestamp": 1769746448166,
    "event": "pipeline:start",
    "tag": ""
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

tag

(String) The tag of the pipeline in execution.

Event: Parallel Pipeline Complete
[
  {
    "timestamp": 1769746448166,
    "event": "pipeline:complete",
    "tag": ""
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

tag

(String) The tag of the pipeline in execution.

Event: Parallel Pipeline Failure via Error Policy
[
  {
    "timestamp": 1769746448166,
    "event": "pipeline:failure",
    "tag": "",
    "errorMessage": {
      ...
    }
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

tag

(String) The tag of the pipeline in execution.

errorMessage

(Object) The error message of the event.

Event: Parallel Pipeline Error
[
  {
    "timestamp": 1769746448166,
    "event": "pipeline:error",
    "tag": "",
    "errorMessage": {
      ...
    }
  }
]
timestamp

(Long) The epoch timestamp when the event happens.

event

(String) The name of the event.

tag

(String) The tag of the pipeline in execution.

errorMessage

(Object) The error message of the event.

Note

The debug request is exactly the same as the one sent to the /api path.

Note

The debug response contains the X-Request-ID header with the ID of the execution.

MCP Server

MCP Servers API
Create a new MCP Server
$ curl --request POST 'queryflow-api:12040/v2/entrypoint/mcp-server' --data '{ ... }'
List all MCP Servers
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/mcp-server'
Get a single MCP Server
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/mcp-server/{id}'
Update an existing MCP Server
$ curl --request PUT 'queryflow-api:12040/v2/entrypoint/mcp-server/{id}' --data '{ ... }'
Delete an existing MCP Server
$ curl --request DELETE 'queryflow-api:12040/v2/entrypoint/mcp-server/{id}'
Enable an existing MCP Server
$ curl --request PATCH 'queryflow-api:12040/v2/entrypoint/mcp-server/{id}/enable'
Disable an existing MCP Server
$ curl --request PATCH 'queryflow-api:12040/v2/entrypoint/mcp-server/{id}/disable'
Clone an existing MCP Server
$ curl --request POST 'queryflow-api:12040/v2/entrypoint/mcp-server/{id}/clone?name=clone-new-name&uri=clone-new-uri'
Query Parameters
uri

(Optional, String) The URI of the new MCP Server.

name

(Optional, String) The name of the new MCP Server.

Search for MCP Servers using DSL Filters
$ curl --request POST 'queryflow-api:12040/v2/entrypoint/mcp-server/search' --data '{ ... }'
Body

The body payload is a DSL Filter to apply to the search.

Autocomplete for MCP Servers
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/mcp-server/autocomplete?q=value'
Query Parameters
q

(Required, String) The query to execute the autocomplete search.

The Discovery QueryFlow implementation of the Model Context Protocol, is as defined on the 2025-11-25 version of the protocol and exposed on top of the streamable HTTP transport layer.

The entrypoint targets the connection of AI Applications to data sources through business rules defined with the help of the finite-state machine.

The following capabilities are supported:

Note

Currently, only the POST endpoint is available. The implementation of the GET listener will be available in future versions of Discovery QueryFlow.

{
  "uri": "/my/mcp/server",
  "name": "My MCP Server",
  ...
}
uri

(Required, String) The URI path for the MCP Server.

name

(Required, String) The unique name to identify the MCP Server.

description

(Optional, String) The description for the configuration.

instructions

(Optional, String) The instructions to interact with the MCP Server.

capabilities

(Required, Object) The configuration of the capabilities exposed by the MCP Server.

Details
{
  "uri": "/my/mcp/server",
  "name": "My MCP Server",
  "pipeline": <Pipeline ID>,
  "capabilities": {
    "logging": {},
    "tools": {}
  },
  ...
}
logging

(Optional, Object) The logging capabilities of the MCP Server. If the server supports logging, an empty object must be included.

tools

(Optional, Object) The tools capabilities of the MCP Server. If the server supports tools, an empty object must be included.

serverInfo

(Required, Object) The information of the MCP Server.

Details
{
  "uri": "/my/mcp/server",
  "name": "My MCP Server",
  "serverInfo": {
    ...
  },
  ...
}
name

(Required, String) The name of the MCP Server.

version

(Required, String) The version of the MCP Server.

title

(Optional, String) The title of the MCP Server.

metadata

(Optional, Object) The metadata of the MCP Server.

properties

(Optional, Object) The properties to be referenced with the help of the Expression Language in the configuration of the processors.

requestTimeout

(Optional, Duration) The timeout for the execution of each request through the MCP Server. Defaults to 30s.

expireAfter

(Optional, Duration) The expiration timeout for idle sessions in the MCP Server.

active

(Optional, Boolean) Whether the MCP Server is active. Defaults to true.

labels

(Optional, Array of Objects) The labels for the configuration.

Details
{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}
key

(Required, String) The key of the label.

value

(Required, String) The value of the label.

Capabilities
Tools
MCP Server Tools API
Create a new MCP Server Tool
$ curl --request POST 'queryflow-api:12040/v2/entrypoint/mcp-server/{mcpServerId}/tool' --data '{ ... }'
List all MCP Server Tools of an MCP Server
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/mcp-server/{mcpServerId}/tool'
Get a single MCP Server Tool by ID and MCP Server
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/mcp-server/{mcpServerId}/tool/{mcpToolId}'
Update an existing MCP Server Tool
$ curl --request PUT 'queryflow-api:12040/v2/entrypoint/mcp-server/{mcpServerId}/tool/{mcpToolId}' --data '{ ... }'
Delete an existing MCP Server Tool
$ curl --request DELETE 'queryflow-api:12040/v2/entrypoint/mcp-server/{mcpServerId}/tool/{mcpToolId}'

The tools capability allows to execute the custom QueryFlow Tools.

{
  "name": "My-MCP-Server-Tool",
  "config": {
    ...
  },
  "pipeline": <Pipeline ID>,
  "timeout": "60s",
  ...
}
name

(Required, String) The unique name to identify the MCP Server Tool. It must follow these restrictions.

description

(Optional, String) The description for the configuration.

config

(Required, Object) The MCP Server Tool configuration.

Details
{
  "name": "My-MCP-Server-Tool",
  "config": {
    "inputSchema": {
      ...
    },
    "outputSchema": {
      ...
    },
    "annotations": {
      ...
    },
    "execution": {
      ...
    },
    "icons": [
      ...
    ],
    "response": {
      ...
    }
  },
  "pipeline": <Pipeline ID>,
  "timeout": "60s",
  ...
}
inputSchema

(Required, Object) The JSON Schema defining expected parameters for the MCP Server Tool execution.

outputSchema

(Optional, Object) The JSON Schema defining expected output structure for the MCP Server Tool execution.

annotations

(Optional, Object) Additional properties describing the MCP Server Tool.

Details
{
  "name": "My-MCP-Server-Tool",
  "config": {
    "annotations": {
      "title": "My tool title",
      "readOnlyHint": false,
      "destructiveHint": false,
      "idempotentHint": false,
      "openWorldHint": false
    },
    ...
  },
  "pipeline": <Pipeline ID>,
  "timeout": "60s",
  ...
}
title

(Optional, String) The MCP Server Tool title.

readOnlyHint

(Optional, Boolean) If true, the tool doesn’t modify its environment.

destructiveHint

(Optional, Boolean) If true, the tool may perform destructive updates to its environment. If false, the tool performs only additive updates.

idempotentHint

(Optional, Boolean) If true, calling the tool repeatedly with the same arguments will have no additional effect on its environment.

openWorldHint

(Optional, Boolean) If true, this tool may interact with an “open world” of external entities. If false, the tool’s domain of interaction is closed.

execution

(Optional, Object) The execution-related properties for the MCP Server Tool.

Details
{
  "name": "My-MCP-Server-Tool",
  "config": {
    "execution": {
      "taskSupport": "forbidden"
    },
    ...
  },
  "pipeline": <Pipeline ID>,
  "timeout": "60s",
  ...
}
taskSupport

(Optional, String) Indicates whether this tool supports task-augmented execution. One of:

  • forbidden: Tool does not support task-augmented execution

  • optional: Tool may support task-augmented execution.

  • required: Tool requires task-augmented execution.

icons

(Optional, Array of Objects) Array of icons to display in user interfaces.

Details
{
  "name": "My-MCP-Server-Tool",
  "config": {
    "icons": [
      {
        "src": "",
        "mimeType": "",
        "sizes": [
          ...
        ],
        "theme": "light"
      }
    ],
    ...
  },
  "pipeline": <Pipeline ID>,
  "timeout": "60s",
  ...
}
src

(Required, String) The URI pointing to the icon resource.

mimeType

(Optional, String) The MIME type override if the source type is missing or generic.

sizes

(Optional, Array of String) The array specifying the sizes to use the icon. Each size must be in WxH format.

theme

(Optional, String) The theme for the icon. It can be either light or dark. The light theme is designed to be used with a light background, while the dark one is designed to be used with a dark background.

response

(Optional, Object) The object specifying the output of the tool execution. If not provided, the latest data node generated is used.

Details
{
  "name": "My-MCP-Server-Tool",
  "config": {
    "response": {
      "snippets": {
        ...
      },
      "content": {
        ...
      }
    },
    ...
  },
  "pipeline": <Pipeline ID>,
  "timeout": "60s",
  ...
}
snippets

(Optional, Object) The snippets to be referenced in the content field with the help of the Expression Language.

Details
{
  "name": "My-MCP-Server-Tool",
  "config": {
    "response": {
      "snippets": {
        "snippetA": {
        ...
        }
      },
      "content": {
        ...
      }
    },
    ...
  },
  "pipeline": <Pipeline ID>,
  "timeout": "60s",
  ...
}
Note

Avoid the usage of any reserved operator such as hyphens in the name of a snippet.

content

(Optional, Object) The object that matches the format specified in the outputSchema if provided, or the format of any Content Result.

pipeline

(Required, UUID) The ID of the pipeline configuration that is executed by the tool.

title

(Optional, String) The title of the custom tool.

timeout

(Required, Duration) The timeout for the execution of the custom tool. Defaults to 30s.

properties

(Optional, Object) The properties to be referenced with the help of the Expression Language in the configuration of the processors.

active

(Optional, Boolean) Whether the MCP Server is active. Defaults to true.

labels

(Optional, Array of Objects) The labels for the configuration.

Details
{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}
key

(Required, String) The key of the label.

value

(Required, String) The value of the label.

Logging

The logging capability is available in executions that return text/event-stream responses with the help of the Message State, where all messages are returned as notifications as defined in the MCP Specification.

Ping

The ping capability provides a mechanism to verify if the connection is alive.

Metadata

The execution starts with the metadata of the invocation stored in the JSON Data Channel:

{
  "id": "d8c7a2d3-3b02-4846-8cf3-0cd1af805b90",
  "session": {
    ...
  },
  "request": {
    ...
  }
}
id

(UUID) An auto-generated ID for the execution context.

session

(Object) The MCP session data.

Details
{
  "session": {
    "id": "d5a581ba-cf3f-4471-a5ce-d45b9eb63083"
  }
}
id

(UUID) An auto-generated ID for the MCP session between the MCP Server and the Client.

request

(Object) The MCP request data.

Details
{
  "request": {
    "toolName": "My-custom-tool",
    "id": "1",
    "arguments": {
      ...
    }
  }
}
toolName

(String) The name of the MCP Server Tool.

id

(String) The request ID.

arguments

(Object) The request parameters, if provided.

Expression Language Extensions
Table 77. Discovery QueryFlow Expression Language Variables
Variable Description Example

tool.id

The ID of the tool in execution

aa4a232e-8878-47ac-82d7-48ff77dbc039

tool.name

The name of the tool in execution

My-Custom-Tool

tool.description

The description of the tool in execution

Description of My Custom Tool

tool.properties

The properties of the tool in execution

[<myPropertyA,1>, <myPropertyB,Text Value>, <myPropertyC,>]

tool.labels

The labels of the tool in execution, grouped by key

[<keyA,[valueA,valueB]>, <keyB,valueC>]

Invoking an MCP Server

Once a MCP Server is fully configured, it can be invoked following using the default mechanism for sending messages to the server using the Streamable HTTP transport layer under the /mcp/{server-uri} root path:

MCP Server Path
$ curl --request POST 'queryflow-api:12040/v2/mcp/my-server-uri' \
       --header 'Accept: application/json' \
       --header 'Accept: text/event-stream'

For capabilities that execute through the finite-state machine, the final JSON-RPC Response (if undefined) will be the most recent entry in the JSON Data Channel.

Pipeline

Pipelines API
Create a new Pipeline
$ curl --request POST 'queryflow-api:12030/v2/pipeline' --data '{ ... }'
List all Pipelines
$ curl --request GET 'queryflow-api:12030/v2/pipeline'
Get a single Pipeline
$ curl --request GET 'queryflow-api:12030/v2/pipeline/{id}'
Update an existing Pipeline
$ curl --request PUT 'queryflow-api:12030/v2/pipeline/{id}' --data '{ ... }'
Delete an existing Pipeline
$ curl --request DELETE 'queryflow-api:12030/v2/pipeline/{id}'
Clone an existing Pipeline
$ curl --request POST 'queryflow-api:12030/v2/pipeline/{id}/clone?name=clone-new-name'
Query Parameters
name

(Required, String) The name of the new Pipeline.

Search for Pipelines using DSL Filters
$ curl --request POST 'queryflow-api:12030/v2/pipeline/search' --data '{ ... }'
Body

The body payload is a DSL Filter to apply to the search.

Autocomplete for Pipelines
$ curl --request GET 'queryflow-api:12030/v2/pipeline/autocomplete?q=value'
Query Parameters
q

(Required, String) The query to execute the autocomplete search.

A pipeline is the definition of the finite-state machine for processing a request:

{
  "name": "My Pipeline",
  "initialState": "stateA",
  "states": {
    "stateA": {
      ...
    },

    "stateB": {
      ...
    }
  },
  ...
}
name

(Required, String) The unique name to identify the pipeline.

description

(Optional, String) The description for the configuration.

initialState

(Required, String) The state, as defined in the states field, to be used as starting point for the request processing.

states

(Required, Object) The states associated to the pipeline.

labels

(Optional, Array of Objects) The labels for the configuration.

Details
{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}
key

(Required, String) The key of the label.

value

(Required, String) The value of the label.

Note

Loops are not forbidden as they might represent valid use cases depending on the configuration of the states. To avoid getting stuck in infinite loops, all entrypoints are required to be configured with a timeout.

Processors

Processors API
Create a new Processor
$ curl --request POST 'queryflow-api:12040/v2/processor' --data '{ ... }'
List all Processors
$ curl --request GET 'queryflow-api:12040/v2/processor'
Get a single Processor
$ curl --request GET 'queryflow-api:12040/v2/processor/{id}'
Update an existing Processor
$ curl --request PUT 'queryflow-api:12040/v2/processor/{id}' --data '{ ... }'
Note

The type of an existing processor can’t be modified.

Delete an existing Processor
$ curl --request DELETE 'queryflow-api:12040/v2/processor/{id}'
Clone an existing Processor
$ curl --request POST 'queryflow-api:12040/v2/processor/{id}/clone?name=clone-new-name'
Query Parameters
name

(Required, String) The name of the new Processor

Test an existing processor
$ curl --request POST 'queryflow-api:12040/v2/processor/{id}/test?timeout=PT15S'
Query Parameters
timeout

(Optional, Duration) The timeout for the request execution. Defaults to 15s.

Body

The body payload is the input value for the Processor.

Search for Processors using DSL Filters
$ curl --request POST 'queryflow-api:12040/v2/processor/search' --data '{ ... }'
Body

The body payload is a DSL Filter to apply to the search

Autocomplete for Processors
$ curl --request GET 'queryflow-api:12040/v2/processor/autocomplete?q=value'
Query Parameters
q

(Required, String) The query to execute the autocomplete search

Each component is stateless, and it’s driven by the configuration defined in the processor and by the context created by the current HTTP Request. This design makes the processor the main building block of Discovery QueryFlow.

They are intended to solve very specific tasks, which makes them re-usable and simple to integrate into any part of the configuration.

{
  "type": "my-component-type",
  "name": "My Component Processor",
  "config": {
    ...
  },
  ...
}
type

(Required, String) The name of the component to execute

name

(Required, String) The unique name to identify the configuration

description

(Optional, String) The description for the configuration.

config

(Required, Object) The configuration for the corresponding action of the component. All configurations will be affected by the Expression Language

snippets

(Optional, Object) The snippets to be referenced in the configuration with the help of the Expression Language

Details
{
  "type": "my-component-type",
  "name": "My Component Processor",
  "config": {
    "myProperty": "#{ snippets.snippetA }"
  },
  "snippets": {
    "snippetA": {
      ...
    }
  },
  ...
}
Note

Avoid the usage of any reserved operator such as hyphens in the name of a snippet.

server

(Optional, UUID/Object) Either the ID of the server configuration for the integration or an object with the detailed configuration.

Details
{
  "server": {
    "id": "ba637726-555f-4c68-bfed-1c91f4803894",
    ...
  },
  ...
}
id

(Required, UUID) The ID of the server configuration for the integration.

credential

(Optional, UUID) The ID of the credential to override the default authentication in the external service.

labels

(Optional, Array of Objects) The labels for the configuration.

Details
{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}
key

(Required, String) The key of the label.

value

(Required, String) The value of the label.

Query Processing with a State Machine

State Types

Processor State

Executes a single or multiple processors in sequence:

{
  "myProcessorState": {
    "type": "processor",
    "processors": [
      ...
    ]
  }
}
type

(Required, String) The type of state. Must be processor

processors

(Required, Array of Objects) The processors to execute

Details
{
  "stateA": {
    "type": "processor",
    "processors": [
      {
        "id": <Processor ID>,
        ...
      }
    ],
    ...
  }
}
id

(Required, UUID) The ID of the processor to execute

outputField

(Optional, String) The output field that wraps the result of the processor execution. Defaults to the one defined in the component

continueOnError

(Optional, Boolean) If true and the processor execution fails, its HTTP response will be stored in its corresponding Data Channel while the other processors in the state continue with their normal execution. If false, the error will either be handled by the onError state, or be spread to its invoker. Defaults to false

active

(Optional, Boolean) false to disable the execution of the processor

next

(Optional, String) The next state for the HTTP Request Execution after the completion of the state. If not provided, the current one will be assumed as the final state

onError

(Optional, String) The state of the to use as fallback if the execution of the current state fails. If undefined, the current HTTP Request Execution will complete with the corresponding error message.

The JSON output of each processor will be stored in the JSON Data Channel wrapped in the configured outputField:

{
  "defaultFieldName": {
    "outputKey": "outputValue"
  }
}

Events emitted through the SSE Data Channel are transmitted as expected by the entrypoint.

Field Mapper State

Adds a new data node on the JSON Data Channel wrapped in the name of the state:

{
  "myFieldMapperState": {
    "type": "field-mapper",
    "mapping": {},
    ...
    "next": "myNextState"
  }
}
type

(Required, String) The type of state. Must be field-mapper.

mapping

(Required, JSON) Any valid JSON (String, Number, Array, Object) created with the help of the Expression Language.

Details
{
  "myFieldMapperState": {
    "type": "field-mapper",
    "mapping": {
      "myProperty": "#{ data('/my/value') }",
      ...
    },
    ...
    "next": "myNextState"
  }
}
snippets

(Optional, Object) The snippets to be referenced in the mapping with the help of the Expression Language.

Details
{
  "myFieldMapperState": {
    "type": "field-mapper",
    "mapping": {
      "myProperty": "#{ snippets.snippetA }",
      ...
    },
    "snippets": {
      "snippetA": {
       ...
      }
    },
    "next": "myNextState"
  }
}
Note

Avoid the usage of any reserved operator such as hyphens in the name of a snippet.

next

(Optional, String) The next state for the HTTP Request Execution. If not provided, the current one will be assumed as the final state.

Message State

Adds a new data node on the SSE Data Channel with the name of the state as event type:

{
  "myMessageState": {
    "type": "message",
    "data": {},
    ...
    "next": "myNextState"
  }
}
type

(Required, String) The type of state. Must be message.

data

(Required, JSON) Any valid JSON (String, Number, Array, Object) created with the help of the Expression Language.

Details
{
  "myMessageState": {
    "type": "message",
    "data": {
      "myProperty": "#{ data('/my/value') }",
      ...
    },
    ...
    "next": "myNextState"
  }
}
snippets

(Optional, Object) The snippets to be referenced in the mapping with the help of the Expression Language.

Details
{
  "myMessageState": {
    "type": "message",
    "data": {
      "myProperty": "#{ snippets.snippetA }",
      ...
    },
    "snippets": {
      "snippetA": {
       ...
      }
    },
    "next": "myNextState"
  }
}
Note

Avoid the usage of any reserved operator such as hyphens in the name of a snippet.

next

(Optional, String) The next state for the HTTP Request Execution. If not provided, the current one will be assumed as the final state.

Parallel Pipeline State

Executes a single or multiple pipelines in parallel:

{
  "myParallelPipelineState": {
    "type": "pipeline",
    "pipelines": {
      ...
    }
  }
}
type

(Required, String) The type of state. Must be pipeline.

pipelines

(Required, Object) The pipelines to execute in parallel.

Details
{
  "type": "pipeline",
  "pipelines": {
    "myPipelineA": {
      "id": <Pipeline ID>,
      ...
    },
    ...
  },
  ...
}
id

(Required, UUID) The ID of the pipeline to execute in parallel.

input

(Optional, Object) The custom metadata to be used at the start of the pipeline execution. All fields can be configured with the help of the Expression Language.

Details
{
  "type": "pipeline",
  "pipelines": {
    "myPipelineA": {
      "id": <Pipeline ID>,
      "input": {
        "myField": "#{ data('/my/value') }"
      }
    },
    ...
  },
  ...
}
output

(Optional, Object) The custom output to be used as the result pipeline execution. All fields can be configured with the help of the Expression Language. If not provided, the last data node generated will be considered as the output.

Details
{
  "type": "pipeline",
  "pipelines": {
    "myPipelineA": {
      "id": <Pipeline ID>,
      "output": {
        "myField": "#{ data('/my/value') }"
      }
    },
    ...
  },
  ...
}
errorPolicy

(Optional, String) If IGNORE and the pipeline execution fails, the other pipelines in the state continue with their normal execution. If FAIL and the pipeline fails, the error will either be handled by the onError state, or be spread to its invoker. The error is always be stored in its corresponding tag on the JSON Data Channel. Defaults to FAIL.

active

(Optional, Boolean) false to disable the execution of the pipeline. If all pipelines are disabled, the state output will be empty.

next

(Optional, String) The next state for the HTTP Request Execution after the completion of all configured pipelines. If not provided, the current one will be assumed as the final state.

onError

(Optional, String) The state of the to use as fallback if the execution of the current state fails. If undefined, the current HTTP Request Execution will complete with the corresponding error message.

The output of the state stored in the JSON Data Channel is a collection with each response:

{
  "myParallelPipelineState": {
    "myPipelineA": {
      ...
    },
    ...
  }
}

Events emitted in the SSE Data Channel are always spread to the invoker.

Switch State

Use DSL Filters and JSON Pointers over the JSON Data Channel to control the flow of the execution given the first matching condition:

{
  "mySwitchState": {
    "type": "switch",
    "options": [
      ...
    ],
    "default": "myDefaultState"
  }
}
type

(Required, String) The type of state. Must be switch.

options

(Required, Array of Objects) The options to evaluate in the state.

Details
{
  "type": "switch",
  "options": [
    {
      "condition": {
        "equals": {
          "field": "/httpRequest/queryParams/input",
          "value": "valueA"
        },
        ...
      },
      "state": "myFirstState"
    },
    ...
  ],
  ...
}
condition

(Required, Object) The predicate described as a DSL Filter over the JSON processing data.

state

(Optional, String) The next state for the finite-state machine if the condition evaluates to true.

default

(Optional, String) The default state for the finite-state machine if no option evaluates to true.

Note

If no state for the finite-state machine is selected, the current one will be assumed as the final state.

Logger State

Asynchronously logs a message through the logger of the entrypoint:

{
  "myLoggerState": {
    "type": "logger",
    "message": "#{ data('/my/log/message') }",
    "level": "ERROR",
    "loggerName": "my-custom-logger",
    "next": "myNextState"
  }
}
message

(Required, JSON) Any valid JSON (String, Number, Array, Object) with the message to log.

level

(Optional, String) Logging level. One of: EMERGENCY, ALERT or CRITICAL, ERROR, WARN, NOTICE, INFO, DEBUG. Defaults to INFO.

loggerName

(Optional, String) The name of the logger. Defaults to QueryFlowLogger.

next

(Optional, String) The next state for the HTTP Request Execution. If not provided, the current one will be assumed as the final state.

External Request Execution

Data Channels

Some state types produce data that is available for subsequent states.

The JSON Data Channel handles application/json output which can be later referenced using JSON Pointers.

Note

New data nodes will never override data nodes previously generated.

Note

When searching for a path, the JSON Pointer will be evaluated against the most recent output. If it is a match, the node is returned. Otherwise, the search continues with the previous one.

The SSE Data Channel handles text/event-stream that gets emitted based on the entrypoint that triggers the execution.

Expression Language Extensions
Table 78. Discovery QueryFlow Expression Language Variables
Variable Description Example

processor.id

The ID of the processor in execution during a processor state

9aa1fec5-bf0d-4563-ade7-2cb270230bd7

processor.type

The type of the processor in execution during a processor state

my-component-type

processor.name

The name of the processor in execution during a processor state

My Component Processor

processor.description

The description of the processor in execution during a processor state

Description of My Component Processor

processor.labels

The labels of the processor in execution during a processor state, grouped by key

[<keyA,[valueA,valueB]>, <keyB,valueC>]

execution.id

The unique ID of the current HTTP request

b366282d-fa00-4411-be02-874111ee317c

execution.startTimestamp

The timestamp when the current HTTP request started

2023-12-08T10:30:00Z

Table 79. Discovery QueryFlow Expression Language Functions
Function Description Example

data(string)

Finds a specific node within the JSON processing channel using a JSON Pointer

data('/path/to/field')

data(integer)

References a specific node within the JSON processing channel using a 0-based index for the first data node generated in the channel. The method supports negative numbers, where -1 represent the latest data node generated. If the index is not found, the result is null

data(0), data(-1)

Components

Amazon Bedrock

Sends requests to Amazon Bedrock. Supports multiple actions for different endpoints of the service. This component’s output field is named amazonBedrock by default.

Processor Action: invoke-model

Processor that invokes the specified Amazon Bedrock model to run inference using the prompt and the inference parameters provided in the configuration.

Configuration example
{
  "type": "amazon-bedrock",
  "name": "My Amazon Bedrock Processor Action",
  "config": {
    "action": "invoke-model",
    "model": "amazon.titan-text-premier-v1:0", (1)
    "request": { (1) (2)
      "inputText": "Write a short story about a rooster",
      "textGenerationConfig": {
        "maxTokenCount": 50,
        "stopSequences": [],
        "temperature": 0.7,
        "topP": 0.9
      },
    },
    "stream": false
  },
  "server": <Amazon Bedrock Server ID> (3)
}
1 These configuration fields are required.
2 This is just a request example, the exact structure is defined by the Amazon Bedrock API.
3 See the Amazon Bedrock integration section.

Each configuration field is defined as follows:

model

(Required, String) The model to invoke.

request

(Required, Object) The body of the request.

stream

(Optional, Boolean) Whether to enable streaming. Defaults to false.

The response of the action is stored in the JSON Data Channel as returned by the Amazon Bedrock service:

Action output structure
{
  "amazonBedrock": {
    ...
  }
}

Chunker

Splits big documents into smaller units that are easier to interpret by LLMs, it exposes different strategies that can be used. This component’s output field is named chunker by default.

Action: sentence

Processor that splits the text by sentences.

Configuration example
{
  "type": "chunker",
  "name": "My Chunk-by-Sentence Action",
  "config": {
    "action": "sentence",
    "text": " #{data('/text')} ", (1)
    "sentences": 4,
    "overlap": 1,
    "maxChars": 200
  }
}
1 This configuration field is required.

Each configuration field is defined as follows:

text

(Required, String) The text to process.

sentences

(Optional, Integer) The number of sentences per chunk. Defaults to 20.

overlap

(Optional, String/Integer) The amount of sentences to overlap, it can be either a percentage, or the number of sentences. Defaults to 10%.

maxChars

(Optional, Integer) The maximum number of chars per chunk.

The response of the action is stored in the JSON Data Channel:

Action output example
{
  "chunker": {
    "chunks": [
      "Lorem ipsum dolor sit amet consectetur adipiscing elit. Placerat in id cursus mi pretium tellus duis. Urna tempor pulvinar vivamus fringilla lacus nec metus.",
      "Urna tempor pulvinar vivamus fringilla lacus nec metus. Integer nunc posuere ut hendrerit semper vel class. Conubia nostra inceptos himenaeos orci varius natoque penatibus."
    ],
    "errors": [
      {
        "index": 3,
        "text": "Mus donec rhoncus eros lobortis nulla molestie mattis Purus est efficitur laoreet mauris pharetra vestibulum fusce Sodales consequat magna ante condimentum neque at luctus Ligula congue sollicitudin erat viverra ac tincidunt nam.",
        "error": {
          "status": 400,
          "code": 3003,
          "messages": [
            "Chunk of size 229 exceeds maximum char limit of 200"
          ],
          "timestamp": "2025-09-11T15:43:42.925739900Z"
        }
      }
    ]
  }
}
Note

If the overlapped text exceeds the maxChars value, then the number of overlapped items will be reduced until there is a valid overlap. This includes no overlap at all, if needed.

Action: word

Processor that splits the text by words.

Configuration example
{
  "type": "chunker",
  "name": "My Chunk-by-Word Action",
  "config": {
    "action": "word",
    "text": " #{data('/text')} ", (1)
    "words": 8,
    "overlap": 3,
    "maxChars": 70
  }
}
1 This configuration field is required.

Each configuration field is defined as follows:

text

(Required, String) The text to process.

words

(Optional, Integer) The number of words per chunk. Defaults to 20.

overlap

(Optional, String/Integer) The amount of words to overlap, it can be either a percentage, or the number of sentences. Defaults to 10%.

maxChars

(Optional, Integer) The maximum number of chars per chunk.

The response of the action is stored in the JSON Data Channel:

Action output example
{
  "chunker": {
    "chunks": [
      "Lorem ipsum dolor sit amet consectetur adipiscing elit.",
      "consectetur adipiscing elit. Vitae pellentesque sem placerat in",
      "sem placerat in id cursus mi. Tempus leo",
      "mi. Tempus leo eu aenean sed diam urna",
      "sed diam urna tempor. aptent taciti sociosqu. Conubia",
      "taciti sociosqu. Conubia nostra inceptos himenaeos orci varius",
      "himenaeos orci varius natoque penatibus. Montes nascetur ridiculus",
      "Montes nascetur ridiculus mus donec rhoncus eros lobortis.",
      "rhoncus eros lobortis. Maximus eget fermentum odio phasellus",
      "fermentum odio phasellus non purus est. Vestibulum fusce",
      "est. Vestibulum fusce dictum risus blandit quis suspendisse",
      "blandit quis suspendisse aliquet. Ante condimentum neque at",
      "condimentum neque at luctus nibh finibus facilisis. Ligula",
      "finibus facilisis. Ligula congue sollicitudin erat viverra ac",
      "erat viverra ac tincidunt nam. Euismod quam justo",
      "Euismod quam justo lectus commodo augue arcu dignissim."
    ],
    "errors": [
      {
        "index": 24,
        "text": "NecmetusbibendumegestasiaculismassanislmalesuadaUthendreritsempervelclass",
        "error": {
          "status": 400,
          "code": 3003,
          "messages": [
            "Chunk of size 73 exceeds maximum char limit of 70"
          ],
          "timestamp": "2025-09-11T19:59:18.402427300Z"
        }
      }
    ]
  }
}
Note

If the overlapped text exceeds the maxChars value, then the number of overlapped items will be reduced until there is a valid overlap. This includes no overlap at all, if needed.

Elasticsearch

Uses the Elasticsearch integration to send requests to the Elasticsearch API. Support multiple actions for common operations such as search, but also provides a mechanism to send raw Elasticsearch queries. This component’s output field is named elasticsearch by default.

Action: autocomplete

Processor that executes a completion suggester query.

Configuration example
{
  "type": "elasticsearch",
  "name": "My Elasticsearch Processor Action",
  "config": {
    "action": "autocomplete",
    "index": "my-index", (1)
    "text": "#{ data('my/query') }", (1)
    "field": "content", (1)
    "size": 3,
    "skipDuplicates": true
  },
  "server": <Elasticsearch Server ID> (2)
}
1 These configuration fields are required.
2 See the Elasticsearch integration section.

Each configuration field is defined as follows:

index

(Required, String) The index where to search.

text

(Required, String) The text to search.

field

(Required, String) The field where to search.

skipDuplicates

(Optional, Boolean) Whether to skip duplicate suggestions.

size

(Optional, Integer) The amount of suggestions.

The response of the action is stored in the JSON Data Channel as returned by the Elasticsearch API:

Action output structure
{
  "elasticsearch": {
    ...
  }
}
Action: knn

Processor that executes a k-nearest neighbor (kNN) query using approximate kNN.

Configuration example
{
  "type": "elasticsearch",
  "name": "My Elasticsearch Processor Action",
  "config": {
    "action": "knn",
    "index": "my-index", (1)
    "field": "content", (1)
    "maxResults": 5, (1)
    "vector": "#{ data('my/vector') }", (1)
    "k": 5, (1)
    "candidatesPerShard": 20, (1)
    "query": {
      "match_all": {}
    }
  },
  "server": <Elasticsearch Server ID> (2)
}
1 These configuration fields are required.
2 See the Elasticsearch integration section.

Each configuration field is defined as follows:

index

(Required, String) The index where to search.

field

(Required, String) The field where to search.

maxResults

(Required, Integer) The maximum number of results.

vector

(Required, Array of Float) The source vector to compare.

k

(Required, Integer) The number of nearest neighbors.

candidatesPerShard

(Required, Integer) The number of nearest neighbors considered per shard.

query

(Optional, Object) The query to filter in addition to the kNN search.

The response of the action is stored in the JSON Data Channel as returned by the Elasticsearch API:

Action output structure
{
  "elasticsearch": {
    ...
  }
}
Action: native

Processor that executes a native Elasticsearch query.

Configuration example
{
  "type": "elasticsearch",
  "name": "My Elasticsearch Processor Action",
  "config": {
    "action": "native",
    "path": "/my-index/_doc/1", (1)
    "method": "POST", (1)
    "queryParam": {
      "param1": "value1"
    },
    "body": {
      "field1": "value2"
    }
  },
  "server": <Elasticsearch Server ID> (2)
}
1 These configuration fields are required.
2 See the Elasticsearch integration section.

Each configuration field is defined as follows:

path

(Required, String) The endpoint of the request, excluding schema, host, port and any path included as part of the connection.

method

(Required, String) The HTTP method for the request.

queryParams

(Optional, Map of String/String) The map of query parameters for the URL.

body

(Optional, Object) The JSON body to submit.

The response of the action is stored in the JSON Data Channel as returned by the Elasticsearch API:

Action output structure
{
  "elasticsearch": {
    ...
  }
}
Action: search

Processor that executes a match query on the index.

Configuration example
{
  "type": "elasticsearch",
  "name": "My Elasticsearch Processor Action",
  "config": {
    "action": "search",
    "index": "my-index", (1)
    "text": "#{ data('my/query') }", (1)
    "field": "content", (1)
    "suggest": { (2)
      "completion-suggestion": {
        "prefix": "Value ",
        "completion": {
          "field": "field.completion"
        }
      }
    },
    "aggregations": { (2)
      "aggregationA": {
        "terms": {
          "field": "field.keyword"
        }
      }
    },
    "highlight": { (2)
      "fields": {
        "field": {}
      }
    },
    "filter": <DSL Filter>, (3)
    "pageable": <Pagination Parameters>, (4)
  },
  "server": <Elasticsearch Server ID> (5)
}
1 These configuration fields are required.
2 The exact expected structure of these objects is defined by the Elasticsearch API, this is just an example at the time of writing.
3 See the DSL Filter section.
4 See the Pagination appendix.
5 See the Elasticsearch integration section.

Each configuration field is defined as follows:

index

(Required, String) The index where to search.

text

(Required, String) The text to search.

field

(Required, String) The field where to search.

suggest

(Optional, Object) The suggester to apply. The object should represent a valid suggester according to the Elasticsearch API.

aggregations

(Optional, Map of String/Object) The field with the aggregations to apply. See the Elasticsearch API aggregation documentation for details on the structure of the map.

highlight

(Optional, Object) The highlighter to apply. The object should represent a valid highlighter according to the Elasticsearch API.

filter

(Optional, DSL Filter) The filters to apply.

pageable

(Optional, Pagination) The pagination parameters.

Details
Page configuration example
{
  "page": 0,
  "size": 25,
  "sort": [ // (1)
    {
      "property" : "fieldA",
      "direction" : "ASC"
    },
    {
      "property" : "fieldB",
      "direction" : "DESC"
    }
  ]
}

Each configuration field is defined as follows:

page

(Integer) The page number.

size

(Integer) The size of the page.

sort

(Array of Objects) The sort definitions for the page.

Field definitions
property

(String) The property where the sort was applied.

direction

(String) The direction of the applied sorting. Either ASC or DESC.

The response of the action is stored in the JSON Data Channel as returned by the Elasticsearch API:

Action output structure
{
  "elasticsearch": {
    ...
  }
}
Action: store

Processor that executes a store request to Elasticsearch.

Configuration example
{
  "type": "elasticsearch",
  "name": "My Elasticsearch Processor Action",
  "config": {
    "action": "store",
    "index": "my-index", (1)
    "document": { (1)
      "field1": "value1"
    },
    "id": "documentID",
    "allowOverride": false,

  },
  "server": <Elasticsearch Server ID> (2)
}
1 These configuration fields are required.
2 See the Elasticsearch integration section.

Each configuration field is defined as follows:

index

(Required, String) The index where to store the document.

document

(Required, Object) The document to be stored.

id

(Optional, String) The ID of the document to be stored. If not provided, it will be autogenerated.

allowOverride

(Optional, Boolean) Whether the document can be overridden or not. Defaults to false.

The response of the action is stored in the JSON Data Channel as returned by the Elasticsearch API:

Action output structure
{
  "elasticsearch": {
    ...
  }
}
Action: vector

Processor that executes a script score query using exact kNN.

Configuration example
{
  "type": "elasticsearch",
  "name": "My Elasticsearch Processor Action",
  "config": {
    "action": "vector",
    "index": "my-index", (1)
    "field": "my_vector_field", (1)
    "vector": "#{ data('my/vector') }", (1)
    "minScore": 0.92, (1)
    "maxResults": 5, (1)
    "function": "cosineSimilarity",
    "query": {
      "match_all": {}
    },
  },
  "server": <Elasticsearch Server ID> (2)
}
1 These configuration fields are required.
2 See the Elasticsearch integration section.

Each configuration field is defined as follows:

index

(Required, String) The index where to search.

field

(Required, String) The field with the vector.

vector

(Required, Array of Float) The source vector to compare.

minScore

(Required, Double) The minimum score for results.

maxResults

(Required, Integer) The maximum number of results.

query

(Optional, Object) The query to apply together with the vector search.

function

(Optional, String) The type of function to use. One of cosineSimilarity, dotProduct, l1norm or l2norm. Defaults to cosineSimilarity

The response of the action is stored in the JSON Data Channel as returned by the Elasticsearch API:

Action output structure
{
  "elasticsearch": {
    ...
  }
}

Facet Snap

Tries to snap facet values based on a list of tokens extracted from the user query. These facet snaps are returned as a Filter (see Filters DSL) that can be later used as clauses on the query sent to the search engine. This component’s output field is named snap by default.

Action: filter

Processor that creates a filter based on the facet values snapped using the query tokens provided as input.

Configuration example
{
  "type": "snap",
  "name": "My Snap Filter Action",
  "config": {
    "action": "filter",
    "query": "#{ data(\"/httpRequest/queryParams/q\") }", (1)
    "tokens": "#{ data(\"/tokens\") }", (1)
    "facetStore": "facet_test", (1)
    "includeFacets": "#{ data(\"/includeFacets\") }",
    "excludeFacets": "#{ data(\"/excludeFacets\") }",
    "matchAllFacets": false,
    "snapMode": "QUERY",
    "greedyMatch": false,
    "maxDisambiguateOffset": false
  }
}
1 These configuration fields are required.

Each configuration field is defined as follows:

tokens

(Required, Array of Strings) The list of tokens to snap to.

query

(Required, String) The search query to use.

facetStore

(Required, String) The Discovery Staging bucket to get the facets from.

Bucket format

The facets stored on the bucket are expected to have the following format:

{
  "name": "name",
  "value": "value",
  "properties": {}
}
name

(Required, String) The name of the facet.

value

(Required, String) The value of the facet.

properties

(Optional, Object) The facet properties. Useful to store additional information for the facet.

snapMode

(Optional, String) The mode to compare facets when snapping.

Details

QUERY: The facets will be matched against the input query text.

TOKENS: The facets will be matched against the input tokens, separated by whitespace. This is useful if you are applying any processing to the tokens.

includeFacets

(Optional, Array of Strings) A list of facets to include when snapping.

excludeFacets

(Optional, Array of Strings) A list of facets to ignore when snapping.

matchAllFacets

(Optional, Boolean) If true, the returned Filter will match all facet fields using and. If false, the returned Filter will match any facet field using or. Defaults to false.

greedyMatch

(Optional, Boolean) If true, snap to the biggest possible facet for each token only, preventing any overlap between matches. If false, snap to every possible facet for each token, allowing overlapped matches. Defaults to false.

maxDisambiguateOffset

(Optional, Integer) The maximum offset size to check when disambiguating. If -1 checks all tokens available on both sides. Defaults to -1.

Tip
Input tokens for this action can be retrieved using the Tokenizer component.
Tip
For faster query responses from the facet store, create indices for both name and value fields.

The response of the action is stored in the JSON Data Channel and besides outputting the filter, the Snap Filter action also provides the snapped facet objects and query ngrams that matched them, for later use as input on other actions.

{
  "snap": {
    "snappedFacets": [
      {
        "facet": { "name": "brand", "value": "nike", "properties": { "code": 123 } },
        "ngram": {
          "value": "nike",
          "offset": { "start": 7, "end": 11 },
          "tokens": [
            { "term": "nike", "offset": { "start": 7, "end": 11 } }
          ]
        }
      },
      {
        "facet": { "name": "size", "value": "7" },
        "ngram": {
          "value": "7",
          "offset": { "start": 5, "end": 6 },
          "tokens": [
            { "term": "size", "offset": { "start": 0, "end": 4 } },
            { "term": "7", "offset": { "start": 5, "end": 6 } }
          ]
        }
      }
    ],
    "filter": {
      "or": [
        { "in": { "field": "size", "values": [ "7" ] } },
        { "in": { "field": "brand", "values": [ "nike" ] } }
      ]
    }
  }
}
Note
The resulting snapped facets are ordered by ngram size, descending. If two ngrams have the same number of tokens they are ordered by appearance.
Action: mask

Processor that creates a masked query based on the snap results of the Snap Filter Action. It replaces facet matches (both name and value) with a given map of facet masks in the input query.

Configuration example
{
  "type": "snap",
  "name": "My Snap Mask Filter Action",
  "config": {
    "action": "mask",
    "query": "#{ data(\"/httpRequest/queryParams/q\") }", (1)
    "snappedFacets": "#{ data(\"/snap/snappedFacets\") }", (1) (2)
    "tokens": "#{ data(\"/tokens\") }", (1)
    "entityMasks": {
      "size": "[SIZE]",
      "brand": "[BRAND]"
    },
  }
}
1 These configuration fields are required.
2 See the snapped facets configuration. Note how this field is using the Expression Language to read from the output of a previously executed Snap Filter Action

Each configuration field is defined as follows:

tokens

(Required, Array of Strings) The list of tokens to snap to.

query

(Required, String) The search query to use.

snappedFacets

(Required, Array of Objects) The list of facets that matched a ngram value.

Facet object example
[
  {
    "facet": {
      "name": "name",
      "value": "value",
      "properties": {}
    },
    "ngram": {
      "value": "value",
      "offset": { "start": 0, "end": 7 },
      "tokens": [ { "term": "term", "offset": { "start": 0, "end": 7 } }, ... ]
    }
  }
]
facet

(Required, Object) The snapped facet value.

Field definitions
name

(Required, String) The name of the facet.

value

(Required, String) The value of the facet.

properties

(Optional, Object) The facet properties. Useful to store additional information for the facet.

ngram

(Required, Object) The snapped ngram value.

Field definitions
value

(Required, String) The ngram value.

offset

(Required, Object) The ngram query offset.

Field definitions
start

(Required, Integer) The offset start index.

end

(Required, Integer) The offset end index.

tokens

(Required, Array of Objects) The tokens that are part of the ngram.

Field definitions
term

(Required, String) The term for this token.

offset

(Required, Object) The token offset.

Details
start

(Required, Integer) The offset start index.

end

(Required, Integer) The offset end index.

entityMasks

(Optional, Map of String/String) Masks to apply to the given facets.

Field configuration
{
  "size": "[SIZE]",
  "brand": "[BRAND]",
  ...
}

The response of the action is stored in the JSON Data Channel as:

{
  "snap": "[SIZE] [BRAND] sneakers"
}
Action: clear

Processor that creates a simplified query based on the snap results of the Snap Filter Action. It removes facet matches (both name and value) from the input tokens and joins the remaining tokens with whitespace.

Configuration example
{
  "type": "snap",
  "name": "My Snap Clear Action",
  "config": { (1)
    "action": "clear",
    "tokens": "#{ data(\"/tokens\") }",
    "snappedFacets": "#{ data(\"/snap/snappedFacets\") }" (2)
  }
}
1 All configuration fields in this action are required.
2 See the snapped facets configuration. Note how this field is using the Expression Language to read from the output of a previously executed Snap Filter Action.

Each configuration field is defined as follows:

tokens

(Required, Array of Strings) The list of tokens to snap to.

snappedFacets

(Required, Array of Objects) The list of facets that matched a ngram value.

Facet object example
[
  {
    "facet": {
      "name": "name",
      "value": "value",
      "properties": {}
    },
    "ngram": {
      "value": "value",
      "offset": { "start": 0, "end": 7 },
      "tokens": [ { "term": "term", "offset": { "start": 0, "end": 7 } }, ... ]
    }
  }
]
facet

(Required, Object) The snapped facet value.

Field definitions
name

(Required, String) The name of the facet.

value

(Required, String) The value of the facet.

properties

(Optional, Object) The facet properties. Useful to store additional information for the facet.

ngram

(Required, Object) The snapped ngram value.

Field definitions
value

(Required, String) The ngram value.

offset

(Required, Object) The ngram query offset.

Field definitions
start

(Required, Integer) The offset start index.

end

(Required, Integer) The offset end index.

tokens

(Required, Array of Objects) The tokens that are part of the ngram.

Field definitions
term

(Required, String) The term for this token.

offset

(Required, Object) The token offset.

Details
start

(Required, Integer) The offset start index.

end

(Required, Integer) The offset end index.

The response of the action is stored in the JSON Data Channel as:

{
  "snap": "sneakers"
}

HTML

Uses Jsoup to parse and process HTML documents.

Action: select

Processor that retrieves elements that match a CSS selector query from an HTML document.

Configuration example
{
  "type": "html",
  "name": "My Select HTML Action",
  "config": {
    "action": "select",
    "text": "#{ data('/text') }", (1)
    "baseUri": "",
    "charset": "UTF-8",
    "selectors": { (1)
      "mySelector": {
        "selector": "::text:not(:blank)", (1)
        "mode": "NODES"
      }
    }
  }
}
1 These configuration field are required.

Each configuration field is defined as follows:

text

(Required, String) The content of the HTML document to be processed, as a plain text string.

Note

The content of HTML files located in the File Service can be retrieved by leveraging the file function of the Expression Language.

baseUri

(Optional, String) The URL of the source, to resolve relative links against. Defaults to "".

charset

(Optional, String) The character set used to encode the content before parsing. If null, determines the charset from the http-equiv meta tag if present, or falls back to UTF-8 if not.

selectors

(Required, Map of String/Object) The set of selector configurations.

Field definitions
selector

(Required, String) The CSS selector query.

mode

(Optional, String) The output format of the selection. Either TEXT, HTML or NODES. Defaults to TEXT.

Note

The NODES mode enables the use of Node Pseudo Selectors. The output for this mode depends on the operator that is used, some output text while others HTML.

The response of the action is stored in the JSON Data Channel:

Action output example
{
  "html": {
    "mySelector": "This is text that was found within as selected element"
  }
}
Action: extract

Processor that extracts and formats tables and description lists from an HTML document.

Configuration example
{
  "type": "html",
  "name": "My Extract HTML Action",
  "config": {
    "action": "extract",
    "text": "#{ data('/text') }", (1)
    "baseUri": "",
    "charset": "UTF-8",
    "table": {
      "active": true,
      "titles": [
        "caption"
      ]
    },
    "descriptionList": {
      "active": true,
      "titles": [
        "h1"
      ]
    }
  }
}
1 This configuration field is required.

Each configuration field is defined as follows:

text

(Required, String) The content of the HTML document to be processed, as a plain text string.

Note

The content of HTML files located in the File Service can be retrieved by leveraging the file function of the Expression Language.

baseUri

(Optional, String) The URL of the source, to resolve relative links against. Defaults to "".

charset

(Optional, String) The character set used to encode the content before parsing. If null, determines the charset from the http-equiv meta tag if present, or falls back to UTF-8 if not.

table

(Optional, Object) The configurations for extracting tables.

Field definitions
active

(Optional, Boolean) Whether the extractor is active. Defaults to true.

titles

(Optional, Array of String) The list of HTML tags to be considered as the title for each element. A title is selected when either the first child or the previous sibling of the element matches any of the given tags.

descriptionList

(Optional, Object) The configurations for extracting description lists.

Field definitions
active

(Optional, Boolean) Whether the extractor is active. Defaults to true.

titles

(Optional, Array of String) The list of HTML tags to be considered as the title for each element. A title is selected when either the first child or the previous sibling of the element matches any of the given tags.

The response of the action is stored in the JSON Data Channel:

Action output example
{
  "html": {
    "tables": [
      {
        "title": "Table",
        "table": [
          [
            {
              "tag": "header",
              "text": "Header 1"
            },
            {
              "tag": "header",
              "text": "Header 2"
            }
          ],
          [
            {
              "tag": "data",
              "text": "Data 1"
            },
            {
              "tag": "data",
              "text": "Data 2"
            }
          ]
        ]
      }
    ],
    "descriptionLists": [
      {
        "title": "Description List",
        "descriptionList": [
          {
            "term": "Term 1",
            "details": [
              "Detail 1",
              "Detail 2"
            ]
          },
          {
            "term": "Term 2",
            "details": []
          }
        ]
      }
    ]
  }
}
Action: remove

Processor that removes elements that match a CSS selector query from an HTML document and outputs the remaining content.

Configuration example
{
  "type": "html",
  "name": "My Remove HTML Action",
  "config": {
    "action": "remove",
    "text": "#{ data('/text') }", (1)
    "baseUri": "",
    "charset": "UTF-8",
    "selector": "header, footer" (1)
  }
}
1 These configuration fields are required.

Each configuration field is defined as follows:

text

(Required, String) The content of the HTML document to be processed, as a plain text string.

Note

The content of HTML files located in the File Service can be retrieved by leveraging the file function of the Expression Language.

baseUri

(Optional, String) The URL of the source, to resolve relative links against. Defaults to "".

charset

(Optional, String) The character set used to encode the content before parsing. If null, determines the charset from the http-equiv meta tag if present, or falls back to UTF-8 if not.

selector

(Required, String) The CSS selector query.

The response of the action is stored in the JSON Data Channel:

Action output example
{
  "html": "<html>\n <head></head>\n <body>\n  <p>body text that wasn't removed </p>\n </body>\n</html>"
}

Hugging Face

Uses the Hugging Face integration to send requests to the Inference API. Supports multiple actions for different endpoints of the service. This component’s output field is named huggingFace by default.

Action: summarization

Processor that summarizes a single or multiple texts organized by an autogenerated id. See Summarization Task.

Configuration example
{
  "type": "hugging-face",
  "name": "My Summarization Action",
  "config": {
    "action": "summarization",
    "model": "Falconsai/text_summarization", (1)
    "input": "#{ data(\"/httpRequest/body/input\") }", (1)
    "parameters": <Parameters Configuration>, (2)
    "options": <Options configuration> (3)
  },
  "server": <Hugging Face Server ID> (4)
}
1 These configuration fields are required.
2 See the parameters configuration.
3 See the options configuration.
4 See the Hugging Face integration section.

Each configuration field is defined as follows:

model

(Required, String) The model to use for the request.

input

(Required, Array of Strings) The list of texts to summarize.

parameters

(Optional, Object) The parameters for the request.

Details
Configuration example
{
  "minLength": 10,
  "maxLength": 10,
  "topK": 5,
  "topP": null,
  "temperature": 1,
  "repetitionPenalty": 0.1,
  "maxTime": 0.1
}

Each configuration field is defined as follows:

minLength

(Optional, Integer) The minimum length of the output tokens.

maxLength

(Optional, Integer) The maximum length of the output tokens.

topK

(Optional, Integer) The top tokens to consider to create new text.

topP

(Optional, Float) Defines the tokens that are within the sample operation for the query.

temperature

(Optional, Float) The temperature of the sampling operation. Defaults to 1.0.

repetitionPenalty

(Optional, Float) The repetition penalty for the request.

maxTime

(Optional, Float) The maximum time that the request should take.

options

(Optional, Object) The request options.

Details
Configuration example
{
  "useCache": true,
  "waitForModel": false
}

Each configuration field is defined as follows:

useCache

(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true.

waitForModel

(Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable.

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

Single text input response
{
  "huggingFace": "Your summarized text"
}
Multiple text inputs response
{
  "huggingFace": [
    "Your summarized text for your first input",
    "Your summarized text for your second input",
    ...
  ]
}
Note
Please note that the order of the responses corresponds to the order of the inputs.
Action: text-generation

Processor that continues the text from a prompt. See Text Generation Task.

Configuration example
{
  "type": "hugging-face",
  "name": "My Text Generation Action",
  "config": {
    "action": "text-generation",
    "model": "gpt2-large", (1)
    "input": "#{ data(\"/httpRequest/body/input\") }", (1)
    "parameters": <Parameters Configuration>, (2)
    "options": <Options configuration> (3)
  },
  "server": <Hugging Face Server ID> (4)
}
1 These configuration fields are required.
2 See the parameters configuration.
3 See the options configuration.
4 See the Hugging Face integration section.

Each configuration field is defined as follows:

model

(Required, String) The model to use for the request.

input

(Required, Strings) The prompt from which to generate the response.

parameters

(Optional, Object) The parameters for the request.

Details
Configuration example
{
  "topK": null,
  "topP": null,
  "temperature": 1,
  "repetitionPenalty": null,
  "maxNewTokens": 20,
  "maxTime": 5,
  "returnFullText": true,
  "numReturnSequences": 1,
  "doSample": true
}

Each configuration field is defined as follows:

maxNewTokens

(Optional, Integer) The number of tokens to be generated.

returnFullText

(Optional, Boolean) Whether to include the input text within the answer or not. Defaults to true.

numReturnSequences

(Optional, Integer) The number of proposition to be returned.

doSample

(Optional, Boolean) Whether to use sampling or not. Use greedy decoding otherwise. Defaults to true.

topK

(Optional, Integer) The top tokens to consider to create new text.

topP

(Optional, Float) Defines the tokens that are within the sample operation for the query.

temperature

(Optional, Float) The temperature of the sampling operation. Defaults to 1.0.

repetitionPenalty

(Optional, Float) The repetition penalty for the request.

maxTime

(Optional, Float) The maximum time that the request should take.

options

(Optional, Object) The request options.

Details
Configuration example
{
  "useCache": true,
  "waitForModel": false
}

Each configuration field is defined as follows:

useCache

(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true.

waitForModel

(Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable.

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

Action output example
{
  "huggingFace": "My autogenerated text"
}
Action: feature-extraction

Processor that extracts a matrix of numerical features from a single or multiple texts organized by an autogenerated id. Seed Feature Extraction Task.

Configuration example
{
  "type": "hugging-face",
  "name": "My Feature Extraction Action",
  "config": {
    "action": "feature-extraction",
    "model": "facebook/bart-base", (1)
    "input": "#{ data(\"/httpRequest/body/input\") }", (1)
    "options": <Options configuration> (2)
  },
  "server": <Hugging Face Server ID> (3)
}
1 These configuration fields are required.
2 See the options configuration.
3 See the Hugging Face integration section.

Each configuration field is defined as follows:

model

(Required, String) The model to use for the request.

input

(Required, Array of String) The list of texts to extract the numerical features.

options

(Optional, Object) The request options.

Details
Configuration example
{
  "useCache": true,
  "waitForModel": false
}

Each configuration field is defined as follows:

useCache

(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true.

waitForModel

(Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable.

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

Single text inputs response
{
  "huggingFace":
     [[ 2.2187119  , 2.7539337  , 1.0330348  , ... ],
      [ -0.2937546 , 0.29999846 , -1.7008113 , ... ],
      [ 0.09872855 , 0.53532976 , 0.7232368  , ... ]]
}
Multiple text inputs response
{
  "huggingFace":
  [
    [
      [ 2.2187119  , 2.7539337  , 1.0330348  , ... ],
      [ -0.2937546 , 0.29999846 , -1.7008113 , ... ],
      ...
    ],
    [
      [ 2.821799  , 2.7055995   , 1.1408421  , ... ],
      [ 1.4287674 , 0.39487326  , -3.7841866 , ... ],
      ...
    ]
  ]
}
Note
Please note that the order of the responses corresponds to the order of the inputs.
Action: fill-mask

Processor that replaces a missing word in a sentence with multiple fitting possibilities. The name of the [MASK] token to be replaced is defined by the chosen model. See Fill Mask Task.

Configuration example
{
  "type": "hugging-face",
  "name": "My Fill Mask Action",
  "config": {
    "action": "fill-mask",
    "model": "distilroberta-base", (1)
    "input": "#{ data(\"/httpRequest/body/input\") }", (1)
    "options": <Options configuration> (2)
  },
  "server": <Hugging Face Server ID> (3)
}
1 These configuration fields are required.
2 See the options configuration.
3 See the Hugging Face integration section.

Each configuration field is defined as follows:

model

(Required, String) The model to use for the request.

input

(Required, Array of String) The list of texts to fill their masks.

options

(Optional, Object) The request options.

Details
Configuration example
{
  "useCache": true,
  "waitForModel": false
}

Each configuration field is defined as follows:

useCache

(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true.

waitForModel

(Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable.

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

Single text inputs response
{
  "huggingFace": [
    {
      "sequence": "Paris is the capital of france",
      "score": 0.2705707,
      "token": 812,
      "tokenStr": " capital"
    },
    ...
  ]
}
Multiple text inputs response
{
  "huggingFace": [
    [
      {
        "sequence": "Paris is the capital of france",
        "score": 0.2705707,
        "token": 812,
        "tokenStr": " capital"
      },
      ...
    ],
    [
      {
        "sequence": "The Eiffle tower is one of the main tourist spots in Paris.",
        "score": 0.9013709,
        "token": 8376,
        "tokenStr": " tourist"
      },
      ...
    ],
    ...
  ]
}
Action: text-clasification

Processor that classifies a text into a group of labels, it provides a score for each label. These labels are determined by the model that is used. See Text Classification Task.

Configuration example
{
  "type": "hugging-face",
  "name": "My Text Classification Action",
  "config": {
    "action": "text-classification",
    "model": "distilbert-base-uncased-finetuned-sst-2-english", (1)
    "input": "#{ data(\"/httpRequest/body/input\") }", (1)
    "options": <Options configuration> (2)
  },
  "server": <Hugging Face Server ID> (3)
}
1 These configuration fields are required.
2 See the options configuration.
3 See the Hugging Face integration section.

Each configuration field is defined as follows:

model

(Required, String) The model to use for the request.

input

(Required, Array of String) The list of texts to classify.

options

(Optional, Object) The request options.

Details
Configuration example
{
  "useCache": true,
  "waitForModel": false
}

Each configuration field is defined as follows:

useCache

(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true.

waitForModel

(Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable.

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

Single text inputs response
{
  "huggingFace": [
    {"label": "LabelA", "score": 0.9998608827590942},
    ...
  ]
}
Multiple text inputs response
{
  "huggingFace": [
    [
      {
        "label": "LabelA",
        "score": 0.9998608827590942
      },
      ...
    ],
    [
      {
        "label": "LabelC",
        "score": 0.9968926310539246
      },
      ...
    ],
    ...
  ]
}
Note
Please note that the order of the responses corresponds to the order of the inputs.
Action: zero-shot-classification

Processor that classifies a text into a group of labels without having seen any training examples for those labels, it provides a score for each label. See Zero Shot Classification Task.

Configuration example
{
  "type": "hugging-face",
  "name": "My Zero Shot Classification Action",
  "config": {
    "action": "zero-shot-classification",
    "model": "facebook/bart-large-mnli", (1)
    "input": "#{ data(\"/httpRequest/body/input\") }", (1)
    "parameters": <Parameters Configuration>, (2)
    "options": <Options configuration> (3)
  },
  "server": <Hugging Face Server ID> (4)
}
1 These configuration fields are required.
2 See the parameters configuration.
3 See the options configuration.
4 See the Hugging Face integration section.

Each configuration field is defined as follows:

model

(Required, String) The model to use for the request.

input

(Required, Array of String) The list of texts to classify.

parameters

(Optional, Object) The parameters for the request.

Details
Configuration example
{
  "candidateLabels": [
    "labelA",
    "labelB"
  ],
  "multiLabel": false
}

Each configuration field is defined as follows:

candidateLabels

(Required, Array of String) The list of possible labels to classify the input

multiLabel

(Optional, Boolean) Whether classes can overlap or not. Defaults to false

options

(Optional, Object) The request options.

Details
Configuration example
{
  "useCache": true,
  "waitForModel": false
}

Each configuration field is defined as follows:

useCache

(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true.

waitForModel

(Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable.

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

Single text inputs response
{
  "huggingFace": [
    {
      "label": "labelA",
      "score": 0.9998608827590942
    },
    ...
  ]
}
Multiple text inputs response
{
  "huggingFace": [
    [
      {
        "label": "labelA",
        "score": 0.9998608827590942
      },
      ...
    ],
    [
      {
        "label": "labelA",
        "score": 0.9968926310539246
      },
      ...
    ],
    ...
  ]
}
Note
Please note that the order of the responses corresponds to the order of the inputs.
Action: token-classification

Processor that assigns a label to the tokens from a single or multiple texts organized by an autogenerated id. See Token Classification Task.

Configuration example
{
  "type": "hugging-face",
  "name": "My Token Classification Action",
  "config": {
    "action": "token-classification",
    "model": "dslim/bert-base-NER", (1)
    "input": "#{ data(\"/httpRequest/body/input\") }", (1)
    "parameters": {
      "aggregationStrategy": "SIMPLE"
    },
    "options": <Options configuration> (2)
  },
  "server": <Hugging Face Server ID> (3)
}
1 These configuration fields are required.
2 See the options configuration.
3 See the Hugging Face integration section.

Each configuration field is defined as follows:

model

(Required, String) The model to use for the request.

input

(Required, Array of String) The list of texts to classify their tokens

parameters

(Optional, Object) The parameters for the request. Should contain a single aggregationStrategy field with a string value for the aggregation strategy to use in the request. Supported aggregation strategy values are:

Type definitions

NONE: Every token gets classified without further aggregation.

SIMPLE: Entities are grouped according to the default schema (B-, I- tags get merged when the tag is similar).

FIRST: Same as the SIMPLE strategy except words cannot end up with different tags. Words will use the tag of the first token when there is ambiguity.

AVERAGE: Same as the SIMPLE strategy except words cannot end up with different tags. Scores are averaged across tokens and then the maximum label is applied.

MAX: Same as the SIMPLE strategy except words cannot end up with different tags. Word entity will be the token with the maximum score

options

(Optional, Object) The request options.

Details
Configuration example
{
  "useCache": true,
  "waitForModel": false
}

Each configuration field is defined as follows:

useCache

(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true.

waitForModel

(Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable.

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

Single text inputs response
{
  "huggingFace": [
    {
      "score": 0.9990085,
      "word": "Omar",
      "start": 11,
      "end": 15,
      "entityGroup": "PER"
    },
    ...
  ]
}
Multiple text inputs response
{
  "huggingFace": [
    [
      {
        "score": 0.9990085,
        "word": "Omar",
        "start": 11,
        "end": 15,
        "entityGroup": "PER"
      },
      ...
    ],
    [
      {
        "score": 0.9949533,
        "word": "George Washington",
        "start": 0,
        "end": 17,
        "entityGroup": "PER"
      },
      ...
    ],
    ....
  ]
}
Note
Please note that the order of the responses corresponds to the order of the inputs.
Action: question-answering

Processor that answers a question based on given contexts. See Question Answering Task

Configuration example
{
  "type": "hugging-face",
  "name": "My Question Answering Action",
  "config": {
    "action": "question-answering",
    "model": "deepset/roberta-base-squad2", (1)
    "input": <Input values>, // <1> (2)
    "options": <Options configuration> (3)
  },
  "server": <Hugging Face Server ID> (4)
}
1 These configuration fields are required.
2 See the input definition.
3 See the options configuration.
4 See the Hugging Face integration section.

Each configuration field is defined as follows:

model

(Required, String) The model to use for the request.

input

(Required, Object) The input for the request.

Details
Configuration example
{
  "question": "#{ data(\"/httpRequest/body/question\") }",
  "context": [
    "#{ data(\"/httpRequest/body/context/0\") }",
    "#{ data(\"/httpRequest/body/context/1\") }",
  ]
}

Each configuration field is defined as follows:

question

(Required, String) The question to be answered by the model.

context

(Required, Array of String) The list of context used to answer the question.

minScore

(Optional, Float) The minimum score for each answer.

options

(Optional, Object) The request options.

Details
Configuration example
{
  "useCache": true,
  "waitForModel": false
}

Each configuration field is defined as follows:

useCache

(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to true.

waitForModel

(Optional, Boolean) Whether to wait until the model is ready or not. If false the response will be 503 - Service Unavailable.

The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:

Action output example
{
  "huggingFace": [
    {
      "answer": "Clara",
      "score": 0.8979613184928894,
      "start": 11,
      "end": 16
    },
    {
      "answer": "Los Angeles",
      "score": 0.013939359225332737,
      "start": 20,
      "end": 31
    },
    ...
  ]
}
Note
Please note that the responses are sorted in descending order according to their score value.

Language Detector

The language Detector component uses Lingua to identify the language from a specified text input. The languages are referenced using ISO-639-1 (alpha-2 code). This component’s output field is named language by default.

Note

Each time a language model is referenced, it will be loaded in memory. Loading too many languages increases the risk of high memory consumption issues.

Processor Action: process

Processor that detects the language of a provided text.

Configuration example
{
  "type": "language-detector",
  "name": "My Language Detector Processor Action",
  "config": {
    "action": "process",
    "text": {    (1)
      "inputA": "#{ data('/httpRequest/body/custom/fieldA') }",
      "inputB": "#{ data('/httpRequest/body/custom/fieldB') }"
    },
    "minDistance": 0.5,
    "supportedLanguages": [
      "en",
      "es"
    ],
    "defaultLanguage": "it"
  }
}
1 This configuration field is required.

Each configuration field is defined as follows:

text

(Required, Object) The text to be evaluated. It can be either a String with a single input, or a Map for multi-input processing.

defaultLanguage

(Optional, String) Default language to select in case no other is detected. Defaults to en.

minDistance

(Optional, Double) Distance between the input and the language model. Defaults to 0.0.

supportedLanguages

(Optional, Array of Strings) List of languages supported by the detector. At least 2 supported languages must be set. Defaults to [ "en", "es" ].

The response of the action is stored in the JSON Data Channel as returned by the Lingua engine:

Single text input response
{
  "language": "en"
}
Multiple text inputs response
{
  "language": {
    "inputA": "en",
    "inputB": "es"
  }
}

MongoDB

Uses the MongoDB integration to send requests to the MongoDB server. This component’s output field is named mongo by default.

Action: aggregate

Processor that runs a configured aggregation pipeline on a MongoDB database.

Configuration example
{
  "type": "mongo",
  "name": "My MongoDB Processor Action",
  "config": { (1)
    "action": "aggregate",
    "database": "my-database",
    "collection": "my-collection",
    "stages": [ (2)
      {
        "$count": "total"
      }
    ]
  },
  "server": <MongoDB Server ID> (3)
}
1 All configuration fields in this action are required.
2 The exact expected structure of these objects is defined by MongoDB. This is just an example at the time of writing.
3 See the MongoDB integration section.

Each configuration field is defined as follows:

database

(Required, String) The database name.

collection

(Required, String) The collection name.

stages

(Required, Array of Objects) List of aggregation stages. Each object in the array should represent a MongoDB aggregation stage.

The response of the action is stored in the JSON Data Channel as returned by the MongoDB server:

Action output structure
{
  "mongo": {
    ...
  }
}
Action: autocomplete

Processor that uses the autocomplete operator in a compound must clause, filters are applied in the filter clause.

Configuration example
{
  "type": "mongo",
  "name": "My MongoDB Processor Action",
  "config": {
    "action": "autocomplete",
    "database": "my-database", (1)
    "collection": "my-collection", (1)
    "index": "my-index", (1)
    "path": "my-field", (1)
    "queries": [ (1)
      "A "
    ],
    "tokenOrder": "ANY",
    "filter": <DSL Filter> (2)
  },
  "server": <MongoDB Server ID> (3)
}
1 These configuration fields are required.
2 See the DSL Filter section.
3 See the MongoDB integration section.

Each configuration field is defined as follows:

database

(Required, String) The database name.

collection

(Required, String) The collection name.

index

(Required, String) The name for the MongoDB full-text search index.

path

(Required, String) The indexed field to search.

queries

(Required, Array of Strings) The phrase or phrases to autocomplete.

tokenOrder

(Optional, String) The order in which the tokens will be searched. Either ANY or SEQUENTIAL.

filter

(Optional, DSL Filter) The filter to apply.

The response of the action is stored in the JSON Data Channel as returned by the MongoDB server:

Action output structure
{
  "mongo": {
    ...
  }
}
Action: search

Processor that uses the text operator in a compound must clause, filters are applied in the filter clause.

Configuration example
{
  "type": "mongo",
  "name": "My MongoDB Processor Action",
  "config": {
    "action": "search",
    "database": "my-database", (1)
    "collection": "my-collection", (1)
    "index": "my-index", (1)
    "paths": [ (1)
      "name",
      "description"
    ],
    "queries": [
      "What is my name?",
      "What is my description?"
    ],
    "filter": <DSL Filter>, (2)
    "pageable": <Pagination Parameters>, (3)
  },
  "server": <MongoDB Server ID> (4)
}
1 These configuration fields are required.
2 See the DSL Filter section.
3 See the Pagination appendix.
4 See the MongoDB integration section.

Each configuration field is defined as follows:

database

(Required, String) The database name.

collection

(Required, String) The collection name.

index

(Required, String) The name for the MongoDB full-text search index.

paths

(Required, String or Array of Strings) The paths of the fields to search. Can be configured as a single String if there’s only one path.

queries

(Required, String or Array of Strings) The phrases to search in the field. Can be configured as a single String if there’s only one phrase.

pageable

(Optional, Object) The pagination object.

Details
Page configuration example
{
  "page": 0,
  "size": 25,
  "sort": [ // (1)
    {
      "property" : "fieldA",
      "direction" : "ASC"
    },
    {
      "property" : "fieldB",
      "direction" : "DESC"
    }
  ]
}

Each configuration field is defined as follows:

page

(Integer) The page number.

size

(Integer) The size of the page.

sort

(Array of Objects) The sort definitions for the page.

Field definitions
property

(String) The property where the sort was applied.

direction

(String) The direction of the applied sorting. Either ASC or DESC.

filter

(Optional, DSL Filter) The filter to apply.

The response of the action is stored in the JSON Data Channel as returned by the MongoDB server:

Action output structure
{
  "mongo": {
    ...
  }
}
Action: vector

Processor that uses the vectorSearch MongoDB operator, and its filter field if provided, to perform ANN vector search on a vector type field indexed in a Atlas Vector Search index. Also adds a minimum vector search result score filter for resulting documents.

Note

The vector search operator’s ENN search capabilities aren’t currently supported by this action. However, they may still be used in QueryFlow via the aggregate action of this component

Configuration example
{
  "type": "mongo",
  "name": "My MongoDB Processor Action",
  "config": {
    "action": "vector",
    "database": "my-database", (1)
    "collection": "my-collection", (1)
    "index": "my-index", (1)
    "queryVector": "#{ data('my/vector') }", (1)
    "path": "my-field", (1)
    "limit": 10, (1)
    "minScore": 0.7, (1)
    "numCandidates": 200, (1)
    "filter": { (2)
      "$and": [
        {"year": {  "$gt": 1955  }},
        {"year": {  "$lt": 1975 }}
        ]
      }
  },
  "server": <MongoDB Server ID> (3)
}
1 These configuration fields are required.
2 The exact expected structure of the filter is defined by MongoDB, this is just an example at the time of writing.
3 See the MongoDB integration section.

Each configuration field is defined as follows:

database

(Required, String) The database name.

collection

(Required, String) The collection name.

index

(Required, String) The name of the Atlas Vector Search index.

queryVector

(Required, Array of Float) The vector used as query in the search.

path

(Required, String) The path to search for the vector in the documents.

numCandidates

(Required, Integer) The number of nearest neighbors to use during an ANN search. This config will be ignored if the exact config value is set to true. Must be higher or equals to limit.

limit

(Required, Integer) The number documents to return in the vector search result. The minScore filter is then applied to those results.

minScore

(Required, Double) The minimum score for results of the vector search.

filter

(Optional, Object) The search operator filter to apply in the query. This object should represent a valid Atlas Vector Search Pre-filter.

The response of the action is stored in the JSON Data Channel as returned by the MongoDB server:

Action output structure
{
  "mongo": {
    ...
  }
}

Neo4j

Executes a read query to a Neo4j server to gather search results from it. This component’s output field is named neo4j by default.

Processor Action: process

Processor that executes a query to a Neo4j server.

Configuration example
{
  "type": "neo4j",
  "name": "My Neo4j Processor Action",
  "config": { (1)
    "action": "process",
    "database": "neo4j",
    "query": "MATCH (p:Person {name: $NAME}) RETURN p",
    "parameters": {
      "NAME": "Adam"
    }
  },
  "server": <Neo4j Server ID>, (2)
}
1 All configuration fields in this action are required.
2 See the Neo4j integration section.

Each configuration field is defined as follows:

database

(Required, String) The Neo4j database to query.

query

(Required, String) The query to be executed.

parameters

(Required, Map of String/Object) Parameters to be used in the query. See Neo4J’s parameters section for details on how to use this configuration.

The response of the action is stored in the JSON Data Channel as returned by the Neo4j server:

Action output structure
{
  "neo4j": {
    ...
  }
}

OpenAI

Uses the OpenAI integration to send requests to the OpenAI API. Supports multiple actions for different endpoint of the service. Additionally, supports text trimming based on OpenAI models' tokenizing and token limits, by integrating the tiktokten library. This component’s non-streamed output field is named openai by default.

Action: chat-completion

Processor that executes a chat completion request to OpenAI API.

Configuration example
{
  "type": "openai",
  "name": "My Chat Completion Action",
  "config": {
    "action": "chat-completion",
    "model": "gpt-4", (1)
    "messages": [ (1) (2)
      {"role": "system", "content": "You are a helpful assistant" },
      {"role": "user", "content": "Hi!" },
      {"role": "assistant", "content": "Hi, how can assist you today?" }
    ],
    "promptCacheKey": "pureinsights",
    "frequencyPenalty": 0.0,
    "presencePenalty": 0.0,
    "temperature": 1,
    "topP": 1,
    "n": 1,
    "maxTokens": 2048,
    "stop": [],
    "stream": false,
    "responseFormat": <Response format configuration> (3)
  },
  "server": <OpenAI Server ID>, (4)
}
1 These configuration fields are required.
2 See the messages configuration definition.
3 See the response format configuration.
4 See the OpenAI integration section.

Each configuration field is defined as follows:

model

(Required, String) The OpenIA model to use.

messages

(Required, Array of Objects) The list of messages for the request

Field definitions
role

(Required, String) The role of the message. Must be one of system, user or assistant.

content

(Required, String) Then content of the message.

name

(Optional, String) The name of the author of the message.

promptCacheKey

(Optional, String) Value used by OpenAI to cache responses for similar requests to optimize the cache hit rates.

frequencyPenalty

(Optional, Double) Positive values penalize new tokens based on their existing frequency in the text so far. Value must be between -2.0 and 2. Defaults to 0.0

presencePenalty

(Optional, Double) Positive values penalize new tokens based on whether they appear in the text so far. Value must be between -2.0 and 2. Defaults to 0.0

temperature

(Optional, Double) Sampling temperature to use. Value must be between 0 and 2. Defaults to 1

Note
It’s generally recommended to alter either this or the topP field, but not both.
topP

(Optional, Double) An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass. Defaults to 1

Note
It’s generally recommended to alter either this or the temperature field, but not both.
n

(Optional, Integer) How many chat completion choices to generate for each input message. Defaults to 1

maxTokens

(Optional, Integer) The maximum number of tokens to generate in the chat completion. Defaults to 2048

stop

(Optional, Array of String) Up to 4 sequences where the API will stop generating further tokens

stream

(Optional, Boolean) Whether enable streaming or not. Defaults to false

responseFormat

(Optional, Object) An object specifying the format that the model must output. Learn more in OpenAI’s Structured Outputs guide.

Details
Configuration example
{
  "type": "json_schema", (1)
  "json_schema": { (2)
    "name":   "a name for the schema",
    "strict": true,
    "schema": { (3)
      "type": "object",
      "properties": {
        "equation": { "type": "string" },
        "answer":   { "type": "string" }
      },
      "required": ["equation", "answer"],
      "additionalProperties": false
    }
  }
}
1 The response type is always required.
2 JSON schemas can only be used and are required with json_schema types of response formats.
3 The exact expected structure of the schema object is defined by the OpenAI API, this is just an example at the time of writing.

Each configuration field is defined as follows:

type

(Required, String) The type of response format being defined. Allowed values: text, json_schema and json_object.

json_schema

(Optional, Object) Structured Outputs configuration options, including a JSON Schema. This field can only be used and is in fact required with response formats of the json_schema type. See OpenAI’s response formats definitions for more details.

Field definitions
name

(Required, String) The name of the response format. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.

description

(Optional, String) A description of what the response format is for, used by the model to determine how to respond in the format.

schema

(Optional, Object) The schema for the response format, described as a JSON Schema object. Learn how to build JSON schemas here.

strict

(Optional, Boolean) Whether to enable strict schema adherence when generating the output. If set to true, the model will always follow the exact schema defined in the schema field. Defaults to false.

The response of the action is stored in the JSON Data Channel as returned by the OpenAI API:

Streaming disabled response
{
  "openai": {
    "created": <Timestamp>,
    "choices": [
      {
        "index": 0,
        "message": {
          "role": "assistant",
          "content": "The response from the model"
        },
        "finishReason": "stop"
      }
    ],
    "model": "gpt-4-0613",
    "usage": {
      "promptTokens": 34,
      "completionTokens": 95,
      "totalTokens": 129
    }
  }
}
Streaming enabled response
[
  {
    "name": "openai",
    "data": "The"
  },
  {
    "name": "openai",
    "data": " response"
  },
  {
    "name": "openai",
    "data": " from"
  },
  {
    "name": "openai",
    "data": " the"
  },
  {
    "name": "openai",
    "data": " model"
  },
  ...
]
Action: embeddings

Processor executes an embeddings request to OpenAI API.

Configuration example
{
  "type": "openai",
  "name": "My OpenAI Embeddings Action",
  "config": {
    "action": "embeddings",
    "model": "text-embedding-ada-002", (1)
    "input": ["Sample text 1", "Sample text 2"], (1)
    "user": "pureinsights"
  },
  "server": <OpenAI Server UUID>, (2)
}
1 These configuration fields are required.
2 See the OpenAI integration section.

Each configuration field is defined as follows:

model

(Required, String) The OpenIA model to use.

input

(Required, Array of Strings) The list of input texts to be processed.

user

(Optional, String) The unique identifier representing the end-user.

The response of the action is stored in the JSON Data Channel as returned by the OpenAI API:

Action output example
{
  "openai": {
    "embeddings": [
      {
        "embedding": [ -0.006929283495992422, -0.005336422007530928, ... ],
        "index": 0
      },
      {
        "embedding": [ -0.024047505110502243, -0.006929283495992422, ... ],
        "index": 1
      }
    ],
    "model": "text-embedding-ada-002-v2",
    "usage": {
      "promptTokens": 4,
      "totalTokens": 4
    }
  }
}
Action: trim

Processor that trims a given text based on an OpenAI model’s tokenizing and either its own token limit, or a custom one.

Configuration example
{
  "type": "openai",
  "name": "My OpenAI Trim Action",
  "config": {
    "action": "trim",
    "text": "#{ data(\"/text\") }", (1)
    "model": "gpt-5.2", (2)
    "tokenLimit": 5 (2)
  }
}
1 This configuration field is required
2 At least one of these configuration fields are required

Each configuration field is defined as follows:

text

(Required, String) The text to trim.

model

(Optional, String) The OpenAI model whose encoding is used to tokenize the text and whose token limit determines whether to truncate the text. If a custom token limit is defined, it’ll override the model’s. If no model is provided, a default o200k_base encoding will be used.

Note
In order to determine the encoding and token limit for models used in chat completion requests, the processor will only take into account those models' "version" when trimming. In this context, "version" translates to the ChatGPT version used, such as gpt-5.2, gpt-4.1, o4, etc. This means that a model field configured as gpt-5.2-2025-12-11 will result in the processor only taking into account the gpt-5.2 version included, and ignore the rest. Consequently, non-existent models such as o4-mini-thismodeldoesntexist are considered valid for this action as long as the model version can be inferred.
tokenLimit

(Optional, Integer) The positive integer used as token limit when determining whether to truncate the encoded text or not. If defined, it’ll override the provided model’s token limit, if any.

The response of the action is stored in the JSON Data Channel:

Action output example
{
  "openai": {
    "text": "The brown fox jumps over",
    "size": 24,
    "tokens": 5,
    "truncated": true,
    "remainder": [
      " the",
      " lazy",
      " dog"
    ]
  }
}

OpenSearch

Uses the OpenSearch integration to send requests to the OpenSearch API. It supports multiple actions for common operations such as search, but also provides a mechanism to send raw OpenSearch queries. This component’s output field is named {opensearch} by default.

Action: autocomplete

Processor that executes a completion suggester query.

Configuration example
{
  "type": "opensearch",
  "name": "My OpenSearch Processor Action",
  "config": {
    "action": "autocomplete",
    "index": "my-index", (1)
    "text": "#{ data('my/query') }", (1)
    "field": "content", (1)
    "size": 3,
    "skipDuplicates": true
  },
  "server": <OpenSearch Server UUID> (2)
}
1 These configuration fields are required.
2 See the OpenSearch integration section.

Each configuration field is defined as follows:

index

(Required, String) The index where to search.

text

(Required, String) The text to autocomplete.

field

(Required, String) The field where to search.

skipDuplicates

(Optional, Boolean) Whether to skip duplicate suggestions.

size

(Optional, Integer) The amount of suggestions.

The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:

Action output structure
{
  "opensearch": {
    ...
  }
}
Action: fetch

Processor that executes a GET request to retrieve a specified JSON document from an index.

Configuration example
{
  "type": "opensearch",
  "name": "My OpenSearch Processor Action",
  "config": {
    "action": "fetch",
    "index": "my-index", (1)
    "id": "document-ID", (1)
    "fields": <DSL Projection> (2)
  },
  "server": <OpenSearch Server UUID> (3)
}
1 These configuration fields are required.
2 See the DSL Projection section.
3 See the OpenSearch integration section.

Each configuration field is defined as follows:

index

(Required, String) The index where to search.

id

(Required, String) The ID of the document.

fields

(Optional, Projection) The source fields to be included or excluded.

The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:

Action output structure
{
  "opensearch": {
    ...
  }
}
Action: knn

Processor that executes an Approximate k-NN query.

{
  "type": "opensearch",
  "name": "My OpenSearch Processor Action",
  "config": {
    "action": "knn",
    "index": "my-index", // (1)
    "field": "vector-field",  // (1)
    "vector": "#{ data('my/vector') }", // (1)
    "minScore": 0.92, // (1)
    "maxResults": 5, // (1)
    "k": 5, // (1)
    "query": {
      "match_all": {}
    }
  },
  "server": <OpenSearch Server UUID> // (2)
}
  1. These configuration fields are required.

  2. See the OpenSearch integration section.

Each configuration field is defined as follows:

index

(Required, String) The index where to search.

field

(Required, String) The field with the vector.

vector

(Required, Array of Float) The source vector to compare.

minScore

(Required, Double) The minimum score for results.

maxResults

(Required, Integer) The maximum number of results.

k

(Required, Integer) The number of neighbors the search of each graph will return.

query

(Optional, Object) The query to filter in addition to the kNN search.

The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:

Action output structure
{
  "opensearch": {
    ...
  }
}
Action: native

Processor that executes a native OpenSearch query.

Configuration example
{
  "type": "opensearch",
  "name": "My OpenSearch Processor Action",
  "config": {
    "action": "native",
    "path": "/my-index/_doc/1", (1)
    "method": "POST", (1)
    "queryParam": {
      "param1": "value1"
    },
    "body": {
      "field1": "value2"
    }
  },
  "server": <OpenSearch Server UUID> (2)
}
1 These configuration fields are required.
2 See the OpenSearch integration section.

Each configuration field is defined as follows:

path

(Required, String) The endpoint of the request, excluding schema, host, port and any path included as part of the connection.

method

(Required, String) The HTTP method for the request.

queryParams

(Optional, Map of String/String) The map of query parameters for the URL.

body

(Optional, Object) The JSON body to submit.

The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:

Action output structure
{
  "opensearch": {
    ...
  }
}
Action: search

Processor that executes a match query on the index.

{
  "type": "opensearch",
  "name": "My OpenSearch Processor Action",
  "config": {
    "action": "search",
    "index": "my-index", // (1)
    "text": "#{ data('my/query') }", // (1)
    "field": "content", // (1)
    "suggest": { // (2)
      "completion-suggestion": {
        "prefix": "Value ",
        "completion": {
          "field": "field.completion"
        }
      }
    },
    "aggregations": { // (2)
      "aggregationA": {
        "terms": {
          "field": "field.keyword"
        }
      }
    },
    "highlight": { // (2)
      "fields": {
        "field": {}
      }
    },
    "filter": <DSL Filter>, // (3)
    "pageable": <Pagination Parameters>, // (4)
  },
  "server": <OpenSearch Server UUID> // (5)
}
  1. These configuration fields are required.

  2. The exact expected structure of these objects is defined by the OpenSearch API, this is just an example at the time of writing.

  3. See the DSL Filter section.

  4. See the Pagination appendix.

  5. See the OpenSearch integration section.

    index

    (Required, String) The index where to search.

    text

    (Required, String) The text to search.

    field

    (Required, String) The field where to search.

    suggest

    (Optional, Object) The suggester to apply. The object should represent a valid suggester according to the OpenSearch API.

    aggregations

    (Optional, Map of String/Object) The field with the aggregations to apply. See the OpenSearch API aggregation documentation for details on the structure of the map.

    highlight

    (Optional, Object) The highlighter to apply. The object should represent a valid highlighter according to the OpenSearch API.

    filter

    (Optional, DSL Filter) The filters to apply.

    pageable

    (Optional, Pagination) The pagination parameters.

The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:

Action output structure
{
  "opensearch": {
    ...
  }
}
Action: store

Processor that stores or updates documents in the given index of OpenSearch.

{
  "type": "opensearch",
  "name": "My OpenSearch Processor Action",
  "config": {
    "action": "store",
    "index": "my-index", // (1)
    "document": { // (1)
      "field1": "value1"
    },
    "id": "documentID",
    "allowOverride": false,
  },
  "server": <OpenSearch Server UUID> // (2)
}
  1. These configuration fields are required.

  2. See the OpenSearch integration section.

Each configuration field is defined as follows:

index

(Required, String) The index where to store the document.

id

(Required, String) The ID of the document to be stored.

document

(Required, Object) The document to be stored.

allowOverride

(Optional, Boolean) Whether the document can be overridden or not. Defaults to false.

The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:

Action output structure
{
  "opensearch": {
    ...
  }
}
Action: vector

Processor that executes an Exact kNN with scoring script query.

{
  "type": "opensearch",
  "name": "My OpenSearch Processor Action",
  "config": {
    "action": "vector",
    "index": "my-index", // (1)
    "field": "my_vector_field", // (1)
    "vector": "#{ data('my/vector') }", // (1)
    "minScore": 0.92, // (1)
    "maxResults": 5, // (1)
    "function": "cosinesimil", // (1)
    "query": {
      "match_all": {}
    }
  },
  "server": <OpenSearch Server UUID> // (2)
}
  1. These configuration fields are required.

  2. See the OpenSearch integration section.

Each configuration field is defined as follows:

index

(Required, String) The index where to search.

field

(Required, String) The field with the vector.

vector

(Required, Array of Float) The source vector to compare.

minScore

(Required, Double) The minimum score for results.

maxResults

(Required, Integer) The maximum number of results.

function

(Required, String) The function used for the k-NN calculation. The available functions can be found here.

query

(Optional, Object) The query to apply together with the vector search.

The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:

Action output structure
{
  "opensearch": {
    ...
  }
}

Question Detector

The Question Detector component validates if an input text contains a question. It uses languages codes that are referenced using ISO-639-1 (alpha-2 code). This component’s output field is named isQuestion by default.

Processor Action: process

Processor that detects if the provided text is a question.

Configuration example
{
  "type": "question-detector",
  "name": "My Question Detector Processor Action",
  "config": {
    "action": "process",
    "language": "es", (1)
    "text": "#{ data(\"/httpRequest/queryParams/question\") }", (1)
    "questionPrefixes": {
      "es": [
        "que",
        "quien",
        "porque",
        "donde",
        "cuando",
        "como"
      ]
    }
  }
}
1 These configuration fields are required.

Each configuration field is defined as follows:

text

(Required, String) The text to be evaluated.

language

(Required, String) The language to use.

questionPrefixes

(Optional, Map of String/List) Words that indicate a question. Defaults to { "en": [ "what", "who", "why", "where", "when", "how" ] }.

The response of the action is stored in the JSON Data Channel:

{
  "isQuestion": <`True` or `False`>
}

Script

Uses the Script Engine to execute a script for advanced handling of the execution data. Supports multiple scripting languages and provides tools for JSON manipulation and for logging. This component’s output field is named script by default.

Processor Action: process

Processor that executes a script to process and interact with data produced in previous states.

Configuration example
{
  "type": "script",
  "name": "My Script Processor Action",
  "config": {
    "action": "process",
    "language": "groovy",
    "script": <Script>, (1)
  }
}
1 This configuration field is required.

Each configuration field is defined as follows:

language

(Optional, String) The language of the script. One of the supported script languages. Defaults to groovy.

script

(Required, String) The script to run.

The response of the action is stored in the JSON Data Channel according to the script’s interaction with the output() object:

Action output example
{
  "script": {
    ...
  }
}

Solr

Uses the Solr integration to send requests to Solr. This component’s output field is named solr by default.

Action: native

Processor that executes a native indexing query.

Configuration example
{
  "type": "solr",
  "name": "My Solr Processor Action",
  "config": {
    "action": "native",
    "path": "/select", (1)
    "method": "POST", (1)
    "queryParams": { (1)
      "q": "description:Pureinsights"
    },
    "body": {},
    "maxResponseMapDepth": 5
  },
  "server": <Solr Server UUID>, (2)
}
1 These configuration fields are required.
2 See the Solr integration section.

Each configuration field is defined as follows:

path

(Required, String) The Solr operation path to be used for the request.

method

(Required, String) The HTTP method for the request.

queryParams

(Required, Map of String/String) The map of query parameters for the request.

body

(Optional, Object) The JSON body to submit for the request. The exact structure will depend on the Solr operation performed.

maxResponseMapDepth

(Optional, Integer) The maximum depth for response object deserialization. Defaults to 5.

The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:

Action output structure
{
  "solr": {
    ...
  }
}
Action: search

Processor that executes a standard search query.

Configuration example
{
  "type": "solr",
  "name": "My Solr Processor Action",
  "config": {
    "action": "search",
    "query": "#{ data('my/query') }", (1)
    "filterQueries": "description:Pureinsights",
    "fields": [
      "name",
      "description"
    ],
    "highlight": false,
    "maxResponseMapDepth": 5
  },
  "server": <Solr Server UUID>, (2)
}
1 These configuration fields are required.
2 See the Solr integration section.

Each configuration field is defined as follows:

query

(Required, String) The search query to be executed.

fields

(Optional, Array of Strings) The optional returned fields of the document. If not set, all the fields in the document are returned.

highlight

(Optional, Boolean) Whether to enable highlighting in the resulting query or not.

filterQueries

(Optional, String) The filter queries to be applied to the search.

maxResponseMapDepth

(Optional, Integer) The maximum depth for response object deserialization. Defaults to 5.

The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:

Action output structure
{
  "solr": {
    ...
  }
}

Staging

Interacts with buckets and content from Discovery Staging. This component’s output field is named staging by default.

Action: fetch

Gets a document from the given bucket.

Configuration example
{
  "type": "staging",
  "name": "My Staging Processor Action",
  "config": {
    "action": "fetch",
    "bucket": "my-bucket", (1)
    "id": "my-document-id", (1)
    "fields": <DSL Projection> (2)
  }
}
1 These configuration fields are required.
2 See the DSL Projection section.

Each configuration field is defined as follows:

bucket

(Required, String) The bucket name.

id

(Required, String) The ID of the document to fetch

fields

(Optional, Projection) The projection to apply on the document.

The response of the action is stored in the JSON Data Channel as returned by the Staging client:

Action output structure
{
  "staging": {
    ...
  }
}
Action: store

Stores a document into the given bucket.

Configuration example
{
  "type": "staging",
  "name": "My Staging Processor Action",
  "config": {
    "action": "store",
    "bucket": "my-bucket", (1)
    "document": { (1)
      "field": "value"
    },
    "id": <Staging Document ID>, (1)
    "allowOverride": false,
  }
}
1 These configuration fields are required.

Each configuration field is defined as follows:

bucket

(Required, String) The bucket name.

document

(Required, Object) The document to store.

id

(Optional, String) The ID of the document to store. If not provided, a random UUID will be used.

allowOverride

(Optional, Boolean) Whether allow overriding an existing document or not. Defaults to false.

The response of the action is stored in the JSON Data Channel as returned by the Staging client:

Action output structure
{
  "staging": {
    ...
  }
}
Action: search

Search for documents in the given bucket.

Configuration example
{
  "type": "staging",
  "name": "My Staging Processor Action",
  "config": {
    "action": "search",
    "bucket": "my-bucket", (1)
    "actions": ["STORE"], (1)
    "filter": <DSL Filter>, (2)
    "projection": <DSL Projection>, (3)
    "parentId": <Staging Document ID>,
    "pageable": <Pagination Parameters> (4)
  }
}
1 These configuration fields are required.
2 See the DSL Filter section.
3 See the DSL Projection section.
4 See the Pagination appendix.

Each configuration field is defined as follows:

bucket

(Required, String) The bucket name.

actions

(Required, Array of Strings) The actions to filter the documents. Defaults to STORE.

projection

(Optional, Projection) The projection to apply on the search.

filter

(Optional, DSL Filter) The filter to apply on the search.

parentId

(Optional, String) The parent ID to match.

pageable

(Optional, Pagination) The pagination object.

Details
Page configuration example
{
  "page": 0,
  "size": 25,
  "sort": [ // (1)
    {
      "property" : "fieldA",
      "direction" : "ASC"
    },
    {
      "property" : "fieldB",
      "direction" : "DESC"
    }
  ]
}

Each configuration field is defined as follows:

page

(Integer) The page number.

size

(Integer) The size of the page.

sort

(Array of Objects) The sort definitions for the page.

Field definitions
property

(String) The property where the sort was applied.

direction

(String) The direction of the applied sorting. Either ASC or DESC.

The response of the action is stored in the JSON Data Channel as returned by the Staging client:

Action output structure
{
  "staging": {
    ...
  }
}

Template

Uses the Template Engine to transform a standard template with contextual structured data, generating a verbalized representation of the information. It can generate various types of documents as either plain text or JSON. This component’s output field is named template by default.

Processor Action: process

Processor that processes the provided template with the defined configuration.

Configuration example
{
  "type": "template",
  "name": "My Template Processor Action",
  "config": {
    "action": "process",
    "template": "Hello, ${name}!", (1)
    "bindings": { (1)
      "name": "John"
    },
    "outputFormat": "PLAIN"
  }
}
1 These configuration fields are required.

Each configuration field is defined as follows:

template

(Required, String) The template to process.

bindings

(Required, Object) The bindings to replace in the template.

Binding object format
{
  "bindingA": "#{ data('/my/binding/field') }",
  ...
}

Each binding, defined as a key in the object, can be later referenced in a template:

My bindingA value is ${bindingA}
outputFormat

(Optional, String) The output format of the precessed template. Supported formats are: JSON and PLAIN. Defaults to PLAIN.

The response of the action is stored in the JSON Data Channel as returned by the Template engine:

Action output structure
{
  "template": {
    ...
  }
}

Tokenizer

Tokenizes a specified text input using Lucene. This component’s output field is named tokens by default.

Processor Action: process

Processor that tokenizes any entry provided using Lucene analyzers. The supported analyzers are:

Note

All analyzers (except the custom one, which needs the tokenizer configuration) will be used as they are built by default if no configuration is specified. Further configurations under the field analyzer can be specified.

Configuration example
{
  "type": "tokenizer",
  "name": "My Tokenizer Processor Action",
  "config": {
    "action": "process",
    "text": "#{ data(\"/httpRequest/queryParams/q\") }", (1)
    "attributes": ["term", "offset"],
    "analyzer": <Analyzer Configuration> (2)
  }
}
1 This configuration field is required.
2 See the analyzer configuration.

Each configuration field is defined as follows:

text

(Required, String) The text to tokenize.

attributes

(Optional, Array of strings) The attributes to include with each token. Supports term and offset. Default is ["term", "offset"].

Type definitions

term: Adds the token itself as an attribute.

offset: Adds the relative start and end position of the token in the input text.

Note

The attributes are added to the configuration as a list, all of those included will be added to the output. That list, if specified, cannot be empty.

analyzer

(Optional, String or Map of String/Object) The analyzer to use for the tokenization. Defaults to standard.

Field definitions

If the analyzer field is set as a string, a default analyzer, without further configuration, will be used in the action. The value should be the name of one of the supported analyzers:

Default standard analyzer example
{
  "analyzer": "standard"
}

If the analyzer field is set as an object representing a Map of String/Object, the analyzer used in the action can be further customized, and the exact expected structure of the map’s objects, which is the custom analyzer’s configuration, will depend on the chosen type of analyzer:

Standard analyzer configuration
Standard analyzer configuration example
{
  "type": "standard",
  "stopwords": {
    "tokens": [
      "the"
    ],
    "ignoreCase": true
  },
  "maxTokenLength": 4
}

Each configuration field is defined as follows:

maxTokenLength

(Optional, Int) The maximum token length the analyzer will emit. Defaults to 255.

stopwords

(Optional, Array of Strings or Map of String/Object) A set of common words usually not useful for search. The field can be defined as a map object to configure both the list of words via the tokens field, and whether to ignore cases when identifying the words, via the ignoreCase field. The words can also be defined directly as a list of Strings, in which case the ignoreCase field defaults to false.

Language analyzers configuration
Language analyzer configuration example
{
  "type": "spanish",
  "stopwords": {
    "tokens": [
      "va"
    ],
    "ignoreCase": true
  },
  "stemExclusion": [
      "voy"
  ]
}

Each configuration field is defined as follows:

stopwords

(Optional, Array of Strings or Map of String/Object) A set of common words usually not useful for search. The field can be defined as a map object to configure both the list of words via the tokens field, and whether to ignore cases when identifying the words, via the ignoreCase field. The words can also be defined directly as a list of Strings, in which case the ignoreCase field defaults to false.

stemExclusion

(Optional, Array of Strings or Map of String/Object) A set of words to not be stemmed. The field can be defined as a map object to configure both the list of words via the tokens field, and whether to ignore cases when identifying the words, via the ignoreCase field. The words can also be defined directly as a list of Strings, in which case the ignoreCase field defaults to false.

Whitespace analyzer configuration
Whitespace analyzer configuration example
{
  "type": "whitespace",
  "maxTokenLength": 255
}

Each configuration field is defined as follows:

maxTokenLength

(Optional, Int) The maximum token length the analyzer will emit. Defaults to 255.

Custom analyzer configuration
Custom analyzer configuration example
{
  "type": "custom",
  "tokenizer": {
    "type": "standard",
    "maxTokenLength": 4
  },
  "filters": [
    "lowercase",
    {
      "type": "edgeNgram",
      "minGramSize": 2,
      "maxGramSize": 3
    }
  ]
}
tokenizer

(Required, String or Map of String/Object) Tokenizer for the custom analyzer. The field can be configured as a map object to configure both the tokenizer type, via the type field, and its parameters, the latter by setting the values in the remaining key/value pairs of the map object. This field can also be configured as a single String that represents the tokenizer type, in which case a default tokenizer, without further customization, is used.

filters

(Optional, Array of Objects) List of filters to be applied. The parameters of each element (filter) of the array, may be configured in the same manner as for the tokenizer field. Alternatively, default filters may also be configured only by name with a String element in the array.

The response of the action is stored in the JSON Data Channel as returned by the Lucene engine:

Action output structure
{
  "tokens": {
    ...
  }
}
Examples
Default Configuration example
{
  "type": "tokenizer",
  "name": "My Tokenizer Processor Action",
  "config": {
    "action": "process",
    "text": "#{ data(\"/httpRequest/queryParams/q\") }"
  }
}
Whitespace Analyzer with Term Attribute
{
  "type": "tokenizer",
  "name": "My Tokenizer Processor Action",
  "config": {
    "action": "process",
    "analyzer": "whitespace",
    "attributes": [
      "term"
    ],
    "text": "#{ data(\"/httpRequest/body/custom/field\") }"
  }
}
Advanced Configuration Analyzer
{
  "type": "tokenizer",
  "name": "My Tokenizer Processor Action",
  "config": {
    "action": "process",
    "analyzer": {
      "type": "english",
      "stopwords":{
        "tokens": [
          "the"
        ],
        "ignoreCase": true
      },
      "stemExclusion": [
        "quick"
      ]
    },
    "attributes": [
      "term"
    ],
    "text": "#{ data(\"/httpRequest/queryParams/q\") }"
  }
}
Advanced Configuration Max Token Length
{
  "type": "tokenizer",
  "name": "My Tokenizer Processor Action",
  "config": {
    "action": "process",
    "analyzer": {
      "type": "whitespace",
      "maxTokenLength": 4
    },
    "attributes": [
      "term"
    ],
    "text": "#{ data(\"/httpRequest/queryParams/q\") }"
  }
}
Simple custom analyzer
{
  "type": "tokenizer",
  "name": "My Tokenizer Processor Action",
  "config": {
    "action": "process",
    "text": "Hi, my cat is INJURED in its paw.",
    "analyzer": {
      "type": "custom",
      "tokenizer": "whitespace",
      "filters": [
        "lowercase"
      ]
    },
    "attributes": [
      "term"
    ]
  }
}
Custom analyzer with parameters
{
  "type": "tokenizer",
  "name": "My Tokenizer Processor Action",
  "config": {
    "action": "process",
    "text": "Hi, my cat is INJURED in its paw.",
    "analyzer": {
      "type": "custom",
      "tokenizer": {
        "type": "standard",
        "maxTokenLength": 4
      },
      "filters": [
        "lowercase",
        {
          "type": "edgeNgram",
          "minGramSize": 2,
          "maxGramSize": 3
        }
      ]
    },
    "attributes": [
      "term"
    ]
  }
}

Vespa

Uses the Vespa integration to send requests to a Vespa service. This component’s output field is named vespa by default.

Action: native

Processor that executes an HTTP request to a Vespa service.

Configuration example
{
  "type": "vespa",
  "name": "My Vespa Native Action",
  "config": {
    "action": "native", (1)
    "method": "POST", (1)
    "path": "/search", (1)
    "queryParams": { (2)
      "timeout": "120s"
    },
    "body": { (2)
      "yql": "select * from sources * where true"
    }
  },
  "server": <Vespa Server UUID>, (3)
}
1 These configuration fields are required.
2 The exact expected structure of these objects is defined by the Vespa API, this is just an example at the time of writing.
3 See the Vespa integration section.

Each configuration field is defined as follows:

method

(Required, String) The HTTP method for the request.

path

(Required, String) The endpoint of the request, excluding schema, host, port and any path included as part of the connection.

queryParams

(Optional, Map of String/String) The map of query parameters for the URL.

body

(Optional, Object) The JSON body to submit as part of the request.

The response of the action is stored in the JSON Data Channel as returned by the Vespa API:

Action output structure
{
  "vespa": {
    ...
  }
}

Voyage AI

Uses the Voyage AI integration to send requests to the Voyage AI API. Supports multiple actions for different endpoints of the service. This component’s output field is named voyage-ai by default.

Action: reranking

Processor that given a query and many documents, returns the (ranks of) relevancy between the query and documents. See Voyage AI Rerankers and the API Rerankers endpoint.

Configuration example
{
  "type": "voyage-ai",
  "name": "My Reranking Action",
  "config": {
    "action": "reranking",
    "model": "rerank-lite-1", (1)
    "query": "Sample query", (1)
    "documents": ["Sample document 1", "Sample document 2"], (1)
    "truncation": true,
    "topK": 10,
    "returnDocuments": false
  },
  "server": <Voyage AI Server> (2)
}
1 These configuration fields are required.
2 See the Voyage AI integration section.

Each configuration field is defined as follows:

model

(Required, String) The model to use for the request. See models.

query

(Required, String) The query the rerank is based on, as a String.

documents

(Required, Array of Strings) Documents to be reranked as a list of String.

truncation

(Optional, Boolean) Whether to truncate the input to satisfy the context length limit on the query and the documents. Defaults to true.

topK

(Optional, Integer) The number of most relevant documents to return.

returnDocuments

(Optional, Boolean) Whether to include the documents in the response. Default to false.

The response of the action is stored in the JSON Data Channel as returned by the Voyage AI API:

Action output structure
{
  "voyage-ai": {
    ...
  }
}
Action: embeddings

Processor that given input string (or a list of strings) and other arguments such as the preferred model name, returns a response containing a list of embeddings. See Voyage AI Embeddings and the API Text embedding models endpoint.

Configuration example
{
  "type": "voyage-ai",
  "name": "My Embeddings Action",
  "config": {
    "action": "embeddings",
    "model": "voyage-large-2", (1)
    "input": ["Sample text 1", "Sample text 2"], (1)
    "truncation": true,
    "inputType": "DOCUMENT",
    "outputDimension": 1536,
    "outputDatatype": "FLOAT",
    "encodingFormat": "Base64",
  },
  "server": <Voyage AI Server> (2)
}
1 These configuration fields are required.
2 See the Voyage AI integration section.

Each configuration field is defined as follows:

model

(Required, String) The model to use for the request. See models.

input

(Required, String or List of Strings) List of documents to be embedded. If there’s a single input, it can be configured as a single String.

truncation

(Optional, Boolean) Whether to truncate the input texts to fit within the context length. Defaults to true.

inputType

(Optional, String) Type of the input text. One of: QUERY or DOCUMENT. Defaults to null.

outputDimension

(Optional, Integer) The number of dimensions for resulting output embeddings. Defaults to null.

outputDatatype

(Optional, String) The data type for the embeddings to be returned. One of: FLOAT, INT8,UINT8, BINARY or UBINARY. Default to FLOAT.

encodingFormat

(Optional, String) Format in which the embeddings are encoded. Defaults to null, but can be set to Base64.

The response of the action is stored in the JSON Data Channel as returned by the Voyage AI API:

Action output structure
{
  "voyage-ai": {
    ...
  }
}
Action: multimodal-embeddings

Processor that given an input list of multimodal inputs consisting of text, images, or an interleaving of both modalities and other arguments such as the preferred model name, returns a response containing a list of embeddings. See Voyage AI Multimodal Embedding and the API Text multimodal embedding models endpoint.

Configuration example
{
  "type": "voyage-ai",
  "name": "My Multimodal Embeddings Action",
  "config": {
    "action": "multimodal-embeddings",
    "model": "voyage-multimodal-3", (1)
    "input": [ (1) (2)
      <Input Objects>
    ],
    "truncation": true,
    "inputType": "DOCUMENT",
    "outputEncoding": "Base64"
  },
  "server": <Voyage AI Server> (3)
}
1 These configuration fields are required.
2 There are multiple types of accepted inputs, check the input object definition for details.
3 See the Voyage AI integration section.

Each configuration field is defined as follows:

model

(Required, String) The model to use for the request. See models.

inputs

(Required, Array of Objects) A list of multimodal inputs to be vectorized. Each object in the list represents an input

Field definitions
type

(Required, String) The type. One of: text, image_url or image_base64.

text

(Optional, String) The text if the type text is chosen.

Text input example
{
  "type": "text",
  "text": "This is a banana."
}
imageUrl

(Optional, String) The image url if the type image_url is chosen.

Image URL input example
{
  "type": "image_url",
  "imageUrl": "https://raw.githubusercontent.com/voyage-ai/voyage-multimodal-3/refs/heads/main/images/banana.jpg"
}
imageBase64

(Optional, Object) The base 64 encoded image if the type image_base64 is chosen.

Image Base64 input example
{
  "type": "image_base64",
  "imageBase64": {
      "mediaType": "image/jpeg",
      "base64": true,
      "data": "/9j/4AAQSkZJRgABAQEAYABgAAD(...)"
  }
}
mediaType

(Required, String) The data media type. Supported media types are: image/png, image/jpeg, image/webp, and image/gif.

base64

(Required, Boolean) Whether the data is encoded in Base64.

data

(Required, String) The data itself.

truncation

(Optional, Boolean) Whether to truncate the inputs to fit within the context length. Defaults to true.

inputType

(Optional, String) Type of the input text. One of: QUERY or DOCUMENT. Defaults to null.

outputEncoding

(Optional, String) Format in which the embeddings are encoded. One of: base64. Defaults to null.

The response of the action is stored in the JSON Data Channel as returned by the Voyage AI API:

Action output structure
{
  "voyage-ai": {
    ...
  }
}

Sandbox API

The Sandbox API allows the user to execute standalone processors without setting up an endpoint, and returns the corresponding response from the execution. This API must be enabled via the queryflow.sandbox.enabled property.

Execute a new Processor
$ curl --request POST 'queryflow-api:12040/v2/sandbox?timeout=PT15S'
Query Parameters
timeout

(Optional, Duration) The timeout for the request execution. Defaults to 15s.

Body
{
  "processor": {
    "type": "my-component-type",
    "config": {
      ...
    },
    "server": {
      "type": "my-server-type",
      "config": {
        ...
      },
      "credential": {
        "type": "my-credential-type",
        "secret": {
          ...
        }
      }
    }
  },
  "input": {
    ...
  }
}
processor

(Required, Object) The configuration for the processor to be executed.

Details
type

(Required, String) The type of the component to execute.

config

(Required, Object) The configuration for the corresponding action of the component.

server

(Optional, UUID/Object) Either the ID of an existing server or the type and configuration for one.

Details
type

(Required, String) The type of external supported service.

config

(Required, Object) The configuration to connect to the external service.

credential

(Optional, UUID/Object) Either the ID of an existing credential or the type and data for one.

Details
type

(Required, String) The type of credentials for the external supported service.

secret

(Required, String/Object) Either the secret key to connect to the external service, or an object with the authentication details.

input

(Required, Object) The input to be sent to the processor.

Discovery Sandbox SDK

The Sandbox SDK is a Python library that allows developers to programatically access Discovery features inside of a Python execution environment without the need for extensive setup. Currently, it supports sending execution requests for Discovery QueryFlow processors and obtaining their result. It requires Python 3.13 or higher.

Quickstart

from sandbox.discovery_sandbox import QueryFlowClient, Processor, QueryFlowSequence, QueryFlowSequenceProcessor

# Initialize the client
client = QueryFlowClient(url="https://your-queryflow-api:12040", api_key="YOUR_API_KEY")

# Define processors
my_processor = Processor(
    type="processor_type",
    config={ ... }
)

another_processor = Processor(
    type="processor_type",
    config={ ... }
)

# Define a streaming processor
streaming_processor = Processor(
    type="processor_type",
    config={
        "stream": true,
        ...
    }
)

# Define input
my_input = {"text": "Hello world!"}

try:
    # Execute text_to_text
    response = client.text_to_text(my_processor, my_input)
    print("Text_to_Text Response:", response)

    # Alternatively, if the processor supports streaming data
    print("Text_to_Stream Response:")
    for chunk in client.text_to_stream(streaming_processor, my_input):
        print(chunk, end="")

except Exception as e:
    print(f"An error occurred: {e}")

# Example with a Processor Sequence

first_processor = QueryFlowSequenceProcessor(my_processor, "PT15S")
second_processor = QueryFlowSequenceProcessor(another_processor, "PT30S")

sequence = QueryFlowSequence([first_processor, second_processor])

result = client.execute(sequence, my_input)

Installation

You can install the Sandbox SDK and all necessary dependencies to your Python execution environment via pip.

cd pdp-sandbox
pip install .

Core Entities

The SDK uses several classes to represent the components involved in a QueryFlow request. These mirror the configuration for QueryFlow and external integration components.

Credential

Represents a Credential entity.

Attributes
type

(Required, String) The credential type.

secret

(Required, Dict) A dictionary containing the secret data.

Server

Represents a Server entity.

Attributes
type

(Required, String) The server type.

config

(Required, Dict) The server configuration.

credential

(Optional, Credential) The credential for authentication with the server.

Processor

Represents an executable Processor entity.

Attributes
type

(Required, String) The type of the processor.

config

(Required, Dict) The configuration for the processor.

server

(Optional, Server) Server configuration to be associated with the processor.

QueryFlowClient

The QueryFlowClient is the main interface for interacting with the QueryFlow Sandbox API.

text_to_text

Executes a processor and returns its complete response as a dictionary. This method is overloaded and can accept either a full Processor object or the ID of a pre-existing processor. Returns the JSON response from the processor execution. If the server returns a 204 No Content, an empty dictionary {} is returned.

Parameters
processor

(Required, Processor) The Processor object to execute.

input

(Required, Dict) The input data to send to the processor.

timeout

(Optional, Duration) The timeout for the request execution.

text_to_stream

Executes a processor with support for streaming responses and yields response chunks as they are received. This method is overloaded similarly to text_to_text. The processor must support the stream key and enable it as part of its configuration.

Parameters
processor

(Required, Processor) The Processor object to execute.

input

(Required, Dict) The input data to send to the processor.

timeout

(Optional, Duration) The timeout for the request execution.

execute

Executes a sequence of processors, where the output of one processor becomes the input for the next.

Parameters
sequence

(Required, QueryFlowSequence) The sequence of processors to execute.

input_data

(Required, Dict) The initial input to be fed into the first processor of the sequence.

Sequential Execution

The Sandbox SDK additionally provides an interface to execute non-streaming processors sequentially, sending the output of a processor as the input for the next.

QueryFlowSequenceProcessor

Represents a single step in a QueryFlowSequence.

Attributes
processor

(Required, String or Processor) Either a Processor object to define a new processor for this step, or a string UUID of an existing processor.

timeout

(Optional, Duration) The timeout specifically for this processor’s execution within the sequence.

QueryFlowSequence

Represents the list of processors to be executed.

Attributes
processors

(Required, List of QueryFlowSequenceProcessors) A list of QueryFlowSequenceProcessor objects defining the sequence.

Labeling a Configuration

Labels API
Create
$ curl --request POST 'core-api:12010/v2/label' --data '{ ... }'
Body
{
  "key": "My Label Key",
  "value": "My Label Value"
}
key

(Required, String) The key of the label

value

(Required, String) The value of the label

Get All
$ curl --request GET 'core-api:12010/v2/label?page={page}&size={size}&sort={sort}'
Query Parameters
page

(Optional, Int) The page number. Defaults to 0.

size

(Optional, Int) The size of the page. Defaults to 25.

sort

(Optional, Array of String) The sort definition for the page.

Get One
$ curl --request GET 'core-api:12010/v2/label/{id}'
Path Parameters
id

(Required, String) The label ID.

Update
$ curl --request PUT 'core-api:12010/v2/label/{id}' --data '{ ... }'
Path Parameters
id

(Required, String) The label ID.

Body
{
  "key": "My Label Key",
  "value": "My Label Value"
}
key

(Required, String) The key of the label

value

(Required, String) The value of the label

Delete
$ curl --request DELETE 'core-api:12010/v2/label/{id}'
Path Parameters
id

(Required, String) The label ID.

Note

Both key and value properties will be trimmed.

Labels are simple key/value pairs that can help to reference user configurations. Any configuration can be tagged with labels either previously created in here, or during the CRUD process of the entity itself. Labels are limited to 45 characters max, for both key and value.

Note

When creating multiple labels during the CRUD process of other entities (e.g. a server or a credential, duplicates will be ignored.

To create a new label directly from an entity configuration, the following property must be included as part of the body payload:

{
  "labels": [
    {
      "key": "My Label Key",
      "value": "My Label Value"
    },
    ...
  ],
  ...
}
key

(Required, String) The key of the label

value

(Required, String) The value of the label

Backup & Restore

Core Backup API
Export entities
$ curl --request GET 'core-api:12010/v2/export'
Import entities
$ curl --request POST 'core-api:12010/v2/import?onConflict={onConflict}' --form 'file=@/../../export-20240319T1030.zip'
Query Parameters
onConflict

(Optional, String) The action to execute when there is a conflict with imported entities. Defaults to FAIL. Supported actions are: IGNORE, UPDATE and FAIL.

Queryflow Backup API
Export entities
$ curl --request GET 'queryflow-api:12040/v2/export'
Import entities
$ curl --request POST 'queryflow-api:12040/v2/import?onConflict={onConflict}' --form 'file=@/../../export-20240319T1030.zip'
Query Parameters
onConflict

(Optional, String) The action to execute when there is a conflict with imported entities. Defaults to FAIL. Supported actions are: IGNORE, UPDATE and FAIL.

Ingestion Backup API
Export entities
$ curl --request GET 'ingestion-api:12030/v2/export'
Import entities
$ curl --request POST 'ingestion-api:12030/v2/import?onConflict={onConflict}' --form 'file=@/../../export-20240319T1030.zip'
Query Parameters
onConflict

(Optional, String) The action to execute when there is a conflict with imported entities. Defaults to FAIL. Supported actions are: IGNORE, UPDATE and FAIL.

Each product Core, Queryflow, and Ingestion has its own backup and restore API. The entity distribution is as follows:

Core Entities
Note

Labels are skipped as they will be handled during the creation of other entities.

Note

Secrets are not part of this process due to security reasons. All credentials assume their referenced secret currently exists or will be created by different means.

The backup and restore for the entities is done through a single export-{timestamp}.zip ZIP file that contains a New Line Delimited JSON (ndjson) file per entity type. Each configuration is exported in the correct order, so it can be imported back without missing dependency problems.

Note

Manual modifications of the exported file might corrupt the backup.

Conflict resolution strategy

Since the ID of each exported entity is expected to remain the same after importing, conflicts might arise. The restore process has 3 different resolution strategies:

  • IGNORE: The input entity will be ignored, keeping the existing one unchanged.

  • UPDATE: The current entity will be updated with the input entity values.

  • FAIL: The current entity will not be modified, and an error will be thrown.

Appendix A: Pagination and Sorting

Any endpoint that paginates results receives the following optional query parameters:

page

(Optional, Integer) The page to retrieve. Defaults to 0.

Note

If the provided value is invalid, it will be replaced by the default one.

size

(Optional, Integer) The size of the page. Must be an integer between 1 and 100. Defaults to 25.

Note

If the provided value is invalid or out of range, it will be replaced by the default one.

sort

(Optional, String) The sorting fields, with an optional direction. Ascending by default: sort=<string>[,(asc\|desc)].

Note

This parameter can be used multiple times: sort=fieldA&sort=fieldB,desc

The response of a paginated request will either be an empty payload with a 204 - No Content status code, or a 200 - OK with the results page:

{
  "content": [
    {
      ...
    },
    ...
  ],
  "pageable": {
    ...
  },
  "totalSize": 1,
  "totalPages": 1,
  "numberOfElements": 1,
  "pageNumber": 0,
  "empty": false,
  "size": 25,
  "offset": 0
}
content

(Array of Objects) The page content.

pageable

(Object) The page request information.

Details
Page configuration example
{
  "page": 0,
  "size": 25,
  "sort": [ // (1)
    {
      "property" : "fieldA",
      "direction" : "ASC"
    },
    {
      "property" : "fieldB",
      "direction" : "DESC"
    }
  ]
}

Each configuration field is defined as follows:

page

(Integer) The page number.

size

(Integer) The size of the page.

sort

(Array of Objects) The sort definitions for the page.

Field definitions
property

(String) The property where the sort was applied.

direction

(String) The direction of the applied sorting. Either ASC or DESC.

totalSize

(Integer) The total number of records.

totalPages

(Integer) The total number of pages.

numberOfElements

(Integer) The number of elements on the returned slice of content.

pageNumber

(Integer) The current page number.

empty

(Boolean) true if the page has no content.

size

(Integer) The size of the returned slice of content.

offset

(Integer) The page offset.

Appendix B: Date and Time Patterns

In some instances, the use of string patterns may be required to represent dates. These patterns consist in a series of letters and symbols that represent the structure the date should follow as an output. To create them follow the next table of definitions:

Table 80. Date and Time Symbols
Symbol Meaning

G

The era (i.e. AD)

u

The year

y

The year of the era

D

The day of the year

M

The month of the year

L

The month of the year

d

The day of the month

Q

Quarter of the year

q

Quarter of the year

Y

Week based year

w

Week of based year

W

The week of the month

E

The day of the week

e

The localized day of the week

c

The localized day of the week

F

The week of the month

a

The am/pm of the day

h

The clock hour (1-12)

K

The clock hour (0-11)

k

The clock hour (1-24)

H

The hour of the day (0-23)

m

Minute of the hour

s

Second of the minute

S

The fraction of the second

A

The milliseconds

n

The nanoseconds

N

The nanoseconds of the day

V

The time-zone ID

z

The time-zone name

O

The localized zone offset

X

The zone offset

x

The zone offset

Z

The zone offset

p

Pad the next

'

Escape for text

''

A single quote

[

Start of an optional section

]

End of an optional section

Each symbol may be used 'n' consecutive times (e.g. uuuu), this will determine the use of a short or long form of the representation. The definition of these forms may vary depending on the type a symbol represents, the following list shows the basic representations depending on the type and the 'n' times a symbol is repeated:

  • Text

    • n < 4: Abbreviation (e.g. Wed for wednesday)

    • n = 4: Full

    • n = 5: Normally one letter (e.g. W for wednesday)

  • Number

    • n: The number with zero padding for the extra quantity (e.g. 3 -> 001)

      • c, F -> n <= 1

      • d, H, h, K, k, m, and s -> n <= 2

      • D -> n <= 3

  • Number and Text (Combination of both)

    • n >= 3: Seen as a Text

    • n < 3: Seen as a Number

  • Fractions

    • n <= 9: The number of truncations to the value

  • Year

    • n = 2: Two numbers (e.g. 23 for 2023)

    • n <= 4, n != 2: The full year

  • ZoneId

    • n = 2: Outputs the zone id

  • Zone names

    • 1 <= n <= 3: Short name

    • n = 4: Full name

  • Offset for 'X' an 'x'

    • n = 1: Just the hour if minute is zero, otherwise include minute

    • n = 2: Hour and minute

    • n = 3: Hour and minute with a colon

    • n = 4: Hour, minute and second

    • n = 5: Hour, minute and second with a colon

  • Offset for 'O'

    • n = 1: Short offset (e.g. GMT+1)

    • n = 4: Full offset (e.g. GMT+1:00)

  • Offset for 'Z'

    • n <= 3: Hour and minute

    • n = 4: The full offset (e.g. GMT+1:00)

    • n = 5: The hour and minute with colon

  • Pad

    • n: Number of the width

For example, the pattern "dd MM:ppppppuuuu" creates the following date "30 11: 2023". Special characters can be combined with the symbols, such as the ':' and the spaces. The latter can also be achieved with the pad, in the example 2 additional spaces are added between the year and the ':'.

Appendix C: Error Messages

If a request to any API produces an error, a standard response will be returned:

{
  "status": 409,
  "code": 2001,
  "messages": [
    "Duplicate entry for field(s): name"
  ],
  "timestamp": "2023-01-28T01:52:22.117244600Z"
}

Response Body

status

(Integer) The HTTP status code of the response

code

(Integer) The internal error code.

Each code is composed by 2 digits that represent the category, and 2 digits that represent the specific error, where the category can be:

messages

(Array of String) An optional list of messages describing the error

timestamp

(Timestamp) The UTC timestamp when the error happened

Error Codes

The error code is a better description of the error. It extends the information provided by the HTTP status code as the same status could be caused by different problems.

Each code is composed by 2 digits that represent the category, and 2 digits that represent the specific error, where the category can be:

  • 10 - Resources: access to entities or endpoints

  • 20 - Data integrity: entities referencing other entities, or other constraints such as unique keys

  • 30 - Data validation: input data (format, missing fields…​)

  • 40 - Execution: problems while invoking an action

  • 70 - Security: access and permissions

  • 80 - Third-party: communication with external services

  • 99 - Others: any other issue

Table 81. Error Codes
Code Description

1001

The endpoint or HTTP Method is undefined or disabled

1002

The requested bucket is missing

1003

The requested resource is missing

2001

The entity already exists (same name or any other combination of fields defined as unique)

2002

The entity to delete is referenced by other entities

3001

The input data is corrupted

3002

The input data is missing or invalid

3003

The input data is too large

4001

The action could not be executed due to the current state of the system

4002

The action was terminated due to a timeout

4003

The Core DSL expression could not be executed

7001

The action could not be executed due to the permissions of the user

8001

Could not establish connection to an external service

8002

The external service returned an error

9901

Custom user error

9999

Undefined error

Appendix D: Metrics

Each Discovery component publishes metrics regarding health, performance, and Discovery-specific workloads. The idea of this page is to give the user a head start in the understanding of these metrics, by highlighting those most commonly used, their meaning, and the dimensions each of them have. Keep in mind that the majority of these have, by default, a time lapse of one minute.

Dimensions

Each metric may have dimensions, which can be used as metric filters. These filters helps ensure only certain published values are taken into account. As an example, the component dimension, which most metrics have, can be used to narrow down metrics to a specific component or API, like the Ingestion Script Component or the QueryFlow API. Below are some of the most commonly used dimensions, however, it’s important to note that there are plenty more, so it’s recommended to check them in each desired metric.

Dimension Description

component

The Discovery component that published the metric

cache

Found in cache metrics, its the specific type of cache that produced the metric, i.e. script, endpoint, etc.

result

Found in metrics that measure an operation, it refers to the operation’s result, such as a job status or a cache hit or miss

seed.id

Found in metrics related to an Ingestion Seed Execution, it indicates the seed’s ID

Common Metrics

These metrics are published by most Discovery Components and are related to more than one product.

Metric Name Description Dimensions

jvm.threads.throttling.monitor.count

The amount of threads currently being throttled. Mostly used to monitor the throttler service used in Ingestion Components

component

cache.gets.count

The amount of times a cache was called in the last time lapse

cache
component
result (can be a hit or a miss)

Ingestion Metrics

These are metrics published by Ingestion Components that help monitor a Seed Execution.

Metric Name Description Dimensions

ingestion.session.jobs.value

The amount of jobs currently being executed

component
seed.id

ingestion.session.records.count

The amount of records collected (i.e that were processed) in the last time lapse

component
result (the failed result is useful for finding record errors)
seed.id

ingestion.jobs.avg

The average time, in milliseconds, that it took to execute jobs completed in the last time lapse

component
result
seed.id

ingestion.jobs.count

The amount of jobs completed in the last time lapse

component
result (the failed result can be used to find out)
seed.id

More information about default metrics published by all Discovery products can be found in the Micronaut-Micrometer documentation, at the the Provided Binders sub-section.