Discovery Reference Guide
Version 2.8.0
The Pureinsights Discovery Platform is a cloud-based family of products, aimed to help with the creation, maintenance and monitoring of production-ready AI applications.
There is not a single recipe that fits all: each collection, each use case and each implementation is different in its own way, and evolving from a prototype to a live solution is not a trivial task. Here is where Discovery shows its value:
-
An architecture that follows a pay-as-you-go model for only the resources required by the specific use case.
-
A no-code approach with building blocks configured through finite-state machines that provides flexibility while reducing the hassle of developer-related tasks such as error handling and orchestration of services.
-
Changes in configuration and tuning up happens on-the-fly, without the downtime of redeploying.
-
Data processing pipelines to extract, transform and load collections from different sources (ETL).
-
Data storage as a "push" model alternative to traditional ETL solutions.
-
Custom REST endpoints with advanced capabilities that adapts to the complexities of processing a query with minimum overhead.
-
Observability with standard monitoring and alerting tools.
What’s new in 2.8.0?
MCP Server in Discovery QueryFlow
The Model Context Protocol is now available in Discovery QueryFlow with the support of custom MCP Servers exposed through entrypoints.
New Oracle Database Integration
It is now possible to create servers to connect to an Oracle Database. The new type uses the Oracle JDBC driver to interact with an Oracle database and execute SQL statements. There is also a new Oracle Database component to query tables and retrieve their values from the database.
New SMB Integration
It is now possible to create servers to connect to an SMB server. There is also a new Filesystem component, which allows crawling SMB servers to extract their content.
New Schedules API in Discovery Ingestion
A new Schedules API has been exposed to allow the automatic execution of Seeds based on a cron expression.
Staging Component in Discovery Ingestion now supports Incremental Scan
The Ingestion Staging Component now supports Incremental Scan using the Checksum identification mechanism.
New Merge Function in Expression Language
It is now possible to use a merge() function with the Expression Language to merge two maps or arrays.
New LDAP Integration
It is now possible to create servers to connect to an LDAP server. There is also a new LDAP component to run searches on an LDAP server and retrieve users and groups from the directory.
New SharePoint Online Ingestion Component
A new Ingestion SharePoint Online Component has been added, to crawl Sites, Subsites, Lists and ListItems from SharePoint Online. With the scan action you can retrieve the information and even download the files and attachments from the ListItems.
Breaking changes
New root path for configuring REST Endpoints in Discovery QueryFlow
In order to prepare for new "entrypoints" in Discovery QueryFlow, the configuration of REST Endpoints is moved from the /v2/endpoint root path, to /v2/entrypoint/endpoint.
Ingestion Staging Component Scan now only retrieves record with the action STORE
The Ingestion Staging Component now only retrieves records with the action STORE instead of being able to choose the actions that it can retrieve. The actions field from the configuration has been removed
Basics
Discovery is a platform composed by 3 products:
-
Discovery Ingestion: a distributed ETL, supported by a finite-state machine that represents the data transformation and loading.
-
Discovery Staging: an abstraction for a Document Database, where collections are represented as buckets with an HTTP interface for simple CRUD operations.
-
Discovery QueryFlow: a configurable REST API with custom endpoints, supported by a finite-state machine that represents the query processing.
All products are supported by the Discovery Core Libraries and API: a common layer for shared concepts for retries, error handling, autoscaling and metrics/logging.
Each independent product has its own value and can exist by itself (although both Discovery Ingestion and Discovery QueryFlow require Discovery Staging for their internal operations). However, using them together brings all the tools to create an end-to-end solution.
Architecture
One of the main goals of Discovery, is being cloud agnostic: using the native resources of each cloud provider without affecting the application itself.
This decision has a direct impact in the overall architecture, as external services should be abstracted in a way they can be mapped by a managed service available on each cloud provider.
-
The relational database is the main storage for configurations, metadata and the state of the multiple executions.
-
The document database is the service abstracted by Discovery Staging. The default implementation for all cloud providers is MongoDB Atlas.
-
The object storage supports the file server and data processing of binary files in Discovery Ingestion.
-
The message queue handles asynchronous communication between the components.
-
The secrets manager stores secure information such as passwords and credentials. The default implementation is an internal secrets provider.
-
The search services is an optional addition to Discovery installations that enable features such as full-text search for entities (i.e.
/searchand/autocompleteendpoints), and advanced solutions such as Search Analytics dashboards. Supports Elasticsearch and OpenSearch.
The Discovery Core Libraries and API is an interface for every Discovery product to interact with these external services. However, despite of this standardization, each independent product is designed based on its own needs: Discovery QueryFlow and Discovery Staging are monoliths due to their need of fast responses, but Discovery Ingestion follows an event-driven architecture that targets scalability with distributed components processing data in parallel.
AWS
Discovery can be integrated with Amazon Web Services (AWS) with managed services that natively support the installation requirements.
The application is deployed using Amazon Elastic Container Service and AWS Lambda in a private subnet, later exposed with Amazon API Gateway and Amazon Route 53.
Other services such as Amazon EventBridge and Amazon CloudWatch support the correct control, autoscaling and monitoring of all components.
Monitoring
All Discovery products constantly publish metrics to a selected monitoring and observability tool:
Integrations
Integrations to external services are represented by Servers, optionally authenticated with a Credential that references an encrypted Secret.
They are re-usable configurations that can later be referenced in Discovery Ingestion Components and Discovery QueryFlow Components.
Connecting to an external service
Servers API
$ curl --request POST 'core-api:12010/v2/server' --data '{ ... }'
$ curl --request GET 'core-api:12010/v2/server'
$ curl --request GET 'core-api:12010/v2/server/{id}'
$ curl --request GET 'core-api:12010/v2/server/{id}/ping'
$ curl --request PUT 'core-api:12010/v2/server/{id}' --data '{ ... }'
|
Note
|
The type of an existing server can’t be modified. |
$ curl --request DELETE 'core-api:12010/v2/server/{id}'
$ curl --request POST 'core-api:12010/v2/server/{id}/clone?name=clone-new-name'
Query Parameters
name-
(Required, String) The name of the new Server
$ curl --request POST 'core-api:12010/v2/search/search' --data '{ ... }'
Body
The body payload is a DSL Filter to apply to the search
$ curl --request GET 'core-api:12010/v2/search/autocomplete?q=value'
Query Parameters
q-
(Required, String) The query to execute the autocomplete search
A server has the properties to create an authenticated connection to an external service:
{
"type": "my-external-service",
"name": "My External Service Configuration",
"config": {
...
},
...
}
type-
(Required, String) The type of external supported service.
name-
(Required, String) The unique name to identify the external service.
description-
(Optional, String) The description for the configuration.
config-
(Required, Object) The configuration to connect to the external service.
credential-
(Optional, UUID) The ID of the credential to authenticate in the external service.
proxy-
(Optional, UUID) The ID of a proxy server entity used to route requests to the external service. Note that the proxy is another server entity.
certificates-
(Optional, Object) The custom certificates for encrypted connection (SSL/TLS), loaded using the secret service. The value can be either the string with the secret of the certificate, or a detailed configuration.
Details
{ "certificates": { "sample-a": { "type": "X.509", "value": "CERTIFICATE_SECRET" }, "sample-b": { "value": "CERTIFICATE_SECRET" }, "sample-c": "CERTIFICATE_SECRET", "sample-d": { "type": "X.509", "value": { "secret": "CERTIFICATE_SECRET", "key": "field" } }, "sample-e": { "secret": "CERTIFICATE_SECRET", "key": "field" } }, ... }type-
(Optional, String) The type of certificate. Defaults to
X.509. value-
(Required, String) The secret of the certificate in the secret service. This could be a simple string with the secret name, or an object with the
secretname and thekeythat specifies the field that will be read in the secret.
NoteThe existence of the certificate secret will be verified.
keys-
(Optional, Object) The keys with their respective certificate chain for encrypted connection (mTLS), loaded using the secret service.
Details
{ "keys": { "keyA": { "value": "SECRET_KEY", "certificateChain": [ { "value": "SECRET_CERT0" }, { "type": "X.509", "value": "SECRET_CERT1" }, { "type": "X.509", "value": { "secret": "SECRET_CERT2", "key": "field" } }, { "secret": "SECRET_CERT3", "key": "field" }, "SECRET_CERT4" ] } } }value-
(Required, String) The secret of the key in the secret service. The contents are expected to be PKCS8 encoded key in PEM format. This could be a simple string with the secret name, or an object with the
secretname and thekeythat specifies the field that will be read in the secret. certificateChain-
(Required, List) One or more certificates associated to the key.
NoteThe existence of the key secret will be verified.
circuitBreaker-
(Optional, Object) The circuit breaker configuration as a mechanism to handle request errors and limitations of an external service.
Details
{ "circuitBreaker": { "waitInOpenState": "90s", "maxTestRequests": 1 }, ... }waitInOpenState-
(Optional, Duration) The maximum time to wait in
OPENstate before transitioning toHALF_OPENstate. maxTestRequests-
(Optional, Integer) The maximum number of requests on
HALF_OPENstate.
labels-
(Optional, Array of Objects) The labels for the configuration.
Details
{ "labels": [ { "key": "My Label Key", "value": "My Label Value" }, ... ], ... }key-
(Required, String) The key of the label.
value-
(Required, String) The value of the label.
Authentication and Credentials
Credentials API
$ curl --request POST 'core-api:12010/v2/credential' --data '{ ... }'
$ curl --request GET 'core-api:12010/v2/credential'
$ curl --request GET 'core-api:12010/v2/credential/{id}'
|
Note
|
Reading credentials will never expose the referenced secret. |
$ curl --request PUT 'core-api:12010/v2/credential/{id}' --data '{ ... }'
|
Note
|
The type of an existing credential can’t be modified. |
$ curl --request DELETE 'core-api:12010/v2/credential/{id}'
$ curl --request POST 'core-api:12010/v2/credential/{id}/clone?name=clone-new-name'
Query Parameters
name-
(Required, String) The name of the new Server.
$ curl --request POST 'core-api:12010/v2/credential/search' --data '{ ... }'
Body
The body payload is a DSL Filter to apply to the search.
$ curl --request GET 'core-api:12010/v2/credential/autocomplete?q=value'
Query Parameters
q-
(Required, String) The query to execute the autocomplete search.
A credential references a secret with the authentication parameters required to connect to an external service:
{
"type": "my-external-service",
"name": "My External Service Credential",
"secret": "MY_SECRET",
...
}
When the secret provider is internal, it is possible to create a secret during the creation of the credential:
{
"type": "my-external-service",
"name": "My External Service Credential",
"secret": {
"name": "MY_SECRET",
"content": {
"username": <username>,
"password": <password>,
},
...
},
...
}
|
Note
|
It is assumed that the referenced secret exists, and has the correct JSON-formatted authentication information. However, this is only a soft-reference, and any deletion of secret keys won’t be noticed until the next time it is required. |
type-
(Required, String) The type of credentials for the external supported service.
name-
(Required, String) The unique name to identify the credentials.
description-
(Optional, String) The description for the configuration.
secret-
(Required, String or Object) Either the secret key to connect to the external service, or an object with the authentication details.
labels-
(Optional, Array of Objects) The labels for the configuration.
Details
{ "labels": [ { "key": "My Label Key", "value": "My Label Value" }, ... ], ... }key-
(Required, String) The key of the label.
value-
(Required, String) The value of the label.
Secrets
Secrets API
$ curl --request POST 'core-api:12010/v2/secret' --data '{ ... }'
$ curl --request GET 'core-api:12010/v2/secret'
$ curl --request GET 'core-api:12010/v2/secret/{id}'
|
Note
|
Reading secrets will never expose their encrypted data. |
$ curl --request PUT 'core-api:12010/v2/secret/{id}' --data '{ ... }'
$ curl --request DELETE 'core-api:12010/v2/secret/{id}'
A secret is a representation of a secure JSON. This document could be anything, but its most common usage is for credentials:
{
"name": "MY_SECRET",
"content": {
"username": <username>,
"password": <password>,
},
...
}
|
Note
|
When the secrets are backed up by an external service, Discovery won’t expose any CRUD for their management. |
name-
(Required, String) The unique name to identify the secret.
description-
(Optional, String) The description for the configuration.
content-
(Required, Object) The JSON to securely store.
labels-
(Optional, Array of Objects) The labels for the configuration.
Details
{ "labels": [ { "key": "My Label Key", "value": "My Label Value" }, ... ], ... }key-
(Required, String) The key of the label.
value-
(Required, String) The value of the label.
Discovery Features for Integrations
Circuit Breaker
{
"type": "my-external-service",
"name": "My External Service",
"config": {
...
},
"circuitBreaker": {
...
},
...
}
A Circuit Breaker is a fault toleration technique intended to gracefully handle errors when communicating with external services.
Its main goal is to avoid problems such as resource exhaustion due to rate limiting constrains.
DSL
The DSL is a specification aimed to have a single, unified language for all user interactions with Discovery and its integrations. Given the "simplified" nature of the language, its use as part of any integration will depend on how well it can be adapted to the capabilities of the integration itself.
Any implementation available will be described in detail in its corresponding section.
Ping
A ping is a mechanism to validate the configuration of an integration (both Server and Credentials working together), using the /ping endpoint:
$ curl --request GET 'core-api:12010/v2/server/{id}/ping'
Proxy Server
{
"type": "my-external-service",
"name": "My External Service",
"config": {
...
},
"proxy": <Proxy Server ID>,
...
}
A Proxy Server can be configured for integrations that require routing requests through an intermediate server. To enable this, the ID of the proxy server must be configured as part of the Server.
The proxy is represented as a separate server entity linked to the client server. Its own configuration object can define connection details such as host and port. Additionally, if a credential is associated with the proxy server, it can be used to authenticate with the proxy using fields like username and password.
TLS/mTLS
{
"type": "my-external-service",
"name": "My External Service",
"config": {
...
},
"certificates": {
...
},
...
}
In scenarios where security is a bigger concern, it might be required to use custom certificates that validate the identity of any of the parties involved in the communication.
This handshake, either one-way (TLS) or mutual (mTLS), can be specified as part of the Server configuration.
Supported Services
Amazon Bedrock
{
"type": "amazon-bedrock",
"name": "My Amazon Bedrock Server",
"config": {
...
},
"credential": <Amazon Bedrock Credential ID> (1)
}
| 1 | Optional. See the Authentication section. |
| Feature | Supported |
|---|---|
Yes |
|
No |
|
No |
|
No |
|
No |
region-
(Required, String) The AWS region.
apiCallTimeout-
(Optional, Duration) The complete duration of an API call.
connection-
(Optional, Object) The configuration of the connection to Amazon Bedrock.
Details
connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
backoffPolicy-
(Optional, Object) The configuration for retries to Amazon Bedrock.
Details
type-
(Optional, String) The type of backoff policy to apply. One of
NONE,CONSTANT, orEXPONENTIAL. Defaults toEXPONENTIAL. initialDelay-
(Optional, Duration) The initial delay before retrying. Defaults to
50ms. retries-
(Optional, Integer) The maximum number of retries. Defaults to
5.
Authentication
{
"type": "aws",
"name": "My Amazon Bedrock Credentials",
"secret": "MY_AMAZON_BEDROCK_SECRET"
}
{
"name": "MY_AMAZON_BEDROCK_SECRET",
"content": {
...
}
}
accessKeyId-
(Required, String) The ID of your access key, used to identify the user.
secretAccessKey-
(Required, String) The secret access key, used to authenticate the user.
sessionToken-
(Optional, String) The session token from an AWS token service, used to authenticating that this user has received temporary permission to access some resource.
expirationTime-
(Optional, Duration) The time after which this identity will no longer be valid. If not provided, the expiration is unknown but it may still expire at some time.
|
Note
|
Amazon S3
{
"type": "amazon-s3",
"name": "My Amazon S3 Server",
"config": {
...
},
"credential": <Amazon S3 Credential ID>, (1)
"proxy": <Amazon S3 Proxy Server ID> (2)
}
| 1 | Optional. See the Authentication section. |
| 2 | Optional. See the Proxy section. |
| Feature | Supported |
|---|---|
No |
|
No |
|
No |
|
Yes |
|
No |
region-
(Required, String) The AWS region.
retries-
(Optional, Integer) The number of retry attempts for failed requests to S3.
connection-
(Optional, Object) The connection configuration.
Details
{ "connection": { "timeout": "5s", "health": { "minimumThroughputInBps": 1, "minimumThroughputTimeout": "5s" } } ... }timeout-
(Optional, Duration) The timeout duration for establishing a connection to S3.
health-
(Optional, Object) The health connection configuration.
Details
minimumThroughputInBps-
(Required, Long) The minimum throughput in bytes per second that is considered healthy for a connection.
minimumThroughputTimeout-
(Required, Duration) The timeout duration used to evaluate if the minimum throughput has been met.
accelerate-
(Optional, Boolean) Enables S3 Transfer Acceleration to speed up uploads.
checksumValidationEnabled-
(Optional, Boolean) Enables validation of checksums during uploads and downloads.
crossRegionAccessEnabled-
(Optional, Boolean) Allows the S3 client to access buckets in different regions from the one specified.
forcePathStyle-
(Optional, Boolean) Forces the use of path-style access (s3.amazonaws.com/bucket) instead of virtual-hosted-style (bucket.s3.amazonaws.com).
maxConcurrency-
(Optional, Integer) The maximum number of concurrent S3 requests allowed.
minimumPartSizeInBytes-
(Optional, Long) The minimum size, in bytes, for each part in a multipart upload. Affects upload efficiency and limits.
initialReadBufferSizeInBytes-
(Optional, Long) The initial buffer size, in bytes, used when reading data from S3.
thresholdInBytes-
(Optional, Long) The file size threshold, in bytes, above which a multipart upload is initiated.
targetThroughputInGbps-
(Optional, Double) The target throughput for the client in gigabits per second, used to optimize performance.
endpointOverride-
(Optional, String) Overrides the default S3 endpoint with a custom URI. Useful for local testing or VPC endpoints.
Authentication
{
"type": "aws",
"name": "My Amazon S3 Credentials",
"secret": "MY_AMAZON_S3_SECRET"
}
{
"name": "MY_AMAZON_S3_SECRET",
"content": {
...
}
}
accessKeyId-
(Required, String) The ID of your access key, used to identify the user.
secretAccessKey-
(Required, String) The secret access key, used to authenticate the user.
sessionToken-
(Optional, String) The session token from an AWS token service, used to authenticating that this user has received temporary permission to access some resource.
expirationTime-
(Optional, Duration) The time after which this identity will no longer be valid. If not provided, the expiration is unknown but it may still expire at some time.
|
Note
|
Proxy Server
{
"type": "amazon-s3",
"name": "My Amazon S3 Server",
"config": {
...
},
"proxy": <Proxy Server ID>,
...
}
{
"type": "proxy",
"name": "My Amazon S3 Proxy Server",
"config": {
...
},
"credential": <Credential ID with the proxy authentication>
}
host-
(Required, String) The hostname of the proxy server.
port-
(Required, Integer) The port of the proxy server.
{
"type": "proxy",
"name": "My Amazon S3 Proxy Server Credentials",
"secret": "MY_AMAZON_S3_PROXY_SECRET"
}
{
"name": "MY_AMAZON_S3_PROXY_SECRET",
"content": {
...
}
}
username-
(Required, String) The username of the credentials.
password-
(Required, String) The password of the credentials.
Elasticsearch
{
"type": "elasticsearch",
"name": "My Elasticsearch Server",
"config": {
...
},
"credential": <Elasticsearch Credential ID> (1)
}
| 1 | Optional. See the Authentication section. |
| Feature | Supported |
|---|---|
No |
|
Yes |
|
Yes |
|
No |
|
Yes |
servers-
(Required, Array of Strings) The URI for the Elasticsearch installation. Multiple
serverswill be invoked in round-robin. pathPrefix-
(Optional, String) The path prefix to add to the
serverson each call. connection-
(Optional, Object) The configuration of the HTTP connection to Elasticsearch.
Details
connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
compressRequest-
(Optional, Boolean)
trueif the requests must be compressed. followRedirects-
(Optional, Boolean)
trueif redirects must be followed.
cloudId-
(Required, String) The ID of the instance in Elastic Cloud.
connection-
(Optional, Object) The configuration of the HTTP connection to Elasticsearch.
Details
connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
compressRequest-
(Optional, Boolean)
trueif the requests must be compressed. followRedirects-
(Optional, Boolean)
trueif redirects must be followed.
Authentication
{
"type": "elasticsearch",
"name": "My Elasticsearch Credentials",
"secret": "MY_ELASTICSEARCH_SECRET"
}
{
"name": "MY_ELASTICSEARCH_SECRET",
"content": {
...
}
}
username-
(Required, String) The username of the credentials.
password-
(Required, String) The password of the credentials.
{
"name": "MY_ELASTICSEARCH_SECRET",
"content": {
...
}
}
token-
(Required, String) The token of the credentials.
{
"name": "MY_ELASTICSEARCH_SECRET",
"content": {
...
}
}
apiKey-
(Required, String) The API key of the credentials.
DSL
| Filter | Elasticsearch Query Operator |
|---|---|
Term Query when |
|
Bool Query with |
|
Bool Query with |
|
Hugging Face
{
"type": "hugging-face",
"name": "My Hugging Face Server",
"config": {
...
},
"credential": <Hugging Face Credential ID> (1)
}
| 1 | Optional. See the Authentication section. |
| Feature | Supported |
|---|---|
No |
|
No |
|
Yes |
|
No |
|
Yes |
servers-
(Required, Array of Strings) The URI for the Hugging Face Inference API service. Multiple
serverswill be invoked in round-robin. connection-
(Optional, Object) The configuration of the HTTP connection to Hugging Face.
Details
connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
compressRequest-
(Optional, Boolean)
trueif the requests must be compressed. followRedirects-
(Optional, Boolean)
trueif redirects must be followed.
Authentication
{
"type": "hugging-face",
"name": "My Hugging Face Credentials",
"secret": "MY_HUGGING_FACE_SECRET"
}
{
"name": "MY_HUGGING_FACE_SECRET",
"content": {
...
}
}
token-
(Required, String) The token of the credentials.
LDAP
{
"type": "ldap",
"name": "My LDAP Server",
"config": {
...
}
}
| Feature | Supported |
|---|---|
No |
|
No |
|
Yes |
|
No |
|
Yes |
servers-
(Required, Array of Objects) List of LDAP server addresses. If multiple servers are provided, requests are distributed among them using a round-robin strategy.
Details
hostname-
(Required, String) The host of the LDAP server.
port-
(Optional, Integer) The port of the LDAP server. Defaults to
389. When using TLS, this port is usually636.
connection-
(Optional, Object) The configuration of the connection to the LDAP server.
Details
connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
Authentication
{
"type": "ldap",
"name": "My LDAP Credentials",
"secret": "MY_LDAP_SECRET"
}
{
"name": "MY_LDAP_SECRET",
"content": {
...
}
}
bindDN-
(Required, String) The distinguished name that will be used to authenticate.
password-
(Required, String) The password of the credentials.
MongoDB
{
"type": "mongo",
"name": "My MongoDB Server",
"config": {
...
},
"credential": <MongoDB Credential ID> (1)
}
| 1 | Optional. See the Authentication section. |
| Feature | Supported |
|---|---|
No |
|
Yes |
|
Yes |
|
No |
|
No |
servers-
(Required, Array of Strings) The connection string for the MongoDB/MongoDB Atlas installation. Multiple
serversrepresent a replica set. connection-
(Optional, Object) The configuration of the connection to MongoDB.
Details
connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
compressors-
(Optional, Array of Strings) A list of data compressors. One of
SNAPPY,ZLIBorZSTD. tls-
(Optional, Boolean)
trueif the connection should be done through SSL. Defaults tofalse. retryWrites-
(Optional, Boolean)
trueif the connection should retry requests. Defaults totrue.
Authentication
{
"type": "mongo",
"name": "My MongoDB Credentials",
"secret": "MY_MONGODB_SECRET"
}
{
"name": "MY_MONGODB_SECRET",
"content": {
"mechanism": "SCRAM-SHA-1",
...
}
}
mechanism-
(Required, String) The authentication mechanism. Must be
SCRAM-SHA-1. username-
(Required, String) The username of the credentials.
password-
(Required, String) The password of the credentials.
source-
(Required, String) The database name associated with the user’s authentication data. Defaults to
admin.
{
"name": "MY_MONGODB_SECRET",
"content": {
"mechanism": "SCRAM-SHA-256",
...
}
}
mechanism-
(Required, String) The authentication mechanism. Must be
SCRAM-SHA-256. username-
(Required, String) The username of the credentials.
password-
(Required, String) The password of the credentials.
source-
(Required, String) The username of the credentials. Defaults to
admin.
{
"name": "MY_MONGODB_SECRET",
"content": {
"mechanism": "MONGODB-AWS",
...
}
}
mechanism-
(Required, String) The authentication mechanism. Must be
MONGODB-AWS. accessKeyId-
(Required, String) The AWS access key ID.
secretAccessKey-
(Optional, String) The AWS secret access key.
sessionToken-
(Optional, String) The AWS session token for authentication with temporary credentials when using an AssumeRole request, or when working with AWS resources that specify this value such as Lambda.
DSL
| Filter | Mongo Query Operator |
|---|---|
|
| Filter | Mongo Query Operator |
|---|---|
Text Operator with a string query |
|
Range Operator with numbers or dates |
|
Range Operator with numbers or dates |
|
Range Operator with numbers or dates |
|
Range Operator with numbers or dates |
|
Range Operator with numbers or dates |
|
Compound Operator with |
|
Compound Operator with |
|
Regex Operator with the keyword analyzer |
Neo4j
{
"type": "neo4j",
"name": "My Neo4j Server",
"config": {
...
},
"credential": <Neo4j Credential ID> (1)
}
| 1 | Optional. See the Authentication section. |
| Feature | Supported |
|---|---|
No |
|
No |
|
Yes |
|
No |
|
No |
server-
(Required, String) The URI to connect to Neo4j, following the supported schemes.
connection-
(Optional, Object) The configuration of the connection to Neo4j.
Details
connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
Authentication
{
"type": "neo4j",
"name": "My Neo4j Credentials",
"secret": "MY_NEO4J_SECRET"
}
{
"name": "MY_NEO4J_SECRET",
"content": {
...
}
}
username-
(Required, String) The username of the credentials.
password-
(Required, String) The password of the credentials.
OpenAI
{
"type": "openai",
"name": "My OpenAI Server",
"config": {
...
},
"credential": <OpenAI Credential ID> (1)
}
| 1 | Optional. See the Authentication section. |
| Feature | Supported |
|---|---|
Yes |
|
No |
|
No |
|
No |
|
No |
organizationId-
(Optional, String) The Organization ID to be added to the requests header.
connection-
(Optional, Object) The configuration of the connection for OpenAI.
Details
connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
compressRequest-
(Optional, Boolean)
trueif the requests must be compressed. followRedirects-
(Optional, Boolean)
trueif redirects must be followed.
maxRetries-
(Optional, Integer) The maximum number of retries for each request. Defaults to
2. baseUrl-
(Optional, String) The custom base URL to connect to the OpenAI service. Defaults to
https://api.openai.com/v1.
Authentication
{
"type": "openai",
"name": "My OpenAI Credentials",
"secret": "MY_OPENAI_SECRET"
}
{
"name": "MY_OPENAI_SECRET",
"content": {
...
}
}
apiKey-
(Required, String) The API key of the credentials.
projectId-
(Optional, String) The Project ID to be added to the requests header.
OpenSearch
{
"type": "opensearch",
"name": "My OpenSearch Server",
"config": {
...
},
"credential": <OpenSearch Credential ID> (1)
}
| 1 | Optional. See the Authentication section. |
| Feature | Supported |
|---|---|
No |
|
Yes |
|
Yes |
|
No |
|
Yes |
servers-
(Required, Array of Strings) The URI for the OpenSearch installation. Multiple
serverswill be invoked in round-robin. pathPrefix-
(Optional, String) The path prefix to add to the
serverson each call. connection-
(Optional, Object) The configuration of the HTTP connection to OpenSearch.
Details
connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
compressRequest-
(Optional, Boolean)
trueif the requests must be compressed. followRedirects-
(Optional, Boolean)
trueif redirects must be followed.
endpoint-
(Required, String) The host to make the request to, without the
http://. signature-
(Required, Object) The signature for the requests.
Details
region-
(Required, String) The AWS region of the service.
serviceName-
(Required, String) The signing service name.
connection-
(Optional, Object) The configuration of the HTTP connection to OpenSearch.
Details
connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
compressRequest-
(Optional, Boolean)
trueif the requests must be compressed. followRedirects-
(Optional, Boolean)
trueif redirects must be followed.
Authentication
{
"type": "opensearch",
"name": "My OpenSearch Credentials",
"secret": "MY_OPENSEARCH_SECRET"
}
{
"name": "MY_OPENSEARCH_SECRET",
"content": {
...
}
}
username-
(Required, String) The username of the credentials.
password-
(Required, String) The password of the credentials.
{
"name": "MY_OPENSEARCH_SECRET",
"content": {
...
}
}
accessKeyId-
(Required, String) The ID of your access key, used to identify the user.
secretAccessKey-
(Required, String) The secret access key, used to authenticate the user.
sessionToken-
(Optional, String) The session token from an AWS token service, used to authenticating that this user has received temporary permission to access some resource.
expirationTime-
(Optional, Duration) The time after which this identity will no longer be valid. If not provided, the expiration is unknown but it may still expire at some time.
|
Note
|
DSL
| Filter | OpenSearch Query Operator |
|---|---|
Term Query when |
|
Bool Query with |
|
Bool Query with |
|
Oracle Database
{
"type": "oracledb",
"name": "My Oracle Database Server",
"config": {
...
},
"credential": <Oracle Database Credential ID> (1)
}
| 1 | Optional. See the Authentication section. |
| Feature | Supported |
|---|---|
No |
|
No |
|
Yes |
|
No |
|
No |
The ping feature is implemented using the SERVER validation mode. For more information, see this chart.
serviceName-
(Required, String) The service name of the Oracle Database. Used to complete the JDBC URL.
jdbcConnector-
(Required, String) The path of the file in the Object Storage that has the Oracle JDBC driver in a
.jarformat. This file must be in the object storage in order to connect to an Oracle database. The supported driver is the Thin driver, which can be downloaded from the official Oracle Website. For Discovery, the driver that supports JDK17 must be used. servers-
(Required, Array of Objects) The addresses of the Oracle databases.
Details
hostname-
(Required, String) The host of the Oracle database.
port-
(Optional, Integer) The port of the Oracle database. Defaults to
1521. protocol-
(Optional, String) The listener protocol address. Defaults to
TCP. The other supported protocols areTCPSandBEQ.
defaultSchema-
(Optional, String) The default schema to be used by the user in the Oracle database.
fetchSize-
(Optional, Integer) The number of rows to be fetched. Defaults to
10. loadBalance-
(Optional, String) Used to enable or disable client load balancing for multiple protocol addresses. Defaults to
OFF. The other supported value isON. failOver-
(Optional, String) To enable or disable connect-time failover for multiple protocol addresses. Defaults to
ON. The other supported value isOFF. serverType-
(Optional, String) Sets the Oracle Database server architecture. Defaults to
DEDICATED. The other supported value isSHARED. connection-
(Optional, Object) The configuration of the connection to the Oracle database.
Details
connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
Authentication
{
"type": "oracledb",
"name": "My Oracle Database Credentials",
"secret": "MY_ORACLEDB_SECRET"
}
{
"name": "MY_ORACLEDB_SECRET",
"content": {
...
}
}
username-
(Required, String) The username of the credentials.
password-
(Required, String) The password of the credentials.
SharePoint Online
{
"type": "sharepoint",
"name": "My SharePoint Online Server",
"config": {
...
}
}
| Feature | Supported |
|---|---|
Yes |
|
No |
|
No |
|
No |
|
No |
tenantUrl-
(Required, String) The Domain URL of the SharePoint tenant to crawl.
connection-
(Optional, Object) The configuration of the connection to SharePoint Online.
Details
{ "connection": { "connectTimeout": "60s", "readTimeout": "60s", "pool": { "size": 5, "keepAlive": "5m" } "followRedirects": true }, ... }connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
followRedirects-
(Optional, Boolean)
trueif redirects must be followed.
Authentication
{
"type": "sharepoint",
"name": "My SharePoint Online Credentials",
"secret": "MY_SHAREPOINT_ONLINE_CERTIFICATE"
}
{
"name": "MY_SHAREPOINT_ONLINE_CERTIFICATE",
"content": {
...
}
}
tenantId-
(Required, String) The Azure Tenant ID.
clientId-
(Required, String) The application ID.
certificate-
(Required, String) The contents of your certificate configured in your Entra application.
privateKey-
(Required, String) The contents of your private-key associated to your certificate.
SMB
{
"type": "smb",
"name": "My SMB Server",
"config": {
...
},
"credential": <SMB Credential ID> (1)
}
| 1 | Optional. See the Authentication section. |
| Feature | Supported |
|---|---|
No |
|
No |
|
Yes |
|
No |
|
No |
servers-
(Required, Array of Strings) The list of the SMB paths, including the share name.
Authentication
{
"type": "smb",
"name": "My SMB Credentials",
"secret": "MY_SMB_SECRET"
}
{
"name": "MY_SMB_SECRET",
"content": {
...
}
}
username-
(Optional, String) The username of the credentials. Defaults to
GUEST. password-
(Optional, String) The password of the credentials. Defaults to an empty string.
domain-
(Optional, String) The domain of the credentials. Defaults to
?.
Solr
{
"type": "solr",
"name": "My Solr Server",
"config": {
...
},
"credential": <Solr Credential ID> (1)
}
| 1 | Optional. See the Authentication section. |
| Feature | Supported |
|---|---|
No |
|
Yes |
|
Yes |
|
No |
|
No |
servers-
(Required, Array of Strings) The URI for the Solr installation. Multiple
serverswill be invoked in round-robin. connection-
(Optional, Object) The configuration of the HTTP connection to Solr.
Details
connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
compressRequest-
(Optional, Boolean)
trueif the requests must be compressed. followRedirects-
(Optional, Boolean)
trueif redirects must be followed.
Authentication
{
"type": "solr",
"name": "My Solr Credentials",
"secret": "MY_SOLR_SECRET"
}
{
"name": "MY_SOLR_SECRET",
"content": {
...
}
}
username-
(Required, String) The username of the credentials.
password-
(Required, String) The password of the credentials.
DSL
| Filter | Solr Query Operator |
|---|---|
Vespa
{
"type": "vespa",
"name": "My Vespa Server",
"config": {
...
},
"credential": <Vespa Credential ID> (1)
}
| 1 | Optional. See the Authentication section. |
| Feature | Supported |
|---|---|
Yes |
|
No |
|
Yes |
|
No |
|
Yes |
|
Note
|
servers-
(Required, Array of Strings) The URI for the REST service. Multiple
serverswill be invoked in round-robin. connection-
(Optional, Object) The configuration of the HTTP connection.
Details
connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
compressRequest-
(Optional, Boolean)
trueif the requests must be compressed. followRedirects-
(Optional, Boolean)
trueif redirects must be followed.
backoffPolicy-
(Optional, Object) The configuration for retries to the REST service.
Details
type-
(Optional, String) The type of backoff policy to apply. One of
NONE,CONSTANT, orEXPONENTIAL. Defaults toEXPONENTIAL. initialDelay-
(Optional, Duration) The initial delay before retrying. Defaults to
50ms. retries-
(Optional, Integer) The maximum number of retries. Defaults to
5.
scroll-
(Optional, Object) The scroll configuration for paginated requests.
Details
{ "scroll": { "size": 50 } }size-
(Required, String) The size of the scroll request.
Authentication
{
"type": "vespa",
"name": "My Vespa Credentials",
"secret": "MY_VESPA_SECRET"
}
{
"name": "MY_VESPA_SECRET",
"content": {
...
}
}
username-
(Required, String) The username of the credentials.
password-
(Required, String) The password of the credentials.
{
"name": "MY_VESPA_SECRET",
"content": {
...
}
}
token-
(Required, String) The token of the credentials.
{
"name": "MY_VESPA_SECRET",
"content": {
...
}
}
apiKey-
(Required, String) The API key of the credentials.
Voyage AI
{
"type": "voyage-ai",
"name": "My Voyage AI Server",
"config": {
...
},
"credential": <Voyage AI Credential ID> (1)
}
| 1 | Optional. See the Authentication section. |
| Feature | Supported |
|---|---|
Yes |
|
No |
|
No |
|
No |
|
Yes |
connection-
(Optional, Object) The configuration of the connection for Voyage AI.
Details
connectTimeout-
(Optional, Duration) The timeout to connect to the service. Defaults to
60s. readTimeout-
(Optional, Duration) The timeout to read the first package from the service. Defaults to
60s. pool-
(Optional, Object) The configuration for the connection pool.
Details
size-
(Optional, Integer) The size of the connection pool. Defaults to
5. keepAlive-
(Optional, Duration) The duration before evicting a connection from the pool. Defaults to
5m.
compressRequest-
(Optional, Boolean)
trueif the requests must be compressed. followRedirects-
(Optional, Boolean)
trueif redirects must be followed.
backoffPolicy-
(Optional, Object) The configuration of the back off policy for Voyage AI.
Details
type-
(Optional, String) The type of backoff policy to apply. One of
NONE,CONSTANT, orEXPONENTIAL. Defaults toEXPONENTIAL. initialDelay-
(Optional, Duration) The initial delay before retrying. Defaults to
50ms. retries-
(Optional, Integer) The maximum number of retries. Defaults to
5.
Authentication
{
"type": "voyage-ai",
"name": "My Voyage AI Credentials",
"secret": "MY_VOYAGE_AI_SECRET"
}
{
"name": "MY_VOYAGE_AI_SECRET",
"content": {
...
}
}
token-
(Required, String) The token of the credentials.
DSL
The Discovery Domain-Specific Language is a standardized definition on how to write JSON expressions that can be applied to all Discovery products.
Filters
"Equals" Filter
The value of the field must be exactly as the one provided.
{
"equals": {
"field": "my-field",
"value": "my-value",
"normalize": true
}
}
When supported, the normalize field enables normalization as described by the filter provider. It is enabled by default.
"Less Than" Filter
The value of the field must be less than the one provided.
{
"lt": {
"field": "my-field",
"value": 1
}
}
"Less Than or Equal to" Filter
The value of the field must be less than or equals to the one provided.
{
"lte": {
"field": "my-field",
"value": 1
}
}
"Between" Filter
The value of the field must be greater than or equals to the "from" value (inclusive), and less than the "to" value (exclusive).
{
"between": {
"field": "my-field",
"from": 1,
"to": 10
}
}
"Greater Than" Filter
The value of the field must be greater than the one provided.
{
"gt": {
"field": "my-field",
"value": 1
}
}
"Greater Than or Equal To" Filter
The value of the field must be greater than or equals to the one provided.
{
"gte": {
"field": "my-field",
"value": 1
}
}
"In" Filter
The value of the field must be one of the provided values.
{
"in": {
"field": "my-field",
"values": [
"my-value-a",
"my-value-b"
]
}
}
"Empty" Filter
Checks if a field is empty:
-
For a collection,
trueif its size is 0. -
For a String,
trueif its length is 0. -
For any other type,
trueif it isnull.
{
"empty": {
"field": "my-field"
}
}
"Exists" Filter
Checks if a field exists.
{
"exists": {
"field": "my-field"
}
}
"Not" Filter
Negates the inner clause.
{
"not": {
"equals": {
"field": "my-field",
"value": "my-value"
}
}
}
"Null" Filter
Checks if a field is null. Note that while the "exists" filter checks whether the field is present or not, the "null" filter
expects the field to be present but with null value.
{
"null": {
"field": "my-field"
}
}
"Regex" Filter
Checks if a field matches with the given regex pattern.
{
"regex": {
"field": "my-field",
"pattern": "my-pattern"
}
}
"Boolean" Filter
-
and: All conditions in the list must be evaluated totrue.
{
"and": [
{
"equals": {
"field": "my-field-a",
"value": "my-value-a"
}
}, {
"equals": {
"field": "my-field-b",
"value": "my-value-b"
}
}
]
}
-
or: At least one condition in the list must be evaluated totrue.
{
"or": [
{
"equals": {
"field": "my-field-a",
"value": "my-value-a"
}
}, {
"equals": {
"field": "my-field-b",
"value": "my-value-b"
}
}
]
}
Projections
A projection allows you to select specific fields (attributes) to filter out from a request:
-
If no includes or excludes fields are defined, all fields are returned.
-
If only the includes fields are defined, only those fields are returned.
-
If only the excludes fields are defined, all available fields, except the ones in the exclusions are returned.
-
If both includes and excludes fields are defined, both are included in the projection.
|
Note
|
The details of how projections are processed might vary between uses cases of the DSL and/or providers, specially when it comes to projections with both included and excluded fields. It’s recommended to check the documentation of the specific component or API that’s going to be used for details like projections that aren’t allowed. |
{
"includes": ["my-field-a", "my-field-b"],
"excludes": ["my-field-c", "my-field-d"]
}
includes-
(Array of Strings) The list of fields to include in the projection.
excludes-
(Array of Strings) The list of fields to exclude from the projection.
Sort
A collection of sort clauses to be applied in the given order, following the rules of the corresponding implementation.
[
{
"property": "sort-field-a",
"direction": "ASC"
},
{
"property": "sort-field-b",
"direction": "DESC"
}
]
property-
(String) The property where the sort was applied.
direction-
(String) The direction of the applied sorting. Either
ASCorDESC.
Expression Language
The Expression Language is a flexible but simple way to manage and handle configurations. In a JSON, the use of expressions allows for values to be ambiguous to later be contextually processed:
{
"dynamicField": "#{ first_math_function('input') + second_math_function('input') }",
"staticField": "value"
}
As shown in the previous example, the syntax of an expression is one or multiple constants, operators and functions wrapped between the #{ and } tokens.
|
Note
|
The Expression Language is case-sensitive and all functions are defined in snake_case. |
Constants
| Constant | Value |
|---|---|
NULL |
|
| Constant | Value |
|---|---|
PI |
3.14159265... |
E |
2.71828182... |
| Constant | Value |
|---|---|
TRUE |
|
FALSE |
|
Operators
| Operator | Token |
|---|---|
Plus |
+ |
Minus |
- |
Multiplication |
* |
Division |
/ |
Power of |
^ |
Module |
% |
| Operator | Token |
|---|---|
Equals |
= |
Equals |
== |
Not equals |
<> |
Not equals |
!= |
Greater than |
> |
Greater than or equal to |
>= |
Less than |
< |
Less than or equal to |
<= |
| Operator | Token |
|---|---|
And |
&& |
Or |
|| |
Not |
! |
| Operator | Token |
|---|---|
Plus |
+ |
Minus |
- |
| Operator | Token |
|---|---|
Concat |
+ |
Functions
| Function | Description | Example |
|---|---|---|
|
Returns the first non-null value, or null if there are none |
|
| Function | Description | Example |
|---|---|---|
|
Returns the absolute value of a value |
|
|
Rounds a number towards positive infinity |
|
|
Returns the factorial of a number |
|
|
Rounds a number towards negative infinity |
|
|
Performs the logarithm with base e on a value |
|
|
Performs the logarithm with base 10 on a value |
|
|
Returns the highest value from all the parameters provided |
|
|
Returns the lowest value from all the parameters provided |
|
|
Returns random number between 0 and 1 |
|
|
Rounds a decimal number to a specified scale |
|
|
Returns the sum of the parameters |
|
|
Returns the square root of the value provided |
|
| Function | Description | Example |
|---|---|---|
|
Returns the arc-cosine in degrees |
|
|
Returns the hyperbolic arc-cosine in degrees |
|
|
Returns the arc-cosine in radians |
|
|
Returns the arc-co-tangent in degrees |
|
|
Returns the hyperbolic arc-co-tangent in degrees |
|
|
Returns the arc-co-tangent in radians |
|
|
Returns the arc-sine in degrees |
|
|
Returns the hyperbolic arc-sine in degrees |
|
|
Returns the arc-sine in radians |
|
|
Returns the arc-tangent in degrees |
|
|
Returns the angle of arc-tangent2 in degrees |
|
|
Returns the angle of arc-tangent2 in radians |
|
|
Returns the hyperbolic arc-tangent in degrees |
|
|
Returns the arc-tangent in radians |
|
|
Returns the cosine in degrees |
|
|
Returns the hyperbolic cosine in degrees |
|
|
Returns the cosine in radians |
|
|
Returns the co-tangent in degrees |
|
|
Returns the hyperbolic co-tangent in degrees |
|
|
Returns the co-tangent in radians |
|
|
Returns the co-secant in degrees |
|
|
Returns the hyperbolic co-secant in degrees |
|
|
Returns the co-secant in radians |
|
|
Converts an angle from radians to degrees |
|
|
Converts an angle from degrees to radians |
|
|
Returns the sine in degrees |
|
|
Returns the hyperbolic sine in degrees |
|
|
Returns the sine in radians |
|
|
Returns the secant in degrees |
|
|
Returns the hyperbolic secant in degrees |
|
|
Returns the secant in radians |
|
|
Returns the tangent in degrees |
|
|
Returns the hyperbolic tangent in degrees |
|
|
Returns the tangent in radians |
|
| Function | Description | Example |
|---|---|---|
|
Formats a date/time with a given pattern as described in Date/Time |
|
|
Gets the current datetime |
|
to_date(string) |
Parses the input pattern as described in Date/Time |
|
to_date(number, number, number, number?, number?) |
When providing a set of integers, year, month and day are required. Hour, minute and second are all optional. The order of the parameters must be as previously mentioned |
|
| Function | Description | Example |
|---|---|---|
|
Conditional operation where if the boolean expression evaluates to |
|
|
Negates a boolean expression |
|
| Function | Description | Example |
|---|---|---|
|
Converts String to lower case |
|
|
Converts String to upper case |
|
|
Verifies with a boolean if a string begins with a given substring. Case sensitivity can optionally be specified. If the case sensitivity flag is not sent, it will be set to |
|
|
Verifies with a boolean if a string ends with a given substring. Case sensitivity can optionally be specified. If the case sensitivity flag is not sent, it will be set to |
|
|
Returns a boolean specifying if a string matches a given pattern |
|
|
Replaces all appearances of a regex pattern in the first string with the third string |
|
|
Returns a boolean specifying if a String is empty |
|
|
Whether the variable is a blank String |
|
|
Returns the length of a given string |
|
|
Concatenates a given set of strings |
|
|
Splits a String into a List by a regex value |
|
|
Strips the punctuation, replacing it by a space |
|
|
Returns a boolean specifying if the first string contains the second one |
|
|
Generates a random UUID v4 |
|
|
Returns the number represented by a string |
|
| Function | Description | Example |
|---|---|---|
|
Hashes a given object using MD5 |
|
|
Hashes a given object using SHA-256 |
|
| Function | Description | Example |
|---|---|---|
|
Returns a boolean specifying if a list is empty |
|
|
Returns the amount of items in the list |
|
|
Returns a boolean specifying if the list contains the value |
|
|
Returns the value in the given index |
|
|
Joins a given set of arrays into one |
|
|
Merges the source array into the destination array |
|
|
Note
|
An alternative syntax to access to the value on the position |
| Function | Description | Example |
|---|---|---|
|
Returns a boolean specifying if a map is empty |
|
|
Returns the amount of items in the map |
|
|
Returns a boolean specifying if the map contains the key |
|
|
Returns the value with the given key |
|
|
Merges the source map into the destination map. If a key exists in both maps, the value from the source map overwrites the destination value. |
|
| Function | Description | Example |
|---|---|---|
|
Finds a specific value within a JSON with a JSONPath string |
|
|
Converts a String into its corresponding JSON representation |
|
| Function | Description | Example |
|---|---|---|
|
Reads a file from the storage. The file can optionally be obtained as a byte array representation if the |
|
Script Engine
The Script Engine enables the execution of scripts for advanced handling of execution data. Supports multiple scripting languages and provides tools for JSON manipulation and logging:
Bindings
Each script has bindings to interact with the execution context where it runs:
data-
(Object) Allows the creation and manipulation of JsonNode instances. Also, if the script runs as part of Discovery Ingestion or Discovery Queryflow, the binding will expose its corresponding the context data.
Details
Method Description ArrayNode arrayNode()Returns a new JSON array
JsonNode get(String)Obtains a deep copy of the nodes from the data generated during the execution, takes the path to the value or node. The input must be the JSON Pointer of the data field, for example:
data.get("/myfield")NullNode nullNode()Returns a new null node
ObjectNode objectNode()Returns a new JSON object
JsonNode output()Obtains the value that will be used as output for the component
JsonNode parseJson(String)Takes a JSON with a String format and parses it into a JSON document
JsonNode valueToJson(Object)Takes any object and tries to convert it into a JSON document
void set(parameter)Sets the output field to a primitive type (integer, long, str, float, double), or a JsonNode. The method will infer the type of parameter in languages that are dynamically typed such as python. If the use case needs it, a casting may help in controlling the output node
byte[] file(String)Reads files or binary data from the context data or the File Storage and return the contents. Returns
nullif not found log-
(Object) Supports a SLF4J logger that can log messages directly into the application
value = 5
if (value <= 10):
log.error("Example of an error")
var response = data.objectNode();
response.put("intValue", 3);
log.info("Node set with value: " + data.output().get("intValue").asText());
var requestBody = data.get("/numberTest").asInt();
data.set(requestBody);
Template Engine
The Template Engine, provided by Freemarker, acts as a blueprint that uses a given input to generate various types of documents as either plain text or JSON.
Supported in both Discovery Ingestion and Discovery QueryFlow, can take a standard template and process it with contextual structured data to create verbalized representation of the information.
Consider the following input data:
{
"name": "mary",
"users": [
{
"name": "Jane Doe",
"id": 0
},
{
"name": "Mary",
"id": 2
},
{
"name": "Alice",
"id": 3
}
]
}
When processed with the following template:
Hello, ${name?capitalize}!
Users registered:
<#list users as user>
<#if user.id == 0>
name: admin, ID: ${user.id}
<#else>
Name: ${user.name}, ID: ${user.id}
</#if>
</#list>
Then, the output would be:
Hello, Mary!
Users registered:
name: admin, ID: 0
Name: Mary, ID: 2
Name: Alice, ID: 3
-
Hello, ${name?capitalize}!will output a greeting to the name specified in the JSON data and capitalize the first letter. Given the JSON data, it will outputHello, Mary!becausemarynow is capitalized. -
<#list users as user>is a directive that iterates over the users array in the JSON data. -
<#if user.id == 0>within the list, checks if the user’s id is0. Iftrue, it outputsname: admin, ID: 0instead of using the user’s actual name. -
<#else>for all other users (where id is not0), it outputsName: user.name, ID: user.id.
Template Language
Placeholders
Placeholders are references to the data model passed to the template.
Syntax:
${variableName}
Example data model:
{
"name": "Mary"
}
Example:
${name}
Output: Mary
Comments
Comments are a way to add notes or explanations within your templates.
Syntax:
<#-- Comment --#>
Example:
<#-- Hello this is my comment --#>
Directives
== Directives These are instructions that control the processing flow of the template (like loops and conditionals). The full list of the directives can be found in the Directive reference of the Freemarker documentation.
=== Assign
Used to define a variable.
Syntax:
<#assign name1=value1>
Example:
<#assign x=1>
Attempt, Recover
Used for error handling in templates. Attempt is to execute code that might fail and recover to define what to do if an error occurs in the attempt block.
Syntax:
<#attempt>
attempt block
<#recover>
recover block
</#attempt>
Example:
<#attempt>
${user.name}
<#recover>
Unknown User
</#attempt>
Function, Return
Used to create a method variable, it must have a parameter that specifies the return value of the method.
Syntax:
<#function name param1 param2 ... paramN>
...
<#return returnValue>
...
</#function>
Example:
<#function avg x y>
<#return (x + y) / 2>
</#function>
${avg(10, 20)}
Output: 15.
Global
Used to define a variable for all namespaces.
Syntax:
<#global name=value>
Example:
<#global x=1>
If, else, elseif
Used to conditionally skip a section of the template.
Syntax:
<#if condition>
...
<#elseif condition2>
...
<#else>
...
</#if>
Example:
<#if x == 1>
x is 1
<#elseif x == 2>
x is 2
<#else>
x is not 1 nor 2
</#if>
Import
Used to bring all macros and functions from another template file into the current template namespace. Path is the path of the template file to import and hash is the name of variable by which you can access the namespace.
Syntax:
<#import path as hash>
Example:
<#import "library.ftl" as lib>
Include
Used to include the content of another template file into the current template output. It does not make macros or functions from the included file available in the current namespace. Path is the path of the template file to include.
Syntax:
<#include path>
Example:
<#include "header.ftl">
List
Used for iterating over a collection.
-
else: Within a list, it is used to specify output if the list is empty.
-
items: It refers to the current item in the iteration.
-
sep: Used to output something between items, like a separator.
-
break: Exits the loop prematurely.
-
continue: Skips the current iteration and moves to the next item.
Syntax:
<#list sequence as item>
Part repeated for each item
</#list>
Example:
<#list users as user>
${user.name}
<#else>
No users found.
</#list>
Macro
Used to define a reusable block of template code.
Syntax:
<#macro name param1 param2 ... paramN>
...
</#macro>
Example:
<#macro test>
Test text
</#macro>
<#-- call the macro: -->
<@test/>
Output: Test text
File Storage
Files API
$ curl --request PUT 'core-api:12010/v2/file/my/key/to/my/file' --form 'file=@"/../../test.txt"'
$ curl --request GET 'core-api:12010/v2/file/my/key/to/my/file'
$ curl --request GET 'core-api:12010/v2/file'
$ curl --request DELETE 'core-api:12010/v2/file/my/key/to/my/file'
The File Storage handles files with the help of a dedicated "folder" in the Object Storage.
File names can be constructed as nested paths, where a slash at the end of each sub path denotes this sub path as a parent/folder. This name must follow the next rules:
-
Parent names can contain alphanumeric characters (i.e.
[A-Z],[a-z],[0-9]), hyphen (i.e.-), underscore (i.e._) and spaces. -
Character quantity must range from 1 to 255.
-
A nested path can consist of up to 10 levels.
|
Note
|
When executing the endpoints to upload or delete files, the Expression Language is notified about the changes to clear its internal cache. |
Discovery Staging
Discovery Staging is a REST API on top of a Document Database. Its goal is to simplify and standardize the interactions of all Discovery products with any supported provider that can handle JSON content, while enabling the final user features such as:
-
A push model alternative for the ETL process in Discovery Ingestion.
-
An intermediate repository for Discovery Ingestion, reducing the time and costs of content reprocessing for each processing iteration.
-
Advanced search capabilities in Discovery QueryFlow such as facet snapping based on the user’s input.
Supported Providers
MongoDB
Being the NoSQL industry standard for document storage and with a MongoDB Atlas managed service available in the marketplace of all major Cloud Providers, makes MongoDB the default Document Database provider for Discovery Staging in all Discovery installations.
Amazon DocumentDB
Amazon DocumentDB (with MongoDB compatibility) is a good alternative Document Database provider for Discovery Staging in installations fully-managed by AWS.
Azure DocumentDB
Azure DocumentDB is a MongoDB compatible engine that supports hybrid and multicloud architectures with enterprise-grade performance, availability, and easy Azure AI integration.
Content
Content API
$ curl --request GET 'staging-api:12020/v2/content/{bucketName}/{contentId}?action={action}&include={include}&exclude={exclude}'
Path Parameters
bucketName-
(Required, String) The name of the bucket.
contentId-
(Required, String) The document ID.
Query Parameters
action-
(Optional, String) The actions to filter the documents. Defaults to
STORE. include-
(Optional, Array of Strings) determine the fields of the document’s content that will be included in the response.
exclude-
(Optional, Array of Strings) determine the fields of the document’s content that will be excluded in the response.
$ curl --request POST 'staging-api:12020/v2/content/{bucketName}'
Path Parameters
bucketName-
(Required, String) The name of the bucket.
$ curl --request POST 'staging-api:12020/v2/content/{bucketName}/{contentId}?parentId={parentId}'
Path Parameters
bucketName-
(Required, String) The name of the bucket.
contentId-
(Required, String) The document ID.
Query Parameters
parentId-
(Optional, String) The parent ID of the documents.
Details
|
Note
|
This endpoint is capable of updating an existing document by using the |
The final content must not exceed the maximum size supported by the chosen provider. Exceeding the limit is depicted by a 413 error in the Staging APIs response.
Documents are stored with metadata. This adds an extra size to the final document besides the content in the request body, so it is recommended to write the body with less than the limit depicted by the provider.
The following table details maximum provider limits:
Provider |
Limit |
MongoDB |
BSON size limit (~16Mb / 16793600 Bytes) |
Amazon DocumentDB |
BSON size limit (~16Mb / 16793600 Bytes) |
Azure DocumentDB |
BSON size limit (~16Mb / 16793600 Bytes) |
$ curl --request DELETE 'staging-api:12020/v2/content/{bucketName}/{contentId}'
Path Parameters
bucketName-
(Required, String) The name of the bucket.
contentId-
(Required, String) The document ID.
$ curl --request DELETE 'staging-api:12020/v2/content/{bucketName}?parentId={parentId}' --data '{ ... }'
Path Parameters
bucketName-
(Required, String) The name of the bucket.
Query Parameters
parentId-
(Optional, String) The parent ID of the documents.
Body
The body payload is an optional DSL Filter to apply to the delete
|
Note
|
The |
$ curl --request POST 'staging-api:12020/v2/content/{bucketName}/scroll?token={token}&parentId={parentId}&size={size}&action={action}' --data '{ ... }'
Path Parameters
bucketName-
(Required, String) The name of the bucket.
Query Parameters
token-
(Optional, Hex String) the token to paginate the documents.
parentId-
(Optional, String) The parent ID of the documents.
size-
(Optional, Int) The number of documents to scroll. Defaults to
25. action-
(Optional, Array of String) The actions to filter the documents. Defaults to
STORE.
Body
The body payload is an optional DSL Filter and an optional DSL Projection to apply to the scroll
{
"fields": <Projection DSL>,
"filters": <Filter DSL>
}
Details
|
Note
|
The |
$ curl --request POST 'staging-api:12020/v2/content/{bucketName}/search?parentId={parentId}&action={action}&page={page}&size={size}&sort={sort}' --data '{ ... }'
Path Parameters
bucketName-
(Required, String) The name of the bucket.
Query Parameters
parentId-
(Optional, String) The parent ID of the documents.
action-
(Optional, Array of String) The actions to filter the documents. Defaults to
STORE. page-
(Optional, Int) The page number. Defaults to
0. size-
(Optional, Int) The size of the page. Defaults to
20. sort-
(Optional, Array of String) The sort definition for the page.
Body
The body payload is an optional DSL Filter and an optional DSL Projection to apply to the search
{
"fields": <Projection DSL>,
"filters": <Filter DSL>
}
|
Note
|
The For both The |
The content of the bucket is the data, stored as JSON.
Buckets
Buckets API
$ curl --request GET 'staging-api:12020/v2/bucket'
$ curl --request GET 'staging-api:12020/v2/bucket/{bucketName}'
Path Parameters
bucketName-
(Required, String) The name of the bucket.
$ curl --request DELETE 'staging-api:12020/v2/bucket/{bucketName}'
Path Parameters
bucketName-
(Required, String) The name of the bucket.
$ curl --request DELETE 'staging-api:12020/v2/bucket/{bucketName}/purge'
Path Parameters
bucketName-
(Required, String) The name of the bucket.
|
Note
|
Only purges the documents with the |
$ curl --request DELETE 'staging-api:12020/v2/bucket/{bucketName}/index/{indexName}'
Path Parameters
bucketName-
(Required, String) The name of the bucket.
indexName-
(Required, String) The name of the index.
$ curl --request PUT 'staging-api:12020/v2/bucket/{bucketName}/index/{indexName}' --data '{ ... }'
Path Parameters
bucketName-
(Required, String) The name of the bucket.
indexName-
(Required, String) The name of the index.
Body
[
{ "fieldA": "ASC" },
{ "fieldB": "DESC" },
...
]
|
Note
|
However, an empty or null body is also allowed. In these cases, an ascending index is created with the index name as the field name. |
$ curl --request POST 'staging-api:12020/v2/bucket/{bucketName}' --data '{ ... }'
Path Parameters
bucketName-
(Required, String) The name of the bucket.
Body
{
"indices": [
{
"name": "myIndexA",
"fields": [
{ "fieldA": "ASC" },
{ "fieldB": "DESC" },
...
]
},
...
],
"config":{}
}
$ curl --request POST 'staging-api:12020/v2/bucket/{bucketName}/rename' --data '{ ... }'
Path Parameters
bucketName-
(Required, String) The name of the bucket to rename.
Body
{
"name": "new-name",
"allowOverride": false
}
name-
(Required, String) The new name of the bucket.
allowOverride-
(Optional, Boolean) Whether a bucket can be overridden if there is already one with the new name. Defaults to
false.
|
Note
|
This endpoint is not supported on Amazon DocumentDB elastic clusters. See the Architecture section. |
A bucket is a complete collection of data. Several operations can be performed on a bucket, where the results vary depending on the user input from the HTTP request.
Metadata description
{
"name": "<Text>",
"documentCount": {
"STORE": "<Number>",
"DELETE": "<Number>"
},
"content": {
"oldest": "<StagingDocument>",
"newest": "<StagingDocument>"
},
"indices": [
{
"name": "<Text>",
"fields": [
{
"fieldA": "ASC|DESC"
}
]
}
]
}
| Property | Type | Description |
|---|---|---|
name |
Text |
The bucket name |
documentCount |
JSON Object |
The total documents in the bucket, divided by action |
documentCount.STORE |
Number |
The number of documents currently in the bucket with a |
documentCount.DELETE |
Number |
The number of documents currently in the bucket with a |
content |
JSON Object |
The content of the bucket, including the oldest and newest documents |
content.oldest |
Staging Document |
The oldest document in the bucket |
content.newest |
Staging Document |
The newest document in the bucket |
indices |
JSON Array |
Array with the name and fields of every index in the bucket |
Index description
| Property | Type | Description |
|---|---|---|
name |
Text |
The index name |
fields |
Key/Value Pair Array |
The fields used for the index. The key of every element is the name of the field, and the value is its sort direction (ASC or DESC). |
|
Note
|
All indices are over the content of the document |
|
Note
|
When creating an index, if any of the fields is duplicated the last value specification for a field will take precedence. |
|
Note
|
Also in the value of the fields, apart from using (
|
Discovery Ingestion
Discovery Ingestion is a fully-featured extract, transform, and load (ETL) tool that orchestrates the communication with external services while applying data enrichment to the records detected in the given data source. It enables features such as:
-
Flexibility to represent complex data processing scenarios through a finite-state machine.
-
Distributed, auto-scalable model that only consumes resources as needed.
-
Extensive component library for data source scanning, records processing and hooks triggering.
Data Seed
Seeds API
$ curl --request POST 'ingestion-api:12030/v2/seed' --data '{ ... }'
$ curl --request POST 'ingestion-api:12030/v2/seed/{id}?scanType={scan-type}' --data '{ ... }'
Query Parameters
scanType-
(Required, String) The scan type for the seed execution. Either
FULLorINCREMENTAL
Body
The body payload is the execution properties, which overrides the ones configured in the Seed
$ curl --request POST 'ingestion-api:12030/v2/seed/{id}/halt'
$ curl --request GET 'ingestion-api:12030/v2/seed'
$ curl --request GET 'ingestion-api:12030/v2/seed/{id}'
$ curl --request PUT 'ingestion-api:12030/v2/seed/{id}' --data '{ ... }'
|
Note
|
The type of an existing seed can’t be modified. |
$ curl --request POST 'ingestion-api:12030/v2/seed/{id}/reset'
$ curl --request DELETE 'ingestion-api:12030/v2/seed/{id}'
$ curl --request POST 'ingestion-api:12030/v2/seed/{id}/clone?name=clone-new-name'
Query Parameters
name-
(Required, String) The name of the new Seed
$ curl --request POST 'ingestion-api:12030/v2/seed/search' --data '{ ... }'
Body
The body payload is a DSL Filter to apply to the search
$ curl --request GET 'ingestion-api:12030/v2/seed/autocomplete?q=value'
Query Parameters
q-
(Required, String) The query to execute the autocomplete search
A seed defines the data source for the configuration and the pipeline to follow during the processing of each record through the finite-state machine.
{
"type": "my-component-type",
"name": "My Component Seed",
"config": {
...
},
"pipeline": <Pipeline ID>,
...
}
type-
(Required, String) The name of the component to execute.
name-
(Required, String) The unique name to identify the configuration.
description-
(Optional, String) The description for the configuration.
config-
(Required, Object) The configuration for the corresponding action of the component. All configurations will be affected by the Expression Language.
server-
(Optional, UUID/Object) Either the ID of the server configuration for the integration or an object with the detailed configuration.
Details
{ "server": { "id": "ba637726-555f-4c68-bfed-1c91f4803894", ... }, ... }id-
(Required, UUID) The ID of the server configuration for the integration.
credential-
(Optional, UUID) The ID of the credential to override the default authentication in the external service.
pipeline-
(Required, UUID) The ID of the pipeline configuration for all detected records.
recordPolicy-
(Optional, Object) The global configuration for the seed execution.
Details
{ "type": "my-component-type", "name": "My Component Seed", "config": { ... }, "pipeline": <Pipeline ID>, "recordPolicy": { ... }, ... }timeoutPolicy-
(Optional, Object) The policy for handling timeouts during the scan of records.
Details
{ "type": "my-component-type", "name": "My Component Seed", "config": { ... }, "pipeline": <Pipeline ID>, "recordPolicy": { "timeoutPolicy": { ... }, ... }, ... }slice-
(Optional, Duration) The timeout for scan on each slice with records. Defaults to
1h.
errorPolicy-
(Optional, String) The error policy for scanned records. Defaults to
FATAL.-
FATAL: A single failed document aborts the complete process -
IGNORE: Ignores the record scan error and the Seed Execution continues
-
outboundPolicy-
(Optional, Object) The policy for groups of records sent for processing within the finite-state machine. Applied by default to outbound records (i.e., records sent to the pipeline).
Details
{ "type": "my-component-type", "name": "My Component Seed", "config": { ... }, "pipeline": <Pipeline ID>, "recordPolicy": { "outboundPolicy": { ... }, ... }, ... }idPolicy-
(Optional, Object) The policy for generating the records IDs. If not provided, the plain ID of the record will be used.
Details
{ "type": "my-component-type", "name": "My Component Seed", "config": { ... }, "pipeline": <Pipeline ID>, "recordPolicy": { "outboundPolicy": { "idPolicy": { ... } }, ... }, ... }generator-
(Optional, String) The expression that represents the ID used for the scanned records.
batchPolicy-
(Optional, Object) The batch policy for outbound batches of records.
Details
{ "type": "my-component-type", "name": "My Component Seed", "config": { ... }, "pipeline": <Pipeline ID>, "recordPolicy": { "outboundPolicy": { "batchPolicy": { ... } }, ... }, ... }maxCount-
(Optional, Integer) The maximum record count in a batch before flushing. Defaults to
25. flushAfter-
(Optional, Duration) The timeout to flush if no other condition has been met. Defaults to
1m.
snapshotPolicy-
(Optional, Object) The configuration for the incremental scan feature.
Details
{ "type": "my-component-type", "name": "My Component Seed", "config": { ... }, "pipeline": <Pipeline ID>, "recordPolicy": { "snapshotPolicy": { ... }, ... }, ... }checksumExpression-
(Optional, String) The expression to determine if the record has changed during an incremental scan by checksum. Defaults to all the fields in the document.
beforeHooks-
(Optional, Object) The Hooks to execute before starting the record processing.
Details
{ "type": "my-component-type", "name": "My Component Seed", "config": { ... }, "pipeline": <Pipeline ID>, "beforeHooks": { "hooks": [ ... ], "timeout": "60s", "errorPolicy": "IGNORE" }, ... }hooks-
(Required, Array of Objects) The list of Hooks to execute
Details
{ "hooks": [ { "id": <Hook ID>, ... } ], "timeout": "60s", "errorPolicy": "IGNORE" }id-
(Required, UUID) The ID of the Hook to execute
errorPolicy-
(Optional, String) Overrides the global policy for errors during the execution of the Hook
-
FATAL: A single failed hook aborts the complete process -
IGNORE: Ignores the hook error and the Seed Execution continues
-
timeout-
(Optional, Duration) Overrides the global timeout for the execution of the Hook
active-
(Optional, Boolean)
falseto disable the execution of the Hook
errorPolicy-
(Required, String) The policy for errors during the execution of the Hook. Defaults to
IGNORE-
FATAL: A single failed hook aborts the complete process -
IGNORE: Ignores the hook error and the Seed Execution continues
-
timeout-
(Required, Duration) The timeout for the execution of the Hook. Defaults to
60s
afterHooks-
(Optional, Object) The Hooks to execute after completing the record processing.
Details
{ "type": "my-component-type", "name": "My Component Seed", "config": { ... }, "pipeline": <Pipeline ID>, "afterHooks": { "hooks": [ ... ], "timeout": "60s", "errorPolicy": "IGNORE" }, ... }hooks-
(Required, Array of Objects) The list of Hooks to execute
Details
{ "hooks": [ { "id": <Hook ID>, ... } ], "timeout": "60s", "errorPolicy": "IGNORE" }id-
(Required, UUID) The ID of the Hook to execute
errorPolicy-
(Optional, String) Overrides the global policy for errors during the execution of the Hook
-
FATAL: A single failed hook aborts the complete process -
IGNORE: Ignores the hook error and the Seed Execution continues
-
timeout-
(Optional, Duration) Overrides the global timeout for the execution of the Hook
active-
(Optional, Boolean)
falseto disable the execution of the Hook
errorPolicy-
(Required, String) The policy for errors during the execution of the Hook. Defaults to
IGNORE-
FATAL: A single failed hook aborts the complete process -
IGNORE: Ignores the hook error and the Seed Execution continues
-
timeout-
(Required, Duration) The timeout for the execution of the Hook. Defaults to
60s
properties-
(Optional, Object) The properties to be referenced with the help of the Expression Language in the configuration of the seed itself, in processors and in hooks.
Details
{ "type": "my-component-type", "name": "My Component Seed", "config": { "myProperty": "#{ seed.properties.keyA }" }, "properties": { "keyA": "valueA" }, "pipeline": <Pipeline ID>, ... }{ "type": "my-component-type", "name": "My Component Processor", "config": { "myProperty": "#{ seed.properties.keyA }" }, ... } labels-
(Optional, Array of Objects) The labels for the configuration.
Details
{ "labels": [ { "key": "My Label Key", "value": "My Label Value" }, ... ], ... }key-
(Required, String) The key of the label.
value-
(Required, String) The value of the label.
Records
Records API
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/record'
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/record/{recordId}'
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/record/summary'
Seed Records reflect the status and parent-child relationship of records during a specific seed’s latest execution.
Each seed record in a given seed is identifiable by combining its seed, its parent plainId (if any) and its own plainId assigned at scan time. However, these IDs are used to generate a more systems-friendly ID for each record, known as hashId. This last ID is made by first applying the SHA-256 algorithm to the combination of the parent plainId plus the record plainId, and then using padded Base64URL encoding to make the result safe to use in URLs.
{
"id": {
...
},
"creationTimestamp": "2025-04-22T16:51:44Z",
"lastUpdatedTimestamp": "2025-04-22T16:51:44Z",
"parent": "FOMM0WPHMpEuBIxMg34VxOkMBi67eVq5R9V3BuLRDdg=",
"status": "FAILURE",
"errors": [
...
]
}
id-
(Object) The ID of the record.
Details
{ "plain": "1", "hash": "a4ayc_80_OGda4BO_1o_V0etpOqiLx1JwB5S3beHW0s=" }plain-
(String) The ID of the record before hashing it.
hash-
(String) The ID of the record as a Base64URL string.
creationTimestamp-
(Timestamp) The timestamp when the record was created.
lastUpdatedTimestamp-
(Timestamp) The timestamp when the record was last updated.
parent-
(String) The parent record’s id as a Base64URL string.
status-
(String) The status of the record.
Details
-
SUCCESS: The record was successfully processed. -
FAILURE: The record reported errors during its processing. -
QUARANTINE: The record has been processed many times, and it should not be processed again.
-
errors-
(Array of Objects) The record’s errors, if any.
Details
Each item in the errors array represents an individual error encountered during record processing.
{ ..., "errors": [ { "id": "a08aa41c-9d5c-436f-94b0-e3ac19b805d3", "error": { "code": 4001, "status": 409, "messages": [ "com.pureinsights.pdp.core.CoreException: Script execution failed: java.lang.Exception: Error" ], "timestamp": "2025-06-04T05:18:42.071218Z" }, "retry": 0 }, ... ], ... }id-
(String) The ID of the error document.
error-
(Object) Contains the detailed error information.
Details
code-
(Integer) The internal code for the error, helpful for identifying the type of failure.
status-
(Integer) The associated HTTP status code, indicating the general category of the failure.
messages-
(List of strings) A list of one or more error messages giving a detailed description of what went wrong.
timestamp-
(Timestamp) The time when the error occurred.
retry-
(Integer) The number of times the system has retried processing the record after the error occurred.
Schedule
Schedules API
$ curl --request POST 'ingestion-api:12030/v2/seed/schedule' --data '{ ... }'
$ curl --request GET 'ingestion-api:12030/v2/seed/schedule'
$ curl --request GET 'ingestion-api:12030/v2/seed/schedule/{id}'
$ curl --request PUT 'ingestion-api:12030/v2/seed/schedule/{id}' --data '{ ... }'
$ curl --request DELETE 'ingestion-api:12030/v2/seed/schedule/{id}'
$ curl --request POST 'ingestion-api:12030/v2/seed/schedule/{id}/clone?name=clone-new-name'
Query Parameters
name-
(Required, String) The name of the new Schedule.
$ curl --request POST 'ingestion-api:12030/v2/seed/schedule/search' --data '{ ... }'
Body
The body payload is a DSL Filter to apply to the search.
$ curl --request GET 'ingestion-api:12030/v2/seed/schedule/autocomplete?q=value'
Query Parameters
q-
(Required, String) The query to execute the autocomplete search.
A Schedule allows to execute a Seed automatically according to a cron-based expression.
{
"name": "My Schedule",
"expression": "1 * * * *",
"seed": <Seed ID>,
"scanType": "FULL"
...
}
name-
(Required, String) The unique name to identify the schedule.
description-
(Optional, String) The description for the configuration.
expression-
(Required, String) The cron expression in UNIX format that defines when the Seed must be executed.
seed-
(Required, UUID) The ID of the Seed to execute.
scanType-
(Required, String) The scan type for the seed execution. Either
FULLorINCREMENTAL. properties-
(Optional, Object) The execution properties, which override the ones configured in the Seed.
labels-
(Optional, Array of Objects) The labels for the configuration.
Details
{ "labels": [ { "key": "My Label Key", "value": "My Label Value" }, ... ], ... }key-
(Required, String) The key of the label.
value-
(Required, String) The value of the label.
Discovery Features for Data Seeds
Incremental Scan
The scan must be executed in one of two modes: FULL or INCREMENTAL.
While the former overrides any previous execution information and, given its nature, only identifies the CREATE action, the latter uses a snapshot from the previous execution to also identify UPDATE, DELETE and NO_CHANGE actions. The mechanism behind the identification varies depending on the Seed action:
-
Checksum- always scans the complete data source and selects each record’s action after a comparison based on a configurable checksum. This is always the case for the split operation. -
Custom- the incremental scan mechanism is defined via the configuration of the Seed action. -
None- the Seed action does not support incremental scans.
Distributed Scan
The scan phase of ingestion, as the starting point of the data processing, might need to identify a large number of records. This could cause either performance problems caused by the sequential nature of scanning, or failures due to timeouts.
In general, all Seed actions should handle this problem by internally slicing the total of records into smaller pages that can be retrieved in a distributed environment. However, some implementations might not offer a proper mechanism for this pagination, and they will be forced to handle the complete scan as a single request.
Hierarchical Records
Some data sources might expose a parent-children relationship on their records. Whenever possible, this information will be captured and exposed through the Expression Language and the Record API.
Pipeline
Pipelines API
$ curl --request POST 'ingestion-api:12030/v2/pipeline' --data '{ ... }'
$ curl --request GET 'ingestion-api:12030/v2/pipeline'
$ curl --request GET 'ingestion-api:12030/v2/pipeline/{id}'
$ curl --request PUT 'ingestion-api:12030/v2/pipeline/{id}' --data '{ ... }'
$ curl --request DELETE 'ingestion-api:12030/v2/pipeline/{id}'
$ curl --request POST 'ingestion-api:12030/v2/pipeline/{id}/clone?name=clone-new-name'
Query Parameters
name-
(Required, String) The name of the new Pipeline.
$ curl --request POST 'ingestion-api:12030/v2/pipeline/search' --data '{ ... }'
Body
The body payload is a DSL Filter to apply to the search.
$ curl --request GET 'ingestion-api:12030/v2/pipeline/autocomplete?q=value'
Query Parameters
q-
(Required, String) The query to execute the autocomplete search.
A pipeline is the definition of the finite-state machine for records processing:
{
"name": "My Pipeline",
"initialState": "stateA",
"states": {
"stateA": {
...
},
"stateB": {
...
}
},
...
}
name-
(Required, String) The unique name to identify the pipeline.
description-
(Optional, String) The description for the configuration.
initialState-
(Required, String) The state, as defined in the
statesfield, to be used as starting point for new and updated records that need to be processed through the pipeline. deleteState-
(Optional, String) The state, as defined in the
statesfield, to be used as starting point for deleted records that need to be processed through the pipeline. Although this field is not required, omitting it may lead to unexpected errors because deleted records are still processed by the pipeline, starting from theinitialState. Since deleted records can differ in structure or content from newly created or updated records, it is recommended to process them in separate states that can handle deletion events. states-
(Required, Object) The states associated to the pipeline.
labels-
(Optional, Array of Objects) The labels for the configuration.
Details
{ "labels": [ { "key": "My Label Key", "value": "My Label Value" }, ... ], ... }key-
(Required, String) The key of the label.
value-
(Required, String) The value of the label.
recordPolicy-
(Required, Object) The global record policies to be applied during the execution of a pipeline.
Details
{ "name": "My Pipeline", "initialState": "stateA", "states": { ... }, "recordPolicy": { ... }, ... }idPolicy-
(Optional, Object) The policy for handling the generation of records IDs.
Details
{ "name": "My Pipeline", "initialState": "stateA", "states": { ... }, "recordPolicy": { "idPolicy": { ... } }, ... }mask-
(Optional, String) The expression that represents the ID of the record during its processing. If not provided, the plain ID of the record will be used.
retryPolicy-
(Optional, Object) The policy for handling the records retries.
Details
{ "name": "My Pipeline", "initialState": "stateA", "states": { ... }, "recordPolicy": { "retryPolicy": { ... } }, ... }active-
(Optional, Boolean) Whether a processor should be retried if failed during the execution. Defaults to
true. maxRetries-
(Optional, Integer) The maximum number of retries for processing the record. The retries are executed from the point where the records failed. Defaults to
3.
timeoutPolicy-
(Optional, Object) The policy for handling the records timeout.
Details
{ "name": "My Pipeline", "initialState": "stateA", "states": { ... }, "recordPolicy": { "timeoutPolicy": { ... } }, ... }record-
(Optional, Duration) The timeout for each record during its processing. Defaults to
60s.
errorPolicy-
(Optional, String) The error policy for records during their processing. Defaults to
FAIL.-
FATAL: A single failed document aborts the complete process -
FAIL: Either marks the document as failed, or sends it to a configured error handling state (if any). Other records continue their execution as expected -
IGNORE: Ignores the record processing error and its execution continues
-
outboundPolicy-
(Optional, Object) The custom policy for groups of records sent for processing to the next state within the finite-state machine.
Details
{ "name": "My Pipeline", "initialState": "stateA", "states": { ... }, "recordPolicy": { "outboundPolicy": { ... } }, ... }mode-
(Optional, String) The output mode for processors that might be affected by operations like splitting. Either
INLINEfor a normal output where any "split" will be represented as an array, orSPLITfor children documents hierarchically related to the origin. Defaults toINLINE. splitPolicy-
(Optional, Object) The splitting policy for outbound batches of children records (where supported).
Details
{ "name": "My Pipeline", "initialState": "stateA", "states": { ... }, "recordPolicy": { "outboundPolicy": { "splitPolicy": { ... } } }, ... }children-
(Optional, Object) The configuration of the children records after the split of the parent.
Details
{ "name": "My Pipeline", "initialState": "stateA", "states": { ... }, "recordPolicy": { "outboundPolicy": { "splitPolicy": { "children": { ... } } } }, ... }idPolicy-
(Optional, Object) The configuration of the ID for each of the new child record.
Details
{ "name": "My Pipeline", "initialState": "stateA", "states": { ... }, "recordPolicy": { "outboundPolicy": { "splitPolicy": { "children": { "idPolicy": { ... } } } } }, ... }generator-
(Optional, Object) The expression that represents the ID of each child record. If not provided, the value assigned is composed of the parent record ID, followed by a colon, and an incremental number, i.e.
<parentRecordId>:<incrementalNumber>.
batchPolicy-
(Optional, Object) The batch policy for outbound batches of records, once their processor execution is completed.
Details
{ "name": "My Pipeline", "initialState": "stateA", "states": { ... }, "recordPolicy": { "outboundPolicy": { "batchPolicy": { ... } } }, ... }maxCount-
(Optional, Integer) The maximum record count in a batch before flushing. Defaults to
25. flushAfter-
(Optional, Duration) The timeout to flush if no other condition has been met. Defaults to
1m.
Processors
Processors API
$ curl --request POST 'ingestion-api:12030/v2/processor' --data '{ ... }'
$ curl --request GET 'ingestion-api:12030/v2/processor'
$ curl --request GET 'ingestion-api:12030/v2/processor/{id}'
$ curl --request PUT 'ingestion-api:12030/v2/processor/{id}' --data '{ ... }'
|
Note
|
The type of an existing processor can’t be modified. |
$ curl --request DELETE 'ingestion-api:12030/v2/processor/{id}'
$ curl --request POST 'ingestion-api:12030/v2/processor/{id}/clone?name=clone-new-name'
Query Parameters
name-
(Required, String) The name of the new Processor
$ curl --request POST 'ingestion-api:12030/v2/processor/search' --data '{ ... }'
Body
The body payload is a DSL Filter to apply to the search
$ curl --request GET 'ingestion-api:12030/v2/processor/autocomplete?q=value'
Query Parameters
q-
(Required, String) The query to execute the autocomplete search
Each component is stateless, and it’s driven by the configuration defined in the processor and by the context created by the current seed execution. This design makes the processor the main building block of Discovery Ingestion.
They are intended to solve very specific tasks, which makes them re-usable and simple to integrate into any part of the configuration.
{
"type": "my-component-type",
"name": "My Component Processor",
"config": {
...
},
...
}
type-
(Required, String) The name of the component to execute
name-
(Required, String) The unique name to identify the configuration
description-
(Optional, String) The description for the configuration.
config-
(Required, Object) The configuration for the corresponding action of the component. All configurations will be affected by the Expression Language
server-
(Optional, UUID/Object) Either the ID of the server configuration for the integration or an object with the detailed configuration.
Details
{ "server": { "id": "ba637726-555f-4c68-bfed-1c91f4803894", ... }, ... }id-
(Required, UUID) The ID of the server configuration for the integration.
credential-
(Optional, UUID) The ID of the credential to override the default authentication in the external service.
labels-
(Optional, Array of Objects) The labels for the configuration.
Details
{ "labels": [ { "key": "My Label Key", "value": "My Label Value" }, ... ], ... }key-
(Required, String) The key of the label.
value-
(Required, String) The value of the label.
Hooks
Hooks are a type of Processor, detached from the record processing. They are related to the execution of pre- or post-actions associated with a long seed execution (e.g. creating indices, changing aliases…).
There are two types of hooks BEFORE_HOOK and AFTER_HOOK. The before hooks are executed at the beginning and the after hooks are executed at the end of the record processing of an execution.
They are useful to do some single pre- or post-actions associated with a long seed execution (e.g. creating indices, changing aliases…).
Data Processing with a State Machine
State Types
Processor State
Executes a single or multiple processors in sequence:
{
"myProcessorState": {
"type": "processor",
"processors": [
...
]
}
}
type-
(Required, String) The type of state. Must be
processor. processors-
(Required, Array of Objects) The processors to execute.
Details
{ "stateA": { "type": "processor", "processors": [ { "id": <Processor ID>, ... } ], ... } }id-
(Required, UUID) The ID of the processor to execute.
outputField-
(Optional, String) The output field that wraps the result of the processor execution. Defaults to the one defined in the component.
recordPolicy-
(Optional, Object) The custom records configuration for the execution of the processor. Overrides the global one defined in the seed or pipeline. Can also be referred to as
record.Details
{ "id": <Processor ID>, "recordPolicy": { ... } }idPolicy-
(Optional, Object) The policy for handling the records IDs.
Details
{ "id": <Processor ID>, "recordPolicy": { "idPolicy": { ... } } }mask-
(Optional, String) The expression that represents the ID of the record during its processing. If not provided, the plain ID of the record will be used.
timeoutPolicy-
(Optional, Object) The policy for handling the records timeout.
Details
{ "id": <Processor ID>, "recordPolicy": { "timeoutPolicy": { ... } } }record-
(Optional, Duration) The timeout for each record during its processing.
errorPolicy-
(Optional, String) The error policy for records during their processing. Defaults to
FAIL.-
FATAL: A single failed document aborts the complete process -
FAIL: Either marks the document as failed, or sends it to a configured error handling state (if any). Other records continue their execution as expected -
IGNORE: Ignores the record processing error and its execution continues
-
retryPolicy-
(Optional, Object) The policy for handling the records retries.
Details
{ "id": <Processor ID>, "recordPolicy": { "retryPolicy": { ... } } }active-
(Optional, Boolean) Whether the processor should be retried if failed.
outboundPolicy-
(Optional, Object) The custom policy for groups of records sent for processing to the next state within the finite-state machine.
Details
{ "id": <Processor ID>, "recordPolicy": { "outboundPolicy": { ... } } }mode-
(Optional, String) The output mode for processors that might be affected by operations like splitting. Either
INLINEfor a normal output where any "split" will be represented as an array, orSPLITfor children documents hierarchically related to the origin. Defaults toINLINE. splitPolicy-
(Optional, Object) The splitting policy for outbound batches of children records (where supported).
Details
{ "id": <Processor ID>, "recordPolicy": { "outboundPolicy": { "splitPolicy": { ... } } } }children-
(Optional, Object) The configuration of the children records after the split of the parent.
Details
{ "id": <Processor ID>, "recordPolicy": { "outboundPolicy": { "splitPolicy": { "children": { ... } } } } }idPolicy-
(Optional, Object) The configuration of the ID for each of the new child records.
Details
{ "id": <Processor ID>, "recordPolicy": { "outboundPolicy": { "splitPolicy": { "children": { "idPolicy": { ... } } } } } }generator-
(Optional, Object) The expression that represents the ID of each child record. If not provided, the value assigned is composed of the parent record ID, followed by a colon, and an incremental number, i.e.
<parentRecordId>:<incrementalNumber>.
snapshotPolicy-
(Optional, Object) The configuration for the incremental scan feature.
Details
{ "id": <Processor ID>, "recordPolicy": { "outboundPolicy": { "splitPolicy": { "children": { "snapshotPolicy": { ... } } } } } }checksumExpression-
(Optional, String) The expression to determine if the record has changed during an incremental scan. Defaults to all the fields in the document.
batchPolicy-
(Optional, Object) The batch policy for outbound batches of records, once their processor execution is completed.
Details
{ "id": <Processor ID>, "recordPolicy": { "outboundPolicy": { "batchPolicy": { ... } } } }maxCount-
(Optional, Integer) The maximum record count in a batch before flushing. Defaults to
25. flushAfter-
(Optional, Duration) The timeout to flush if no other condition has been met. Defaults to
1m.
active-
(Optional, Boolean)
falseto disable the execution of the processor. Default istrue.
next-
(Optional, String) The next state for the HTTP Request Execution after the completion of the state. If not provided, the current one will be assumed as the final state.
onError-
(Optional, String) The state of the to use as fallback if the execution of the current state fails. If undefined, the current HTTP Request Execution will complete with the corresponding error message.
mode-
(Optional, Object) The execution mode for the configured processors.
Details
{ "stateA": { "type": "processor", "mode": { "type": "group", ... }, ... } }type-
(Required, String) The type of execution mode for the processors in the state. It can be
groupfor a grouped execution of processors,splitto split the records before the execution of the processors in the state, orcollapseto specify an expression that represents the output of the state. Defaults togroup.Group mode
{ "stateA": { "type": "processor", "mode": { "type": "group" }, ... } }Collapse mode
{ "stateA": { "type": "processor", "mode": { "type": "collapse", "output": { "title": "#{ concat(data('/title'), ' - ', data('/subtitle')) }", "description": "#{ data('/description') }" } }, ... } }output-
(Required, Object) The expression that represents the output of the state.
Split mode
{ "stateA": { "type": "processor", "mode": { "type": "split", "source": " #{data('/dataToSplit')} ", ... } ... } }source-
(Required, Array of Objects) The array with the content to split.
splitPolicy-
(Optional, Object) The splitting policy for outbound batches of children records.
Details
{ "stateA": { "type": "processor", "mode": { "type": "split", "source": " #{data('/dataToSplit')} ", "splitPolicy": { ... } } ... } }children-
(Optional, Object) The configuration of the children records after the split of the parent.
Details
{ "stateA": { "type": "processor", "mode": { "type": "split", "source": " #{ data('/dataToSplit') } ", "splitPolicy": { "children": { ... } } } ... } }idPolicy-
(Optional, Object) The configuration of the ID for each of the new child record.
Details
{ "stateA": { "type": "processor", "mode": { "type": "split", "source": " #{data('/dataToSplit')} ", "splitPolicy": { "children": { "idPolicy": { ... } } } } ... } }generator-
(Optional, Object) The expression that represents the ID of each child record. If not provided, the value assigned is composed of the parent record ID, followed by a colon, and an incremental number, i.e.
<parentRecordId>:<incrementalNumber>.
The output of each processor will be stored in the JSON Data Channel wrapped in the configured outputField:
{
"defaultFieldName": {
"outputKey": "outputValue"
}
}
Switch State
Use DSL Filters and JSON Pointers over the JSON Data Channel to control the flow of the execution given the first matching condition:
{
"mySwitchState": {
"type": "switch",
"options": [
...
],
"default": "myDefaultState"
}
}
type-
(Required, String) The type of state. Must be
switch options-
(Required, Array of Objects) The options to evaluate in the state
Details
{ "type": "switch", "options": [ { "condition": { "equals": { "field": "/my/input/field", "value": "valueA" }, ... }, "state": "myFirstState" }, ... ], ... }condition-
(Required, Object) The predicate described as a DSL Filter over the JSON processing data
state-
(Optional, String) The next state for the finite-state machine if the
conditionevaluates totrue
default-
(Optional, String) The default state for the finite-state machine if no option evaluates to
true
|
Note
|
If no state for the finite-state machine is selected, the current one will be assumed as the final state. |
Seed Execution
Seed Executions API
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/execution'
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}'
$ curl --request POST 'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/halt'
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/audit'
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/config/seed'
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/config/pipeline/{pipelineId}'
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/config/processor/{processorId}'
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/config/server/{serverId}'
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/config/credential/{credentialId}'
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/job/summary'
$ curl --request GET 'ingestion-api:12030/v2/seed/{seedId}/execution/{executionId}/record/summary'
{
"id": "1ed146d8-e5d8-49df-9b65-b9f6396183ff",
"creationTimestamp": "2025-03-13T08:59:15Z",
"lastUpdatedTimestamp": "2025-03-13T09:46:59Z",
"triggerType": "MANUAL",
"status": "DONE",
"scanType": "FULL",
"stages": [
"BEFORE_HOOKS",
"INGEST",
"AFTER_HOOKS"
]
}
id-
(UUID) A unique ID that identifies the seed execution
creationTimestamp-
(Timestamp) The timestamp when the execution was triggered
lastUpdatedTimestamp-
(Timestamp) The timestamp when the execution was last updated
triggerType-
(String) The origin who triggered the execution. Currently, only
MANUAL status-
(String) The status of the execution
Details
-
CREATED: The seed has been triggered, but the execution has not started -
RUNNING: The seed is being executed -
HALTING: The seed execution received aHALTrequest, but some processing might still be happening -
HALTED: The seed execution is completely halted -
DONE: The seed completed its execution successfully -
FAILED: The seed failed during its execution
-
scanType-
(String) The scan type for the execution. Currently, only
FULL stages-
(Array of Strings) The competed stages of the execution
Details
-
BEFORE_HOOKS: The hooks before record data processing (if any). -
INGEST: The record data processing -
AFTER_HOOKS: The hooks after record data processing (if any).
-
The Seed Execution represents the currently active Seed. It updates with each complete stage of the execution.
When the user starts the execution of a seed, a copy of every user-created configuration related to the seed is stored for use during the entirety of the seed execution:
-
The configuration of the seed being executed
-
The configuration of any hook that’s used by the seed in execution
-
The configuration of the pipeline assigned to the seed
-
The configuration of any processor that’s used in any step of the pipeline
-
The configuration of any server that’s used either by the seed, or any of the related processors
-
The configuration of any credential that’s used either by the seed, or any of the related processors, or any of the related servers.
This means that changing the configuration of the entities used during the execution of a seed will have no effect on the outcome of it. This is done to avoid unexpected and inconsistent behaviors.
|
Note
|
For security reasons, when the snapshot of the configuration of a credential is stored, the associated secrets are not included in it. A reference to the underlying secret is saved instead. This means that changes applied to secrets mid-seed execution can unpredictably affect the current execution |
Every generated record is tagged with an corresponding action to apply during a specific execution:
-
CREATE: It is a new record for the seed. -
UPDATE: The record was processed during a previous seed execution, but its content has changed. -
DELETE: The record is marked to be deleted.
During a seed execution, every record has a status that changes as the seed is processed:
-
PROCESSING: The record was detected and is currently being processed. -
FAILED: The processing of the record failed. -
DONE: The record was successfully processed.
Record Data Channels
During a seed execution, records can produce data in JSON format, as well as binary files.
JSON data is stored in a dedicated bucket within Discovery Staging and can be later referenced using JSON Pointers.
|
Note
|
New data nodes will never override data nodes previously generated. |
|
Note
|
When searching for a path, the JSON Pointer will be evaluated against the most recent output. If it is a match, the node is returned. Otherwise, the search continues with the previous one. |
Binary data, such as images, videos or PDFs, is stored in a dedicated container inside the Object Storage.
Record Batches
Seeds can configure how batches are flushed through the finite-state machine.
The seed configuration and its override in the processor state defines the boundaries of the batch, where the first condition to be met will trigger the flush process where all the records in the batch to the next stage in the pipeline (such as next processor, next state from the state machine or even the end of the pipeline).
Expression Language Extensions
| Variable | Description | Example |
|---|---|---|
|
The ID of the seed in execution |
|
|
|
|
|
The name of the seed |
|
|
The description of the seed |
|
|
The labels of the seed, grouped by key |
|
|
The properties to use during placeholders resolution |
|
|
The ID of the seed execution |
|
|
The start time of the seed execution |
|
|
The scan type of the seed execution |
|
|
Trigger type of the seed execution |
|
|
The properties to use during placeholders resolution |
|
|
The ID of the processor |
|
|
|
|
|
The name of the processor |
|
|
The description of the processor |
|
|
The labels of the processor, grouped by key |
|
|
The ID of the pipeline |
|
|
The name of the pipeline |
|
|
The description of the pipeline |
|
|
The labels of the pipeline, grouped by key |
|
|
The ID of a generated record from a seed execution. If an ID mask is configured through ID policies, the masked ID will be returned |
|
|
The ID of a generated record from a seed execution |
|
|
The action of a generated record from a seed execution |
|
|
The parent ID of a generated record from a seed execution |
|
|
The record ID as detected from the source NOTE: Only the |
|
|
The record content. Nested fields can be accessed as well (e.g. NOTE: Only the |
|
| Function | Description | Example |
|---|---|---|
|
Finds a specific node within the Record data channel using a JSON Pointer |
|
|
References a specific node within the Record data channel using a 0-based index for the first data node generated in the channel. The method supports negative numbers, where |
|
Discovery Features for Data Processing
Record Splitting
The nature of some processors and their actions is splitting a record into multiple children (e.g. a CSV file where each line represents a new record).
The configuration of the pipeline (and the processors within the pipeline) not only supports this SPLIT behavior, but also allows an INLINE mode where children are output as an array instead.
Components
Amazon S3
Performs different actions on Amazon S3, either downloading or uploading files depending on the action.
Scan Action: scan
Seed that retrieves the files stored in a bucket of Amazon S3.
| Feature | Supported |
|---|---|
Checksum |
|
Yes |
|
No |
{
"type": "amazon-s3",
"name": "My Amazon S3 Scan Action",
"config": {
"action": "scan",
"bucket": "my-bucket", (1)
"prefix": "/my/prefix",
"pageSize": 100
},
"pipeline": <Pipeline ID>, (2)
"server": <Amazon S3 Server ID> (3)
}
| 1 | This configuration field is required. |
| 2 | See the ingestion pipelines section. |
| 3 | See the Amazon S3 integration section. |
Each configuration field is defined as follows:
bucket-
(Required, String) The name of the bucket to scan.
prefix-
(Optional, String) The bucket prefix to filter documents.
pageSize-
(Optional, Integer) The maximum number of elements per page.
Processor Action: download
Processor that by giving the name of an Amazon S3 bucket and the file key, it downloads the file and saves it in the Object Storage.
| Feature | Supported |
|---|---|
No |
{
"type": "amazon-s3",
"name": "My Download Action",
"config": {
"action": "download", (1)
"bucket": "my-s3-bucket", (1)
"key": "image.jpg",
"failOnMissing": true
},
"server": <Amazon S3 Server ID> (2)
}
| 1 | These configuration fields are required. |
| 2 | See the Amazon S3 integration section. |
Each configuration field is defined as follows:
bucket-
(Required, String) The S3 bucket where the document will be downloaded.
key-
(Required, String) The key of the document within the bucket.
metadata-
(Optional, Boolean) Whether to include metadata when downloading the document. Defaults to
true. failOnMissing-
(Optional, Boolean) Whether to throw an error if the document is not found in the bucket. Defaults to
true.
The resulting download information will be saved in each record’s s3 field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting information in a record would be:
{
"s3": {
"metadata": {}, // If the flag was set to true
"file": {
"@discovery": "object",
"bucket": "ingestion",
"key": "ingestion-88542806-94a1-4874-a1b2-fb4af5c7c540/6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b/02519ad0-d8b6-4736-99d8-dd5744c6837c/ec48e0b4-b39a-4f6b-8856-c7104ab7e200"
}
}
}
Processor Action: upload
Processor that by giving the name of an Amazon S3 bucket, the file key and the file from the Object Storage, uploads the file to the given bucket.
| Feature | Supported |
|---|---|
No |
{
"type": "amazon-s3",
"name": "My Upload Action",
"config": {
"action": "upload", (1)
"bucket": "amazon-s3-bucket", (1)
"key": "uploadedFile.jpg",
"file": "#{ file('fileToUpload', 'BYTES') }",
"metadata": false
},
"server": <Amazon S3 Server ID> (2)
}
| 1 | These configuration fields are required. |
| 2 | See the Amazon S3 integration section. |
Each configuration field is defined as follows:
bucket-
(Required, String) The S3 bucket where the document will be uploaded.
key-
(Required, String) The key of the document within the bucket.
file-
(Required, String) The file to upload. This file can be obtained from the object storage with the file function from the Expression Language.
metadata-
(Optional, Boolean) Whether to include metadata in the response when uploading the document. Defaults to
false.
The resulting upload information will be saved in each record’s s3 field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting information in a record would be:
{
"s3": {
"metadata": {}, // If the flag was set to true
}
}
Chunker
Splits big documents into smaller units that are easier to interpret by LLMs, it exposes different strategies that can be used.
Processor Action: sentence
Processor that splits the text by sentences.
| Feature | Supported |
|---|---|
Yes |
{
"type": "chunker",
"name": "My Chunk-By-Sentence Action",
"config": {
"action": "sentence", (1)
"text": " #{data('/text')} ", (1)
"sentences": 4,
"maxChars": 200,
"overlap": "1"
}
}
| 1 | These configuration fields are required. |
Each configuration field is defined as follows:
text-
(Required, String) The text to process.
sentences-
(Optional, Integer) The number of sentences per chunk. Defaults to
20. overlap-
(Optional, String/Integer) The amount of sentences to overlap, it can be either a percentage, or the number of sentences. Defaults to
10%. maxChars-
(Optional, Integer) The maximum number of chars per chunk.
The resulting information will be saved in each record’s chunker field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting information in a record would be:
{
"chunker": {
"chunks": [
"Lorem ipsum dolor sit amet consectetur adipiscing elit. Placerat in id cursus mi pretium tellus duis. Urna tempor pulvinar vivamus fringilla lacus nec metus.",
"Urna tempor pulvinar vivamus fringilla lacus nec metus. Integer nunc posuere ut hendrerit semper vel class. Conubia nostra inceptos himenaeos orci varius natoque penatibus.",
"Lectus commodo augue arcu dignissim velit aliquam imperdiet. Cras eleifend turpis fames primis vulputate ornare sagittis. Libero feugiat tristique accumsan maecenas potenti ultricies habitant.",
"Libero feugiat tristique accumsan maecenas potenti ultricies habitant. Cubilia curae hac habitasse platea dictumst lorem ipsum. Faucibus ex sapien vitae pellentesque sem placerat in.",
"Faucibus ex sapien vitae pellentesque sem placerat in. Tempus leo eu aenean sed diam urna tempor."
],
"errors": [
{
"index": 5,
"text": "Mus donec rhoncus eros lobortis nulla molestie mattis Purus est efficitur laoreet mauris pharetra vestibulum fusce Sodales consequat magna ante condimentum neque at luctus Ligula congue sollicitudin erat viverra ac tincidunt nam.",
"error": {
"status": 400,
"code": 3003,
"messages": [
"Chunk of size 229 exceeds maximum char limit of 200"
],
"timestamp": "2025-09-11T15:43:42.925739900Z"
}
}
]
}
}
|
Note
|
If the overlapped text exceeds the |
Processor Action: word
Processor that splits the text by words.
| Feature | Supported |
|---|---|
Yes |
{
"type": "chunker",
"name": "My Chunk-By-Word Action",
"config": {
"action": "word", (1)
"text": " #{data('/text')} ", (1)
"words": 8,
"maxChars": 70,
"overlap": 3
}
}
| 1 | These configuration fields are required. |
Each configuration field is defined as follows:
text-
(Required, String) The text to process.
words-
(Optional, Integer) The number of words per chunk. Defaults to
20. overlap-
(Optional, String/Integer) The amount of words to overlap, it can be either a percentage, or the number of words. Defaults to
10%. maxChars-
(Optional, Integer) The maximum number of chars per chunk.
The resulting information will be saved in each record’s chunker field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting information in a record would be:
{
"chunker": {
"chunks": [
"Lorem ipsum dolor sit amet consectetur adipiscing elit",
"consectetur adipiscing elit Placerat in id cursus mi",
"id cursus mi pretium tellus duis Urna tempor",
"duis Urna tempor pulvinar vivamus fringilla lacus nec",
"fringilla lacus nec metus Integer nunc posuere ut",
"nunc posuere ut hendrerit semper vel class Conubia",
"vel class Conubia nostra inceptos himenaeos orci varius",
"himenaeos orci varius natoque penatibus. Mus donec rhoncus",
"Mus donec rhoncus eros lobortis nulla molestie mattis.",
"molestie mattis. Sodales consequat magna ante condimentum neque",
"ante condimentum neque at luctus. Ligula congue sollicitudin",
"Ligula congue sollicitudin erat viverra ac tincidunt nam.",
"ac tincidunt nam. Lectus commodo augue arcu dignissim",
"augue arcu dignissim velit aliquam imperdiet. Cras eleifend",
"imperdiet. Cras eleifend turpis fames primis vulputate ornare",
"primis vulputate ornare sagittis. Libero feugiat tristique accumsan",
"feugiat tristique accumsan maecenas potenti ultricies habitant. Cubilia",
"ultricies habitant. Cubilia curae hac habitasse platea dictumst",
"habitasse platea dictumst lorem ipsum. Faucibus ex sapien",
"Faucibus ex sapien vitae pellentesque sem placerat in.",
"sem placerat in. Tempus leo eu aenean sed",
"eu aenean sed diam urna tempor."
],
"errors": [
{
"index": 48,
"text": "Purusestefficiturlaoreetmaurispharetravestibulumfuscetravestibulumfusce.",
"error": {
"status": 400,
"code": 3003,
"messages": [
"Chunk of size 72 exceeds maximum char limit of 70"
]
}
}
]
}
}
|
Note
|
If the overlapped text exceeds the |
Elasticsearch
Uses the Elasticsearch integration to invoke the Elasticsearch API.
Scan Action: scan, search-after
Seed that uses the Search after parameter to retrieve all the documents from an index.
| Feature | Supported |
|---|---|
Checksum |
|
Yes |
|
No |
{
"type": "elasticsearch",
"name": "My Elasticsearch Scan Action",
"config": {
"action": "search-after",
"index": "my-index", (1)
"sort": [ (1)
{ "field-a": "asc" },
{ "field-b": { "order": "desc" } }
],
"size": 100,
"metadata": false,
"query": {
"match": {
"field-a": {
"query": "value"
}
}
},
},
"pipeline": <Pipeline ID>, (2)
"server": <Elasticsearch Server ID> (3)
}
| 1 | These configuration fields are required. |
| 2 | See the ingestion pipelines section. |
| 3 | See the Elasticsearch integration section. |
Each configuration field is defined as follows:
index-
(Required, Array of Strings or String) The list of Elasticsearch indexes to search on. Can also be configured as a single string if there’s only one index to search.
sort-
(Required, Array of Objects) The list of sort options.
Sort object format
{ "<field>": "<sort_value>" },or
{ "<field>": { "<sort_option>": "<sort_value>", ... } } query-
(Optional, Object) The query body for the search request. If not provided, a match all query will be used instead.
size-
(Optional, Integer) The maximum number of hits to return. Defaults to
100. metadata-
(Optional, Boolean) Whether to include the metadata or no. Defaults to
false.
Hook Action: aliases
Hook that executes a native Elasticsearch query to the Aliases API.
{
"type": "elasticsearch",
"name": "My Elasticsearch Hook Action",
"config": {
"action": "aliases",
"actions": [ (1)
{
"add": {
"index": "my-index-1",
"alias": "my-alias-1"
}
},
{
"remove": {
"index": "my-index-2",
"alias": "my-alias-2"
}
}
]
},
"server": <Elasticsearch Server ID> (2)
}
| 1 | This configuration field is required. The exact expected structure of the action object is defined by the Elasticsearch API, these are examples at the time of writing. |
| 2 | See the Elasticsearch integration section. |
Each configuration field is defined as follows:
actions-
(Required, Array of Objects) The request body. Each element in the array should represent an alias action.
|
Note
|
Currently, if at least one of the actions on the list is successful, the whole request will be successful. On the other side, the request only fails if none of them is successful. |
Hook Action: create-index
Hook that executes a native Elasticsearch query to the Create Index API.
{
"type": "elasticsearch",
"name": "My Elasticsearch Hook Action",
"config": {
"action": "create-index",
"index": "my-index", (1)
"body": { (1) (2)
"mappings": {
"properties": {
"field": { "type": "text" }
}
}
},
"waitForActiveShards": 2,
"masterTimeout": "5m",
"timeout": "180s"
},
"server": <Elasticsearch Server ID> (3)
}
| 1 | These configuration fields are required. |
| 2 | The exact expected structure of the body object is defined by the Elasticsearch API, this is just an example at the time of writing. |
| 3 | See the Elasticsearch integration section. |
Each configuration field is defined as follows:
index-
(Required, String) The index name.
body-
(Required, Object) The request body. The body should represent an index body.
waitForActiveShards-
(Optional, Integer) The number of copies of each shard that must be active before proceeding with the operation.
masterTimeout-
(Optional, String) The period to wait for the master node. The string should represent a duration according to the Elasticsearch API.
timeout-
(Optional, String) The period to wait for a response. The string should represent a duration according to the Elasticsearch API.
Processor Action: bulk, hydrate
Processor that executes a bulk request to the Elasticsearch Bulk API.
| Feature | Supported |
|---|---|
No |
{
"type": "elasticsearch",
"name": "My Elasticsearch Processor Action",
"config": {
"action": "hydrate",
"index": "my-index", (1)
"data": "#{ data('/my/record') }",
"allowOverride": true,
"bulk": <Bulk Configuration> (2)
},
"server": <Elasticsearch Server ID>, (3)
}
| 1 | This configuration field is required. |
| 2 | See the bulk configuration. |
| 3 | See the Elasticsearch integration section. |
Each configuration field is defined as follows:
index-
(Required, String) The Elasticsearch index to perform the action.
data-
(Optional, Object) The data to hydrate. If not provided, the component will use the output from the last processor that generated data for each record.
allowOverride-
(Optional, Boolean) Whether allow overriding an existing document or not. Defaults to
true.
bulk-
(Optional, Object) The bulk configuration.
Details
Configuration example{ "pipeline": "pipelineID", "refresh": "WAIT_FOR", "requireAlias": false, "routing": "shard-route", "timeout": "1m", "waitForActiveShards": "1", "flush": { (1) "maxOperations": 1000, "maxConcurrentRequests": 1, "maxSize": "5mb", "flushInterval": "5s" } }1 See the flush configuration definition. Each configuration field is defined as follows:
pipeline-
(Optional, String) The ID of the Elasticsearch pipeline that’ll be used to preprocess incoming documents.
routing-
(Optional, String) Used to route operations to a specific shard.
waitForActiveShards-
(Optional, String) The number of copies of each shard that must be active before proceeding with the Elasticsearch operation.
timeout-
(Optional, String) The period of time to wait for some operations. The string should represent a duration according to the Elasticsearch API.
requireAlias-
(Optional, Boolean) Whether the request’s actions must target an index alias.
refresh-
(Optional, String) The refresh type. Supported types are:
TRUE,FALSEandWAIT_FOR.Type definitions
WAIT_FOR: Waits for a refresh to make the Elasticsearch operation visible to searchTRUE: Refreshes the affected shards to make the Elasticsearch operation visible to searchFALSE: Do nothing with the refreshes
flush-
(Optional, Object) The flush configuration.
Field definitions
maxOperations-
(Optional, Integer) The maximum number of operations. Defaults to
1000. maxConcurrent-
(Optional, Integer) The maximum number of concurrent requests waiting to be executed by Elasticsearch. Default is
1. maxSize-
(Optional, String) The maximum size of the bulk request. Defaults to
5MB. flushInterval-
(Optional, Duration) The interval between flushes.
Field Mapper
Provides the user a dynamic way to map fields and execute some actions to them in order to customize how data is structured. You can leverage the Expression Language to achieve this.
Processor Action: process
Processor that allows to format the output data of the records.
| Feature | Supported |
|---|---|
No |
{
"type": "fieldmapper",
"name": "My Field Mapper Processor Action",
"config": {
"action": "process",
"output": { (1)
"id": "#{ concat(data('/id'), ' - ', data('/title')) }",
"value": "#{ data('/description') }"
}
}
}
| 1 | This configuration field is required. |
Each configuration field is defined as follows:
output-
(Required, Object) The object with the expressions that represent the output of the processor.
{
"fieldMapper": {
"id": "output-Id",
"value": "outputValue"
}
}
Filesystem crawler
Uses Norconex Filesystem Crawler for the crawling of filesystems to extract their content. Currently, it only supports the SMB protocol.
Scan Action: scan
Seed that crawls a filesystem using basic configurations.
| Feature | Supported |
|---|---|
Custom |
|
No |
|
No |
{
"type": "filesystem",
"name": "My Filesystem Crawler Scan Action",
"config": {
"action": "scan",
"startPaths": ["/my/path"],
"maxDocuments": 0,
"metadataFilters": [
...
],
"referenceFilters": [
...
]
},
"pipeline": <Pipeline ID>, (1)
"server": <SMB server ID> (2)
}
| 1 | See the ingestion pipelines section. |
| 2 | See the SMB integration section. |
Each configuration field is defined as follows:
startPaths-
(Optional, Array of Strings) The list of starting paths to crawl. The values are appended to the
serversfield value of the SMB server configuration. maxDocuments-
(Optional, Integer) The maximum number of documents to successfully process. Default is unlimited.
metadataFilters-
(Optional, Array of Objects) The list of filters to apply based on the documents metadata.
Details
{
"metadataFilters": [
{
"field": "Content-Type",
"values": [
"pdf"
],
"mode": "EXCLUDE"
}
]
}
Each configuration field is defined as follows:
field-
(Required, String) The name of the metadata field.
values-
(Required, Array of Strings) The list of regex values used to filter from the field specified.
mode-
(Optional, String) The mode to define if the documents are either included or excluded from the result. One of
INCLUDEorEXCLUDE. Defaults toINCLUDE.
referenceFilters-
(Optional, Array of Objects) The list of filters to apply based on the documents reference (i.e. URLs).
Details
{
"referenceFilters": [
{
"type": "EXTENSION",
"filter": "pdf, csv",
"mode": "EXCLUDE",
"caseSensitive": true
}
]
}
Each configuration field is defined as follows:
type-
(Required, String) The type of filter. One of:
-
EXTENSION: Filters by document extension. -
REGEX: Filters by regular expressions.
-
filter-
(Required, String) The value of the filter to apply. If the type is
EXTENSION, the value must be a comma-separated list of extensions or a single extension. In the case of theREGEXtype, the value is the regular expression. mode-
(Optional, String) The mode to define if the documents are either included or excluded from the result. One of
INCLUDEorEXCLUDE. Defaults toINCLUDE. caseSensitive-
(Optional, Boolean) Whether the filter is case-sensitive or not.
When using the SMB protocol, the ACL data is extracted per each document processed:
{
"properties": {
"collector": {
"acl": {
"smb": [
{
"domainSid": "",
"ace": "",
"sidAsText": "",
"sid": "",
"type": "",
"accountName": "",
"domainName": "",
"typeAsText": ""
}
]
},
...
},
...
},
...
}
Details
domainSid-
(String) The domain Security Identifier.
ace-
(String) The Access Control Entry.
sidAsText-
(String) The SID as text.
sid-
(String) The Security Identifier.
type-
(String) The ACL type. If the access was allowed, the value is
0. Otherwise, the value is1.s accountName-
(String) The name of the account.
domainName-
(String) The name of the domain.
typeAsText-
(String) The ACL type as text.
HTML
Uses Jsoup to process HTML files.
Processor Action: select
Processor that retrieves elements that match a CSS selector query from an HTML document.
| Feature | Supported |
|---|---|
No |
{
"type": "html",
"name": "My HTML Processor Select Action",
"config": {
"action": "select",
"file": "#{ file('my-file') }", (1)
"baseUri": "",
"charset": "UTF-8",
"selectors": { (1)
"mySelector": {
"selector": "::text:not(:blank)", (1)
"mode": "NODES"
}
}
}
}
| 1 | These configuration fields are required. |
Each configuration field is defined as follows:
file-
(Required, File) The HTML file to be processed. Notice that the expected value is not the name of the file but the actual content. You can leverage the Expression Language to do that.
baseUri-
(Optional, String) The URL of the source, to resolve relative links against. Defaults to
"". charset-
(Optional, String) The character set of the file’s content. If
null, determines the charset from thehttp-equivmeta tag if present, or falls back toUTF-8if not. selectors-
(Required, Map of String/Object) The set of selector configurations.
Field definitions
selector-
(Required, String) The CSS selector query.
mode-
(Optional, String) The output format of the selection. Either
TEXT,HTMLorNODES. Defaults toTEXT.
NoteThe
NODESmode enables the use of Node Pseudo Selectors. The output for this mode depends on the operator that is used, some output text while others HTML.
The selected text or html will be saved in each record’s html field by default. This field’s name can be overwritten in the processor state configuration. For example, for the processor configured above and an HTML document <p>Hello World!</p>, the processor’s output in a record would be:
{
"html": {
"mySelector": "Hello World!"
}
}
Processor Action: extract
Processor that extracts and formats tables and description lists from an HTML document.
| Feature | Supported |
|---|---|
No |
{
"type": "html",
"name": "My HTML Processor Extract Action",
"config": {
"action": "extract",
"file": "#{ file('my-file') }", (1)
"baseUri": "",
"charset": "UTF-8",
"table": {
"active": true,
"titles": [
"caption"
]
},
"descriptionList": {
"active": true,
"titles": [
"h1"
]
}
}
}
| 1 | This configuration field is required. |
Each configuration field is defined as follows:
file-
(Required, File) The HTML file to be processed. Notice that the expected value is not the name of the file but the actual content. You can leverage the Expression Language to do that.
baseUri-
(Optional, String) The URL of the source, to resolve relative links against. Defaults to
"". charset-
(Optional, String) The character set of the file’s content. If
null, determines the charset from thehttp-equivmeta tag if present, or falls back toUTF-8if not. table-
(Optional, Object) The configurations for extracting tables.
Field definitions
active-
(Optional, Boolean) Whether the extractor is active. Defaults to
true. titles-
(Optional, Array of String) The list of HTML tags to be considered as the title for each element. A title is selected when either the first child or the previous sibling of the element matches any of the given tags.
descriptionList-
(Optional, Object) The configurations for extracting description lists.
Field definitions
active-
(Optional, Boolean) Whether the extractor is active. Defaults to
true. titles-
(Optional, Array of String) The list of HTML tags to be considered as the title for each element. A title is selected when either the first child or the previous sibling of the element matches any of the given tags.
The extracted tables and description lists will be saved in each record’s html field by default. This field’s name can be overwritten in the processor state configuration. For example, for the processor configured above and an HTML document:
<div>
<div>
<h1>Description List</h1>
<dl>
<dt>Term 1</dt>
<dd>Detail 1</dd>
<dd>Detail 2</dd>
<dt>Term 2</dt>
</dl>
</div>
<table>
<caption>Table</caption>
<tr>
<th>Header 1</th>
<th>Header 2</th>
</tr>
<tr>
<td>Data 1</td>
<td>Data 2</td>
</tr>
</table>
</div>
The processor’s output in a record would be:
{
"html": {
"tables": [
{
"title": "Table",
"table": [
[
{
"tag": "header",
"text": "Header 1"
},
{
"tag": "header",
"text": "Header 2"
}
],
[
{
"tag": "data",
"text": "Data 1"
},
{
"tag": "data",
"text": "Data 2"
}
]
]
}
],
"descriptionLists": [
{
"title": "Description List",
"descriptionList": [
{
"term": "Term 1",
"details": [
"Detail 1",
"Detail 2"
]
},
{
"term": "Term 2",
"details": []
}
]
}
]
}
}
Processor Action: remove
Processor that removes elements that match a CSS selector query from an HTML document.
| Feature | Supported |
|---|---|
No |
{
"type": "html",
"name": "My HTML Processor Remove Action",
"config": {
"action": "remove",
"file": "#{ file('my-file') }", (1)
"baseUri": "",
"charset": "UTF-8",
"selector": "header, footer" (1)
}
}
| 1 | These configuration fields are required. |
Each configuration field is defined as follows:
file-
(Required, File) The HTML file to be processed. Notice that the expected value is not the name of the file but the actual content. You can leverage the Expression Language to do that.
baseUri-
(Optional, String) The URL of the source, to resolve relative links against. Defaults to
"". charset-
(Optional, String) The character set of the file’s content. If
null, determines the charset from thehttp-equivmeta tag if present, or falls back toUTF-8if not. selector-
(Required, String) The CSS selector query.
The remaining html will be saved in each record’s html field by default. This field’s name can be overwritten in the processor state configuration. For example, for the processor configured above and an HTML document:
<html>
<head></head>
<body>
<header>header text</header>
<p>body text</p>
<footer>footer text</footer>
</body>
</html>
The processor’s output in a record would be:
{
"html": "<html>\n <head></head>\n <body>\n <p>body text</p>\n </body>\n</html>"
}
Insights
The Insights component is designed to provide various actions that generate different metrics or information for later analysis.
Processor Action: engine-score:non-contextual
Processor that calculates the query score based on the result position metadata field value only.
The engine scoring action is designed to power the Engine Scoring Dashboards by evaluating the quality of a search engine’s results in terms of precision and recall.
| Feature | Supported |
|---|---|
No |
{
"type": "insights",
"name": "My Engine-Score Non-Contextual Processor Action",
"config": {
"action": "engine-score:non-contextual",
"resultPosition": "25", (1)
"kfactor": "0.9",
"startPosition": "1",
"precision": "4"
}
}
| 1 | This configuration field is required. |
Each configuration field is defined as follows:
resultPosition-
(Required, Integer) Field containing the position of the search result to be used in the engine scoring calculation.
kfactor-
(Optional, Double) Value between 0 and 1 used to determine the importance of the relevant records. Defaults to
0.9. startPosition-
(Optional, Integer) Indicate the start position to take into account when doing the K-factor calculation. Defaults to
1. precision-
(Optional, Integer) Number of digits to return after the decimal point for the score value. Defaults to
4.
The resulting score will be saved in each record’s score field by default. This field’s name can be overwritten in the processor state configuration. For example, given a resulting score of 0.1, the processor’s output in a record would be:
{
"score": 0.1
}
Language Detector
The Language Detector component uses Lingua to identify the language from a specified text input. The languages are referenced using ISO-639-1 (alpha-2 code).
|
Note
|
Each time a language model is referenced, it will be loaded in memory. Loading too many languages increases the risk of high memory consumption issues. |
Processor Action: process
Processor that detects the language of a provided text.
| Feature | Supported |
|---|---|
No |
{
"type": "language",
"name": "My Language Detector Processor Action",
"config": {
"action": "process",
"text": { (1)
"inputA": "#{ data('/fieldA') }",
"inputB": "#{ data('/fieldB') }"
},
"defaultLanguage": "en",
"minDistance": 0.0,
"supportedLanguages": ["en", "es"]
}
}
| 1 | This configuration field is required. |
Each configuration field is defined as follows:
text-
(Required, Object) The text to be evaluated. It can be either a
Stringwith a single input, or aMapfor multi-input processing. defaultLanguage-
(Optional, String) Default language to select in case no other is detected. Defaults to
en. minDistance-
(Optional, Double) Distance between the input and the language model. Defaults to
0.0. supportedLanguages-
(Optional, Array of Strings) List of languages supported by the detector. At least
2supported languages must be set. Defaults to[ "en", "es" ].
The output of the processor will be saved in the record’s language field by default. This field’s name can be overwritten in the processor state configuration. For example:
{
"language": "es"
}
{
"language": {
"inputA": "es",
"inputB": "en"
}
}
LDAP
Performs a search on an LDAP directory server, like retrieving users and groups.
Scan Action: scan
Seed that runs a search in the directory and creates a record for each entry returned.
| Feature | Supported |
|---|---|
Checksum |
|
Yes |
|
No |
{
"type": "ldap",
"name": "My LDAP Scan Action",
"config": {
"action": "scan",
"baseDN": "ou=Users,dc=my-domain,dc=com", (1)
"filter":"(objectClass=inetOrgPerson)", (1)
"projection": <DSL Projection>, (2)
"pageSize": 500
},
"pipeline": <Pipeline ID>, (3)
"server": <LDAP Server ID> (4)
}
| 1 | These fields are required. |
| 2 | See the DSL Projection section. |
| 3 | See the ingestion pipelines section. |
| 4 | See the LDAP integration section. |
Each configuration field is defined as follows:
baseDN-
(Required, String) The base distinguished name for the search request.
filter-
(Required, String) The string representation of the filter to use to identify matching entries.
projection-
(Optional, DSL Projection) The projection to apply to the search’s attributes.
pageSize-
(Optional, Integer) The number of entries that will be fetched using pagination. Defaults to
1000. There are some LDAP servers that do not support pagination. In that case, it is recommended to retrieve every desired entry at once if possible.
MongoDB
Performs different actions on MongoDB collections, either reading or writing data depending on the action.
Scan Action: scan
Seed that finds all documents in a MongoDB collection and creates a record for each document found.
| Feature | Supported |
|---|---|
Checksum |
|
Yes |
|
No |
{
"type": "mongo",
"name": "My Mongo Scan Action",
"config": {
"action": "scan",
"database": "my-database", (1)
"collection": "my-collection", (1)
"filter": { ... },
"projection": { ... },
"sort": [ ... ],
"size": 100, (1)
"fields": {
"id": "_id",
"token": "_id"
}
},
"pipeline": <Pipeline ID>, (2)
"server": <MongoDB Server ID> (3)
}
| 1 | These configuration fields are required. |
| 2 | See the ingestion pipelines section. |
| 3 | See the MongoDB integration section. |
Each configuration field is defined as follows:
database-
(Required, String) The database to connect to.
collection-
(Required, Boolean) The collection whose documents are turned to records.
filter-
(Optional, DSL Filter) The filter to apply to the request.
projection-
(Optional, DSL Projection) The projection to apply to the response.
sort-
(Optional, DSL Sort) The sort to apply to the request.
size-
(Required, Integer) The size for pagination. Defaults to
100. fields-
(Optional, Object) The name of the control fields from the source collection.
Details
id-
(Optional, String) The name of the field with the ID of the record. Defaults to
_id. token-
(Optional, String) The name of the field with the pagination token of the collection. Defaults to
_id.
Processor Action: bulk, hydrate
Processor that stores the records in the pipeline via Bulk Write operations on the specified MongoDB collection.
| Feature | Supported |
|---|---|
No |
{
"type": "mongo",
"name": "My Mongo Processor Action",
"config": {
"action": "bulk",
"database": "my-database", (1)
"collection": "my-collection", (1)
"allowOverride": true,
"data": "#{ data('/my/record') }",
"flush": <Bulk Flush Configuration> (2)
},
"server": <MongoDB Server ID>, (3)
}
| 1 | These configuration fields are required. |
| 2 | See the flush configuration. |
| 3 | See the MongoDB integration section. |
Each configuration field is defined as follows:
database-
(Required, String) The database to connect to.
collection-
(Required, String) The collection where the records are bulk written.
allowOverride-
(Optional, Boolean) Whether the records should be stored if there is one already with their ID. Defaults to
true. data-
(Optional, Object) The data to store on the collection. If not provided, the component will use the output from the last processor that generated data for each record.
flush-
(Optional, Object) The flush configuration.
Details
Configuration example{ "maxCount": 25, "maxWeight": 25, "flushAfter": "PT5M" }Each configuration field is defined as follows:
maxCount-
(Optional, Integer) The maximum number of records in the bulk before flushing.
maxWeight-
(Optional, Long) The maximum weight allowed in a bulk request.
flushAfter-
(Optional, Duration) The time to wait before flushing a bulk request.
OCR
Performs a Optical Character Recognition (OCR) operation over files to extract text from non-searchable PDfs or image based PDFs.
Processor Action: process
| Feature | Supported |
|---|---|
No |
{
"type": "ocr",
"name": "My OCR Processor Action",
"config": {
"action": "process",
"file": "#{ file('my-file', 'BYTES') }", (1)
"languages": ["EN"],
"pageSegmentationMode": 3
}
}
| 1 | This configuration field is required. |
Each configuration field is defined as follows:
file-
(Required, File) The file to extract the text from. Notice that the expected value is not the name of the file but the actual content. You can leverage the Expression Language to do that.
languages-
(Optional, Array of Strings) The list languages that the OCR operation should support. The languages are codes of two letters as described by the ISO 639-1 standard. Defaults to
["EN"]. pageSegmentationMode-
(Optional, Integer) The mode to do page segmentation. See Tesseract Page Segmentation Method to get the list of available modes. Defaults to
3.
The recognized text will be saved in each record’s ocr field by default. This field’s name can be overwritten in the processor state configuration. For example, given a document with the Hello World! phrase in it, the processor’s output in a record would be:
{
"ocr": "Hello World!"
}
OpenAI
Uses the OpenAI integration to send requests to OpenAI. Additionally, supports text trimming based on OpenAI models' tokenizing and token limits, by integrating the tiktokten library.
Processor Action: chat-completion
Processor that executes a chat completion request to OpenAI API.
| Feature | Supported |
|---|---|
No |
{
"type": "openai",
"name": "My Chat Completion Action",
"config": {
"action": "chat-completion",
"model": "openai-model", (1)
"messages": [ (1) (2)
{"role": "system", "content": "You are a helpful assistant" },
{"role": "user", "content": "Hi!" },
{"role": "assistant", "content": "Hi, how can assist you today?" }
],
"promptCacheKey": "pureinsights",
"frequencyPenalty": 0.0,
"presencePenalty": 0.0,
"temperature": 1,
"topP": 1,
"n": 1,
"maxTokens": 2048,
"stop": [],
"responseFormat": <Response format configuration> (3)
},
"server": <OpenAI Server ID>, (4)
}
| 1 | These configuration fields are required. |
| 2 | See the messages configuration definition. |
| 3 | See the response format configuration. |
| 4 | See the OpenAI integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The OpenIA model to use.
messages-
(Required, Array of Objects) The list of messages for the request.
Field definitions
role-
(Required, String) The role of the message. Must be one of
system,userorassistant. content-
(Required, String) Then content of the message.
name-
(Optional, String) The name of the author of the message.
promptCacheKey-
(Optional, String) Value used by OpenAI to cache responses for similar requests to optimize the cache hit rates.
frequencyPenalty-
(Optional, Double) Positive values penalize new tokens based on their existing frequency in the text so far. Value must be between
-2.0and2. Defaults to0.0. presencePenalty-
(Optional, Double) Positive values penalize new tokens based on whether they appear in the text so far. Value must be between
-2.0and2. Defaults to0.0. temperature-
(Optional, Double) Sampling temperature to use. Value must be between
0and2. Defaults to1.
|
Note
|
It’s generally recommended to alter either this or the topP field, but not both.
|
topP-
(Optional, Double) An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass. Defaults to
1.
|
Note
|
It’s generally recommended to alter either this or the temperature field, but not both.
|
n-
(Optional, Integer) How many chat completion choices to generate for each input message. Defaults to
1. maxTokens-
(Optional, Integer) The maximum number of tokens to generate in the chat completion. Defaults to
2048. stop-
(Optional, Array of String) Up to 4 sequences where the API will stop generating further tokens.
responseFormat-
(Optional, Object) An object specifying the format that the model must output. Learn more in OpenAI’s Structured Outputs guide.
Details
Configuration example{ "type": "json_schema", (1) "json_schema": { (2) "name": "a name for the schema", "strict": true, "schema": { (3) "type": "object", "properties": { "equation": { "type": "string" }, "answer": { "type": "string" } }, "required": ["equation", "answer"], "additionalProperties": false } } }1 The response type is always required. 2 JSON schemas can only be used and are required with json_schematypes of response formats.3 The exact expected structure of the schema object is defined by the OpenAI API, this is just an example at the time of writing. Each configuration field is defined as follows:
type-
(Required, String) The type of response format being defined. Allowed values:
text,json_schemaandjson_object. json_schema-
(Optional, Object) Structured Outputs configuration options, including a JSON Schema. This field can only be used and is in fact required with response formats of the
json_schematype. See OpenAI’s response formats definitions for more details.
Field definitions
name-
(Required, String) The name of the response format. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.
description-
(Optional, String) A description of what the response format is for, used by the model to determine how to respond in the format.
schema-
(Optional, Object) The schema for the response format, described as a JSON Schema object. Learn how to build JSON schemas here.
strict-
(Optional, Boolean) Whether to enable strict schema adherence when generating the output. If set to true, the model will always follow the exact schema defined in the
schemafield. Defaults tofalse.
The resulting chat completion will be saved in each record’s openai field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting chat completion in a record would be:
{
"openai": {
"created": "2025-07-24T15:38:59Z",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "{\"content\":\"Let's solve the equation step by step:\\n\\n1. Start with: 8x + 31 = 2\\n2. Subtract 31 from both sides: 8x = 2 - 31\\n3. Simplify: 8x = -29\\n4. Divide both sides by 8: x = -29/8\\n\\nSo the solution is:\\n\\nx = -29/8\",\"next\":\"Do you have any other equations you want to solve, or would you like to see this as a decimal?\"}",
},
"finish_reason": "stop"
}
],
"model": "gpt-4.1-2025-04-14",
"usage": {
"prompt_tokens": 61,
"completion_tokens": 116,
"total_tokens": 177
}
}
}
Processor Action: embeddings
Processor that executes embedding requests to the OpenAI API.
| Feature | Supported |
|---|---|
No |
{
"type": "openai",
"name": "My OpenAI Processor Action",
"config": {
"action": "embeddings",
"model": "openai-model", (1)
"input": "#{ data('/my/input') }", (1)
"user": "pureinsights",
"flush": <Bulk Flush Configuration> (2)
},
"server": <OpenAI Server ID>, (3)
}
| 1 | These configuration fields are required. |
| 2 | See the flush configuration. |
| 3 | See the OpenAI integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The OpenAI model to use.
input-
(Required, String) The input to generate the embeddings.
user-
(Optional, String) An unique identifier representing the end-user.
flush-
(Optional, Object) The flush configuration.
Details
Configuration example{ "maxCount": 25, "maxWeight": 25, "flushAfter": "PT5M" }Each configuration field is defined as follows:
maxCount-
(Optional, Integer) The maximum number of records in the bulk before flushing.
maxWeight-
(Optional, Long) The maximum weight allowed in a bulk request.
flushAfter-
(Optional, Duration) The time to wait before flushing a bulk request.
The resulting embeddings will be saved in each record’s openai field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting embeddings in a record would be:
{
"openai": [-0.006929283495992422, -0.005336422007530928, ...]
}
Processor Action: trim
Processor that trims a given text based on an OpenAI model’s tokenizing and either its own token limit, or a custom one.
| Feature | Supported |
|---|---|
No |
{
"type": "openai",
"name": "My OpenAI Trim Action",
"config": {
"action": "trim",
"text": "#{ data(\"/text\") }", (1)
"model": "gpt-5.2", (2)
"tokenLimit": 5 (2)
}
}
| 1 | This configuration field is required |
| 2 | At least one of these configuration fields are required |
Each configuration field is defined as follows:
text-
(Required, String) The text to trim.
model-
(Optional, String) The OpenAI model whose encoding is used to tokenize the text and whose token limit determines whether to truncate the text. If a custom token limit is defined, it’ll override the model’s. If no model is provided, a default
o200k_baseencoding will be used.
|
Note
|
In order to determine the encoding and token limit for models used in chat completion requests, the processor will only take into account those models' "version"
when trimming. In this context, "version" translates to the ChatGPT version used, such as gpt-5.2, gpt-4.1, o4, etc. This means that a model field configured as gpt-5.2-2025-12-11 will result in the processor only taking into account the gpt-5.2 version included, and ignore the rest. Consequently, non-existent models such as o4-mini-thismodeldoesntexist are considered valid for this action as long as the model version can be inferred.
|
tokenLimit-
(Optional, Integer) The positive integer used as token limit when determining whether to truncate the encoded text or not. If defined, it’ll override the provided model’s token limit, if any.
The result of the trimming process will be saved in each record’s openai field by default. This field’s name can be overwritten in the processor state configuration. Given the example configuration shown above, with a text input of The brown fox jumps over the lazy dog, the trim result would be saved as:
{
"openai": {
"text": "The brown fox jumps over",
"size": 24,
"tokens": 5,
"truncated": true,
"remainder": [
" the",
" lazy",
" dog"
]
}
}
OpenSearch
Uses the OpenSearch integration to invoke the OpenSearch API.
Hook Action: aliases
Hook that executes a native OpenSearch query to the Aliases API.
{
"type": "opensearch",
"name": "My OpenSearch Hook Action",
"config": {
"action": "aliases",
"actions": [ (1)
{
"add": {
"index": "my-index-1",
"alias": "my-alias-1"
}
},
{
"remove": {
"index": "my-index-2",
"alias": "my-alias-2"
}
}
],
"clusterManagerTimeout": "30s",
"timeout": "30s"
},
"server": <OpenSearch Server ID> (2)
}
| 1 | This configuration field is required. The exact expected structure of the action object is defined by the OpenSearch API, this is just an example at the time of writing. |
| 2 | See the OpenSearch integration section. |
Each configuration field is defined as follows:
actions-
(Required, Array of Objects) Set of actions to perform on the index. Each element in the array should represent an alias action.
clusterManagerTimeout-
(Optional, String) The amount of time to wait for a response from the cluster manager node. The string should contain an OpenSearch API time unit.
timeout-
(Optional, String) The amount of time to wait for a response from the cluster. The string should contain an OpenSearch API time unit.
Hook Action: create-index
Hook that executes a native OpenSearch query to the Create Index API.
{
"type": "opensearch",
"name": "My OpenSearch Hook Action",
"config": {
"action": "create-index",
"index": "my-index", (1)
"body": { (1) (2)
"mappings": {
"properties": {
"age": {
"type": "integer"
}
}
}
}
},
"server": <OpenSearch Server ID> (3)
}
| 1 | These configuration fields are required. |
| 2 | The exact expected structure of the body object is defined by the OpenSearch API, this is just an example at the time of writing. |
| 3 | See the OpenSearch integration section. |
Each configuration field is defined as follows:
index-
(Required, String) The index name.
body-
(Required, Object) The request body. The body should represent an index body.
waitForActiveShards-
(Optional, Integer) The number of active shards that must be available before OpenSearch processes the request.
clusterManagerTimeout-
(Optional, String) The amount of time to wait for a response from the cluster manager node. The string should contain an OpenSearch API time unit.
timeout-
(Optional, String) The amount of time to wait for a response from the cluster. The string should contain an OpenSearch API time unit.
Processor Action: bulk, hydrate
Processor that executes a bulk request to the OpenSearch Bulk API.
| Feature | Supported |
|---|---|
No |
{
"type": "opensearch",
"name": "My OpenSearch Processor Action",
"config": {
"action": "hydrate",
"index": "my-index", (1)
"data": "#{ data('/my/record') }",
"bulk": <Bulk Configuration>, (2)
"flush": <Bulk Flush Configuration>, (3)
},
"server": <OpenSearch Server ID> (4)
}
| 1 | This configuration field is required. |
| 2 | See the bulk configuration. |
| 3 | See the flush configuration. |
| 4 | See the OpenSearch integration section. |
Each configuration field is defined as follows:
index-
(Required, String) The OpenSearch index to perform the action.
data-
(Optional, Object) The data to hydrate. If not provided, the component will use the output from the last processor that generated data for each record.
bulk-
(Optional, Object) The bulk configuration.
Details
Configuration example{ "pipeline": "pipelineID", "refresh": "WAIT_FOR", "requireAlias": false, "routing": "shard-route", "waitForActiveShards": "1", "timeout": "PT1M" }Each configuration field is defined as follows:
pipeline-
(Optional, String) The ID of the OpenSearch Pipeline that’ll be used to preprocess incoming documents.
refresh-
(Optional, String) The refresh type. Supported are:
TRUE,FALSEandWAIT_FOR.Type definitions
WAIT_FOR: Waits for a refresh to make the OpenSearch operation visible to search.TRUE: Refreshes the affected shards to make the OpenSearch operation visible to search.FALSE: Do nothing with the refreshes. requireAlias-
(Optional, Boolean) Whether the request’s actions must target an index alias.
routing-
(Optional, String) Used to route operations to a specific shard.
waitForActiveShards-
(Optional, String) The number of copies of each shard that must be active before proceeding with the OpenSearch operation. The string should contain an OpenSearch API time unit.
timeout-
(Optional, String) The period of time to wait for some operations. The string should contain an OpenSearch API time unit.
flush-
(Optional, Object) The flush configuration.
Details
Configuration example{ "maxCount": 25, "maxWeight": 25, "flushAfter": "PT5M" }Each configuration field is defined as follows:
maxCount-
(Optional, Integer) The maximum number of records in the bulk before flushing.
maxWeight-
(Optional, Long) The maximum weight allowed in a bulk request.
flushAfter-
(Optional, Duration) The time to wait before flushing a bulk request.
Oracle Database
Performs different actions on an Oracle database, mainly reading from a table.
Scan Action: scan
Seed that runs an SQL query and creates a record for each row returned from the database.
| Feature | Supported |
|---|---|
Checksum |
|
Yes |
|
No |
{
"type": "oracledb",
"name": "My Oracle DB Scan Action",
"config": {
"action": "scan",
"sql": "SELECT * FROM my_table", (1)
"pageSize": 500
},
"pipeline": <Pipeline ID>, (2)
"server": <Oracle Database Server ID> (3)
}
| 1 | This field is required. |
| 2 | See the ingestion pipelines section. |
| 3 | See the Oracle Database integration section. |
Each configuration field is defined as follows:
pageSize-
(Optional, Integer) The number of records that will be fetched using pagination. Defaults to
1000 sql-
(Required, String) The SQL query that will be executed. This component supports processing records with pagination through the use of the
offsetandpageSizevariables, which can be defined in the SQL query using the Mustache format. For example:
SELECT record_id AS id, name FROM my_table WHERE field = 'condition' OFFSET {{offset}} ROWS FETCH NEXT {{pageSize}} ROWS ONLY
This SQL query retrieves the number of records specified by the pageSize variable, and the offset value ensures that previously processed records are skipped. Currently, pageSize and offset are the only variables supported and they are not required. To fetch new documents a new Scan job is created with the updated values for offset and pageSize.
The Oracle data types supported are the following:
Details
-
CHAR -
VARCHAR2 -
NCHAR -
NVARCHAR2 -
NUMBER -
FLOAT -
BINARY_FLOAT -
BINARY_DOUBLE -
BOOLEAN -
DATE -
TIMESTAMP -
TIMESTAMP WITH TIME ZONE -
TIMESTAMP WITH LOCAL TIME ZONE -
INTERVAL DAY TO SECOND -
INTERVAL YEAR TO MONTH -
ROWID -
RAWandBLOB: When serialized into JSON, it becomes a string encoded in Base 64. -
CLOB -
NCLOB -
Collection (
VARRAYor nested table)
The following data types are not supported and require a different approach:
Details
-
BFILE: TheBFILE's content should be read within the database and returned as a LOB in the results. -
LONGandLONGRAW: The use of these data types is not recommended. They should be converted to a LOB. -
Oracle Object and Object reference: The object’s fields should be dereferenced and included as separate columns in the results. See the DEREF function to get an object reference’s fields.
-
REF CURSOR: Not currently supported.
If a data type is not included in this list, if it cannot be converted to a regular Java data type, it most likely is not supported.
NOTE: By default, the record’s ID will be obtained from the field with the name "ID", case-insensitive. This is why in the pagination example, the record_id is selected with an alias id. This can be overriden with the field recordPolicy.outboundPolicy.idPolicy.generator in the seed’s configuration. See the Data Seed section.
Random
Generates records with random content
Scan action plain, scan
Seed that generates a predetermined amount of records with a single text field whose value is a random text of random length withing the given range.
| Feature | Supported |
|---|---|
Yes |
|
No |
{
"type": "random",
"name": "My Random Plain Action",
"config": {
"action": "plain",
"records": 1000, (1)
"charsPerRecord": 50 (1)
},
"pipeline": <Pipeline ID>, (2)
}
| 1 | These configuration fields are required. |
| 2 | See the ingestion pipelines section. |
Each configuration field is defined as follows:
records-
(Required, Integer) The amount of records to generate.
charsPerRecord-
(Required, Object or Integer) The set amount of random characters each record will have in it’s
textfield, if configured as a single positive integer. This value, if given, should be between1and2147483647. If a range for a randomly chosen amount of characters is wanted instead, then this field can also be configured as an object for this purpose, as is next shown:Details
{ "charsPerRecord": { "min": 10, "max": 1000 } }min-
(Required, Integer) The minimum amount of characters that a record may have, inclusive. Must be higher than
1. max-
(Required, Integer) The maximum amount of characters that a record may have, inclusive. Must be higher than
minand lower than2147483647.
Script
Uses the Script Engine to execute a script for advanced handling of the execution data. Supports multiple scripting languages and provides tools for JSON manipulation and for logging.
Processor Action: process
Processor that executes a script that interacts with the record data generated from a seed execution.
| Feature | Supported |
|---|---|
No |
{
"type": "script",
"name": "My Script Processor Action",
"config": {
"action": "process",
"language": "groovy",
"script": <Script>, (1)
}
}
| 1 | This configuration field is required. |
Each configuration field is defined as follows:
language-
(Optional, String) The language of the script. Must be one of the supported script languages. Defaults to
groovy. script-
(Required, String) The script to run.
Any output set in the output() object during the script execution will be saved in the record’s script field by default. This field’s name can be overwritten in the processor state configuration. For example, if a script runs a single output().put("field", "value") instruction, it’ll be saved in the records as:
{
"script": {
"field": "value"
}
}
SharePoint Online
Crawls information from Sites, Lists and ListItems from SharePoint Online.
Scan Action: scan
Seed that crawls the sites information from a SharePoint tenant. Crawls the parent site, their subsites, lists and list items, including the files and attachments for list items.
|
Note
|
Calendar Lists currently only crawl present and future events. Past events are not retrieved. |
| Feature | Supported |
|---|---|
Checksum |
|
Yes |
|
Yes |
{
"type": "sharepoint",
"name": "My SharePoint Online Scan Action",
"config": {
"action": "scan",
"sites": [ (1)
/sites/my-site,
/sites/my-other-site
/sites/my-parent-site/my-subsite
],
"checkNoCrawl": true
},
"pipeline": <Pipeline ID>, (2)
"server": <SharePoint Online Server ID> (3)
}
| 1 | This configuration field is required. |
| 2 | See the ingestion pipelines section. |
| 3 | See the SharePoint Online integration section. |
Each configuration field is defined as follows:
sites-
(Required, Array of Strings) The list of sites to crawl. The URL must be of the form:
/sites/*. checkNoCrawl-
(Optional, Boolean) If
false, thenoCrawlflag on sites and lists is ignored. Otherwise, any site or list marked asnoCrawlwill not be processed. Defaults totrue.
Staging
Interacts with buckets and content from Discovery Staging.
Scan Action: scan, scroll
Seed that scrolls throughout a bucket, and creates the records to be ingested into the pipeline.
| Feature | Supported |
|---|---|
Checksum |
|
Yes |
|
No |
{
"type": "staging",
"name": "My Staging Scan Action",
"config": {
"action": "scroll",
"bucket": "my-bucket", (1)
"metadata": false,
"size": 35,
"filter": <DSL Filter>, (2)
"projection": <DSL Projection>, (3)
"parentId": <Staging Document ID>
},
"pipeline": <Pipeline ID>, (4)
}
| 1 | This configuration field is required. |
| 2 | See the DSL Filter section. |
| 3 | See the DSL Projection section. |
| 4 | See the ingestion pipelines section. |
Each configuration field is defined as follows:
bucket-
(Required, String) The bucket to scroll.
metadata-
(Optional, Boolean) Whether to include the metadata or not. Defaults to
false. size-
(Optional, Integer) The size of the contents result.
filter-
(Optional, DSL Filter) The filter to apply when scrolling.
projection-
(Optional, DSL Projection) The projection to apply when scrolling.
parentId-
(Optional, String) The parent ID to match.
Hook Action: create-bucket
Hook that creates a bucket with the given configuration.
{
"type": "staging",
"name": "My Staging Hook Action",
"config": {
"action": "create-bucket",
"bucket": "new-bucket", (1)
"config": {},
"indices": [ (2)
{
"name": "my-index-1",
"fields": [ (3)
{ "fieldA": "ASC" },
{ "fieldB": "DESC" },
]
},
{
"name": "my-index-2",
"fields": [
{ "fieldC": "DESC" }
]
}
],
}
}
| 1 | This configuration field is required. |
| 2 | See the index definition. |
| 3 | See the index field definition. |
Each configuration field is defined as follows:
bucket-
(Required, String) The bucket name.
config-
(Optional, Object) The bucket configuration.
indices-
(Optional, Array of Objects) The list of indices for the bucket.
Field definitions
name-
(Required, String) The index name.
fields-
(Required, Array of Objects) The index fields. Key/Value pairs with the field name, and the corresponding sort ordering, either
ASCorDESC.Field object format
{ "<field>": "<ASC or DESC>" }
Hook Action: delete-many
Hook that deletes multiple documents from a bucket.
{
"type": "staging",
"name": "My Staging Hook Action",
"config": {
"action": "delete-many",
"bucket": "my-bucket", (1)
"parentId": <Staging Document ID>,
"filter": <DSL Filter>, (2)
}
}
| 1 | This configuration field is required. |
| 2 | See the DSL Filter section. |
Each configuration field is defined as follows:
bucket-
(Required, String) The bucket whose documents are deleted
parentId-
(Optional, String) Documents with this parent document will be deleted. If the
filterfield is also present, then documents that have this parent document and pass the filter are the only ones deleted. filter-
(Optional, DSL Filter) Documents that pass this filter will be deleted. If the
parentIdfield is also present, then documents that pass the filter and and have the configured parent document are the only ones deleted.
|
Note
|
Either the |
Processor Action: store, hydrate
Processor that stores the records into the given bucket.
| Feature | Supported |
|---|---|
No |
{
"type": "staging",
"name": "My Staging Processor Action",
"config": {
"action": "store",
"bucket": "my-bucket", (1)
"data": "#{ data('/my/record') }"
}
}
| 1 | This configuration field is required. |
Each configuration field is defined as follows:
bucket-
(Required, String) The bucket where the documents will be stored.
parentId-
(Optional, String) The parent ID of the documents to store.
data-
(Optional, Object) The data to store on the bucket. If not provided, the component will use the output from the last processor that generated data for each record.
metadata-
(Optional, Boolean) Whether to store the metadata from the saved documents or not. If
true, the metadata will be included as part of the execution data of each record. Defaults tofalse.
Template
Uses the Template Engine to process dynamic data provided by the user to generate a text output based on a custom template.
Processor Action: process
Processor that processes the provided template with the defined configuration.
| Feature | Supported |
|---|---|
No |
{
"type": "template",
"name": "My Template Processor Action",
"config": {
"action": "process",
"template": "Hello, ${name}!", (1)
"bindings": { (1)
"name": "John"
},
"outputFormat": "PLAIN"
}
}
| 1 | These configuration fields are required. |
Each configuration field is defined as follows:
template-
(Required, String) The template to process.
bindings-
(Required, Object) The bindings to replace in the template.
Binding object format
{ "bindingA": "#{ data('/my/binding/field') }", ... }Each binding, defined as a key in the object, can be later referenced in a template:
My bindingA value is ${bindingA} outputFormat-
(Optional, String) The output format of the precessed template. Supported formats are
JSONandPLAIN. Defaults toPLAIN
The output of the processor will be saved in the record’s template field by default. This field’s name can be overwritten in the processor state configuration. For example, given a processor with a PLAIN output format:
{
"template": "plain value"
}
Tika
Integrates with the Apache Tika library to detect and extract content and metadata from various file types. Tika supports a wide range of document formats and provides a consistent interface for text and metadata extraction.
Processor Action: parse,process
Processor that parses plain text and XHTML content from an input file using the Apache Tika Java API. It supports custom Tika configuration files and can include metadata in the output. This action is suitable for content extraction workflows that need to handle diverse file formats.
| Feature | Supported |
|---|---|
No |
{
"type": "tika",
"name": "Tika Text Parser",
"config": {
"action": "process",
"file": "#{ file('file.txt', 'BYTES') }", (1)
"metadata": <Metadata>, (3)
"config": "#{ file('tikaConfig.xml') }",
"outputFormat": "plain" (1) (2)
}
}
| 1 | This configuration field is required. |
| 2 | Must be either plain or xhtml. |
| 3 | See the metadata configuration. |
Each configuration field is defined as follows:
file-
(Required, InputStream) The file to parse using Tika.
outputFormat-
(Required, String) Defines the format of the parsed output. Must be either
plain(plain text output) orxhtml(XHTML content output).
metadata-
(Optional, Object)
Details
Configuration example{ "input": { (2) "keyA": "valueA" }, "output": true (1) }1 This configuration field is required. 2 See the metadata input format. Each configuration field is defined as follows:
input-
(Required, Map of String/Object) A set of key-value pairs passed as input metadata to Tika.
Details
Configuration example{ "Accept-Encoding": "gzip, deflate, br" } output-
(Required, Boolean) If set to
true, Tika’s output metadata will be included in the result.
config-
(Optional, InputStream) The Tika configuration XML, as defined in the official Tika configuration documentation.
The resulting parsed content and optional metadata will be saved in each record’s tika field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting Tika output in a record would be:
{
"tika": {
"plain": "The Universe:\nThe universe is all of space and time and their contents.\nThe portion of the universe that can be seen by humans is approximately 93 billion light-years in diameter at present, but the total size of the universe is not known.\n",
"metadata": {
"keyA": "valueA",
"X-TIKA:Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.csv.TextAndCSVParser"
],
"X-TIKA:Parsed-By-Full-Set": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.csv.TextAndCSVParser"
],
"Content-Encoding": "ISO-8859-1",
"X-TIKA:detectedEncoding": "ISO-8859-1",
"X-TIKA:encodingDetector": "UniversalEncodingDetector",
"Content-Type": "text/plain; charset=ISO-8859-1"
}
}
}
{
"tika": {
"file": {
"@discovery": "object",
"bucket": "ingestion",
"key": "ingestion-4049a2dd-17ca-440c-893b-cddfd817e45f/6b86b273ff34fce19d6b804eff5a3f5747ada4eaa22f1d49c01e52ddb7875b4b/f63067a4-ebc1-403b-a7a3-57b8a93a6d28/b1e92337-9ae0-427f-81e6-724e96af0970"
},
"metadata": {
"keyA": "valueA",
"X-TIKA:Parsed-By": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.JSoupParser"
],
"dc:title": "Rooster",
"author": "Wikipedia contributors",
"X-TIKA:Parsed-By-Full-Set": [
"org.apache.tika.parser.DefaultParser",
"org.apache.tika.parser.html.JSoupParser"
],
"Content-Encoding": "ISO-8859-1",
"dc:creator": "Wikipedia contributors",
"Content-Type-Hint": "text/html; charset=UTF-8",
"X-TIKA:detectedEncoding": "ISO-8859-1",
"X-TIKA:encodingDetector": "UniversalEncodingDetector",
"Content-Type": "application/xhtml+xml; charset=ISO-8859-1"
}
}
}
Vespa
Uses the Vespa integration to send HTTP requests to a Vespa service.
Processor Action: store
Processor that upserts or deletes documents from a Vespa app using the Document API.
| Feature | Supported |
|---|---|
No |
{
"type": "vespa",
"name": "My Vespa Store Action",
"config": {
"action": "store",
"namespace": "my-namespace", (1)
"documentType": "my-document-type", (1)
"condition": "schema.field==value",
"data": "#{ data('/my/record') }",
"route": "default",
"timeout": "180s",
"traceLevel": 0
},
"server": <Vespa Server ID> (2)
}
| 1 | These configuration fields are required. |
| 2 | See the Vespa integration section. |
Each configuration field is defined as follows:
|
Note
|
All optional configuration fields will not be specified in requests if they’re missing from the configuration. |
namespace-
(Required, String) The namespace of the vespa document.
documentType-
(Required, String) The document type of the vespa document. Described in the schemas.sd.
condition-
(Optional, String) The condition sent as query parameter in requests.
route-
(Optional, String) The custom route for the requests.
timeout-
(Optional, String) The timeout, in seconds, for the requests. Refer to the linked page for other available time units.
traceLevel-
(Optional, Integer) The trace level for the request’s logging. Must be a whole number between 0 and 9, where higher gives more details.
data-
(Optional, Object) The fields of the vespa document. If not provided, the component will use the output from the last processor that generated data for each record.
Voyage AI
Uses the Voyage AI integration to send requests to the Voyage AI API. Supports multiple actions for different endpoints of the service.
Processor Action: embeddings
Processor that given input string and other arguments such as the preferred model name, it returns a response containing a list of embeddings. See Voyage AI Embeddings and the API Text embedding models endpoint.
| Feature | Supported |
|---|---|
No |
{
"type": "voyage-ai",
"name": "My Embeddings Action",
"config": {
"action": "embeddings",
"model": "voyage-large-2", (1)
"input": "#{ data('/input') }", (1)
"truncation": true,
"inputType": "DOCUMENT",
"outputDimension": 1536,
"outputDatatype": "FLOAT",
"encodingFormat": "Base64",
"flush": <Bulk Flush Configuration> (2)
},
"server": <Voyage AI Server> (3)
}
| 1 | These configuration fields are required. |
| 2 | See the flush configuration. |
| 3 | See the Voyage AI integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The model to use for the request. See models.
input-
(Required, String) The input document to be embedded.
truncation-
(Optional, Boolean) Whether to truncate the input to satisfy the context length limit on the query and the documents. Defaults to
true. inputType-
(Optional, String) Type of the input text. One of:
QUERYorDOCUMENT. Defaults tonull. outputDimension-
(Optional, Integer) The number of dimensions for resulting output embeddings. Defaults to
null. outputDatatype-
(Optional, String) The data type for the embeddings to be returned. One of:
FLOAT,INT8,UINT8,BINARYorUBINARY. Defaults toFLOAT. encodingFormat-
(Optional, String) Format in which the embeddings are encoded. Defaults to
null, but can be set toBase64.
flush-
(Optional, Object) The flush configuration.
Details
Configuration example{ "maxCount": 25, "maxWeight": 25, "flushAfter": "PT5M" }Each configuration field is defined as follows:
maxCount-
(Optional, Integer) The maximum number of records in the bulk before flushing.
maxWeight-
(Optional, Long) The maximum weight allowed in a bulk request.
flushAfter-
(Optional, Duration) The time to wait before flushing a bulk request.
The resulting embeddings will be saved in each record’s voyage-ai field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting embeddings in a record would be:
{
"voyage-ai": [-0.006929283495992422, -0.005336422007530928, ...]
}
Processor Action: multimodal-embeddings
Processor that given an input list of multimodal inputs consisting of text, images, or an interleaving of both modalities and other arguments such as the preferred model name, returns a response containing a list of embeddings. See Voyage AI Multimodal Embedding and the API Text multimodal embedding models endpoint.
| Feature | Supported |
|---|---|
No |
{
"type": "voyage-ai",
"name": "My Multimodal Embeddings Action",
"config": {
"action": "multimodal-embeddings",
"model": "voyage-multimodal-3", (1)
"input": <Input Object>, (1) (2)
"truncation": true,
"inputType": "DOCUMENT",
"outputEncoding": "Base64",
"flush": <Bulk Flush Configuration> (3)
},
"server": <Voyage AI Server> (4)
}
| 1 | These configuration fields are required. |
| 2 | There are multiple types of accepted inputs, check the input object definition for details. |
| 3 | See the embedding’s action flush configuration. |
| 4 | See the Voyage AI integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The model to use for the request. See models.
input-
(Required, Object) The input object to be embedded.
Field definitions
type-
(Required, String) The type. One of:
text,image_urlorimage_base64. text-
(Optional, String) The text if the type
textis chosen.Text input example
{ "type": "text", "text": "This is a banana." } imageUrl-
(Optional, String) The image URL if the type
image_urlis chosen.Image URL input example
{ "type": "image_url", "imageUrl": "https://raw.githubusercontent.com/voyage-ai/voyage-multimodal-3/refs/heads/main/images/banana.jpg" } imageBase64-
(Optional, Object) The base 64 encoded image if the type
image_base64is chosen.Image Base64 input example
{ "type": "image_base64", "imageBase64": { "mediaType": "image/jpeg", "base64": true, "data": "/9j/4AAQSkZJRgABAQEAYABgAAD(...)" } }Each configuration field is defined as follows:
mediaType-
(Required, String) The data media type. Supported media types are:
image/png,image/jpeg,image/webp, andimage/gif. base64-
(Required, Boolean) Whether the data is encoded in Base64.
data-
(Required, String) The data itself.
truncation-
(Optional, Boolean) Whether to truncate the inputs to fit within the context length. Defaults to
true. inputType-
(Optional, String) Type of the input text. One of:
QUERYorDOCUMENT. Defaults tonull. outputEncoding-
(Optional, String) Format in which the embeddings are encoded. Defaults to
null, but can be set toBase64. flush-
(Optional, Object) The flush configuration. Its definition is the same as for the embedding’s action flush.
The resulting embeddings will be saved in each record’s voyage-ai field by default. This field’s name can be overwritten in the processor state configuration. An example of the resulting embeddings in a record would be:
{
"voyage-ai": [-0.006929283495992422, -0.005336422007530928, ...]
}
Web Crawler
Uses Norconex Web Crawler for the crawling of websites to extract their content.
Scan Action: scan
Seed that crawls a website using basic configurations.
| Feature | Supported |
|---|---|
Custom |
|
No |
|
No |
{
"type": "webcrawler",
"name": "My Web Crawler Scan Action",
"config": {
"action": "scan",
"urls": ["https://pureinsights"], (1)
"userAgent": "pureinsights-website-connector",
"delay": "200ms",
"maxDepth": 0,
"maxDocuments": 0,
"ignoreRobotsTxt": false,
"ignoreRobotsMeta": false,
"ignoreSiteMap": false,
"ignoreCanonicalLinks": false,
"metadataFilters": [
...
],
"referenceFilters": [
...
],
"connection": {
...
}
},
"pipeline": <Pipeline ID>, (2)
}
| 1 | This configuration field is required. |
| 2 | See the ingestion pipelines section. |
Each configuration field is defined as follows:
urls-
(Required, Array of Strings) The list of starting URLs to crawl.
userAgent-
(Optional, String) The identifier for the websites to identify the crawler. Defaults to
pureinsights-website-connector. delay-
(Optional, Duration) The delay to handle interval between each page download. Defaults to
200ms. maxDepth-
(Optional, Integer) The maximum number of levels deep to crawl from the starting URLs. Default is unlimited.
maxDocuments-
(Optional, Integer) The maximum number of documents to successfully process. Default is unlimited.
ignoreRobotsTxt-
(Optional, Boolean) Whether to ignore crawling instructions in
robots.txtfiles. Default isfalse. ignoreRobotsMeta-
(Optional, Boolean) Whether to ignore in-page robot rules. Default is
false. ignoreSiteMap-
(Optional, Boolean) Whether to ignore sitemap detection and resolution for the URLs to process. Default is
false. ignoreCanonicalLinks-
(Optional, Boolean) Whether to ignore canonical links found in HTTP headers and in HTML files section. Default is
false. metadataFilters-
(Optional, Array of Objects) The list of filters to apply based on the documents metadata.
Details
{
"metadataFilters": [
{
"field": "Content-Type",
"values": [
"pdf"
],
"mode": "EXCLUDE"
}
]
}
Each configuration field is defined as follows:
field-
(Required, String) The name of the metadata field.
values-
(Required, Array of Strings) The list of values used to filter from the field specified.
mode-
(Optional, String) The mode to define if the documents are either included or excluded from the result. One of
INCLUDEorEXCLUDE. Defaults toINCLUDE.
referenceFilters-
(Optional, Array of Objects) The list of filters to apply based on the documents reference (i.e. URLs).
Details
{
"referenceFilters": [
{
"type": "BASIC",
"filter": "my-page",
"mode": "EXCLUDE",
"caseSensitive": true
}
]
}
Each configuration field is defined as follows:
type-
(Required, String) The type of filter. One of:
-
WILDCARD: Filters by using wildcards with the*and?characters. -
BASIC: Text is matched as specified. -
CSV: Same as having multiple BASIC filters, separated by commas. -
REGEX: Filters by regular expressions.
-
filter-
(Required, String) The value of the filter to apply.
mode-
(Optional, String) The mode to define if the documents are either included or excluded from the result. One of
INCLUDEorEXCLUDE. Defaults toINCLUDE. caseSensitive-
(Optional, Boolean) Whether the filter is case-sensitive or not.
connection-
(Optional, Object) The configuration of the connection for the HTTP Fetcher.
Details
{
"connection": {
"connectTimeout": "60s",
"socketTimeout": "60s",
"requestTimeout": "60s",
"pool": {
}
}
}
Each configuration field is defined as follows:
connectTimeout-
(Optional, Duration) The timeout to connect to the website. Defaults to
60s. socketTimeout-
(Optional, Duration) The maximum period of inactivity between two consecutive data packets. Defaults to
60s. requestTimeout-
(Optional, Duration) The timeout for a requested connection. Defaults to
60s. pool-
(Optional, Object) The configuration of the connection pool.
Details
{
"pool": {
"size": 200
}
}
size-
(Optional, Integer) The maximum number of connections that can be created. Defaults to
200.
Discovery QueryFlow
Discovery QueryFlow is a lightweight tool that allows to process external requests through configurable entrypoints with minimum overhead, while enabling:
-
Flexibility to represent complex query processing scenarios through a finite-state machine.
-
On-the-fly tuning of configurations for a fast feedback loop.
-
Extensive component library for advanced interpretation of the external request.
Entrypoints
REST Endpoints
REST Endpoints API
$ curl --request POST 'queryflow-api:12040/v2/entrypoint/endpoint' --data '{ ... }'
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/endpoint'
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/endpoint/{id}'
$ curl --request PUT 'queryflow-api:12040/v2/entrypoint/endpoint/{id}' --data '{ ... }'
|
Note
|
The type of an existing endpoint can’t be modified. |
$ curl --request DELETE 'queryflow-api:12040/v2/entrypoint/endpoint/{id}'
$ curl --request PATCH 'queryflow-api:12040/v2/entrypoint/endpoint/{id}/enable'
$ curl --request PATCH 'queryflow-api:12040/v2/entrypoint/endpoint/{id}/disable'
$ curl --request POST 'queryflow-api:12040/v2/entrypoint/endpoint/{id}/clone?name=clone-new-name&uri=clone-new-uri&method=clone-new-method'
Query Parameters
method-
(Required, String) The HTTP Method of the new Endpoint
uri-
(Required, String) The URI of the new Endpoint
name-
(Required, String) The name of the new Endpoint
$ curl --request POST 'queryflow-api:12040/v2/entrypoint/endpoint/search' --data '{ ... }'
Body
The body payload is a DSL Filter to apply to the search
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/endpoint/autocomplete?q=value'
Query Parameters
q-
(Required, String) The query to execute the autocomplete search
Discovery QueryFlow enables the creation of custom RESTful APIs where each endpoint is defined by: a unique URI, an HTTP Method, the MIME type it produces, and the request pipeline with its corresponding finite-state machine for processing.
{
"type": "default",
"uri": "/my/custom/endpoint",
"httpMethod": "GET",
"name": "My Custom Endpoint",
"pipeline": <Pipeline ID>,
"timeout": "60s"
...
}
type-
(Required, String) The produced MIME type in the HTTP response, defined by the Accept HTTP Request header. Either
json(ordefault) forapplication/jsonorstreamfortext/event-stream. Defaults tojson httpMethod-
(Required, String) The HTTP method for the custom endpoint. Must be one of:
GET,POST,PUT,DELETE,PATCH. uri-
(Required, String) The URI path for the custom endpoint (e.g.
/my/path).The URI can contain variables in any of its paths (e.g.
/my/{pathA},/{pathA}/{pathB}). If present, the values for every placeholder will be available as part of the metadata of the HTTP request and can be accessed in the configuration of the processors with the help of the Expression Language.Details
{ "uri": "/my/{pathA}/endpoint", "httpMethod": "GET", "name": "My Custom Endpoint", "timeout": "60s" ... }{ "type": "my-component-type", "name": "My Component Processor", "config": { "myProperty": "#{ data('/httpRequest/pathVariables/pathA') }" } ... } name-
(Required, String) The unique name to identify the custom endpoint
description-
(Optional, String) The description for the configuration.
pipeline-
(Required, UUID) The ID of the pipeline configuration where the request is processed.
config-
(Optional, Object) The configuration for the response after the execution of the request, based on the endpoint
type.JSON
{ "uri": "/my/endpoint", "httpMethod": "GET", "name": "My Custom Endpoint", "config": { ... }, "timeout": "60s" ... }statusCode-
(Required, Integer) The HTTP status code in the range of
[200, 599[for the response. Defaults to200. headers-
(Optional, Object) The HTTP headers to return as part of the response.
Details
{ "uri": "/my/endpoint", "httpMethod": "GET", "name": "My Custom Endpoint", "config": { "headers": { "Etag": "#{ data('/my/etag/value') }", ... }, ... }, "timeout": "60s" ... } body-
(Optional, Object) The HTTP JSON body to return as part of the response.
Details
{ "type": "response", "body": { "keyA": "#{ data('/my/data') }", ... }, ... } snippets-
(Optional, Object) The snippets to be referenced in the configuration with the help of the Expression Language.
Details
{ "uri": "/my/endpoint", "httpMethod": "GET", "name": "My Custom Endpoint", "config": { "body": { "myProperty": "#{ snippets.snippetA }" }, "snippets": { "snippetA": { ... } }, ... }, "timeout": "60s" ... }NoteAvoid the usage of any reserved operator such as hyphens in the name of a snippet.
Stream (SSE)
{ "uri": "/my/endpoint", "httpMethod": "GET", "name": "My Custom Endpoint", "config": { ... }, "timeout": "60s" ... }statusCode-
(Required, Integer) The HTTP status code in the range of
[200, 599[for the response. Defaults to207.
properties-
(Optional, Object) The properties to be referenced with the help of the Expression Language in the configuration of the processors
Details
{ "uri": "/my/custom/endpoint", "httpMethod": "GET", "name": "My Custom Endpoint", "properties": { "keyA": "valueA" }, "timeout": "60s" ... }{ "type": "my-component-type", "name": "My Component Processor", "config": { "myProperty": "#{ endpoint.properties.keyA') }" }, ... } timeout-
(Required, Duration) The timeout for the execution of the custom endpoint
labels-
(Optional, Array of Objects) The labels for the configuration.
Details
{ "labels": [ { "key": "My Label Key", "value": "My Label Value" }, ... ], ... }key-
(Required, String) The key of the label.
value-
(Required, String) The value of the label.
Metadata
The execution starts with the metadata of the invocation stored in the JSON Data Channel:
{
"id": "55d22c60-6d61-41ce-b8b1-c0f1acd6e5e4",
"httpRequest": {
...
},
"pageable": {
...
},
"properties": {
...
}
}
id-
(UUID) An auto-generated ID for the execution.
httpRequest-
(Object) The HTTP Request tat triggered the execution.
Details
{ "httpRequest": { "uri": "/my-endpoint", "method": "POST", "headers": { "header-a": "value-a", "header-b": [ "value-b-1", "value-b-2" ] }, "queryParams": { "param-a": "value-a", "param-b": [ "value-b-1", "value-b-2" ] }, "cookies": [ { "name": "cookie-name-a", "path": "/some/path/a", "value": "cookie-value-a", "domain": "cookie-domain-a", "maxAge": 1234 }, { "name": "cookie-name-b", "value": "cookie-value-b" } ], "body": { "body-a": "value-a" }, "pathVariables": { "variable-a": "value-a" } } }uri-
(Required, String) The URI path for the HTTP Request.
method-
(Required, String) The HTTP Method for the HTTP Request.
headers-
(Optional, Object) The headers for the HTTP Request. The value of each header can be either a single String, or an Array of Strings.
queryParams-
(Optional, Object) The query parameters for the HTTP Request. The value of each query parameter can be either a single String, or an Array of Strings.
cookies-
(Optional, Array of Objects) The list of cookies for the HTTP Request.
Details
name-
(Required, String) The name of the cookie.
value-
(Required, String) The value of the cookie.
path-
(Optional, String) The path of the cookie.
domain-
(Optional, String) The domain of the cookie.
maxAge-
(Optional, Integer) The maximum age of the cookie.
body-
(Optional, Object) The body of the HTTP Request.
pathVariables-
(Optional, Object) The variables of the HTTP Request’s URI path. The value of each variable must be a single String.
pageable-
(Object) The pagination request parameters.
Details
Page configuration example{ "page": 0, "size": 25, "sort": [ // (1) { "property" : "fieldA", "direction" : "ASC" }, { "property" : "fieldB", "direction" : "DESC" } ] }Each configuration field is defined as follows:
page-
(Integer) The page number.
size-
(Integer) The size of the page.
sort-
(Array of Objects) The sort definitions for the page.
Field definitions
property-
(String) The property where the sort was applied.
direction-
(String) The direction of the applied sorting. Either
ASCorDESC.
properties-
(Object) The execution properties as configured in the Endpoint.
Expression Language Extensions
| Variable | Description | Example |
|---|---|---|
|
The ID of the endpoint in execution |
|
|
The HTTP method of the endpoint in execution |
|
|
The URI paths of the endpoint in execution |
|
|
The name of the endpoint in execution |
|
|
The description of the endpoint in execution |
|
|
The properties of the endpoint in execution |
|
|
The labels of the endpoint in execution, grouped by key |
|
Invoking a REST Endpoint
Once a REST Endpoint is fully configured, it can be invoked as any other REST API by calling its HTTP Method + URI + Type under the /api root path:
$ curl --request GET 'queryflow-api:12040/v2/api/my-endpoint?param=value'
$ curl --request GET 'queryflow-api:12040/v2/api/my-endpoint?param=value' --header 'Accept: application/json'
The expected HTTP response the first match of the following rules:
-
The predefined response as configured in the endpoint.
-
The most recent entry in the JSON Data Channel, where:
-
If the root of the document is named
httpResponse, thestatusCode,headersandbodywill be used accordinglyDetails
{ "httpResponse": { "statusCode": 200, "headers": { "header-a": "value-a", "header-b": [ "value-b-1", "value-b-2" ] }, "body": { ... } } }statusCode-
(Integer) The HTTP Status Code of the HTTP Response.
headers-
(Object) The headers of the HTTP Response. The value of each header can be either a single String, or an Array of Strings.
Details
{ "httpResponse": { "headers": { "header-a": "value-a", "header-b": [ "value-b-1", "value-b-2" ] }, ... } } body-
(Object) The body of the HTTP Response.
-
If the node is any other from a processor state, the body will be unwrapped from the
outputFieldand the status code will be 200.
-
-
In any other case, the response will be
204 - No Content.
$ curl --request GET 'queryflow-api:12040/v2/api/my-endpoint?param=value' --header 'Accept: text/event-stream'
A Server-Sent Event (SSE) response is immediately returned with its defined HTTP status code (or 207 - Multi-Status if undefined) with the events emitted through the SSE Data Channel.
Once the execution is completed, the connection will be closed by the server.
Details
name: <Output Field Name>
data: <Data>
name-
(String) The configured
outputFieldNamefor the processor. data-
(Object) The JSON data through the channel.
Debugging a REST Endpoint
Given that the definition of the Endpoint can grow in complexity, the risk of something breaking increases: a condition failed, the output was not as expected, a parameter was wrong…
In order to identify the problem, the /debug root path offers a complete tracing of the execution for the endpoint. Each one of the states, their output, their errors and the overall step-by-step followed by the finite-state machine will be displayed:
$ curl --request GET 'queryflow-api:12040/v2/debug/my-endpoint?param=value'
$ curl --request GET 'queryflow-api:12040/v2/debug/my-endpoint?param=value' --header 'Accept: application/json'
$ curl --request GET 'queryflow-api:12040/v2/debug/my-endpoint?param=value' --header 'Accept: text/event-stream'
Event: Execution Start
[
{
"timestamp": 1769746448166,
"event": "execution:start"
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
Event: Execution Complete
[
{
"timestamp": 1769746448166,
"event": "execution:complete"
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
Event: Execution Error
[
{
"timestamp": 1769746448166,
"event": "execution:error",
"errorMessage": {
...
}
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
errorMessage-
(Object) The error message of the event.
Event: Execution Timeout
[
{
"timestamp": 1769746448166,
"event": "execution:timeout",
"duration": 100
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
duration-
(Long) The duration in milliseconds of the execution.
Event: JSON Data
[
{
"timestamp": 1769746448166,
"event": "data:json",
"data": {
...
}
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
data-
(JSON) The JSON data generated.
Event: Server-Sent Event Data
[
{
"timestamp": 1769746448166,
"event": "data:sse",
"data": {
...
}
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
data-
(JSON) The Server-Sent Event generated.
Event: State Start
[
{
"timestamp": 1769746448166,
"event": "state:start",
"type": ""
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
type-
(String) The type of state.
Event: State Complete
[
{
"timestamp": 1769746448166,
"event": "state:complete",
"duration": 100
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
duration-
(Long) The duration in milliseconds of the execution.
Event: State Error
[
{
"timestamp": 1769746448166,
"event": "state:error",
"duration": 100,
"errorMessage": {
...
}
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
duration-
(Long) The duration in milliseconds of the execution.
errorMessage-
(Object) The error message of the event.
Event: Processor Step Start
[
{
"timestamp": 1769746448166,
"event": "step:start",
"stepIndex": 0
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
stepIndex-
(Integer) The index of the step in execution.
Event: Processor Step Skip
[
{
"timestamp": 1769746448166,
"event": "step:skip",
"stepIndex": 0
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
stepIndex-
(Integer) The index of the skipped step.
Event: Processor Step Complete
[
{
"timestamp": 1769746448166,
"event": "step:complete",
"stepIndex": 0
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
stepIndex-
(Integer) The index of the step in execution.
Event: Processor Step Failure via Error Policy
[
{
"timestamp": 1769746448166,
"event": "step:failure",
"stepIndex": 0,
"errorMessage": {
...
}
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
stepIndex-
(Integer) The index of the step in execution.
errorMessage-
(Object) The error message of the event.
Event: Processor Step Error
[
{
"timestamp": 1769746448166,
"event": "step:error",
"stepIndex": 0,
"errorMessage": {
...
}
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
stepIndex-
(Integer) The index of the step in execution.
errorMessage-
(Object) The error message of the event.
Event: Switch Match
[
{
"timestamp": 1769746448166,
"event": "switch:match",
"option": {
...
}
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
option-
(DSL Filter) The DSL Filter that matched.
Event: Switch Default
[
{
"timestamp": 1769746448166,
"event": "switch:default"
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
Event: Parallel Pipeline Start
[
{
"timestamp": 1769746448166,
"event": "pipeline:start",
"tag": ""
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
tag-
(String) The tag of the pipeline in execution.
Event: Parallel Pipeline Complete
[
{
"timestamp": 1769746448166,
"event": "pipeline:complete",
"tag": ""
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
tag-
(String) The tag of the pipeline in execution.
Event: Parallel Pipeline Failure via Error Policy
[
{
"timestamp": 1769746448166,
"event": "pipeline:failure",
"tag": "",
"errorMessage": {
...
}
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
tag-
(String) The tag of the pipeline in execution.
errorMessage-
(Object) The error message of the event.
Event: Parallel Pipeline Error
[
{
"timestamp": 1769746448166,
"event": "pipeline:error",
"tag": "",
"errorMessage": {
...
}
}
]
timestamp-
(Long) The epoch timestamp when the event happens.
event-
(String) The name of the event.
tag-
(String) The tag of the pipeline in execution.
errorMessage-
(Object) The error message of the event.
|
Note
|
The debug request is exactly the same as the one sent to the |
|
Note
|
The debug response contains the |
MCP Server
MCP Servers API
$ curl --request POST 'queryflow-api:12040/v2/entrypoint/mcp-server' --data '{ ... }'
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/mcp-server'
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/mcp-server/{id}'
$ curl --request PUT 'queryflow-api:12040/v2/entrypoint/mcp-server/{id}' --data '{ ... }'
$ curl --request DELETE 'queryflow-api:12040/v2/entrypoint/mcp-server/{id}'
$ curl --request PATCH 'queryflow-api:12040/v2/entrypoint/mcp-server/{id}/enable'
$ curl --request PATCH 'queryflow-api:12040/v2/entrypoint/mcp-server/{id}/disable'
$ curl --request POST 'queryflow-api:12040/v2/entrypoint/mcp-server/{id}/clone?name=clone-new-name&uri=clone-new-uri'
Query Parameters
uri-
(Optional, String) The URI of the new MCP Server.
name-
(Optional, String) The name of the new MCP Server.
$ curl --request POST 'queryflow-api:12040/v2/entrypoint/mcp-server/search' --data '{ ... }'
Body
The body payload is a DSL Filter to apply to the search.
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/mcp-server/autocomplete?q=value'
Query Parameters
q-
(Required, String) The query to execute the autocomplete search.
The Discovery QueryFlow implementation of the Model Context Protocol, is as defined on the 2025-11-25 version of the protocol and exposed on top of the streamable HTTP transport layer.
The entrypoint targets the connection of AI Applications to data sources through business rules defined with the help of the finite-state machine.
The following capabilities are supported:
|
Note
|
{
"uri": "/my/mcp/server",
"name": "My MCP Server",
...
}
uri-
(Required, String) The URI path for the MCP Server.
name-
(Required, String) The unique name to identify the MCP Server.
description-
(Optional, String) The description for the configuration.
instructions-
(Optional, String) The instructions to interact with the MCP Server.
capabilities-
(Required, Object) The configuration of the capabilities exposed by the MCP Server.
Details
{ "uri": "/my/mcp/server", "name": "My MCP Server", "pipeline": <Pipeline ID>, "capabilities": { "logging": {}, "tools": {} }, ... }logging-
(Optional, Object) The logging capabilities of the MCP Server. If the server supports logging, an empty object must be included.
tools-
(Optional, Object) The tools capabilities of the MCP Server. If the server supports tools, an empty object must be included.
serverInfo-
(Required, Object) The information of the MCP Server.
Details
{ "uri": "/my/mcp/server", "name": "My MCP Server", "serverInfo": { ... }, ... }name-
(Required, String) The name of the MCP Server.
version-
(Required, String) The version of the MCP Server.
title-
(Optional, String) The title of the MCP Server.
metadata-
(Optional, Object) The metadata of the MCP Server.
properties-
(Optional, Object) The properties to be referenced with the help of the Expression Language in the configuration of the processors.
requestTimeout-
(Optional, Duration) The timeout for the execution of each request through the MCP Server. Defaults to
30s. expireAfter-
(Optional, Duration) The expiration timeout for idle sessions in the MCP Server.
active-
(Optional, Boolean) Whether the MCP Server is active. Defaults to
true. labels-
(Optional, Array of Objects) The labels for the configuration.
Details
{ "labels": [ { "key": "My Label Key", "value": "My Label Value" }, ... ], ... }key-
(Required, String) The key of the label.
value-
(Required, String) The value of the label.
Capabilities
Tools
MCP Server Tools API
$ curl --request POST 'queryflow-api:12040/v2/entrypoint/mcp-server/{mcpServerId}/tool' --data '{ ... }'
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/mcp-server/{mcpServerId}/tool'
$ curl --request GET 'queryflow-api:12040/v2/entrypoint/mcp-server/{mcpServerId}/tool/{mcpToolId}'
$ curl --request PUT 'queryflow-api:12040/v2/entrypoint/mcp-server/{mcpServerId}/tool/{mcpToolId}' --data '{ ... }'
$ curl --request DELETE 'queryflow-api:12040/v2/entrypoint/mcp-server/{mcpServerId}/tool/{mcpToolId}'
The tools capability allows to execute the custom QueryFlow Tools.
{
"name": "My-MCP-Server-Tool",
"config": {
...
},
"pipeline": <Pipeline ID>,
"timeout": "60s",
...
}
name-
(Required, String) The unique name to identify the MCP Server Tool. It must follow these restrictions.
description-
(Optional, String) The description for the configuration.
config-
(Required, Object) The MCP Server Tool configuration.
Details
{ "name": "My-MCP-Server-Tool", "config": { "inputSchema": { ... }, "outputSchema": { ... }, "annotations": { ... }, "execution": { ... }, "icons": [ ... ], "response": { ... } }, "pipeline": <Pipeline ID>, "timeout": "60s", ... }inputSchema-
(Required, Object) The JSON Schema defining expected parameters for the MCP Server Tool execution.
outputSchema-
(Optional, Object) The JSON Schema defining expected output structure for the MCP Server Tool execution.
annotations-
(Optional, Object) Additional properties describing the MCP Server Tool.
Details
{ "name": "My-MCP-Server-Tool", "config": { "annotations": { "title": "My tool title", "readOnlyHint": false, "destructiveHint": false, "idempotentHint": false, "openWorldHint": false }, ... }, "pipeline": <Pipeline ID>, "timeout": "60s", ... }title-
(Optional, String) The MCP Server Tool title.
readOnlyHint-
(Optional, Boolean) If
true, the tool doesn’t modify its environment. destructiveHint-
(Optional, Boolean) If
true, the tool may perform destructive updates to its environment. Iffalse, the tool performs only additive updates. idempotentHint-
(Optional, Boolean) If
true, calling the tool repeatedly with the same arguments will have no additional effect on its environment. openWorldHint-
(Optional, Boolean) If
true, this tool may interact with an“open world”of external entities. Iffalse, the tool’s domain of interaction is closed.
execution-
(Optional, Object) The execution-related properties for the MCP Server Tool.
Details
{ "name": "My-MCP-Server-Tool", "config": { "execution": { "taskSupport": "forbidden" }, ... }, "pipeline": <Pipeline ID>, "timeout": "60s", ... }taskSupport-
(Optional, String) Indicates whether this tool supports task-augmented execution. One of:
-
forbidden: Tool does not support task-augmented execution -
optional: Tool may support task-augmented execution. -
required: Tool requires task-augmented execution.
-
icons-
(Optional, Array of Objects) Array of icons to display in user interfaces.
Details
{ "name": "My-MCP-Server-Tool", "config": { "icons": [ { "src": "", "mimeType": "", "sizes": [ ... ], "theme": "light" } ], ... }, "pipeline": <Pipeline ID>, "timeout": "60s", ... }src-
(Required, String) The URI pointing to the icon resource.
mimeType-
(Optional, String) The MIME type override if the source type is missing or generic.
sizes-
(Optional, Array of String) The array specifying the sizes to use the icon. Each size must be in
WxHformat. theme-
(Optional, String) The theme for the icon. It can be either
lightordark. Thelighttheme is designed to be used with a light background, while thedarkone is designed to be used with a dark background.
response-
(Optional, Object) The object specifying the output of the tool execution. If not provided, the latest data node generated is used.
Details
{ "name": "My-MCP-Server-Tool", "config": { "response": { "snippets": { ... }, "content": { ... } }, ... }, "pipeline": <Pipeline ID>, "timeout": "60s", ... }snippets-
(Optional, Object) The snippets to be referenced in the content field with the help of the Expression Language.
Details
{ "name": "My-MCP-Server-Tool", "config": { "response": { "snippets": { "snippetA": { ... } }, "content": { ... } }, ... }, "pipeline": <Pipeline ID>, "timeout": "60s", ... }NoteAvoid the usage of any reserved operator such as hyphens in the name of a snippet.
content-
(Optional, Object) The object that matches the format specified in the
outputSchemaif provided, or the format of any Content Result.
pipeline-
(Required, UUID) The ID of the pipeline configuration that is executed by the tool.
title-
(Optional, String) The title of the custom tool.
timeout-
(Required, Duration) The timeout for the execution of the custom tool. Defaults to
30s. properties-
(Optional, Object) The properties to be referenced with the help of the Expression Language in the configuration of the processors.
active-
(Optional, Boolean) Whether the MCP Server is active. Defaults to
true. labels-
(Optional, Array of Objects) The labels for the configuration.
Details
{ "labels": [ { "key": "My Label Key", "value": "My Label Value" }, ... ], ... }key-
(Required, String) The key of the label.
value-
(Required, String) The value of the label.
Logging
The logging capability is available in executions that return text/event-stream responses with the help of the Message State, where all messages are returned as notifications as defined in the MCP Specification.
Ping
The ping capability provides a mechanism to verify if the connection is alive.
Metadata
The execution starts with the metadata of the invocation stored in the JSON Data Channel:
{
"id": "d8c7a2d3-3b02-4846-8cf3-0cd1af805b90",
"session": {
...
},
"request": {
...
}
}
id-
(UUID) An auto-generated ID for the execution context.
session-
(Object) The MCP session data.
Details
{
"session": {
"id": "d5a581ba-cf3f-4471-a5ce-d45b9eb63083"
}
}
id-
(UUID) An auto-generated ID for the MCP session between the MCP Server and the Client.
request-
(Object) The MCP request data.
Details
{
"request": {
"toolName": "My-custom-tool",
"id": "1",
"arguments": {
...
}
}
}
toolName-
(String) The name of the MCP Server Tool.
id-
(String) The request ID.
arguments-
(Object) The request parameters, if provided.
Expression Language Extensions
| Variable | Description | Example |
|---|---|---|
|
The ID of the tool in execution |
|
|
The name of the tool in execution |
|
|
The description of the tool in execution |
|
|
The properties of the tool in execution |
|
|
The labels of the tool in execution, grouped by key |
|
Invoking an MCP Server
Once a MCP Server is fully configured, it can be invoked following using the default mechanism for sending messages to the server using the Streamable HTTP transport layer under the /mcp/{server-uri} root path:
$ curl --request POST 'queryflow-api:12040/v2/mcp/my-server-uri' \
--header 'Accept: application/json' \
--header 'Accept: text/event-stream'
For capabilities that execute through the finite-state machine, the final JSON-RPC Response (if undefined) will be the most recent entry in the JSON Data Channel.
Pipeline
Pipelines API
$ curl --request POST 'queryflow-api:12030/v2/pipeline' --data '{ ... }'
$ curl --request GET 'queryflow-api:12030/v2/pipeline'
$ curl --request GET 'queryflow-api:12030/v2/pipeline/{id}'
$ curl --request PUT 'queryflow-api:12030/v2/pipeline/{id}' --data '{ ... }'
$ curl --request DELETE 'queryflow-api:12030/v2/pipeline/{id}'
$ curl --request POST 'queryflow-api:12030/v2/pipeline/{id}/clone?name=clone-new-name'
Query Parameters
name-
(Required, String) The name of the new Pipeline.
$ curl --request POST 'queryflow-api:12030/v2/pipeline/search' --data '{ ... }'
Body
The body payload is a DSL Filter to apply to the search.
$ curl --request GET 'queryflow-api:12030/v2/pipeline/autocomplete?q=value'
Query Parameters
q-
(Required, String) The query to execute the autocomplete search.
A pipeline is the definition of the finite-state machine for processing a request:
{
"name": "My Pipeline",
"initialState": "stateA",
"states": {
"stateA": {
...
},
"stateB": {
...
}
},
...
}
name-
(Required, String) The unique name to identify the pipeline.
description-
(Optional, String) The description for the configuration.
initialState-
(Required, String) The state, as defined in the
statesfield, to be used as starting point for the request processing. states-
(Required, Object) The states associated to the pipeline.
labels-
(Optional, Array of Objects) The labels for the configuration.
Details
{ "labels": [ { "key": "My Label Key", "value": "My Label Value" }, ... ], ... }key-
(Required, String) The key of the label.
value-
(Required, String) The value of the label.
|
Note
|
Loops are not forbidden as they might represent valid use cases depending on the configuration of the states. To avoid getting stuck in infinite loops, all entrypoints are required to be configured with a timeout. |
Processors
Processors API
$ curl --request POST 'queryflow-api:12040/v2/processor' --data '{ ... }'
$ curl --request GET 'queryflow-api:12040/v2/processor'
$ curl --request GET 'queryflow-api:12040/v2/processor/{id}'
$ curl --request PUT 'queryflow-api:12040/v2/processor/{id}' --data '{ ... }'
|
Note
|
The type of an existing processor can’t be modified. |
$ curl --request DELETE 'queryflow-api:12040/v2/processor/{id}'
$ curl --request POST 'queryflow-api:12040/v2/processor/{id}/clone?name=clone-new-name'
Query Parameters
name-
(Required, String) The name of the new Processor
$ curl --request POST 'queryflow-api:12040/v2/processor/{id}/test?timeout=PT15S'
Query Parameters
timeout-
(Optional, Duration) The timeout for the request execution. Defaults to
15s.
Body
The body payload is the input value for the Processor.
$ curl --request POST 'queryflow-api:12040/v2/processor/search' --data '{ ... }'
Body
The body payload is a DSL Filter to apply to the search
$ curl --request GET 'queryflow-api:12040/v2/processor/autocomplete?q=value'
Query Parameters
q-
(Required, String) The query to execute the autocomplete search
Each component is stateless, and it’s driven by the configuration defined in the processor and by the context created by the current HTTP Request. This design makes the processor the main building block of Discovery QueryFlow.
They are intended to solve very specific tasks, which makes them re-usable and simple to integrate into any part of the configuration.
{
"type": "my-component-type",
"name": "My Component Processor",
"config": {
...
},
...
}
type-
(Required, String) The name of the component to execute
name-
(Required, String) The unique name to identify the configuration
description-
(Optional, String) The description for the configuration.
config-
(Required, Object) The configuration for the corresponding action of the component. All configurations will be affected by the Expression Language
snippets-
(Optional, Object) The snippets to be referenced in the configuration with the help of the Expression Language
Details
{ "type": "my-component-type", "name": "My Component Processor", "config": { "myProperty": "#{ snippets.snippetA }" }, "snippets": { "snippetA": { ... } }, ... }NoteAvoid the usage of any reserved operator such as hyphens in the name of a snippet.
server-
(Optional, UUID/Object) Either the ID of the server configuration for the integration or an object with the detailed configuration.
Details
{ "server": { "id": "ba637726-555f-4c68-bfed-1c91f4803894", ... }, ... }id-
(Required, UUID) The ID of the server configuration for the integration.
credential-
(Optional, UUID) The ID of the credential to override the default authentication in the external service.
labels-
(Optional, Array of Objects) The labels for the configuration.
Details
{ "labels": [ { "key": "My Label Key", "value": "My Label Value" }, ... ], ... }key-
(Required, String) The key of the label.
value-
(Required, String) The value of the label.
Query Processing with a State Machine
State Types
Processor State
Executes a single or multiple processors in sequence:
{
"myProcessorState": {
"type": "processor",
"processors": [
...
]
}
}
type-
(Required, String) The type of state. Must be
processor processors-
(Required, Array of Objects) The processors to execute
Details
{ "stateA": { "type": "processor", "processors": [ { "id": <Processor ID>, ... } ], ... } }id-
(Required, UUID) The ID of the processor to execute
outputField-
(Optional, String) The output field that wraps the result of the processor execution. Defaults to the one defined in the component
continueOnError-
(Optional, Boolean) If
trueand the processor execution fails, its HTTP response will be stored in its corresponding Data Channel while the other processors in the state continue with their normal execution. Iffalse, the error will either be handled by theonErrorstate, or be spread to its invoker. Defaults tofalse active-
(Optional, Boolean)
falseto disable the execution of the processor
next-
(Optional, String) The next state for the HTTP Request Execution after the completion of the state. If not provided, the current one will be assumed as the final state
onError-
(Optional, String) The state of the to use as fallback if the execution of the current state fails. If undefined, the current HTTP Request Execution will complete with the corresponding error message.
The JSON output of each processor will be stored in the JSON Data Channel wrapped in the configured outputField:
{
"defaultFieldName": {
"outputKey": "outputValue"
}
}
Events emitted through the SSE Data Channel are transmitted as expected by the entrypoint.
Field Mapper State
Adds a new data node on the JSON Data Channel wrapped in the name of the state:
{
"myFieldMapperState": {
"type": "field-mapper",
"mapping": {},
...
"next": "myNextState"
}
}
type-
(Required, String) The type of state. Must be
field-mapper. mapping-
(Required, JSON) Any valid JSON (String, Number, Array, Object) created with the help of the Expression Language.
Details
{ "myFieldMapperState": { "type": "field-mapper", "mapping": { "myProperty": "#{ data('/my/value') }", ... }, ... "next": "myNextState" } } snippets-
(Optional, Object) The snippets to be referenced in the mapping with the help of the Expression Language.
Details
{ "myFieldMapperState": { "type": "field-mapper", "mapping": { "myProperty": "#{ snippets.snippetA }", ... }, "snippets": { "snippetA": { ... } }, "next": "myNextState" } }NoteAvoid the usage of any reserved operator such as hyphens in the name of a snippet.
next-
(Optional, String) The next state for the HTTP Request Execution. If not provided, the current one will be assumed as the final state.
Message State
Adds a new data node on the SSE Data Channel with the name of the state as event type:
{
"myMessageState": {
"type": "message",
"data": {},
...
"next": "myNextState"
}
}
type-
(Required, String) The type of state. Must be
message. data-
(Required, JSON) Any valid JSON (String, Number, Array, Object) created with the help of the Expression Language.
Details
{ "myMessageState": { "type": "message", "data": { "myProperty": "#{ data('/my/value') }", ... }, ... "next": "myNextState" } } snippets-
(Optional, Object) The snippets to be referenced in the mapping with the help of the Expression Language.
Details
{ "myMessageState": { "type": "message", "data": { "myProperty": "#{ snippets.snippetA }", ... }, "snippets": { "snippetA": { ... } }, "next": "myNextState" } }NoteAvoid the usage of any reserved operator such as hyphens in the name of a snippet.
next-
(Optional, String) The next state for the HTTP Request Execution. If not provided, the current one will be assumed as the final state.
Parallel Pipeline State
Executes a single or multiple pipelines in parallel:
{
"myParallelPipelineState": {
"type": "pipeline",
"pipelines": {
...
}
}
}
type-
(Required, String) The type of state. Must be
pipeline. pipelines-
(Required, Object) The pipelines to execute in parallel.
Details
{ "type": "pipeline", "pipelines": { "myPipelineA": { "id": <Pipeline ID>, ... }, ... }, ... }id-
(Required, UUID) The ID of the pipeline to execute in parallel.
input-
(Optional, Object) The custom metadata to be used at the start of the pipeline execution. All fields can be configured with the help of the Expression Language.
Details
{ "type": "pipeline", "pipelines": { "myPipelineA": { "id": <Pipeline ID>, "input": { "myField": "#{ data('/my/value') }" } }, ... }, ... } output-
(Optional, Object) The custom output to be used as the result pipeline execution. All fields can be configured with the help of the Expression Language. If not provided, the last data node generated will be considered as the output.
Details
{ "type": "pipeline", "pipelines": { "myPipelineA": { "id": <Pipeline ID>, "output": { "myField": "#{ data('/my/value') }" } }, ... }, ... } errorPolicy-
(Optional, String) If
IGNOREand the pipeline execution fails, the other pipelines in the state continue with their normal execution. IfFAILand the pipeline fails, the error will either be handled by theonErrorstate, or be spread to its invoker. The error is always be stored in its corresponding tag on the JSON Data Channel. Defaults toFAIL. active-
(Optional, Boolean)
falseto disable the execution of the pipeline. If all pipelines are disabled, the state output will be empty.
next-
(Optional, String) The next state for the HTTP Request Execution after the completion of all configured pipelines. If not provided, the current one will be assumed as the final state.
onError-
(Optional, String) The state of the to use as fallback if the execution of the current state fails. If undefined, the current HTTP Request Execution will complete with the corresponding error message.
The output of the state stored in the JSON Data Channel is a collection with each response:
{
"myParallelPipelineState": {
"myPipelineA": {
...
},
...
}
}
Events emitted in the SSE Data Channel are always spread to the invoker.
Switch State
Use DSL Filters and JSON Pointers over the JSON Data Channel to control the flow of the execution given the first matching condition:
{
"mySwitchState": {
"type": "switch",
"options": [
...
],
"default": "myDefaultState"
}
}
type-
(Required, String) The type of state. Must be
switch. options-
(Required, Array of Objects) The options to evaluate in the state.
Details
{ "type": "switch", "options": [ { "condition": { "equals": { "field": "/httpRequest/queryParams/input", "value": "valueA" }, ... }, "state": "myFirstState" }, ... ], ... }condition-
(Required, Object) The predicate described as a DSL Filter over the JSON processing data.
state-
(Optional, String) The next state for the finite-state machine if the
conditionevaluates totrue.
default-
(Optional, String) The default state for the finite-state machine if no option evaluates to
true.
|
Note
|
If no state for the finite-state machine is selected, the current one will be assumed as the final state. |
Logger State
Asynchronously logs a message through the logger of the entrypoint:
{
"myLoggerState": {
"type": "logger",
"message": "#{ data('/my/log/message') }",
"level": "ERROR",
"loggerName": "my-custom-logger",
"next": "myNextState"
}
}
message-
(Required, JSON) Any valid JSON (String, Number, Array, Object) with the message to log.
level-
(Optional, String) Logging level. One of:
EMERGENCY,ALERTorCRITICAL,ERROR,WARN,NOTICE,INFO,DEBUG. Defaults toINFO. loggerName-
(Optional, String) The name of the logger. Defaults to
QueryFlowLogger. next-
(Optional, String) The next state for the HTTP Request Execution. If not provided, the current one will be assumed as the final state.
External Request Execution
Data Channels
Some state types produce data that is available for subsequent states.
The JSON Data Channel handles application/json output which can be later referenced using JSON Pointers.
|
Note
|
New data nodes will never override data nodes previously generated. |
|
Note
|
When searching for a path, the JSON Pointer will be evaluated against the most recent output. If it is a match, the node is returned. Otherwise, the search continues with the previous one. |
The SSE Data Channel handles text/event-stream that gets emitted based on the entrypoint that triggers the execution.
Expression Language Extensions
| Variable | Description | Example |
|---|---|---|
|
The ID of the processor in execution during a processor state |
|
|
The type of the processor in execution during a processor state |
|
|
The name of the processor in execution during a processor state |
|
|
The description of the processor in execution during a processor state |
|
|
The labels of the processor in execution during a processor state, grouped by key |
|
|
The unique ID of the current HTTP request |
|
|
The timestamp when the current HTTP request started |
|
| Function | Description | Example |
|---|---|---|
|
Finds a specific node within the JSON processing channel using a JSON Pointer |
|
|
References a specific node within the JSON processing channel using a 0-based index for the first data node generated in the channel. The method supports negative numbers, where |
|
Components
Amazon Bedrock
Sends requests to Amazon Bedrock. Supports multiple actions for different endpoints of the service. This component’s output field is named amazonBedrock by default.
Processor Action: invoke-model
Processor that invokes the specified Amazon Bedrock model to run inference using the prompt and the inference parameters provided in the configuration.
{
"type": "amazon-bedrock",
"name": "My Amazon Bedrock Processor Action",
"config": {
"action": "invoke-model",
"model": "amazon.titan-text-premier-v1:0", (1)
"request": { (1) (2)
"inputText": "Write a short story about a rooster",
"textGenerationConfig": {
"maxTokenCount": 50,
"stopSequences": [],
"temperature": 0.7,
"topP": 0.9
},
},
"stream": false
},
"server": <Amazon Bedrock Server ID> (3)
}
| 1 | These configuration fields are required. |
| 2 | This is just a request example, the exact structure is defined by the Amazon Bedrock API. |
| 3 | See the Amazon Bedrock integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The model to invoke.
request-
(Required, Object) The body of the request.
stream-
(Optional, Boolean) Whether to enable streaming. Defaults to
false.
The response of the action is stored in the JSON Data Channel as returned by the Amazon Bedrock service:
{
"amazonBedrock": {
...
}
}
Chunker
Splits big documents into smaller units that are easier to interpret by LLMs, it exposes different strategies that can be used. This component’s output field is named chunker by default.
Action: sentence
Processor that splits the text by sentences.
{
"type": "chunker",
"name": "My Chunk-by-Sentence Action",
"config": {
"action": "sentence",
"text": " #{data('/text')} ", (1)
"sentences": 4,
"overlap": 1,
"maxChars": 200
}
}
| 1 | This configuration field is required. |
Each configuration field is defined as follows:
text-
(Required, String) The text to process.
sentences-
(Optional, Integer) The number of sentences per chunk. Defaults to
20. overlap-
(Optional, String/Integer) The amount of sentences to overlap, it can be either a percentage, or the number of sentences. Defaults to
10%. maxChars-
(Optional, Integer) The maximum number of chars per chunk.
The response of the action is stored in the JSON Data Channel:
{
"chunker": {
"chunks": [
"Lorem ipsum dolor sit amet consectetur adipiscing elit. Placerat in id cursus mi pretium tellus duis. Urna tempor pulvinar vivamus fringilla lacus nec metus.",
"Urna tempor pulvinar vivamus fringilla lacus nec metus. Integer nunc posuere ut hendrerit semper vel class. Conubia nostra inceptos himenaeos orci varius natoque penatibus."
],
"errors": [
{
"index": 3,
"text": "Mus donec rhoncus eros lobortis nulla molestie mattis Purus est efficitur laoreet mauris pharetra vestibulum fusce Sodales consequat magna ante condimentum neque at luctus Ligula congue sollicitudin erat viverra ac tincidunt nam.",
"error": {
"status": 400,
"code": 3003,
"messages": [
"Chunk of size 229 exceeds maximum char limit of 200"
],
"timestamp": "2025-09-11T15:43:42.925739900Z"
}
}
]
}
}
|
Note
|
If the overlapped text exceeds the |
Action: word
Processor that splits the text by words.
{
"type": "chunker",
"name": "My Chunk-by-Word Action",
"config": {
"action": "word",
"text": " #{data('/text')} ", (1)
"words": 8,
"overlap": 3,
"maxChars": 70
}
}
| 1 | This configuration field is required. |
Each configuration field is defined as follows:
text-
(Required, String) The text to process.
words-
(Optional, Integer) The number of words per chunk. Defaults to
20. overlap-
(Optional, String/Integer) The amount of words to overlap, it can be either a percentage, or the number of sentences. Defaults to
10%. maxChars-
(Optional, Integer) The maximum number of chars per chunk.
The response of the action is stored in the JSON Data Channel:
{
"chunker": {
"chunks": [
"Lorem ipsum dolor sit amet consectetur adipiscing elit.",
"consectetur adipiscing elit. Vitae pellentesque sem placerat in",
"sem placerat in id cursus mi. Tempus leo",
"mi. Tempus leo eu aenean sed diam urna",
"sed diam urna tempor. aptent taciti sociosqu. Conubia",
"taciti sociosqu. Conubia nostra inceptos himenaeos orci varius",
"himenaeos orci varius natoque penatibus. Montes nascetur ridiculus",
"Montes nascetur ridiculus mus donec rhoncus eros lobortis.",
"rhoncus eros lobortis. Maximus eget fermentum odio phasellus",
"fermentum odio phasellus non purus est. Vestibulum fusce",
"est. Vestibulum fusce dictum risus blandit quis suspendisse",
"blandit quis suspendisse aliquet. Ante condimentum neque at",
"condimentum neque at luctus nibh finibus facilisis. Ligula",
"finibus facilisis. Ligula congue sollicitudin erat viverra ac",
"erat viverra ac tincidunt nam. Euismod quam justo",
"Euismod quam justo lectus commodo augue arcu dignissim."
],
"errors": [
{
"index": 24,
"text": "NecmetusbibendumegestasiaculismassanislmalesuadaUthendreritsempervelclass",
"error": {
"status": 400,
"code": 3003,
"messages": [
"Chunk of size 73 exceeds maximum char limit of 70"
],
"timestamp": "2025-09-11T19:59:18.402427300Z"
}
}
]
}
}
|
Note
|
If the overlapped text exceeds the |
Elasticsearch
Uses the Elasticsearch integration to send requests to the Elasticsearch API. Support multiple actions for common operations such as search, but also provides a mechanism to send raw Elasticsearch queries. This component’s output field is named elasticsearch by default.
Action: autocomplete
Processor that executes a completion suggester query.
{
"type": "elasticsearch",
"name": "My Elasticsearch Processor Action",
"config": {
"action": "autocomplete",
"index": "my-index", (1)
"text": "#{ data('my/query') }", (1)
"field": "content", (1)
"size": 3,
"skipDuplicates": true
},
"server": <Elasticsearch Server ID> (2)
}
| 1 | These configuration fields are required. |
| 2 | See the Elasticsearch integration section. |
Each configuration field is defined as follows:
index-
(Required, String) The index where to search.
text-
(Required, String) The text to search.
field-
(Required, String) The field where to search.
skipDuplicates-
(Optional, Boolean) Whether to skip duplicate suggestions.
size-
(Optional, Integer) The amount of suggestions.
The response of the action is stored in the JSON Data Channel as returned by the Elasticsearch API:
{
"elasticsearch": {
...
}
}
Action: knn
Processor that executes a k-nearest neighbor (kNN) query using approximate kNN.
{
"type": "elasticsearch",
"name": "My Elasticsearch Processor Action",
"config": {
"action": "knn",
"index": "my-index", (1)
"field": "content", (1)
"maxResults": 5, (1)
"vector": "#{ data('my/vector') }", (1)
"k": 5, (1)
"candidatesPerShard": 20, (1)
"query": {
"match_all": {}
}
},
"server": <Elasticsearch Server ID> (2)
}
| 1 | These configuration fields are required. |
| 2 | See the Elasticsearch integration section. |
Each configuration field is defined as follows:
index-
(Required, String) The index where to search.
field-
(Required, String) The field where to search.
maxResults-
(Required, Integer) The maximum number of results.
vector-
(Required, Array of Float) The source vector to compare.
k-
(Required, Integer) The number of nearest neighbors.
candidatesPerShard-
(Required, Integer) The number of nearest neighbors considered per shard.
query-
(Optional, Object) The query to filter in addition to the kNN search.
The response of the action is stored in the JSON Data Channel as returned by the Elasticsearch API:
{
"elasticsearch": {
...
}
}
Action: native
Processor that executes a native Elasticsearch query.
{
"type": "elasticsearch",
"name": "My Elasticsearch Processor Action",
"config": {
"action": "native",
"path": "/my-index/_doc/1", (1)
"method": "POST", (1)
"queryParam": {
"param1": "value1"
},
"body": {
"field1": "value2"
}
},
"server": <Elasticsearch Server ID> (2)
}
| 1 | These configuration fields are required. |
| 2 | See the Elasticsearch integration section. |
Each configuration field is defined as follows:
path-
(Required, String) The endpoint of the request, excluding schema, host, port and any path included as part of the connection.
method-
(Required, String) The HTTP method for the request.
queryParams-
(Optional, Map of String/String) The map of query parameters for the URL.
body-
(Optional, Object) The JSON body to submit.
The response of the action is stored in the JSON Data Channel as returned by the Elasticsearch API:
{
"elasticsearch": {
...
}
}
Action: search
Processor that executes a match query on the index.
{
"type": "elasticsearch",
"name": "My Elasticsearch Processor Action",
"config": {
"action": "search",
"index": "my-index", (1)
"text": "#{ data('my/query') }", (1)
"field": "content", (1)
"suggest": { (2)
"completion-suggestion": {
"prefix": "Value ",
"completion": {
"field": "field.completion"
}
}
},
"aggregations": { (2)
"aggregationA": {
"terms": {
"field": "field.keyword"
}
}
},
"highlight": { (2)
"fields": {
"field": {}
}
},
"filter": <DSL Filter>, (3)
"pageable": <Pagination Parameters>, (4)
},
"server": <Elasticsearch Server ID> (5)
}
| 1 | These configuration fields are required. |
| 2 | The exact expected structure of these objects is defined by the Elasticsearch API, this is just an example at the time of writing. |
| 3 | See the DSL Filter section. |
| 4 | See the Pagination appendix. |
| 5 | See the Elasticsearch integration section. |
Each configuration field is defined as follows:
index-
(Required, String) The index where to search.
text-
(Required, String) The text to search.
field-
(Required, String) The field where to search.
suggest-
(Optional, Object) The suggester to apply. The object should represent a valid suggester according to the Elasticsearch API.
aggregations-
(Optional, Map of String/Object) The field with the aggregations to apply. See the Elasticsearch API aggregation documentation for details on the structure of the map.
highlight-
(Optional, Object) The highlighter to apply. The object should represent a valid highlighter according to the Elasticsearch API.
filter-
(Optional, DSL Filter) The filters to apply.
pageable-
(Optional, Pagination) The pagination parameters.
Details
Page configuration example{ "page": 0, "size": 25, "sort": [ // (1) { "property" : "fieldA", "direction" : "ASC" }, { "property" : "fieldB", "direction" : "DESC" } ] }Each configuration field is defined as follows:
page-
(Integer) The page number.
size-
(Integer) The size of the page.
sort-
(Array of Objects) The sort definitions for the page.
Field definitions
property-
(String) The property where the sort was applied.
direction-
(String) The direction of the applied sorting. Either
ASCorDESC.
The response of the action is stored in the JSON Data Channel as returned by the Elasticsearch API:
{
"elasticsearch": {
...
}
}
Action: store
Processor that executes a store request to Elasticsearch.
{
"type": "elasticsearch",
"name": "My Elasticsearch Processor Action",
"config": {
"action": "store",
"index": "my-index", (1)
"document": { (1)
"field1": "value1"
},
"id": "documentID",
"allowOverride": false,
},
"server": <Elasticsearch Server ID> (2)
}
| 1 | These configuration fields are required. |
| 2 | See the Elasticsearch integration section. |
Each configuration field is defined as follows:
index-
(Required, String) The index where to store the document.
document-
(Required, Object) The document to be stored.
id-
(Optional, String) The ID of the document to be stored. If not provided, it will be autogenerated.
allowOverride-
(Optional, Boolean) Whether the document can be overridden or not. Defaults to
false.
The response of the action is stored in the JSON Data Channel as returned by the Elasticsearch API:
{
"elasticsearch": {
...
}
}
Action: vector
Processor that executes a script score query using exact kNN.
{
"type": "elasticsearch",
"name": "My Elasticsearch Processor Action",
"config": {
"action": "vector",
"index": "my-index", (1)
"field": "my_vector_field", (1)
"vector": "#{ data('my/vector') }", (1)
"minScore": 0.92, (1)
"maxResults": 5, (1)
"function": "cosineSimilarity",
"query": {
"match_all": {}
},
},
"server": <Elasticsearch Server ID> (2)
}
| 1 | These configuration fields are required. |
| 2 | See the Elasticsearch integration section. |
Each configuration field is defined as follows:
index-
(Required, String) The index where to search.
field-
(Required, String) The field with the vector.
vector-
(Required, Array of Float) The source vector to compare.
minScore-
(Required, Double) The minimum score for results.
maxResults-
(Required, Integer) The maximum number of results.
query-
(Optional, Object) The query to apply together with the vector search.
function-
(Optional, String) The type of function to use. One of
cosineSimilarity,dotProduct,l1normorl2norm. Defaults tocosineSimilarity
The response of the action is stored in the JSON Data Channel as returned by the Elasticsearch API:
{
"elasticsearch": {
...
}
}
Facet Snap
Tries to snap facet values based on a list of tokens extracted from the user query. These facet snaps are returned as a Filter (see Filters DSL) that can be later used as clauses on the query sent to the search engine. This component’s output field is named snap by default.
Action: filter
Processor that creates a filter based on the facet values snapped using the query tokens provided as input.
{
"type": "snap",
"name": "My Snap Filter Action",
"config": {
"action": "filter",
"query": "#{ data(\"/httpRequest/queryParams/q\") }", (1)
"tokens": "#{ data(\"/tokens\") }", (1)
"facetStore": "facet_test", (1)
"includeFacets": "#{ data(\"/includeFacets\") }",
"excludeFacets": "#{ data(\"/excludeFacets\") }",
"matchAllFacets": false,
"snapMode": "QUERY",
"greedyMatch": false,
"maxDisambiguateOffset": false
}
}
| 1 | These configuration fields are required. |
Each configuration field is defined as follows:
tokens-
(Required, Array of Strings) The list of tokens to snap to.
query-
(Required, String) The search query to use.
facetStore-
(Required, String) The Discovery Staging bucket to get the facets from.
Bucket format
The facets stored on the bucket are expected to have the following format:
{ "name": "name", "value": "value", "properties": {} }name-
(Required, String) The name of the facet.
value-
(Required, String) The value of the facet.
properties-
(Optional, Object) The facet properties. Useful to store additional information for the facet.
snapMode-
(Optional, String) The mode to compare facets when snapping.
Details
QUERY: The facets will be matched against the input query text.TOKENS: The facets will be matched against the input tokens, separated by whitespace. This is useful if you are applying any processing to the tokens. includeFacets-
(Optional, Array of Strings) A list of facets to include when snapping.
excludeFacets-
(Optional, Array of Strings) A list of facets to ignore when snapping.
matchAllFacets-
(Optional, Boolean) If
true, the returned Filter will match all facet fields usingand. Iffalse, the returned Filter will match any facet field usingor. Defaults tofalse. greedyMatch-
(Optional, Boolean) If
true, snap to the biggest possible facet for each token only, preventing any overlap between matches. Iffalse, snap to every possible facet for each token, allowing overlapped matches. Defaults tofalse. maxDisambiguateOffset-
(Optional, Integer) The maximum offset size to check when disambiguating. If
-1checks all tokens available on both sides. Defaults to-1.
|
Tip
|
Input tokens for this action can be retrieved using the Tokenizer component. |
|
Tip
|
For faster query responses from the facet store, create indices for both name and value fields.
|
The response of the action is stored in the JSON Data Channel and besides outputting the filter, the Snap Filter action also provides the snapped facet objects and query ngrams that matched them, for later use as input on other actions.
{
"snap": {
"snappedFacets": [
{
"facet": { "name": "brand", "value": "nike", "properties": { "code": 123 } },
"ngram": {
"value": "nike",
"offset": { "start": 7, "end": 11 },
"tokens": [
{ "term": "nike", "offset": { "start": 7, "end": 11 } }
]
}
},
{
"facet": { "name": "size", "value": "7" },
"ngram": {
"value": "7",
"offset": { "start": 5, "end": 6 },
"tokens": [
{ "term": "size", "offset": { "start": 0, "end": 4 } },
{ "term": "7", "offset": { "start": 5, "end": 6 } }
]
}
}
],
"filter": {
"or": [
{ "in": { "field": "size", "values": [ "7" ] } },
{ "in": { "field": "brand", "values": [ "nike" ] } }
]
}
}
}
|
Note
|
The resulting snapped facets are ordered by ngram size, descending. If two ngrams have the same number of tokens they are ordered by appearance. |
Action: mask
Processor that creates a masked query based on the snap results of the Snap Filter Action. It replaces facet matches (both name and value) with a given map of facet masks in the input query.
{
"type": "snap",
"name": "My Snap Mask Filter Action",
"config": {
"action": "mask",
"query": "#{ data(\"/httpRequest/queryParams/q\") }", (1)
"snappedFacets": "#{ data(\"/snap/snappedFacets\") }", (1) (2)
"tokens": "#{ data(\"/tokens\") }", (1)
"entityMasks": {
"size": "[SIZE]",
"brand": "[BRAND]"
},
}
}
| 1 | These configuration fields are required. |
| 2 | See the snapped facets configuration. Note how this field is using the Expression Language to read from the output of a previously executed Snap Filter Action |
Each configuration field is defined as follows:
tokens-
(Required, Array of Strings) The list of tokens to snap to.
query-
(Required, String) The search query to use.
snappedFacets-
(Required, Array of Objects) The list of facets that matched a ngram value.
Facet object example
[ { "facet": { "name": "name", "value": "value", "properties": {} }, "ngram": { "value": "value", "offset": { "start": 0, "end": 7 }, "tokens": [ { "term": "term", "offset": { "start": 0, "end": 7 } }, ... ] } } ]facet-
(Required, Object) The snapped facet value.
Field definitions
name-
(Required, String) The name of the facet.
value-
(Required, String) The value of the facet.
properties-
(Optional, Object) The facet properties. Useful to store additional information for the facet.
ngram-
(Required, Object) The snapped ngram value.
Field definitions
value-
(Required, String) The ngram value.
offset-
(Required, Object) The ngram query offset.
Field definitions
start-
(Required, Integer) The offset start index.
end-
(Required, Integer) The offset end index.
tokens-
(Required, Array of Objects) The tokens that are part of the ngram.
Field definitions
term-
(Required, String) The term for this token.
offset-
(Required, Object) The token offset.
Details
start-
(Required, Integer) The offset start index.
end-
(Required, Integer) The offset end index.
entityMasks-
(Optional, Map of String/String) Masks to apply to the given facets.
Field configuration
{ "size": "[SIZE]", "brand": "[BRAND]", ... }
The response of the action is stored in the JSON Data Channel as:
{
"snap": "[SIZE] [BRAND] sneakers"
}
Action: clear
Processor that creates a simplified query based on the snap results of the Snap Filter Action. It removes facet matches (both name and value) from the input tokens and joins the remaining tokens with whitespace.
{
"type": "snap",
"name": "My Snap Clear Action",
"config": { (1)
"action": "clear",
"tokens": "#{ data(\"/tokens\") }",
"snappedFacets": "#{ data(\"/snap/snappedFacets\") }" (2)
}
}
| 1 | All configuration fields in this action are required. |
| 2 | See the snapped facets configuration. Note how this field is using the Expression Language to read from the output of a previously executed Snap Filter Action. |
Each configuration field is defined as follows:
tokens-
(Required, Array of Strings) The list of tokens to snap to.
snappedFacets-
(Required, Array of Objects) The list of facets that matched a ngram value.
Facet object example
[ { "facet": { "name": "name", "value": "value", "properties": {} }, "ngram": { "value": "value", "offset": { "start": 0, "end": 7 }, "tokens": [ { "term": "term", "offset": { "start": 0, "end": 7 } }, ... ] } } ]facet-
(Required, Object) The snapped facet value.
Field definitions
name-
(Required, String) The name of the facet.
value-
(Required, String) The value of the facet.
properties-
(Optional, Object) The facet properties. Useful to store additional information for the facet.
ngram-
(Required, Object) The snapped ngram value.
Field definitions
value-
(Required, String) The ngram value.
offset-
(Required, Object) The ngram query offset.
Field definitions
start-
(Required, Integer) The offset start index.
end-
(Required, Integer) The offset end index.
tokens-
(Required, Array of Objects) The tokens that are part of the ngram.
Field definitions
term-
(Required, String) The term for this token.
offset-
(Required, Object) The token offset.
Details
start-
(Required, Integer) The offset start index.
end-
(Required, Integer) The offset end index.
The response of the action is stored in the JSON Data Channel as:
{
"snap": "sneakers"
}
HTML
Uses Jsoup to parse and process HTML documents.
Action: select
Processor that retrieves elements that match a CSS selector query from an HTML document.
{
"type": "html",
"name": "My Select HTML Action",
"config": {
"action": "select",
"text": "#{ data('/text') }", (1)
"baseUri": "",
"charset": "UTF-8",
"selectors": { (1)
"mySelector": {
"selector": "::text:not(:blank)", (1)
"mode": "NODES"
}
}
}
}
| 1 | These configuration field are required. |
Each configuration field is defined as follows:
text-
(Required, String) The content of the HTML document to be processed, as a plain text string.
|
Note
|
The content of HTML files located in the File Service can be retrieved by leveraging the |
baseUri-
(Optional, String) The URL of the source, to resolve relative links against. Defaults to
"". charset-
(Optional, String) The character set used to encode the content before parsing. If
null, determines the charset from thehttp-equivmeta tag if present, or falls back toUTF-8if not. selectors-
(Required, Map of String/Object) The set of selector configurations.
Field definitions
selector-
(Required, String) The CSS selector query.
mode-
(Optional, String) The output format of the selection. Either
TEXT,HTMLorNODES. Defaults toTEXT.
NoteThe
NODESmode enables the use of Node Pseudo Selectors. The output for this mode depends on the operator that is used, some output text while others HTML.
The response of the action is stored in the JSON Data Channel:
{
"html": {
"mySelector": "This is text that was found within as selected element"
}
}
Action: extract
Processor that extracts and formats tables and description lists from an HTML document.
{
"type": "html",
"name": "My Extract HTML Action",
"config": {
"action": "extract",
"text": "#{ data('/text') }", (1)
"baseUri": "",
"charset": "UTF-8",
"table": {
"active": true,
"titles": [
"caption"
]
},
"descriptionList": {
"active": true,
"titles": [
"h1"
]
}
}
}
| 1 | This configuration field is required. |
Each configuration field is defined as follows:
text-
(Required, String) The content of the HTML document to be processed, as a plain text string.
|
Note
|
The content of HTML files located in the File Service can be retrieved by leveraging the |
baseUri-
(Optional, String) The URL of the source, to resolve relative links against. Defaults to
"". charset-
(Optional, String) The character set used to encode the content before parsing. If
null, determines the charset from thehttp-equivmeta tag if present, or falls back toUTF-8if not. table-
(Optional, Object) The configurations for extracting tables.
Field definitions
active-
(Optional, Boolean) Whether the extractor is active. Defaults to
true. titles-
(Optional, Array of String) The list of HTML tags to be considered as the title for each element. A title is selected when either the first child or the previous sibling of the element matches any of the given tags.
descriptionList-
(Optional, Object) The configurations for extracting description lists.
Field definitions
active-
(Optional, Boolean) Whether the extractor is active. Defaults to
true. titles-
(Optional, Array of String) The list of HTML tags to be considered as the title for each element. A title is selected when either the first child or the previous sibling of the element matches any of the given tags.
The response of the action is stored in the JSON Data Channel:
{
"html": {
"tables": [
{
"title": "Table",
"table": [
[
{
"tag": "header",
"text": "Header 1"
},
{
"tag": "header",
"text": "Header 2"
}
],
[
{
"tag": "data",
"text": "Data 1"
},
{
"tag": "data",
"text": "Data 2"
}
]
]
}
],
"descriptionLists": [
{
"title": "Description List",
"descriptionList": [
{
"term": "Term 1",
"details": [
"Detail 1",
"Detail 2"
]
},
{
"term": "Term 2",
"details": []
}
]
}
]
}
}
Action: remove
Processor that removes elements that match a CSS selector query from an HTML document and outputs the remaining content.
{
"type": "html",
"name": "My Remove HTML Action",
"config": {
"action": "remove",
"text": "#{ data('/text') }", (1)
"baseUri": "",
"charset": "UTF-8",
"selector": "header, footer" (1)
}
}
| 1 | These configuration fields are required. |
Each configuration field is defined as follows:
text-
(Required, String) The content of the HTML document to be processed, as a plain text string.
|
Note
|
The content of HTML files located in the File Service can be retrieved by leveraging the |
baseUri-
(Optional, String) The URL of the source, to resolve relative links against. Defaults to
"". charset-
(Optional, String) The character set used to encode the content before parsing. If
null, determines the charset from thehttp-equivmeta tag if present, or falls back toUTF-8if not. selector-
(Required, String) The CSS selector query.
The response of the action is stored in the JSON Data Channel:
{
"html": "<html>\n <head></head>\n <body>\n <p>body text that wasn't removed </p>\n </body>\n</html>"
}
Hugging Face
Uses the Hugging Face integration to send requests to the Inference API. Supports multiple actions for different endpoints of the service. This component’s output field is named huggingFace by default.
Action: summarization
Processor that summarizes a single or multiple texts organized by an autogenerated id. See Summarization Task.
{
"type": "hugging-face",
"name": "My Summarization Action",
"config": {
"action": "summarization",
"model": "Falconsai/text_summarization", (1)
"input": "#{ data(\"/httpRequest/body/input\") }", (1)
"parameters": <Parameters Configuration>, (2)
"options": <Options configuration> (3)
},
"server": <Hugging Face Server ID> (4)
}
| 1 | These configuration fields are required. |
| 2 | See the parameters configuration. |
| 3 | See the options configuration. |
| 4 | See the Hugging Face integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The model to use for the request.
input-
(Required, Array of Strings) The list of texts to summarize.
parameters-
(Optional, Object) The parameters for the request.
Details
Configuration example{ "minLength": 10, "maxLength": 10, "topK": 5, "topP": null, "temperature": 1, "repetitionPenalty": 0.1, "maxTime": 0.1 }Each configuration field is defined as follows:
minLength-
(Optional, Integer) The minimum length of the output tokens.
maxLength-
(Optional, Integer) The maximum length of the output tokens.
topK-
(Optional, Integer) The top tokens to consider to create new text.
topP-
(Optional, Float) Defines the tokens that are within the sample operation for the query.
temperature-
(Optional, Float) The temperature of the sampling operation. Defaults to
1.0. repetitionPenalty-
(Optional, Float) The repetition penalty for the request.
maxTime-
(Optional, Float) The maximum time that the request should take.
options-
(Optional, Object) The request options.
Details
Configuration example{ "useCache": true, "waitForModel": false }Each configuration field is defined as follows:
useCache-
(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to
true. waitForModel-
(Optional, Boolean) Whether to wait until the model is ready or not. If
falsethe response will be503 - Service Unavailable.
Note
The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:
{
"huggingFace": "Your summarized text"
}
{
"huggingFace": [
"Your summarized text for your first input",
"Your summarized text for your second input",
...
]
}
|
Note
|
Please note that the order of the responses corresponds to the order of the inputs. |
Action: text-generation
Processor that continues the text from a prompt. See Text Generation Task.
{
"type": "hugging-face",
"name": "My Text Generation Action",
"config": {
"action": "text-generation",
"model": "gpt2-large", (1)
"input": "#{ data(\"/httpRequest/body/input\") }", (1)
"parameters": <Parameters Configuration>, (2)
"options": <Options configuration> (3)
},
"server": <Hugging Face Server ID> (4)
}
| 1 | These configuration fields are required. |
| 2 | See the parameters configuration. |
| 3 | See the options configuration. |
| 4 | See the Hugging Face integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The model to use for the request.
input-
(Required, Strings) The prompt from which to generate the response.
parameters-
(Optional, Object) The parameters for the request.
Details
Configuration example{ "topK": null, "topP": null, "temperature": 1, "repetitionPenalty": null, "maxNewTokens": 20, "maxTime": 5, "returnFullText": true, "numReturnSequences": 1, "doSample": true }Each configuration field is defined as follows:
maxNewTokens-
(Optional, Integer) The number of tokens to be generated.
returnFullText-
(Optional, Boolean) Whether to include the input text within the answer or not. Defaults to
true. numReturnSequences-
(Optional, Integer) The number of proposition to be returned.
doSample-
(Optional, Boolean) Whether to use sampling or not. Use greedy decoding otherwise. Defaults to
true. topK-
(Optional, Integer) The top tokens to consider to create new text.
topP-
(Optional, Float) Defines the tokens that are within the sample operation for the query.
temperature-
(Optional, Float) The temperature of the sampling operation. Defaults to
1.0. repetitionPenalty-
(Optional, Float) The repetition penalty for the request.
maxTime-
(Optional, Float) The maximum time that the request should take.
options-
(Optional, Object) The request options.
Details
Configuration example{ "useCache": true, "waitForModel": false }Each configuration field is defined as follows:
useCache-
(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to
true. waitForModel-
(Optional, Boolean) Whether to wait until the model is ready or not. If
falsethe response will be503 - Service Unavailable.
Note
The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:
{
"huggingFace": "My autogenerated text"
}
Action: feature-extraction
Processor that extracts a matrix of numerical features from a single or multiple texts organized by an autogenerated id. Seed Feature Extraction Task.
{
"type": "hugging-face",
"name": "My Feature Extraction Action",
"config": {
"action": "feature-extraction",
"model": "facebook/bart-base", (1)
"input": "#{ data(\"/httpRequest/body/input\") }", (1)
"options": <Options configuration> (2)
},
"server": <Hugging Face Server ID> (3)
}
| 1 | These configuration fields are required. |
| 2 | See the options configuration. |
| 3 | See the Hugging Face integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The model to use for the request.
input-
(Required, Array of String) The list of texts to extract the numerical features.
options-
(Optional, Object) The request options.
Details
Configuration example{ "useCache": true, "waitForModel": false }Each configuration field is defined as follows:
useCache-
(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to
true. waitForModel-
(Optional, Boolean) Whether to wait until the model is ready or not. If
falsethe response will be503 - Service Unavailable.
Note
The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:
{
"huggingFace":
[[ 2.2187119 , 2.7539337 , 1.0330348 , ... ],
[ -0.2937546 , 0.29999846 , -1.7008113 , ... ],
[ 0.09872855 , 0.53532976 , 0.7232368 , ... ]]
}
{
"huggingFace":
[
[
[ 2.2187119 , 2.7539337 , 1.0330348 , ... ],
[ -0.2937546 , 0.29999846 , -1.7008113 , ... ],
...
],
[
[ 2.821799 , 2.7055995 , 1.1408421 , ... ],
[ 1.4287674 , 0.39487326 , -3.7841866 , ... ],
...
]
]
}
|
Note
|
Please note that the order of the responses corresponds to the order of the inputs. |
Action: fill-mask
Processor that replaces a missing word in a sentence with multiple fitting possibilities. The name of the [MASK] token to be replaced is defined by the chosen model. See Fill Mask Task.
{
"type": "hugging-face",
"name": "My Fill Mask Action",
"config": {
"action": "fill-mask",
"model": "distilroberta-base", (1)
"input": "#{ data(\"/httpRequest/body/input\") }", (1)
"options": <Options configuration> (2)
},
"server": <Hugging Face Server ID> (3)
}
| 1 | These configuration fields are required. |
| 2 | See the options configuration. |
| 3 | See the Hugging Face integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The model to use for the request.
input-
(Required, Array of String) The list of texts to fill their masks.
options-
(Optional, Object) The request options.
Details
Configuration example{ "useCache": true, "waitForModel": false }Each configuration field is defined as follows:
useCache-
(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to
true. waitForModel-
(Optional, Boolean) Whether to wait until the model is ready or not. If
falsethe response will be503 - Service Unavailable.
Note
The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:
{
"huggingFace": [
{
"sequence": "Paris is the capital of france",
"score": 0.2705707,
"token": 812,
"tokenStr": " capital"
},
...
]
}
{
"huggingFace": [
[
{
"sequence": "Paris is the capital of france",
"score": 0.2705707,
"token": 812,
"tokenStr": " capital"
},
...
],
[
{
"sequence": "The Eiffle tower is one of the main tourist spots in Paris.",
"score": 0.9013709,
"token": 8376,
"tokenStr": " tourist"
},
...
],
...
]
}
Action: text-clasification
Processor that classifies a text into a group of labels, it provides a score for each label. These labels are determined by the model that is used. See Text Classification Task.
{
"type": "hugging-face",
"name": "My Text Classification Action",
"config": {
"action": "text-classification",
"model": "distilbert-base-uncased-finetuned-sst-2-english", (1)
"input": "#{ data(\"/httpRequest/body/input\") }", (1)
"options": <Options configuration> (2)
},
"server": <Hugging Face Server ID> (3)
}
| 1 | These configuration fields are required. |
| 2 | See the options configuration. |
| 3 | See the Hugging Face integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The model to use for the request.
input-
(Required, Array of String) The list of texts to classify.
options-
(Optional, Object) The request options.
Details
Configuration example{ "useCache": true, "waitForModel": false }Each configuration field is defined as follows:
useCache-
(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to
true. waitForModel-
(Optional, Boolean) Whether to wait until the model is ready or not. If
falsethe response will be503 - Service Unavailable.
Note
The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:
{
"huggingFace": [
{"label": "LabelA", "score": 0.9998608827590942},
...
]
}
{
"huggingFace": [
[
{
"label": "LabelA",
"score": 0.9998608827590942
},
...
],
[
{
"label": "LabelC",
"score": 0.9968926310539246
},
...
],
...
]
}
|
Note
|
Please note that the order of the responses corresponds to the order of the inputs. |
Action: zero-shot-classification
Processor that classifies a text into a group of labels without having seen any training examples for those labels, it provides a score for each label. See Zero Shot Classification Task.
{
"type": "hugging-face",
"name": "My Zero Shot Classification Action",
"config": {
"action": "zero-shot-classification",
"model": "facebook/bart-large-mnli", (1)
"input": "#{ data(\"/httpRequest/body/input\") }", (1)
"parameters": <Parameters Configuration>, (2)
"options": <Options configuration> (3)
},
"server": <Hugging Face Server ID> (4)
}
| 1 | These configuration fields are required. |
| 2 | See the parameters configuration. |
| 3 | See the options configuration. |
| 4 | See the Hugging Face integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The model to use for the request.
input-
(Required, Array of String) The list of texts to classify.
parameters-
(Optional, Object) The parameters for the request.
Details
Configuration example{ "candidateLabels": [ "labelA", "labelB" ], "multiLabel": false }Each configuration field is defined as follows:
candidateLabels-
(Required, Array of String) The list of possible labels to classify the input
multiLabel-
(Optional, Boolean) Whether classes can overlap or not. Defaults to
false
options-
(Optional, Object) The request options.
Details
Configuration example{ "useCache": true, "waitForModel": false }Each configuration field is defined as follows:
useCache-
(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to
true. waitForModel-
(Optional, Boolean) Whether to wait until the model is ready or not. If
falsethe response will be503 - Service Unavailable.
Note
The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:
{
"huggingFace": [
{
"label": "labelA",
"score": 0.9998608827590942
},
...
]
}
{
"huggingFace": [
[
{
"label": "labelA",
"score": 0.9998608827590942
},
...
],
[
{
"label": "labelA",
"score": 0.9968926310539246
},
...
],
...
]
}
|
Note
|
Please note that the order of the responses corresponds to the order of the inputs. |
Action: token-classification
Processor that assigns a label to the tokens from a single or multiple texts organized by an autogenerated id. See Token Classification Task.
{
"type": "hugging-face",
"name": "My Token Classification Action",
"config": {
"action": "token-classification",
"model": "dslim/bert-base-NER", (1)
"input": "#{ data(\"/httpRequest/body/input\") }", (1)
"parameters": {
"aggregationStrategy": "SIMPLE"
},
"options": <Options configuration> (2)
},
"server": <Hugging Face Server ID> (3)
}
| 1 | These configuration fields are required. |
| 2 | See the options configuration. |
| 3 | See the Hugging Face integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The model to use for the request.
input-
(Required, Array of String) The list of texts to classify their tokens
parameters-
(Optional, Object) The parameters for the request. Should contain a single
aggregationStrategyfield with a string value for the aggregation strategy to use in the request. Supported aggregation strategy values are:Type definitions
NONE: Every token gets classified without further aggregation.SIMPLE: Entities are grouped according to the default schema (B-, I- tags get merged when the tag is similar).FIRST: Same as theSIMPLEstrategy except words cannot end up with different tags. Words will use the tag of the first token when there is ambiguity.AVERAGE: Same as theSIMPLEstrategy except words cannot end up with different tags. Scores are averaged across tokens and then the maximum label is applied.MAX: Same as the SIMPLE strategy except words cannot end up with different tags. Word entity will be the token with the maximum score
options-
(Optional, Object) The request options.
Details
Configuration example{ "useCache": true, "waitForModel": false }Each configuration field is defined as follows:
useCache-
(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to
true. waitForModel-
(Optional, Boolean) Whether to wait until the model is ready or not. If
falsethe response will be503 - Service Unavailable.
Note
The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:
{
"huggingFace": [
{
"score": 0.9990085,
"word": "Omar",
"start": 11,
"end": 15,
"entityGroup": "PER"
},
...
]
}
{
"huggingFace": [
[
{
"score": 0.9990085,
"word": "Omar",
"start": 11,
"end": 15,
"entityGroup": "PER"
},
...
],
[
{
"score": 0.9949533,
"word": "George Washington",
"start": 0,
"end": 17,
"entityGroup": "PER"
},
...
],
....
]
}
|
Note
|
Please note that the order of the responses corresponds to the order of the inputs. |
Action: question-answering
Processor that answers a question based on given contexts. See Question Answering Task
{
"type": "hugging-face",
"name": "My Question Answering Action",
"config": {
"action": "question-answering",
"model": "deepset/roberta-base-squad2", (1)
"input": <Input values>, // <1> (2)
"options": <Options configuration> (3)
},
"server": <Hugging Face Server ID> (4)
}
| 1 | These configuration fields are required. |
| 2 | See the input definition. |
| 3 | See the options configuration. |
| 4 | See the Hugging Face integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The model to use for the request.
input-
(Required, Object) The input for the request.
Details
Configuration example{ "question": "#{ data(\"/httpRequest/body/question\") }", "context": [ "#{ data(\"/httpRequest/body/context/0\") }", "#{ data(\"/httpRequest/body/context/1\") }", ] }Each configuration field is defined as follows:
question-
(Required, String) The question to be answered by the model.
context-
(Required, Array of String) The list of context used to answer the question.
minScore-
(Optional, Float) The minimum score for each answer.
options-
(Optional, Object) The request options.
Details
Configuration example{ "useCache": true, "waitForModel": false }Each configuration field is defined as follows:
useCache-
(Optional, Boolean) Whether to cache the results. Useful with deterministic models. Defaults to
true. waitForModel-
(Optional, Boolean) Whether to wait until the model is ready or not. If
falsethe response will be503 - Service Unavailable.
Note
The response of the action is stored in the JSON Data Channel as returned by the invoked endpoint:
{
"huggingFace": [
{
"answer": "Clara",
"score": 0.8979613184928894,
"start": 11,
"end": 16
},
{
"answer": "Los Angeles",
"score": 0.013939359225332737,
"start": 20,
"end": 31
},
...
]
}
|
Note
|
Please note that the responses are sorted in descending order according to their score value. |
Language Detector
The language Detector component uses Lingua to identify the language from a specified text input. The languages are referenced using ISO-639-1 (alpha-2 code). This component’s output field is named language by default.
|
Note
|
Each time a language model is referenced, it will be loaded in memory. Loading too many languages increases the risk of high memory consumption issues. |
Processor Action: process
Processor that detects the language of a provided text.
{
"type": "language-detector",
"name": "My Language Detector Processor Action",
"config": {
"action": "process",
"text": { (1)
"inputA": "#{ data('/httpRequest/body/custom/fieldA') }",
"inputB": "#{ data('/httpRequest/body/custom/fieldB') }"
},
"minDistance": 0.5,
"supportedLanguages": [
"en",
"es"
],
"defaultLanguage": "it"
}
}
| 1 | This configuration field is required. |
Each configuration field is defined as follows:
text-
(Required, Object) The text to be evaluated. It can be either a
Stringwith a single input, or aMapfor multi-input processing. defaultLanguage-
(Optional, String) Default language to select in case no other is detected. Defaults to
en. minDistance-
(Optional, Double) Distance between the input and the language model. Defaults to
0.0. supportedLanguages-
(Optional, Array of Strings) List of languages supported by the detector. At least
2supported languages must be set. Defaults to[ "en", "es" ].
The response of the action is stored in the JSON Data Channel as returned by the Lingua engine:
{
"language": "en"
}
{
"language": {
"inputA": "en",
"inputB": "es"
}
}
MongoDB
Uses the MongoDB integration to send requests to the MongoDB server. This component’s output field is named mongo by default.
Action: aggregate
Processor that runs a configured aggregation pipeline on a MongoDB database.
{
"type": "mongo",
"name": "My MongoDB Processor Action",
"config": { (1)
"action": "aggregate",
"database": "my-database",
"collection": "my-collection",
"stages": [ (2)
{
"$count": "total"
}
]
},
"server": <MongoDB Server ID> (3)
}
| 1 | All configuration fields in this action are required. |
| 2 | The exact expected structure of these objects is defined by MongoDB. This is just an example at the time of writing. |
| 3 | See the MongoDB integration section. |
Each configuration field is defined as follows:
database-
(Required, String) The database name.
collection-
(Required, String) The collection name.
stages-
(Required, Array of Objects) List of aggregation stages. Each object in the array should represent a MongoDB aggregation stage.
The response of the action is stored in the JSON Data Channel as returned by the MongoDB server:
{
"mongo": {
...
}
}
Action: autocomplete
Processor that uses the autocomplete operator in a compound must clause, filters are applied in the filter clause.
{
"type": "mongo",
"name": "My MongoDB Processor Action",
"config": {
"action": "autocomplete",
"database": "my-database", (1)
"collection": "my-collection", (1)
"index": "my-index", (1)
"path": "my-field", (1)
"queries": [ (1)
"A "
],
"tokenOrder": "ANY",
"filter": <DSL Filter> (2)
},
"server": <MongoDB Server ID> (3)
}
| 1 | These configuration fields are required. |
| 2 | See the DSL Filter section. |
| 3 | See the MongoDB integration section. |
Each configuration field is defined as follows:
database-
(Required, String) The database name.
collection-
(Required, String) The collection name.
index-
(Required, String) The name for the MongoDB full-text search index.
path-
(Required, String) The indexed field to search.
queries-
(Required, Array of Strings) The phrase or phrases to autocomplete.
tokenOrder-
(Optional, String) The order in which the tokens will be searched. Either
ANYorSEQUENTIAL. filter-
(Optional, DSL Filter) The filter to apply.
The response of the action is stored in the JSON Data Channel as returned by the MongoDB server:
{
"mongo": {
...
}
}
Action: search
Processor that uses the text operator in a compound must clause, filters are applied in the filter clause.
{
"type": "mongo",
"name": "My MongoDB Processor Action",
"config": {
"action": "search",
"database": "my-database", (1)
"collection": "my-collection", (1)
"index": "my-index", (1)
"paths": [ (1)
"name",
"description"
],
"queries": [
"What is my name?",
"What is my description?"
],
"filter": <DSL Filter>, (2)
"pageable": <Pagination Parameters>, (3)
},
"server": <MongoDB Server ID> (4)
}
| 1 | These configuration fields are required. |
| 2 | See the DSL Filter section. |
| 3 | See the Pagination appendix. |
| 4 | See the MongoDB integration section. |
Each configuration field is defined as follows:
database-
(Required, String) The database name.
collection-
(Required, String) The collection name.
index-
(Required, String) The name for the MongoDB full-text search index.
paths-
(Required, String or Array of Strings) The paths of the fields to search. Can be configured as a single String if there’s only one path.
queries-
(Required, String or Array of Strings) The phrases to search in the field. Can be configured as a single String if there’s only one phrase.
pageable-
(Optional, Object) The pagination object.
Details
Page configuration example{ "page": 0, "size": 25, "sort": [ // (1) { "property" : "fieldA", "direction" : "ASC" }, { "property" : "fieldB", "direction" : "DESC" } ] }Each configuration field is defined as follows:
page-
(Integer) The page number.
size-
(Integer) The size of the page.
sort-
(Array of Objects) The sort definitions for the page.
Field definitions
property-
(String) The property where the sort was applied.
direction-
(String) The direction of the applied sorting. Either
ASCorDESC.
filter-
(Optional, DSL Filter) The filter to apply.
The response of the action is stored in the JSON Data Channel as returned by the MongoDB server:
{
"mongo": {
...
}
}
Action: vector
Processor that uses the vectorSearch MongoDB operator, and its filter field if provided, to perform ANN vector search on a vector type field indexed in a Atlas Vector Search index. Also adds a minimum vector search result score filter for resulting documents.
|
Note
|
The vector search operator’s ENN search capabilities
aren’t currently supported by this action. However, they may still be used in QueryFlow via the |
{
"type": "mongo",
"name": "My MongoDB Processor Action",
"config": {
"action": "vector",
"database": "my-database", (1)
"collection": "my-collection", (1)
"index": "my-index", (1)
"queryVector": "#{ data('my/vector') }", (1)
"path": "my-field", (1)
"limit": 10, (1)
"minScore": 0.7, (1)
"numCandidates": 200, (1)
"filter": { (2)
"$and": [
{"year": { "$gt": 1955 }},
{"year": { "$lt": 1975 }}
]
}
},
"server": <MongoDB Server ID> (3)
}
| 1 | These configuration fields are required. |
| 2 | The exact expected structure of the filter is defined by MongoDB, this is just an example at the time of writing. |
| 3 | See the MongoDB integration section. |
Each configuration field is defined as follows:
database-
(Required, String) The database name.
collection-
(Required, String) The collection name.
index-
(Required, String) The name of the Atlas Vector Search index.
queryVector-
(Required, Array of Float) The vector used as query in the search.
path-
(Required, String) The path to search for the vector in the documents.
numCandidates-
(Required, Integer) The number of nearest neighbors to use during an ANN search. This config will be ignored if the
exactconfig value is set totrue. Must be higher or equals tolimit. limit-
(Required, Integer) The number documents to return in the vector search result. The
minScorefilter is then applied to those results. minScore-
(Required, Double) The minimum score for results of the vector search.
filter-
(Optional, Object) The search operator filter to apply in the query. This object should represent a valid Atlas Vector Search Pre-filter.
The response of the action is stored in the JSON Data Channel as returned by the MongoDB server:
{
"mongo": {
...
}
}
Neo4j
Executes a read query to a Neo4j server to gather search results from it. This component’s output field is named neo4j by default.
Processor Action: process
Processor that executes a query to a Neo4j server.
{
"type": "neo4j",
"name": "My Neo4j Processor Action",
"config": { (1)
"action": "process",
"database": "neo4j",
"query": "MATCH (p:Person {name: $NAME}) RETURN p",
"parameters": {
"NAME": "Adam"
}
},
"server": <Neo4j Server ID>, (2)
}
| 1 | All configuration fields in this action are required. |
| 2 | See the Neo4j integration section. |
Each configuration field is defined as follows:
database-
(Required, String) The Neo4j database to query.
query-
(Required, String) The query to be executed.
parameters-
(Required, Map of String/Object) Parameters to be used in the query. See Neo4J’s parameters section for details on how to use this configuration.
The response of the action is stored in the JSON Data Channel as returned by the Neo4j server:
{
"neo4j": {
...
}
}
OpenAI
Uses the OpenAI integration to send requests to the OpenAI API. Supports multiple actions for different endpoint of the service. Additionally, supports text trimming based on OpenAI models' tokenizing and token limits, by integrating the tiktokten library. This component’s non-streamed output field is named openai by default.
Action: chat-completion
Processor that executes a chat completion request to OpenAI API.
{
"type": "openai",
"name": "My Chat Completion Action",
"config": {
"action": "chat-completion",
"model": "gpt-4", (1)
"messages": [ (1) (2)
{"role": "system", "content": "You are a helpful assistant" },
{"role": "user", "content": "Hi!" },
{"role": "assistant", "content": "Hi, how can assist you today?" }
],
"promptCacheKey": "pureinsights",
"frequencyPenalty": 0.0,
"presencePenalty": 0.0,
"temperature": 1,
"topP": 1,
"n": 1,
"maxTokens": 2048,
"stop": [],
"stream": false,
"responseFormat": <Response format configuration> (3)
},
"server": <OpenAI Server ID>, (4)
}
| 1 | These configuration fields are required. |
| 2 | See the messages configuration definition. |
| 3 | See the response format configuration. |
| 4 | See the OpenAI integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The OpenIA model to use.
messages-
(Required, Array of Objects) The list of messages for the request
Field definitions
role-
(Required, String) The role of the message. Must be one of
system,userorassistant. content-
(Required, String) Then content of the message.
name-
(Optional, String) The name of the author of the message.
promptCacheKey-
(Optional, String) Value used by OpenAI to cache responses for similar requests to optimize the cache hit rates.
frequencyPenalty-
(Optional, Double) Positive values penalize new tokens based on their existing frequency in the text so far. Value must be between
-2.0and2. Defaults to0.0 presencePenalty-
(Optional, Double) Positive values penalize new tokens based on whether they appear in the text so far. Value must be between
-2.0and2. Defaults to0.0 temperature-
(Optional, Double) Sampling temperature to use. Value must be between
0and2. Defaults to1
|
Note
|
It’s generally recommended to alter either this or the topP field, but not both.
|
topP-
(Optional, Double) An alternative to sampling with temperature, where the model considers the results of the tokens with top_p probability mass. Defaults to
1
|
Note
|
It’s generally recommended to alter either this or the temperature field, but not both.
|
n-
(Optional, Integer) How many chat completion choices to generate for each input message. Defaults to
1 maxTokens-
(Optional, Integer) The maximum number of tokens to generate in the chat completion. Defaults to
2048 stop-
(Optional, Array of String) Up to 4 sequences where the API will stop generating further tokens
stream-
(Optional, Boolean) Whether enable streaming or not. Defaults to
false
responseFormat-
(Optional, Object) An object specifying the format that the model must output. Learn more in OpenAI’s Structured Outputs guide.
Details
Configuration example{ "type": "json_schema", (1) "json_schema": { (2) "name": "a name for the schema", "strict": true, "schema": { (3) "type": "object", "properties": { "equation": { "type": "string" }, "answer": { "type": "string" } }, "required": ["equation", "answer"], "additionalProperties": false } } }1 The response type is always required. 2 JSON schemas can only be used and are required with json_schematypes of response formats.3 The exact expected structure of the schema object is defined by the OpenAI API, this is just an example at the time of writing. Each configuration field is defined as follows:
type-
(Required, String) The type of response format being defined. Allowed values:
text,json_schemaandjson_object. json_schema-
(Optional, Object) Structured Outputs configuration options, including a JSON Schema. This field can only be used and is in fact required with response formats of the
json_schematype. See OpenAI’s response formats definitions for more details.
Field definitions
name-
(Required, String) The name of the response format. Must be a-z, A-Z, 0-9, or contain underscores and dashes, with a maximum length of 64.
description-
(Optional, String) A description of what the response format is for, used by the model to determine how to respond in the format.
schema-
(Optional, Object) The schema for the response format, described as a JSON Schema object. Learn how to build JSON schemas here.
strict-
(Optional, Boolean) Whether to enable strict schema adherence when generating the output. If set to true, the model will always follow the exact schema defined in the
schemafield. Defaults tofalse.
The response of the action is stored in the JSON Data Channel as returned by the OpenAI API:
{
"openai": {
"created": <Timestamp>,
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "The response from the model"
},
"finishReason": "stop"
}
],
"model": "gpt-4-0613",
"usage": {
"promptTokens": 34,
"completionTokens": 95,
"totalTokens": 129
}
}
}
[
{
"name": "openai",
"data": "The"
},
{
"name": "openai",
"data": " response"
},
{
"name": "openai",
"data": " from"
},
{
"name": "openai",
"data": " the"
},
{
"name": "openai",
"data": " model"
},
...
]
Action: embeddings
Processor executes an embeddings request to OpenAI API.
{
"type": "openai",
"name": "My OpenAI Embeddings Action",
"config": {
"action": "embeddings",
"model": "text-embedding-ada-002", (1)
"input": ["Sample text 1", "Sample text 2"], (1)
"user": "pureinsights"
},
"server": <OpenAI Server UUID>, (2)
}
| 1 | These configuration fields are required. |
| 2 | See the OpenAI integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The OpenIA model to use.
input-
(Required, Array of Strings) The list of input texts to be processed.
user-
(Optional, String) The unique identifier representing the end-user.
The response of the action is stored in the JSON Data Channel as returned by the OpenAI API:
{
"openai": {
"embeddings": [
{
"embedding": [ -0.006929283495992422, -0.005336422007530928, ... ],
"index": 0
},
{
"embedding": [ -0.024047505110502243, -0.006929283495992422, ... ],
"index": 1
}
],
"model": "text-embedding-ada-002-v2",
"usage": {
"promptTokens": 4,
"totalTokens": 4
}
}
}
Action: trim
Processor that trims a given text based on an OpenAI model’s tokenizing and either its own token limit, or a custom one.
{
"type": "openai",
"name": "My OpenAI Trim Action",
"config": {
"action": "trim",
"text": "#{ data(\"/text\") }", (1)
"model": "gpt-5.2", (2)
"tokenLimit": 5 (2)
}
}
| 1 | This configuration field is required |
| 2 | At least one of these configuration fields are required |
Each configuration field is defined as follows:
text-
(Required, String) The text to trim.
model-
(Optional, String) The OpenAI model whose encoding is used to tokenize the text and whose token limit determines whether to truncate the text. If a custom token limit is defined, it’ll override the model’s. If no model is provided, a default
o200k_baseencoding will be used.
|
Note
|
In order to determine the encoding and token limit for models used in chat completion requests, the processor will only take into account those models' "version"
when trimming. In this context, "version" translates to the ChatGPT version used, such as gpt-5.2, gpt-4.1, o4, etc. This means that a model field configured as gpt-5.2-2025-12-11 will result in the processor only taking into account the gpt-5.2 version included, and ignore the rest. Consequently, non-existent models such as o4-mini-thismodeldoesntexist are considered valid for this action as long as the model version can be inferred.
|
tokenLimit-
(Optional, Integer) The positive integer used as token limit when determining whether to truncate the encoded text or not. If defined, it’ll override the provided model’s token limit, if any.
The response of the action is stored in the JSON Data Channel:
{
"openai": {
"text": "The brown fox jumps over",
"size": 24,
"tokens": 5,
"truncated": true,
"remainder": [
" the",
" lazy",
" dog"
]
}
}
OpenSearch
Uses the OpenSearch integration to send requests to the OpenSearch API. It supports multiple actions for common operations such as search, but also provides a mechanism to send raw OpenSearch queries. This component’s output field is named {opensearch} by default.
Action: autocomplete
Processor that executes a completion suggester query.
{
"type": "opensearch",
"name": "My OpenSearch Processor Action",
"config": {
"action": "autocomplete",
"index": "my-index", (1)
"text": "#{ data('my/query') }", (1)
"field": "content", (1)
"size": 3,
"skipDuplicates": true
},
"server": <OpenSearch Server UUID> (2)
}
| 1 | These configuration fields are required. |
| 2 | See the OpenSearch integration section. |
Each configuration field is defined as follows:
index-
(Required, String) The index where to search.
text-
(Required, String) The text to autocomplete.
field-
(Required, String) The field where to search.
skipDuplicates-
(Optional, Boolean) Whether to skip duplicate suggestions.
size-
(Optional, Integer) The amount of suggestions.
The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:
{
"opensearch": {
...
}
}
Action: fetch
Processor that executes a GET request to retrieve a specified JSON document from an index.
{
"type": "opensearch",
"name": "My OpenSearch Processor Action",
"config": {
"action": "fetch",
"index": "my-index", (1)
"id": "document-ID", (1)
"fields": <DSL Projection> (2)
},
"server": <OpenSearch Server UUID> (3)
}
| 1 | These configuration fields are required. |
| 2 | See the DSL Projection section. |
| 3 | See the OpenSearch integration section. |
Each configuration field is defined as follows:
index-
(Required, String) The index where to search.
id-
(Required, String) The ID of the document.
fields-
(Optional, Projection) The source fields to be included or excluded.
The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:
{
"opensearch": {
...
}
}
Action: knn
Processor that executes an Approximate k-NN query.
{
"type": "opensearch",
"name": "My OpenSearch Processor Action",
"config": {
"action": "knn",
"index": "my-index", // (1)
"field": "vector-field", // (1)
"vector": "#{ data('my/vector') }", // (1)
"minScore": 0.92, // (1)
"maxResults": 5, // (1)
"k": 5, // (1)
"query": {
"match_all": {}
}
},
"server": <OpenSearch Server UUID> // (2)
}
-
These configuration fields are required.
-
See the OpenSearch integration section.
Each configuration field is defined as follows:
index-
(Required, String) The index where to search.
field-
(Required, String) The field with the vector.
vector-
(Required, Array of Float) The source vector to compare.
minScore-
(Required, Double) The minimum score for results.
maxResults-
(Required, Integer) The maximum number of results.
k-
(Required, Integer) The number of neighbors the search of each graph will return.
query-
(Optional, Object) The query to filter in addition to the kNN search.
The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:
{
"opensearch": {
...
}
}
Action: native
Processor that executes a native OpenSearch query.
{
"type": "opensearch",
"name": "My OpenSearch Processor Action",
"config": {
"action": "native",
"path": "/my-index/_doc/1", (1)
"method": "POST", (1)
"queryParam": {
"param1": "value1"
},
"body": {
"field1": "value2"
}
},
"server": <OpenSearch Server UUID> (2)
}
| 1 | These configuration fields are required. |
| 2 | See the OpenSearch integration section. |
Each configuration field is defined as follows:
path-
(Required, String) The endpoint of the request, excluding schema, host, port and any path included as part of the connection.
method-
(Required, String) The HTTP method for the request.
queryParams-
(Optional, Map of String/String) The map of query parameters for the URL.
body-
(Optional, Object) The JSON body to submit.
The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:
{
"opensearch": {
...
}
}
Action: search
Processor that executes a match query on the index.
{
"type": "opensearch",
"name": "My OpenSearch Processor Action",
"config": {
"action": "search",
"index": "my-index", // (1)
"text": "#{ data('my/query') }", // (1)
"field": "content", // (1)
"suggest": { // (2)
"completion-suggestion": {
"prefix": "Value ",
"completion": {
"field": "field.completion"
}
}
},
"aggregations": { // (2)
"aggregationA": {
"terms": {
"field": "field.keyword"
}
}
},
"highlight": { // (2)
"fields": {
"field": {}
}
},
"filter": <DSL Filter>, // (3)
"pageable": <Pagination Parameters>, // (4)
},
"server": <OpenSearch Server UUID> // (5)
}
-
These configuration fields are required.
-
The exact expected structure of these objects is defined by the OpenSearch API, this is just an example at the time of writing.
-
See the DSL Filter section.
-
See the Pagination appendix.
-
See the OpenSearch integration section.
index-
(Required, String) The index where to search.
text-
(Required, String) The text to search.
field-
(Required, String) The field where to search.
suggest-
(Optional, Object) The suggester to apply. The object should represent a valid suggester according to the OpenSearch API.
aggregations-
(Optional, Map of String/Object) The field with the aggregations to apply. See the OpenSearch API aggregation documentation for details on the structure of the map.
highlight-
(Optional, Object) The highlighter to apply. The object should represent a valid highlighter according to the OpenSearch API.
filter-
(Optional, DSL Filter) The filters to apply.
pageable-
(Optional, Pagination) The pagination parameters.
The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:
{
"opensearch": {
...
}
}
Action: store
Processor that stores or updates documents in the given index of OpenSearch.
{
"type": "opensearch",
"name": "My OpenSearch Processor Action",
"config": {
"action": "store",
"index": "my-index", // (1)
"document": { // (1)
"field1": "value1"
},
"id": "documentID",
"allowOverride": false,
},
"server": <OpenSearch Server UUID> // (2)
}
-
These configuration fields are required.
-
See the OpenSearch integration section.
Each configuration field is defined as follows:
index-
(Required, String) The index where to store the document.
id-
(Required, String) The ID of the document to be stored.
document-
(Required, Object) The document to be stored.
allowOverride-
(Optional, Boolean) Whether the document can be overridden or not. Defaults to
false.
The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:
{
"opensearch": {
...
}
}
Action: vector
Processor that executes an Exact kNN with scoring script query.
{
"type": "opensearch",
"name": "My OpenSearch Processor Action",
"config": {
"action": "vector",
"index": "my-index", // (1)
"field": "my_vector_field", // (1)
"vector": "#{ data('my/vector') }", // (1)
"minScore": 0.92, // (1)
"maxResults": 5, // (1)
"function": "cosinesimil", // (1)
"query": {
"match_all": {}
}
},
"server": <OpenSearch Server UUID> // (2)
}
-
These configuration fields are required.
-
See the OpenSearch integration section.
Each configuration field is defined as follows:
index-
(Required, String) The index where to search.
field-
(Required, String) The field with the vector.
vector-
(Required, Array of Float) The source vector to compare.
minScore-
(Required, Double) The minimum score for results.
maxResults-
(Required, Integer) The maximum number of results.
function-
(Required, String) The function used for the k-NN calculation. The available functions can be found here.
query-
(Optional, Object) The query to apply together with the vector search.
The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:
{
"opensearch": {
...
}
}
Question Detector
The Question Detector component validates if an input text contains a question. It uses languages codes that are referenced using ISO-639-1 (alpha-2 code). This component’s output field is named isQuestion by default.
Processor Action: process
Processor that detects if the provided text is a question.
{
"type": "question-detector",
"name": "My Question Detector Processor Action",
"config": {
"action": "process",
"language": "es", (1)
"text": "#{ data(\"/httpRequest/queryParams/question\") }", (1)
"questionPrefixes": {
"es": [
"que",
"quien",
"porque",
"donde",
"cuando",
"como"
]
}
}
}
| 1 | These configuration fields are required. |
Each configuration field is defined as follows:
text-
(Required, String) The text to be evaluated.
language-
(Required, String) The language to use.
questionPrefixes-
(Optional, Map of String/List) Words that indicate a question. Defaults to
{ "en": [ "what", "who", "why", "where", "when", "how" ] }.
The response of the action is stored in the JSON Data Channel:
{
"isQuestion": <`True` or `False`>
}
Script
Uses the Script Engine to execute a script for advanced handling of the execution data. Supports multiple scripting languages and provides tools for JSON manipulation and for logging. This component’s output field is named script by default.
Processor Action: process
Processor that executes a script to process and interact with data produced in previous states.
{
"type": "script",
"name": "My Script Processor Action",
"config": {
"action": "process",
"language": "groovy",
"script": <Script>, (1)
}
}
| 1 | This configuration field is required. |
Each configuration field is defined as follows:
language-
(Optional, String) The language of the script. One of the supported script languages. Defaults to
groovy. script-
(Required, String) The script to run.
The response of the action is stored in the JSON Data Channel according to the script’s interaction with the output() object:
{
"script": {
...
}
}
Solr
Uses the Solr integration to send requests to Solr. This component’s output field is named solr by default.
Action: native
Processor that executes a native indexing query.
{
"type": "solr",
"name": "My Solr Processor Action",
"config": {
"action": "native",
"path": "/select", (1)
"method": "POST", (1)
"queryParams": { (1)
"q": "description:Pureinsights"
},
"body": {},
"maxResponseMapDepth": 5
},
"server": <Solr Server UUID>, (2)
}
| 1 | These configuration fields are required. |
| 2 | See the Solr integration section. |
Each configuration field is defined as follows:
path-
(Required, String) The Solr operation path to be used for the request.
method-
(Required, String) The HTTP method for the request.
queryParams-
(Required, Map of String/String) The map of query parameters for the request.
body-
(Optional, Object) The JSON body to submit for the request. The exact structure will depend on the Solr operation performed.
maxResponseMapDepth-
(Optional, Integer) The maximum depth for response object deserialization. Defaults to
5.
The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:
{
"solr": {
...
}
}
Action: search
Processor that executes a standard search query.
{
"type": "solr",
"name": "My Solr Processor Action",
"config": {
"action": "search",
"query": "#{ data('my/query') }", (1)
"filterQueries": "description:Pureinsights",
"fields": [
"name",
"description"
],
"highlight": false,
"maxResponseMapDepth": 5
},
"server": <Solr Server UUID>, (2)
}
| 1 | These configuration fields are required. |
| 2 | See the Solr integration section. |
Each configuration field is defined as follows:
query-
(Required, String) The search query to be executed.
fields-
(Optional, Array of Strings) The optional returned fields of the document. If not set, all the fields in the document are returned.
highlight-
(Optional, Boolean) Whether to enable highlighting in the resulting query or not.
filterQueries-
(Optional, String) The filter queries to be applied to the search.
maxResponseMapDepth-
(Optional, Integer) The maximum depth for response object deserialization. Defaults to
5.
The response of the action is stored in the JSON Data Channel as returned by the OpenSearch API:
{
"solr": {
...
}
}
Staging
Interacts with buckets and content from Discovery Staging. This component’s output field is named staging by default.
Action: fetch
Gets a document from the given bucket.
{
"type": "staging",
"name": "My Staging Processor Action",
"config": {
"action": "fetch",
"bucket": "my-bucket", (1)
"id": "my-document-id", (1)
"fields": <DSL Projection> (2)
}
}
| 1 | These configuration fields are required. |
| 2 | See the DSL Projection section. |
Each configuration field is defined as follows:
bucket-
(Required, String) The bucket name.
id-
(Required, String) The ID of the document to fetch
fields-
(Optional, Projection) The projection to apply on the document.
The response of the action is stored in the JSON Data Channel as returned by the Staging client:
{
"staging": {
...
}
}
Action: store
Stores a document into the given bucket.
{
"type": "staging",
"name": "My Staging Processor Action",
"config": {
"action": "store",
"bucket": "my-bucket", (1)
"document": { (1)
"field": "value"
},
"id": <Staging Document ID>, (1)
"allowOverride": false,
}
}
| 1 | These configuration fields are required. |
Each configuration field is defined as follows:
bucket-
(Required, String) The bucket name.
document-
(Required, Object) The document to store.
id-
(Optional, String) The ID of the document to store. If not provided, a random UUID will be used.
allowOverride-
(Optional, Boolean) Whether allow overriding an existing document or not. Defaults to
false.
The response of the action is stored in the JSON Data Channel as returned by the Staging client:
{
"staging": {
...
}
}
Action: search
Search for documents in the given bucket.
{
"type": "staging",
"name": "My Staging Processor Action",
"config": {
"action": "search",
"bucket": "my-bucket", (1)
"actions": ["STORE"], (1)
"filter": <DSL Filter>, (2)
"projection": <DSL Projection>, (3)
"parentId": <Staging Document ID>,
"pageable": <Pagination Parameters> (4)
}
}
| 1 | These configuration fields are required. |
| 2 | See the DSL Filter section. |
| 3 | See the DSL Projection section. |
| 4 | See the Pagination appendix. |
Each configuration field is defined as follows:
bucket-
(Required, String) The bucket name.
actions-
(Required, Array of Strings) The actions to filter the documents. Defaults to
STORE. projection-
(Optional, Projection) The projection to apply on the search.
filter-
(Optional, DSL Filter) The filter to apply on the search.
parentId-
(Optional, String) The parent ID to match.
pageable-
(Optional, Pagination) The pagination object.
Details
Page configuration example{ "page": 0, "size": 25, "sort": [ // (1) { "property" : "fieldA", "direction" : "ASC" }, { "property" : "fieldB", "direction" : "DESC" } ] }Each configuration field is defined as follows:
page-
(Integer) The page number.
size-
(Integer) The size of the page.
sort-
(Array of Objects) The sort definitions for the page.
Field definitions
property-
(String) The property where the sort was applied.
direction-
(String) The direction of the applied sorting. Either
ASCorDESC.
The response of the action is stored in the JSON Data Channel as returned by the Staging client:
{
"staging": {
...
}
}
Template
Uses the Template Engine to transform a standard template with contextual structured data, generating a verbalized representation of the information. It can generate various types of documents as either plain text or JSON. This component’s output field is named template by default.
Processor Action: process
Processor that processes the provided template with the defined configuration.
{
"type": "template",
"name": "My Template Processor Action",
"config": {
"action": "process",
"template": "Hello, ${name}!", (1)
"bindings": { (1)
"name": "John"
},
"outputFormat": "PLAIN"
}
}
| 1 | These configuration fields are required. |
Each configuration field is defined as follows:
template-
(Required, String) The template to process.
bindings-
(Required, Object) The bindings to replace in the template.
Binding object format
{ "bindingA": "#{ data('/my/binding/field') }", ... }Each binding, defined as a key in the object, can be later referenced in a template:
My bindingA value is ${bindingA} outputFormat-
(Optional, String) The output format of the precessed template. Supported formats are:
JSONandPLAIN. Defaults toPLAIN.
The response of the action is stored in the JSON Data Channel as returned by the Template engine:
{
"template": {
...
}
}
Tokenizer
Tokenizes a specified text input using Lucene. This component’s output field is named tokens by default.
Processor Action: process
Processor that tokenizes any entry provided using Lucene analyzers. The supported analyzers are:
|
Note
|
All analyzers (except the custom one, which needs the tokenizer configuration) will be used as they are built by default if no configuration is specified. Further configurations under the field |
{
"type": "tokenizer",
"name": "My Tokenizer Processor Action",
"config": {
"action": "process",
"text": "#{ data(\"/httpRequest/queryParams/q\") }", (1)
"attributes": ["term", "offset"],
"analyzer": <Analyzer Configuration> (2)
}
}
| 1 | This configuration field is required. |
| 2 | See the analyzer configuration. |
Each configuration field is defined as follows:
text-
(Required, String) The text to tokenize.
attributes-
(Optional, Array of strings) The attributes to include with each token. Supports
termandoffset. Default is["term", "offset"].Type definitions
term: Adds the token itself as an attribute.offset: Adds the relative start and end position of the token in the input text.
|
Note
|
The attributes are added to the configuration as a list, all of those included will be added to the output. That list, if specified, cannot be empty. |
analyzer-
(Optional, String or Map of String/Object) The analyzer to use for the tokenization. Defaults to
standard.Field definitions
If the
analyzerfield is set as a string, a default analyzer, without further configuration, will be used in the action. The value should be the name of one of the supported analyzers:Default standard analyzer example{ "analyzer": "standard" }If the
analyzerfield is set as an object representing a Map of String/Object, the analyzer used in the action can be further customized, and the exact expected structure of the map’s objects, which is the custom analyzer’s configuration, will depend on the chosen type of analyzer:Standard analyzer configuration
Standard analyzer configuration example{ "type": "standard", "stopwords": { "tokens": [ "the" ], "ignoreCase": true }, "maxTokenLength": 4 }Each configuration field is defined as follows:
maxTokenLength-
(Optional, Int) The maximum token length the analyzer will emit. Defaults to
255.
stopwords-
(Optional, Array of Strings or Map of String/Object) A set of common words usually not useful for search. The field can be defined as a map object to configure both the list of words via the
tokensfield, and whether to ignore cases when identifying the words, via theignoreCasefield. The words can also be defined directly as a list of Strings, in which case theignoreCasefield defaults to false.
Language analyzers configuration
Language analyzer configuration example{ "type": "spanish", "stopwords": { "tokens": [ "va" ], "ignoreCase": true }, "stemExclusion": [ "voy" ] }Each configuration field is defined as follows:
stopwords-
(Optional, Array of Strings or Map of String/Object) A set of common words usually not useful for search. The field can be defined as a map object to configure both the list of words via the
tokensfield, and whether to ignore cases when identifying the words, via theignoreCasefield. The words can also be defined directly as a list of Strings, in which case theignoreCasefield defaults to false. stemExclusion-
(Optional, Array of Strings or Map of String/Object) A set of words to not be stemmed. The field can be defined as a map object to configure both the list of words via the
tokensfield, and whether to ignore cases when identifying the words, via theignoreCasefield. The words can also be defined directly as a list of Strings, in which case theignoreCasefield defaults to false.
Whitespace analyzer configuration
Whitespace analyzer configuration example{ "type": "whitespace", "maxTokenLength": 255 }Each configuration field is defined as follows:
maxTokenLength-
(Optional, Int) The maximum token length the analyzer will emit. Defaults to
255.
Custom analyzer configuration
Custom analyzer configuration example{ "type": "custom", "tokenizer": { "type": "standard", "maxTokenLength": 4 }, "filters": [ "lowercase", { "type": "edgeNgram", "minGramSize": 2, "maxGramSize": 3 } ] }tokenizer-
(Required, String or Map of String/Object) Tokenizer for the custom analyzer. The field can be configured as a map object to configure both the tokenizer type, via the
typefield, and its parameters, the latter by setting the values in the remaining key/value pairs of the map object. This field can also be configured as a single String that represents the tokenizer type, in which case a default tokenizer, without further customization, is used. filters-
(Optional, Array of Objects) List of filters to be applied. The parameters of each element (filter) of the array, may be configured in the same manner as for the
tokenizerfield. Alternatively, default filters may also be configured only by name with a String element in the array.
The response of the action is stored in the JSON Data Channel as returned by the Lucene engine:
{
"tokens": {
...
}
}
Examples
{
"type": "tokenizer",
"name": "My Tokenizer Processor Action",
"config": {
"action": "process",
"text": "#{ data(\"/httpRequest/queryParams/q\") }"
}
}
{
"type": "tokenizer",
"name": "My Tokenizer Processor Action",
"config": {
"action": "process",
"analyzer": "whitespace",
"attributes": [
"term"
],
"text": "#{ data(\"/httpRequest/body/custom/field\") }"
}
}
{
"type": "tokenizer",
"name": "My Tokenizer Processor Action",
"config": {
"action": "process",
"analyzer": {
"type": "english",
"stopwords":{
"tokens": [
"the"
],
"ignoreCase": true
},
"stemExclusion": [
"quick"
]
},
"attributes": [
"term"
],
"text": "#{ data(\"/httpRequest/queryParams/q\") }"
}
}
{
"type": "tokenizer",
"name": "My Tokenizer Processor Action",
"config": {
"action": "process",
"analyzer": {
"type": "whitespace",
"maxTokenLength": 4
},
"attributes": [
"term"
],
"text": "#{ data(\"/httpRequest/queryParams/q\") }"
}
}
{
"type": "tokenizer",
"name": "My Tokenizer Processor Action",
"config": {
"action": "process",
"text": "Hi, my cat is INJURED in its paw.",
"analyzer": {
"type": "custom",
"tokenizer": "whitespace",
"filters": [
"lowercase"
]
},
"attributes": [
"term"
]
}
}
{
"type": "tokenizer",
"name": "My Tokenizer Processor Action",
"config": {
"action": "process",
"text": "Hi, my cat is INJURED in its paw.",
"analyzer": {
"type": "custom",
"tokenizer": {
"type": "standard",
"maxTokenLength": 4
},
"filters": [
"lowercase",
{
"type": "edgeNgram",
"minGramSize": 2,
"maxGramSize": 3
}
]
},
"attributes": [
"term"
]
}
}
Vespa
Uses the Vespa integration to send requests to a Vespa service. This component’s output field is named vespa by default.
Action: native
Processor that executes an HTTP request to a Vespa service.
{
"type": "vespa",
"name": "My Vespa Native Action",
"config": {
"action": "native", (1)
"method": "POST", (1)
"path": "/search", (1)
"queryParams": { (2)
"timeout": "120s"
},
"body": { (2)
"yql": "select * from sources * where true"
}
},
"server": <Vespa Server UUID>, (3)
}
| 1 | These configuration fields are required. |
| 2 | The exact expected structure of these objects is defined by the Vespa API, this is just an example at the time of writing. |
| 3 | See the Vespa integration section. |
Each configuration field is defined as follows:
method-
(Required, String) The HTTP method for the request.
path-
(Required, String) The endpoint of the request, excluding schema, host, port and any path included as part of the connection.
queryParams-
(Optional, Map of String/String) The map of query parameters for the URL.
body-
(Optional, Object) The JSON body to submit as part of the request.
The response of the action is stored in the JSON Data Channel as returned by the Vespa API:
{
"vespa": {
...
}
}
Voyage AI
Uses the Voyage AI integration to send requests to the Voyage AI API. Supports multiple actions for different endpoints of the service. This component’s output field is named voyage-ai by default.
Action: reranking
Processor that given a query and many documents, returns the (ranks of) relevancy between the query and documents. See Voyage AI Rerankers and the API Rerankers endpoint.
{
"type": "voyage-ai",
"name": "My Reranking Action",
"config": {
"action": "reranking",
"model": "rerank-lite-1", (1)
"query": "Sample query", (1)
"documents": ["Sample document 1", "Sample document 2"], (1)
"truncation": true,
"topK": 10,
"returnDocuments": false
},
"server": <Voyage AI Server> (2)
}
| 1 | These configuration fields are required. |
| 2 | See the Voyage AI integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The model to use for the request. See models.
query-
(Required, String) The query the rerank is based on, as a String.
documents-
(Required, Array of Strings) Documents to be reranked as a list of String.
truncation-
(Optional, Boolean) Whether to truncate the input to satisfy the context length limit on the query and the documents. Defaults to
true. topK-
(Optional, Integer) The number of most relevant documents to return.
returnDocuments-
(Optional, Boolean) Whether to include the documents in the response. Default to
false.
The response of the action is stored in the JSON Data Channel as returned by the Voyage AI API:
{
"voyage-ai": {
...
}
}
Action: embeddings
Processor that given input string (or a list of strings) and other arguments such as the preferred model name, returns a response containing a list of embeddings. See Voyage AI Embeddings and the API Text embedding models endpoint.
{
"type": "voyage-ai",
"name": "My Embeddings Action",
"config": {
"action": "embeddings",
"model": "voyage-large-2", (1)
"input": ["Sample text 1", "Sample text 2"], (1)
"truncation": true,
"inputType": "DOCUMENT",
"outputDimension": 1536,
"outputDatatype": "FLOAT",
"encodingFormat": "Base64",
},
"server": <Voyage AI Server> (2)
}
| 1 | These configuration fields are required. |
| 2 | See the Voyage AI integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The model to use for the request. See models.
input-
(Required, String or List of Strings) List of documents to be embedded. If there’s a single input, it can be configured as a single String.
truncation-
(Optional, Boolean) Whether to truncate the input texts to fit within the context length. Defaults to
true. inputType-
(Optional, String) Type of the input text. One of:
QUERYorDOCUMENT. Defaults tonull. outputDimension-
(Optional, Integer) The number of dimensions for resulting output embeddings. Defaults to
null. outputDatatype-
(Optional, String) The data type for the embeddings to be returned. One of:
FLOAT,INT8,UINT8,BINARYorUBINARY. Default toFLOAT. encodingFormat-
(Optional, String) Format in which the embeddings are encoded. Defaults to
null, but can be set toBase64.
The response of the action is stored in the JSON Data Channel as returned by the Voyage AI API:
{
"voyage-ai": {
...
}
}
Action: multimodal-embeddings
Processor that given an input list of multimodal inputs consisting of text, images, or an interleaving of both modalities and other arguments such as the preferred model name, returns a response containing a list of embeddings. See Voyage AI Multimodal Embedding and the API Text multimodal embedding models endpoint.
{
"type": "voyage-ai",
"name": "My Multimodal Embeddings Action",
"config": {
"action": "multimodal-embeddings",
"model": "voyage-multimodal-3", (1)
"input": [ (1) (2)
<Input Objects>
],
"truncation": true,
"inputType": "DOCUMENT",
"outputEncoding": "Base64"
},
"server": <Voyage AI Server> (3)
}
| 1 | These configuration fields are required. |
| 2 | There are multiple types of accepted inputs, check the input object definition for details. |
| 3 | See the Voyage AI integration section. |
Each configuration field is defined as follows:
model-
(Required, String) The model to use for the request. See models.
inputs-
(Required, Array of Objects) A list of multimodal inputs to be vectorized. Each object in the list represents an input
Field definitions
type-
(Required, String) The type. One of:
text,image_urlorimage_base64. text-
(Optional, String) The text if the type
textis chosen.Text input example
{ "type": "text", "text": "This is a banana." } imageUrl-
(Optional, String) The image url if the type
image_urlis chosen.Image URL input example
{ "type": "image_url", "imageUrl": "https://raw.githubusercontent.com/voyage-ai/voyage-multimodal-3/refs/heads/main/images/banana.jpg" } imageBase64-
(Optional, Object) The base 64 encoded image if the type
image_base64is chosen.Image Base64 input example
{ "type": "image_base64", "imageBase64": { "mediaType": "image/jpeg", "base64": true, "data": "/9j/4AAQSkZJRgABAQEAYABgAAD(...)" } }mediaType-
(Required, String) The data media type. Supported media types are:
image/png,image/jpeg,image/webp, andimage/gif. base64-
(Required, Boolean) Whether the data is encoded in Base64.
data-
(Required, String) The data itself.
truncation-
(Optional, Boolean) Whether to truncate the inputs to fit within the context length. Defaults to
true. inputType-
(Optional, String) Type of the input text. One of:
QUERYorDOCUMENT. Defaults tonull. outputEncoding-
(Optional, String) Format in which the embeddings are encoded. One of:
base64. Defaults tonull.
The response of the action is stored in the JSON Data Channel as returned by the Voyage AI API:
{
"voyage-ai": {
...
}
}
Sandbox API
The Sandbox API allows the user to execute standalone processors without setting up an endpoint, and returns the corresponding response from the execution. This API must be enabled via the queryflow.sandbox.enabled property.
$ curl --request POST 'queryflow-api:12040/v2/sandbox?timeout=PT15S'
Query Parameters
timeout-
(Optional, Duration) The timeout for the request execution. Defaults to
15s.
Body
{
"processor": {
"type": "my-component-type",
"config": {
...
},
"server": {
"type": "my-server-type",
"config": {
...
},
"credential": {
"type": "my-credential-type",
"secret": {
...
}
}
}
},
"input": {
...
}
}
processor-
(Required, Object) The configuration for the processor to be executed.
Details
type-
(Required, String) The type of the component to execute.
config-
(Required, Object) The configuration for the corresponding action of the component.
server-
(Optional, UUID/Object) Either the ID of an existing server or the type and configuration for one.
Details
type-
(Required, String) The type of external supported service.
config-
(Required, Object) The configuration to connect to the external service.
credential-
(Optional, UUID/Object) Either the ID of an existing credential or the type and data for one.
Details
type-
(Required, String) The type of credentials for the external supported service.
secret-
(Required, String/Object) Either the secret key to connect to the external service, or an object with the authentication details.
input-
(Required, Object) The input to be sent to the processor.
Discovery Sandbox SDK
The Sandbox SDK is a Python library that allows developers to programatically access Discovery features inside of a Python execution environment without the need for extensive setup. Currently, it supports sending execution requests for Discovery QueryFlow processors and obtaining their result. It requires Python 3.13 or higher.
Quickstart
from sandbox.discovery_sandbox import QueryFlowClient, Processor, QueryFlowSequence, QueryFlowSequenceProcessor
# Initialize the client
client = QueryFlowClient(url="https://your-queryflow-api:12040", api_key="YOUR_API_KEY")
# Define processors
my_processor = Processor(
type="processor_type",
config={ ... }
)
another_processor = Processor(
type="processor_type",
config={ ... }
)
# Define a streaming processor
streaming_processor = Processor(
type="processor_type",
config={
"stream": true,
...
}
)
# Define input
my_input = {"text": "Hello world!"}
try:
# Execute text_to_text
response = client.text_to_text(my_processor, my_input)
print("Text_to_Text Response:", response)
# Alternatively, if the processor supports streaming data
print("Text_to_Stream Response:")
for chunk in client.text_to_stream(streaming_processor, my_input):
print(chunk, end="")
except Exception as e:
print(f"An error occurred: {e}")
# Example with a Processor Sequence
first_processor = QueryFlowSequenceProcessor(my_processor, "PT15S")
second_processor = QueryFlowSequenceProcessor(another_processor, "PT30S")
sequence = QueryFlowSequence([first_processor, second_processor])
result = client.execute(sequence, my_input)
Installation
You can install the Sandbox SDK and all necessary dependencies to your Python execution environment via pip.
cd pdp-sandbox
pip install .
Core Entities
The SDK uses several classes to represent the components involved in a QueryFlow request. These mirror the configuration for QueryFlow and external integration components.
Represents a Credential entity.
Attributes
type-
(Required, String) The credential type.
secret-
(Required, Dict) A dictionary containing the secret data.
Represents a Server entity.
Attributes
type-
(Required, String) The server type.
config-
(Required, Dict) The server configuration.
credential-
(Optional, Credential) The credential for authentication with the server.
Represents an executable Processor entity.
Attributes
type-
(Required, String) The type of the processor.
config-
(Required, Dict) The configuration for the processor.
server-
(Optional, Server) Server configuration to be associated with the processor.
QueryFlowClient
The QueryFlowClient is the main interface for interacting with the QueryFlow Sandbox API.
text_to_textExecutes a processor and returns its complete response as a dictionary. This method is overloaded and can accept either a full Processor object or the ID of a pre-existing processor. Returns the JSON response from the processor execution. If the server returns a 204 No Content, an empty dictionary {} is returned.
Parameters
processor-
(Required, Processor) The
Processorobject to execute. input-
(Required, Dict) The input data to send to the processor.
timeout-
(Optional, Duration) The timeout for the request execution.
text_to_streamExecutes a processor with support for streaming responses and yields response chunks as they are received. This method is overloaded similarly to text_to_text. The processor must support the stream key and enable it as part of its configuration.
Parameters
processor-
(Required, Processor) The
Processorobject to execute. input-
(Required, Dict) The input data to send to the processor.
timeout-
(Optional, Duration) The timeout for the request execution.
executeExecutes a sequence of processors, where the output of one processor becomes the input for the next.
Parameters
sequence-
(Required, QueryFlowSequence) The sequence of processors to execute.
input_data-
(Required, Dict) The initial input to be fed into the first processor of the sequence.
Sequential Execution
The Sandbox SDK additionally provides an interface to execute non-streaming processors sequentially, sending the output of a processor as the input for the next.
QueryFlowSequenceProcessorRepresents a single step in a QueryFlowSequence.
Attributes
processor-
(Required, String or Processor) Either a
Processorobject to define a new processor for this step, or a string UUID of an existing processor. timeout-
(Optional, Duration) The timeout specifically for this processor’s execution within the sequence.
QueryFlowSequenceRepresents the list of processors to be executed.
Attributes
processors-
(Required, List of QueryFlowSequenceProcessors) A list of
QueryFlowSequenceProcessorobjects defining the sequence.
Labeling a Configuration
Labels API
$ curl --request POST 'core-api:12010/v2/label' --data '{ ... }'
Body
{
"key": "My Label Key",
"value": "My Label Value"
}
key-
(Required, String) The key of the label
value-
(Required, String) The value of the label
$ curl --request GET 'core-api:12010/v2/label?page={page}&size={size}&sort={sort}'
Query Parameters
page-
(Optional, Int) The page number. Defaults to
0. size-
(Optional, Int) The size of the page. Defaults to
25. sort-
(Optional, Array of String) The sort definition for the page.
$ curl --request GET 'core-api:12010/v2/label/{id}'
Path Parameters
id-
(Required, String) The label ID.
$ curl --request PUT 'core-api:12010/v2/label/{id}' --data '{ ... }'
Path Parameters
id-
(Required, String) The label ID.
Body
{
"key": "My Label Key",
"value": "My Label Value"
}
key-
(Required, String) The key of the label
value-
(Required, String) The value of the label
$ curl --request DELETE 'core-api:12010/v2/label/{id}'
Path Parameters
id-
(Required, String) The label ID.
|
Note
|
Both |
Labels are simple key/value pairs that can help to reference user configurations. Any configuration can be tagged with labels either previously created in here, or during the CRUD process of the entity itself. Labels are limited to 45 characters max, for both key and value.
|
Note
|
When creating multiple labels during the CRUD process of other entities (e.g. a server or a credential, duplicates will be ignored. |
To create a new label directly from an entity configuration, the following property must be included as part of the body payload:
{
"labels": [
{
"key": "My Label Key",
"value": "My Label Value"
},
...
],
...
}
key-
(Required, String) The key of the label
value-
(Required, String) The value of the label
Backup & Restore
Core Backup API
$ curl --request GET 'core-api:12010/v2/export'
$ curl --request POST 'core-api:12010/v2/import?onConflict={onConflict}' --form 'file=@/../../export-20240319T1030.zip'
Query Parameters
onConflict-
(Optional, String) The action to execute when there is a conflict with imported entities. Defaults to
FAIL. Supported actions are:IGNORE,UPDATEandFAIL.
Queryflow Backup API
$ curl --request GET 'queryflow-api:12040/v2/export'
$ curl --request POST 'queryflow-api:12040/v2/import?onConflict={onConflict}' --form 'file=@/../../export-20240319T1030.zip'
Query Parameters
onConflict-
(Optional, String) The action to execute when there is a conflict with imported entities. Defaults to
FAIL. Supported actions are:IGNORE,UPDATEandFAIL.
Ingestion Backup API
$ curl --request GET 'ingestion-api:12030/v2/export'
$ curl --request POST 'ingestion-api:12030/v2/import?onConflict={onConflict}' --form 'file=@/../../export-20240319T1030.zip'
Query Parameters
onConflict-
(Optional, String) The action to execute when there is a conflict with imported entities. Defaults to
FAIL. Supported actions are:IGNORE,UPDATEandFAIL.
Each product Core, Queryflow, and Ingestion has its own backup and restore API. The entity distribution is as follows:
|
Note
|
Labels are skipped as they will be handled during the creation of other entities. |
|
Note
|
Secrets are not part of this process due to security reasons. All credentials assume their referenced secret currently exists or will be created by different means. |
The backup and restore for the entities is done through a single export-{timestamp}.zip ZIP file that contains a New Line Delimited JSON (ndjson) file per entity type. Each configuration is exported in the correct order, so it can be imported back without missing dependency problems.
|
Note
|
Manual modifications of the exported file might corrupt the backup. |
Since the ID of each exported entity is expected to remain the same after importing, conflicts might arise. The restore process has 3 different resolution strategies:
-
IGNORE: The input entity will be ignored, keeping the existing one unchanged. -
UPDATE: The current entity will be updated with the input entity values. -
FAIL: The current entity will not be modified, and an error will be thrown.
Appendix A: Pagination and Sorting
Any endpoint that paginates results receives the following optional query parameters:
page-
(Optional, Integer) The page to retrieve. Defaults to
0.
|
Note
|
If the provided value is invalid, it will be replaced by the default one. |
size-
(Optional, Integer) The size of the page. Must be an integer between 1 and 100. Defaults to
25.
|
Note
|
If the provided value is invalid or out of range, it will be replaced by the default one. |
sort-
(Optional, String) The sorting fields, with an optional direction. Ascending by default:
sort=<string>[,(asc\|desc)].
|
Note
|
This parameter can be used multiple times: |
The response of a paginated request will either be an empty payload with a 204 - No Content status code, or a 200 - OK with the results page:
{
"content": [
{
...
},
...
],
"pageable": {
...
},
"totalSize": 1,
"totalPages": 1,
"numberOfElements": 1,
"pageNumber": 0,
"empty": false,
"size": 25,
"offset": 0
}
content-
(Array of Objects) The page content.
pageable-
(Object) The page request information.
Details
Page configuration example{ "page": 0, "size": 25, "sort": [ // (1) { "property" : "fieldA", "direction" : "ASC" }, { "property" : "fieldB", "direction" : "DESC" } ] }Each configuration field is defined as follows:
page-
(Integer) The page number.
size-
(Integer) The size of the page.
sort-
(Array of Objects) The sort definitions for the page.
Field definitions
property-
(String) The property where the sort was applied.
direction-
(String) The direction of the applied sorting. Either
ASCorDESC.
totalSize-
(Integer) The total number of records.
totalPages-
(Integer) The total number of pages.
numberOfElements-
(Integer) The number of elements on the returned slice of content.
pageNumber-
(Integer) The current page number.
empty-
(Boolean)
trueif the page has no content. size-
(Integer) The size of the returned slice of content.
offset-
(Integer) The page offset.
Appendix B: Date and Time Patterns
In some instances, the use of string patterns may be required to represent dates. These patterns consist in a series of letters and symbols that represent the structure the date should follow as an output. To create them follow the next table of definitions:
| Symbol | Meaning |
|---|---|
G |
The era (i.e. AD) |
u |
The year |
y |
The year of the era |
D |
The day of the year |
M |
The month of the year |
L |
The month of the year |
d |
The day of the month |
Q |
Quarter of the year |
q |
Quarter of the year |
Y |
Week based year |
w |
Week of based year |
W |
The week of the month |
E |
The day of the week |
e |
The localized day of the week |
c |
The localized day of the week |
F |
The week of the month |
a |
The am/pm of the day |
h |
The clock hour (1-12) |
K |
The clock hour (0-11) |
k |
The clock hour (1-24) |
H |
The hour of the day (0-23) |
m |
Minute of the hour |
s |
Second of the minute |
S |
The fraction of the second |
A |
The milliseconds |
n |
The nanoseconds |
N |
The nanoseconds of the day |
V |
The time-zone ID |
z |
The time-zone name |
O |
The localized zone offset |
X |
The zone offset |
x |
The zone offset |
Z |
The zone offset |
p |
Pad the next |
' |
Escape for text |
'' |
A single quote |
[ |
Start of an optional section |
] |
End of an optional section |
Each symbol may be used 'n' consecutive times (e.g. uuuu), this will determine the use of a short or long form of the representation. The definition of these forms may vary depending on the type a symbol represents, the following list shows the basic representations depending on the type and the 'n' times a symbol is repeated:
-
Text
-
n < 4: Abbreviation (e.g. Wed for wednesday)
-
n = 4: Full
-
n = 5: Normally one letter (e.g. W for wednesday)
-
-
Number
-
n: The number with zero padding for the extra quantity (e.g. 3 -> 001)
-
c, F -> n <= 1
-
d, H, h, K, k, m, and s -> n <= 2
-
D -> n <= 3
-
-
-
Number and Text (Combination of both)
-
n >= 3: Seen as a Text
-
n < 3: Seen as a Number
-
-
Fractions
-
n <= 9: The number of truncations to the value
-
-
Year
-
n = 2: Two numbers (e.g. 23 for 2023)
-
n <= 4, n != 2: The full year
-
-
ZoneId
-
n = 2: Outputs the zone id
-
-
Zone names
-
1 <= n <= 3: Short name
-
n = 4: Full name
-
-
Offset for 'X' an 'x'
-
n = 1: Just the hour if minute is zero, otherwise include minute
-
n = 2: Hour and minute
-
n = 3: Hour and minute with a colon
-
n = 4: Hour, minute and second
-
n = 5: Hour, minute and second with a colon
-
-
Offset for 'O'
-
n = 1: Short offset (e.g. GMT+1)
-
n = 4: Full offset (e.g. GMT+1:00)
-
-
Offset for 'Z'
-
n <= 3: Hour and minute
-
n = 4: The full offset (e.g. GMT+1:00)
-
n = 5: The hour and minute with colon
-
-
Pad
-
n: Number of the width
-
For example, the pattern "dd MM:ppppppuuuu" creates the following date "30 11: 2023". Special characters can be combined with the symbols, such as the ':' and the spaces. The latter can also be achieved with the pad, in the example 2 additional spaces are added between the year and the ':'.
Appendix C: Error Messages
If a request to any API produces an error, a standard response will be returned:
{
"status": 409,
"code": 2001,
"messages": [
"Duplicate entry for field(s): name"
],
"timestamp": "2023-01-28T01:52:22.117244600Z"
}
Response Body
status-
(Integer) The HTTP status code of the response
code-
(Integer) The internal error code.
Each code is composed by 2 digits that represent the category, and 2 digits that represent the specific error, where the category can be:
messages-
(Array of String) An optional list of messages describing the error
timestamp-
(Timestamp) The UTC timestamp when the error happened
Error Codes
The error code is a better description of the error. It extends the information provided by the HTTP status code as the same status could be caused by different problems.
Each code is composed by 2 digits that represent the category, and 2 digits that represent the specific error, where the category can be:
-
10 - Resources: access to entities or endpoints
-
20 - Data integrity: entities referencing other entities, or other constraints such as unique keys
-
30 - Data validation: input data (format, missing fields…)
-
40 - Execution: problems while invoking an action
-
70 - Security: access and permissions
-
80 - Third-party: communication with external services
-
99 - Others: any other issue
| Code | Description |
|---|---|
1001 |
The endpoint or HTTP Method is undefined or disabled |
1002 |
The requested bucket is missing |
1003 |
The requested resource is missing |
2001 |
The entity already exists (same name or any other combination of fields defined as unique) |
2002 |
The entity to delete is referenced by other entities |
3001 |
The input data is corrupted |
3002 |
The input data is missing or invalid |
3003 |
The input data is too large |
4001 |
The action could not be executed due to the current state of the system |
4002 |
The action was terminated due to a timeout |
4003 |
The Core DSL expression could not be executed |
7001 |
The action could not be executed due to the permissions of the user |
8001 |
Could not establish connection to an external service |
8002 |
The external service returned an error |
9901 |
Custom user error |
9999 |
Undefined error |
Appendix D: Metrics
Each Discovery component publishes metrics regarding health, performance, and Discovery-specific workloads. The idea of this page is to give the user a head start in the understanding of these metrics, by highlighting those most commonly used, their meaning, and the dimensions each of them have. Keep in mind that the majority of these have, by default, a time lapse of one minute.
Dimensions
Each metric may have dimensions, which can be used as metric filters. These filters helps ensure only certain published values are taken into account. As an example, the component dimension, which most metrics have, can be used to narrow down metrics to a specific component or API, like the Ingestion Script Component or the QueryFlow API. Below are some of the most commonly used dimensions, however, it’s important to note that there are plenty more, so it’s recommended to check them in each desired metric.
| Dimension | Description |
|---|---|
The Discovery component that published the metric |
|
Found in cache metrics, its the specific type of cache that produced the metric, i.e. |
|
Found in metrics that measure an operation, it refers to the operation’s result, such as a job status or a cache |
|
Found in metrics related to an Ingestion Seed Execution, it indicates the seed’s ID |
Common Metrics
These metrics are published by most Discovery Components and are related to more than one product.
| Metric Name | Description | Dimensions |
|---|---|---|
|
The amount of threads currently being throttled. Mostly used to monitor the throttler service used in Ingestion Components |
|
|
The amount of times a cache was called in the last time lapse |
Ingestion Metrics
These are metrics published by Ingestion Components that help monitor a Seed Execution.
| Metric Name | Description | Dimensions |
|---|---|---|
|
The amount of jobs currently being executed |
|
|
The amount of records collected (i.e that were processed) in the last time lapse |
|
|
The average time, in milliseconds, that it took to execute jobs completed in the last time lapse |
|
|
The amount of jobs completed in the last time lapse |
|
More information about default metrics published by all Discovery products can be found in the Micronaut-Micrometer documentation, at the the Provided Binders sub-section.