Elastic Pipelines: Skip Import On Failure - elasticsearch

Individual processors in an Elastic pipelines have an on_failure attribute. This allows you to handle a failure/error in a pipeline. The example in the docs show setting an additional field on your document.
{
"description" : "my first pipeline with handled exceptions",
"processors" : [
{
"rename" : {
"field" : "foo",
"to" : "bar",
"on_failure" : [
{
"set" : {
"field" : "error.message",
"value" : "{{ _ingest.on_failure_message }}"
}
}
]
}
}
]
}
Is it possible to tell the pipeline to SKIP importing a document if any processors in the pipeline fail?

You can "hijack" the drop processor to skip either directly in the on_failure step (no need for _ingest.on_failure_message if you're aborting anyways):
{
"description": "my first pipeline with handled exceptions",
"processors": [
{
"rename" : {
"field" : "foo",
"target_field": "bar",
"on_failure" : [
{
"drop" : {
"if" : "true"
}
}
]
}
}
]
}
or use it as a separate processor, perhaps at the very end, after ctx.error has been set by any of the processors' on_failure handlers:
{
"description": "my first pipeline with handled exceptions",
"processors": [
{
"rename" : {
"field" : "foo",
"to" : "bar",
"on_failure" : [
{
"set" : {
"field" : "error.message",
"value" : "{{ _ingest.on_failure_message }}"
}
}
]
}
},
{
"drop": {
"if": "ctx.error.size() != null"
}
}
]
}
Both of these will result in a noop when the pipeline is applied.

Related

Filter in projection aggregation in SpringMongoDb

I have this aggregation that I want to transform into a Spring MongoDb aggregation.
A document has several named data in a list. But I want to project only 2 datas for a couple of documents with a given ID as follow.
How to put a filter in the spring aggregation ?
db.getCollection("data").aggregate(
[
{
"$match" : {
"_id" : {
"$in" : [
1,
2
]
}
}
},
{
"$project" : {
"datas" : {
"$filter" : {
"input" : "$datas",
"as" : "item",
"cond" : {
"$or" : [
{
"$eq" : [
"$$item.name",
"data1"
]
},
{
"$eq" : [
"$$item.name",
"data2"
]
}
]
}
}
}
}
}
]
);

creating data stream through logstash

I have installed elasticsearch cluster v 7.14.
I have created ILM policy and Index template. However data stream parameters mentioned under logstash pipeline file are giving error.
ILM policy -
{
"testpolicy" : {
"version" : 1,
"modified_date" : "2021-08-28T02:58:25.942Z",
"policy" : {
"phases" : {
"hot" : {
"min_age" : "0ms",
"actions" : {
"rollover" : {
"max_primary_shard_size" : "900mb",
"max_age" : "2d"
},
"set_priority" : {
"priority" : 100
}
}
},
"delete" : {
"min_age" : "2d",
"actions" : {
"delete" : {
"delete_searchable_snapshot" : true
}
}
}
}
},
"in_use_by" : {
"indices" : [ ],
"data_streams" : [ ],
"composable_templates" : [ ]
}
}
}
Index temaplate -
{
"index_templates" : [
{
"name" : "access_template",
"index_template" : {
"index_patterns" : [
"test-data-stream*"
],
"template" : {
"settings" : {
"index" : {
"number_of_shards" : "1",
"number_of_replicas" : "0"
}
},
"mappings" : {
"_routing" : {
"required" : false
},
"dynamic_date_formats" : [
"strict_date_optional_time",
"yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"
],
"numeric_detection" : true,
"_source" : {
"excludes" : [ ],
"includes" : [ ],
"enabled" : true
},
"dynamic" : true,
"dynamic_templates" : [ ],
"date_detection" : true
}
},
"composed_of" : [ ],
"priority" : 500,
"version" : 1,
"data_stream" : {
"hidden" : false
}
}
}
]
}
logstash pipeline config file -
input {
beats {
port => 5044
}
}
filter {
if [log_type] == "access_server" and [app_id] == "pa"
{
grok {
match => {
"message" => "%{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:%{MINUTE}(?::?%{SECOND})\| %{USERNAME:exchangeId}\| %{DATA:trackingId}\| %{NUMBER:RoundTrip:int}%{SPACE}ms\| %{NUMBER:ProxyRoundTrip:int}%{SPACE}ms\| %{NUMBER:UserInfoRoundTrip:int}%{SPACE}ms\| %{DATA:Resource}\| %{DATA:subject}\| %{DATA:authmech}\| %{DATA:scopes}\| %{IPV4:Client}\| %{WORD:method}\| %{DATA:Request_URI}\| %{INT:response_code}\| %{DATA:failedRuleType}\| %{DATA:failedRuleName}\| %{DATA:APP_Name}\| %{DATA:Resource_Name}\| %{DATA:Path_Prefix}"
}
}
mutate {
replace => {
"[type]" => "access_server"
}
}
}
}
output {
if [log_type] == "access_server" {
elasticsearch {
hosts => ['http://10.10.10.76:9200']
user => elastic
password => xxx
data_stream => "true"
data_stream_type => "logs"
data_stream_dataset => "access"
data_stream_namespace => "default"
ilm_rollover_alias => "access"
ilm_pattern => "000001"
ilm_policy => "testpolicy"
template => "/tmp/access_template"
template_name => "access_template"
}
}
elasticsearch {
hosts => ['http://10.10.10.76:9200']
index => "%{[#metadata][beat]}-%{[#metadata][version]}-%{+YYYY.MM.dd}"
user => elastic
password => xxx
}
}
After all deployment done, can only see system indices but data stream is not created.
[2021-08-28T12:42:50,103][ERROR][logstash.outputs.elasticsearch][main] Invalid data stream configuration, following parameters are not supported: {"template"=>"/tmp/pingaccess_template", "ilm_pattern"=>"000001", "template_name"=>"pingaccess_template", "ilm_rollover_alias"=>"pingaccess", "ilm_policy"=>"testpolicy"}
[2021-08-28T12:42:50,547][ERROR][logstash.javapipeline ][main] Pipeline error {:pipeline_id=>"main", :exception=>#<LogStash::ConfigurationError: Invalid data stream configuration: ["template", "ilm_pattern", "template_name", "ilm_rollover_alias", "ilm_policy"]>, :backtrace=>["/usr/share/logstash/vendor/bundle/jruby/2.5.0/gems/logstash-output-elasticsearch-11.0.2-java/lib/logstash/outputs/elasticsearch/data_stream_support.rb:57:in `check_data_stream_config!'"
[2021-08-28T12:42:50,702][ERROR][logstash.agent ] Failed to execute action {:id=>:main, :action_type=>LogStash::ConvergeResult::FailedAction, :message=>"Could not execute action: PipelineAction::Create<main>, action_result: false", :backtrace=>nil}
error is saying parameters like template"=>"/tmp/pingaccess_template", "ilm_pattern"=>"000001", "template_name"=>"pingaccess_template", "ilm_rollover_alias"=>"pingaccess", "ilm_policy"=>"testpolicy" are not valid but in below link they are mentioned.
https://www.elastic.co/guide/en/logstash/current/plugins-outputs-elasticsearch.html#plugins-outputs-elasticsearch-data-streams
The solution is to use logstash without be "aware" of data_stream.
FIRST of all (before running logstash) create your ILM and index_template BUT adding the "index.lifecycle.name" in the settings. That way, you are linking the template and ILM. Also, don't forget the data_stream in the index template.
{
"index_templates" : [
{
"name" : "access_template",
"index_template" : {
"index_patterns" : [
"test-data-stream*"
],
"template" : {
"settings" : {
"index" : {
"number_of_shards" : "1",
"number_of_replicas" : "0",
"index.lifecycle.name": "testpolicy"
}
},
"mappings" : {
...
}
},
"composed_of" : [ ],
"priority" : 500,
"version" : 1,
"data_stream" : {
"hidden" : false
}
}
}
]
}
Keep Logstash output like if data_stream doesn't exist but add action => create. This is because you can't use "index" API with data streams. Need the _create API call.
output { elasticsearch {
hosts => ['http://10.10.10.76:9200']
index => "test-data-stream"
user => elastic
password => xxx
action => "create"
}
That way, logstash will output to ES, but the index template will be applied automatically (because of pattern match) and also the ILM and data_stream will be applied.
Important: To make it work, you need to start from scratch. If the index "test-data-stream" already exists in ES (as a traditional index), then data_stream will NOT be created. Make the test with another index name to make sure it works.
The documentation is unclear, but the plugin does not support those options when datastream output is enabled. The plugin is logging the options returned by the invalid_data_stream_params function, which allows action, routing, data_stream, anything else that starts with data_stream_, the shared options defined by the mixin, and the common options defined by the output plugin base.

Partial update overwriting whole structure

I'm indexing a new document with the following content
{
"lastUpdate" : "20180114144020452",
"name" : "My Process",
"startDate" : "20180114162356585",
"endData" : "",
"tasks" : [
{
"1" : {
"lastUpdate" : "20180114144020452",
"taskId" : "123",
"subject" : "Terceira Atividade",
"status" : "Active",
"type" : "userTask",
"assign" : [
{
"date" : "20180114144020452",
"type" : "role",
"name" : "Time 3",
"id" : "Team3_345"
}
],
"receivedDate" : "",
"readDate" : "",
"finishDate" : ""
}
}
]
}
And then I'm trying to change task.1.status value with the following update content
{
"doc" : {
"tasks" : [
{
"1" : {
"status" : "Closed"
}
}
]
}
}
But it's overwriting the whole task.1 structure, deleting other values and letting only status value to closed instead of keep other values and change only status value.
How can I solve this? Thanks
You need to do it via a scripted partial updated like this
POST updates/update/1/_update
{
"script": {
"source": "ctx._source.tasks[0].1.status = 'Closed'"
}
}

Update multi level nested document in elasticsearch

Using Elasticsearch 1.7.1, I have the following document structure
"_source" : {
"questions" : {
"defaultQuestion" : {
"tag" : 0,
"gid" : 0,
"rid" : 0,
"caption" : "SRID",
},
"tableQuestion" : {
"rows" : [{
"ids" : {
"answerList" : ["3547", "3548"],
"tag" : "0",
"caption" : "Accounts",
},
"name" : {
"answerList" : ["Some Name"],
"tag" : "0",
"caption" : "Name",
}
}
],
"caption" : "BPI 1500541753537",
"id" : 644251570,
"tag" : ""
}
},
"id" : "447722821"
}
I want to add a new object in in questions.tableQuestion.rows. My current script is replacing the existing object with the new one. Kindly suggest how to append it instead. Following is my update script.
{ "update": {"_id": "935663867", "_retry_on_conflict" : 3} }
{ "script" : "ctx._source.questions += param1", "params" : {"param1" : {"tableQuestion": {"rows" : [ NEWROWOBJECT ]} } }}
You can build the path with next nested fields, right to the rows property and then use += operator. It's also good to have a check if rows array is null and initialize it in this case.
Checked with ES 2.4, but should be similar for earlier versions:
POST http://127.0.0.1:9200/sample/demo/{document_id}/_update
{
"script": {
"inline": "if (ctx._source.questions.tableQuestion.rows == null) ctx._source.questions.tableQuestion.rows = new ArrayList(); ctx._source.questions.tableQuestion.rows += param1;",
"params" : {
"param1" : {
"ids": {
"answerList": [
"478",
"255"
],
"tag": "2",
"caption": "My Test"
},
"name": {
"answerList": [
"My Name"
],
"tag": "1",
"caption": "My Demo"
}
}
}
}
}
For ES 5.x and Painless language the script is a bit different:
POST http://127.0.0.1:9200/sample/demo/{document_id}/_update
{
"script": {
"inline": "if (ctx._source.questions.tableQuestion.rows == null) { ctx._source.questions.tableQuestion.rows = new ArrayList();} ctx._source.questions.tableQuestion.rows.add(params.param1);",
"params" : {
"param1" : {
...
}
}
}
}
Update to the additional comment
If some part of the path is dynamic, you can also use parameters to build the path - with get(param_name) method - try this syntax (I removed the null check for simplicity):
{
"script": {
"inline": "ctx._source.questions.get(param2).rows += param1;",
"params" : {
"param2" : "6105243",
"param1" : {
"ids": {
"answerList": [
"478",
"255"
],
"tag": "2",
"caption": "My Test"
},
"name": {
"answerList": [
"My Name"
],
"tag": "1",
"caption": "My Demo"
}
}
}
}
}

Count Documents Matching Multiple Array Criteria

Schema is:
{
"_id" : ObjectId("594b7e86f59ccd05bb8a90b5"),
"_class" : "com.notification.model.entity.Notification",
"notificationReferenceId" : "7917a5365ba246d1bb3664092c59032a",
"notificationReceivedAt" : ISODate("2017-06-22T08:23:34.382+0000"),
"sendTo" : [
{
"userReferenceId" : "check",
"mediumAndDestination" : [
{
"medium" : "API",
"status" : "UNREAD"
}
]
}
]
}
{
"_id" : ObjectId("594b8045f59ccd076dd86063"),
"_class" : "com.notification.model.entity.Notification",
"notificationReferenceId" : "6990329330294cbc950ef2b38f6d1a4f",
"notificationReceivedAt" : ISODate("2017-06-22T08:31:01.299+0000"),
"sendTo" : [
{
"userReferenceId" : "check",
"mediumAndDestination" : [
{
"medium" : "API",
"status" : "UNREAD"
}
]
}
]
}
{
"_id" : ObjectId("594b813ef59ccd076dd86064"),
"_class" : "com.notification.model.entity.Notification",
"notificationReferenceId" : "3c910cf5fcec42d6bfb78a9baa393efa",
"notificationReceivedAt" : ISODate("2017-06-22T08:35:10.474+0000"),
"sendTo" : [
{
"userReferenceId" : "check",
"mediumAndDestination" : [
{
"medium" : "API",
"status" : "UNREAD"
}
]
},
{
"userReferenceId" : "hello",
"mediumAndDestination" : [
{
"medium" : "API",
"status" : "READ"
}
]
}
]
}
I want to count a user notifications based on statusList which is a List. I used mongoOperations to make a query:
Query query = new Query();
query.addCriteria(Criteria.where("sendTo.userReferenceId").is(userReferenceId)
.andOperator(Criteria.where("sendTo.mediumAndDestination.status").in(statusList)));
long count = mongoOperations.count(query, Notification.class);
I realise I'm doing it wrong because I am getting count as 1 when I query for user with reference ID hello and statusList with single element as UNREAD.
How do I perform an aggregated query on array element?
The query needs $elemMatch in order to actually match "within" the array element that matches both criteria:
Query query = new Query(Criteria.where("sendTo")
.elemMatch(
Criteria.where("userReferenceId").is("hello")
.and("mediumAndDestination.status").is("UNREAD")
));
Which essentially serializes to:
{
"sendTo": {
"$elemMatch": {
"userReferenceId": "hello",
"mediumAndDestination.status": "UNREAD"
}
}
}
Note that in your question there is no such document, the only matching thing with "hello" actually has the "status" of "READ". If I supply those criteria instead:
{
"sendTo": {
"$elemMatch": {
"userReferenceId": "hello",
"mediumAndDestination.status": "READ"
}
}
}
Then I get the last document:
{
"_id" : ObjectId("594b813ef59ccd076dd86064"),
"_class" : "com.notification.model.entity.Notification",
"notificationReferenceId" : "3c910cf5fcec42d6bfb78a9baa393efa",
"notificationReceivedAt" : ISODate("2017-06-22T08:35:10.474Z"),
"sendTo" : [
{
"userReferenceId" : "check",
"mediumAndDestination" : [
{
"medium" : "API",
"status" : "UNREAD"
}
]
},
{
"userReferenceId" : "hello",
"mediumAndDestination" : [
{
"medium" : "API",
"status" : "READ"
}
]
}
]
}
But with "UNREAD" the count is actually 0 for this sample.

Resources