How split a field to words by ingest pipeline in Kibana - elasticsearch

I have created an ingest pipeline as below to split a field into words:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "String cutting processing",
"processors": [
{
"split": {
"field": "foo",
"separator": "|"
}
}
]
},
"docs": [
{
"_source": {
"foo": "apple|time"
}
}
]
}
but it split the field into characters:
{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_doc",
"_id" : "_id",
"_source" : {
"foo" : [
"a",
"p",
"p",
"l",
"e",
"|",
"t",
"i",
"m",
"e"
]
}
}
}
]
}
If I replace the separator with a comma, the same pipeline split the field to words:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "String cutting processing",
"processors": [
{
"split": {
"field": "foo",
"separator": ","
}
}
]
},
"docs": [
{
"_source": {
"foo": "apple,time"
}
}
]
}
then the output would be:
{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_doc",
"_id" : "_id",
"_source" : {
"foo" : [
"apple",
"time"
]
}
}
}
]
}
How can I split the field into words when the separator is "|"?
My next question is how could I apply this ingest pipeline to an existing index?
I tried this solution, but it doesn't work for me.
Edit
Here is the whole pipeline with the document that will assign two parts to two columns:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": """combined fields are text that contain "|" to separate two fields""",
"processors": [
{
"split": {
"field": "dv_m",
"separator": "|",
"target_field": "dv_m_splited"
}
},
{
"set": {
"field": "dv_metric_prod",
"value": "{{dv_m_splited.1}}",
"override": false
}
},
{
"set": {
"field": "dv_metric_section",
"value": "{{dv_m_splited.2}}",
"override": false
}
}
]
},
"docs": [
{
"_source": {
"dv_m": "amaze_inc|Understanding"
}
}
]
}
That generates this response:
{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_doc",
"_id" : "_id",
"_source" : {
"dv_metric_prod" : "m",
"dv_m_splited" : [
"a",
"m",
"a",
"z",
"e",
"_",
"i",
"n",
"c",
"|",
"U",
"n",
"d",
"e",
"r",
"s",
"t",
"a",
"n",
"d",
"i",
"n",
"g"
],
"dv_metric_section" : "a",
"dv_m" : "amaze_inc|Understanding"
},
"_ingest" : {
"timestamp" : "2021-08-02T08:33:58.2234143Z"
}
}
}
]
}
If I set "separator": "\\|", then I will get this error:
{
"docs" : [
{
"error" : {
"root_cause" : [
{
"type" : "general_script_exception",
"reason" : "Error running com.github.mustachejava.codes.DefaultMustache#776f8239"
}
],
"type" : "general_script_exception",
"reason" : "Error running com.github.mustachejava.codes.DefaultMustache#776f8239",
"caused_by" : {
"type" : "mustache_exception",
"reason" : "Failed to get value for dv_m_splited.2 #[query-template:1]",
"caused_by" : {
"type" : "mustache_exception",
"reason" : "2 #[query-template:1]",
"caused_by" : {
"type" : "index_out_of_bounds_exception",
"reason" : "2"
}
}
}
}
}
]
}

The solution is fairly simple: just escape your separator.
As the separator field in the split processor is a regular expression, you need to escape special characters such as |.
You also need to escape it twice.
So your code only lacks the double escaping part:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "String cutting processing",
"processors": [
{
"split": {
"field": "foo",
"separator": "\\|"
}
}
]
},
"docs": [
{
"_source": {
"foo": "apple|time"
}
}
]
}
UPDATE
You did not mention or I missed the part where you wanted to assign the values to two separate fields.
In this case, you should use dissect instead of split. It is shorter, simpler, cleaner. See the documentation here.
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": """combined fields are text that contain "|" to separate two fields""",
"processors": [
{
"dissect": {
"field": "dv_m",
"pattern": "%{dv_metric_prod}|%{dv_metric_section}"
}
}
]
},
"docs": [
{
"_source": {
"dv_m": "amaze_inc|Understanding"
}
}
]
}
Result
{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_doc",
"_id" : "_id",
"_source" : {
"dv_metric_prod" : "amaze_inc",
"dv_metric_section" : "Understanding",
"dv_m" : "amaze_inc|Understanding"
},
"_ingest" : {
"timestamp" : "2021-08-18T07:39:12.84910326Z"
}
}
}
]
}
ADDENDUM
If using split instead of dissect
You got your array indices wrong. There is no such thing as {{dv_m_splited.2}} as the array index starts from 0 and you only have two results.
This is the correct pipeline when using the split processor.
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": """combined fields are text that contain "|" to separate two fields""",
"processors": [
{
"split": {
"field": "dv_m",
"separator": "\\|",
"target_field": "dv_m_splited"
}
},
{
"set": {
"field": "dv_metric_prod",
"value": "{{dv_m_splited.0}}",
"override": false
}
},
{
"set": {
"field": "dv_metric_section",
"value": "{{dv_m_splited.1}}",
"override": false
}
}
]
},
"docs": [
{
"_source": {
"dv_m": "amaze_inc|Understanding"
}
}
]
}

Related

Is there a way to reference the field 'path.virtual' as part of this split processor?

The field I am interested in from my ES doc below "virtual":
"path" : {
"root" : "cda42f809526c222ebb54e5887117139",
"virtual" : "/tests/3.pdf",
"real" : "/tmp/es/tests/3.pdf"
}
My simulated ingest pipeline:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "split words on line_number field",
"processors": [
{
"split": {
"field": "path.virtual",
"separator": "/",
"target_field": "temporary_field"
}
},
{
"set": {
"field": "caseno",
"value": "{{temporary_field.1}}"
}
},
{
"set": {
"field": "file",
"value": "{{temporary_field.2}}"
}
},
{
"remove": {
"field": "temporary_field"
}
}
]
},
"docs": [
{
"_source": {
"path.virtual": "/test/3.pdf"
}
}
]
}
If I change the actual field 'path.virtual' to 'path' or 'virtual' I get desired result but if I use the actual field name I get the following error:
{
"docs" : [
{
"error" : {
"root_cause" : [
{
"type" : "illegal_argument_exception",
"reason" : "field [[path] not present as part of path [[path.virtual]]"
}
],
"type" : "illegal_argument_exception",
"reason" : "field [[path] not present as part of path [[path.virtual]]"
}
}
]
}
What can I do to avoid this?
Try this in simulate:
"docs": [
{
"_source": {
"path": {
"virtual": "/test/3.pdf"
}
}
}
]

How to convert the particular item in the filebeat message to lowercase using elastic search processor

I am simulating the below code in elastic search, how to convert the event.action in the below code from Query to lowercase "query" as expected in the output.
The below simulation done in the elastic devtools console:
POST /_ingest/pipeline/_simulate
{
"pipeline" :
{
"description": "_description",
"processors": [
{
"dissect": {
"field" : "message",
"pattern" : "%{#timestamp}\t%{->} %{process.thread.id} %{event.action}\t%{message}"
},
"set": {
"field": "event.category",
"value": "database"
}
}
]
},
"docs": [
{
"_index": "index",
"_id": "id",
"_source": {
"message": "2020-10-22T20:28:26.267397Z\t 9 Query\tset session"
}
}
]
}
Expected output
{
"docs" : [
{
"doc" : {
"_index" : "index",
"_id" : "id",
"_source" : {
"process" : {
"thread" : {
"id" : "9"
}
},
"#timestamp" : "2020-10-22T20:28:26.267397Z",
"message" : "set session",
"event" : {
"category" : "database",
"action" : "query"
}
},
"_ingest" : {
"timestamp" : "2022-08-17T09:27:34.587465824Z"
}
}
}
]
}
You can use lowercase processor in same ingest pipeline as shown below:
{
"pipeline": {
"description": "_description",
"processors": [
{
"dissect": {
"field": "message",
"pattern": "%{#timestamp}\t%{->} %{process.thread.id} %{event.action}\t%{message}"
}
},
{
"set": {
"field": "event.category",
"value": "database"
}
},
{
"lowercase": {
"field": "event.action"
}
}
]
},
"docs": [
{
"_index": "index",
"_id": "id",
"_source": {
"message": "2020-10-22T20:28:26.267397Z\t 9 Query\tset session"
}
}
]
}

Should and Filter combination in ElasticSearch

I have this query which return the correct result
GET /person/_search
{
"query": {
"bool": {
"should": [
{
"fuzzy": {
"nameDetails.name.nameValue.surname": {
"value": "Pibba",
"fuzziness": "AUTO"
}
}
},
{
"fuzzy": {
"nameDetails.nameValue.firstName": {
"value": "Fawsu",
"fuzziness": "AUTO"
}
}
}
]
}
}
}
and the result is below:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 3.6012557,
"hits" : [
{
"_index" : "person",
"_type" : "_doc",
"_id" : "70002",
"_score" : 3.6012557,
"_source" : {
"gender" : "Male",
"activeStatus" : "Inactive",
"deceased" : "No",
"nameDetails" : {
"name" : [
{
"nameValue" : {
"firstName" : "Fawsu",
"middleName" : "L.",
"surname" : "Pibba"
},
"nameType" : "Primary Name"
},
{
"nameValue" : {
"firstName" : "Fausu",
"middleName" : "L.",
"surname" : "Pibba"
},
"nameType" : "Spelling Variation"
}
]
}
}
}
]
}
But when I add the filter for Gender, it returns no result
GET /person/_search
{
"query": {
"bool": {
"should": [
{
"fuzzy": {
"nameDetails.name.nameValue.surname": {
"value": "Pibba",
"fuzziness": "AUTO"
}
}
},
{
"fuzzy": {
"nameDetails.nameValue.firstName": {
"value": "Fawsu",
"fuzziness": "AUTO"
}
}
}
],
"filter": [
{
"term": {
"gender": "Male"
}
}
]
}
}
}
Even I just use filter, it return no result
GET /person/_search
{
"query": {
"bool": {
"filter": [
{
"term": {
"gender": "Male"
}
}
]
}
}
}
You are not getting any search result, because you are using the term query (in the filter clause). Term query will return the document only if it has an exact match.
A standard analyzer is used when no analyzer is specified, which will tokenize Male to male. So either you can search for male instead of Male or use any of the below solutions.
If you have not defined any explicit index mapping, you need to add .keyword to the gender field. This uses the keyword analyzer instead of the standard analyzer (notice the ".keyword" after gender field). Try out this below query -
{
"query": {
"bool": {
"filter": [
{
"term": {
"gender.keyword": "Male"
}
}
]
}
}
}
Search Result:
"hits": [
{
"_index": "66879128",
"_type": "_doc",
"_id": "1",
"_score": 0.0,
"_source": {
"gender": "Male",
"activeStatus": "Inactive",
"deceased": "No",
"nameDetails": {
"name": [
{
"nameValue": {
"firstName": "Fawsu",
"middleName": "L.",
"surname": "Pibba"
},
"nameType": "Primary Name"
},
{
"nameValue": {
"firstName": "Fausu",
"middleName": "L.",
"surname": "Pibba"
},
"nameType": "Spelling Variation"
}
]
}
}
}
]
If you have defined index mapping, then modify the mapping for gender field as shown below
{
"mappings": {
"properties": {
"gender": {
"type": "keyword"
}
}
}
}

How to Viewing trace logs from OpenTelemetry in Elastic APM

I receive logs from opentelemetry-collector in Elastic APM
logs structure :
"{Timestamp:HH:mm:ss} {Level:u3} trace.id={TraceId} transaction.id={SpanId}{NewLine}{Message:lj}{NewLine}{Exception}"
example:
08:27:47 INF trace.id=898a7716358b25408d4f193f1cd17831 transaction.id=4f7590e4ba80b64b SOME MSG
I tried use pipeline
POST _ingest/pipeline/_simulate { "pipeline": { "description" : "parse multiple patterns", "processors": [
{
"grok": {
"field": "message",
"patterns": ["%{TIMESTAMP_ISO8601:logtime} %{LOGLEVEL:loglevel} \\[trace.id=%{TRACE_ID:trace.id}(?: transaction.id=%{SPAN_ID:transaction.id})?\\] %{GREEDYDATA:message}"],
"pattern_definitions": {
"TRACE_ID": "[0-9A-Fa-f]{32}",
"SPAN_ID": "[0-9A-Fa-f]{16}"
}
},
"date": { "field": "logtime", "target_field": "#timestamp", "formats": ["HH:mm:ss"] }
} ] } }
My goal see logs in Elastic APM
{
"#timestamp": 2021-01-05T10:10:10",
"message": "Protocol Port MIs-Match",
"trace": {
"traceId": "898a7716358b25408d4f193f1cd17831",
"spanId": "4f7590e4ba80b64b"
}
}
Good job so far. Your pipeline is almost good, however, the grok pattern needs some fixing and you have some orphan curly braces. Here is a working example:
POST _ingest/pipeline/_simulate
{
"pipeline": {
"description": "parse multiple patterns",
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"""%{TIME:logtime} %{WORD:loglevel} trace.id=%{TRACE_ID:trace.id}(?: transaction.id=%{SPAN_ID:transaction.id})? %{GREEDYDATA:message}"""
],
"pattern_definitions": {
"TRACE_ID": "[0-9A-Fa-f]{32}",
"SPAN_ID": "[0-9A-Fa-f]{16}"
}
}
},
{
"date": {
"field": "logtime",
"target_field": "#timestamp",
"formats": [
"HH:mm:ss"
]
}
}
]
},
"docs": [
{
"_source": {
"message": "08:27:47 INF trace.id=898a7716358b25408d4f193f1cd17831 transaction.id=4f7590e4ba80b64b SOME MSG"
}
}
]
}
Response:
{
"docs" : [
{
"doc" : {
"_index" : "_index",
"_type" : "_doc",
"_id" : "_id",
"_source" : {
"trace" : {
"id" : "898a7716358b25408d4f193f1cd17831"
},
"#timestamp" : "2021-01-01T08:27:47.000Z",
"loglevel" : "INF",
"message" : "SOME MSG",
"logtime" : "08:27:47",
"transaction" : {
"id" : "4f7590e4ba80b64b"
}
},
"_ingest" : {
"timestamp" : "2021-03-30T11:07:52.067275598Z"
}
}
}
]
}
Just note that the exact date is missing so the #timestamp field resolve to January 1st this year.

Elasticsearch nested sort based on minimum values of child of child arrays

I've two orders and these orders have multiple shipments and shipments have multiple products.
How can I sort the orders based on the minimum product.quantity in a shipment?
For example. When ordering ascending, orderNo = 2 should be listed first because it has a shipment that contains a product.quantity=1. (This is the minimum value among all product.quantity values. (productName doesn't matter)
{
"orders": [
{
"orderNo": "1",
"shipments": [
{
"products": [
{
"productName": "AAA",
"quantity": "2"
},
{
"productName": "AAA",
"quantity": "2"
}
]
},
{
"products": [
{
"productName": "AAA",
"quantity": "3"
},
{
"productName": "AAA",
"quantity": "6"
}
]
}
]
},
{
"orderNo": "2",
"shipments": [
{
"products": [
{
"productName": "AAA",
"quantity": "1"
},
{
"productName": "AAA",
"quantity": "6"
}
]
},
{
"products": [
{
"productName": "AAA",
"quantity": "4"
},
{
"productName": "AAA",
"quantity": "5"
}
]
}
]
}
]
}
Assuming that each order is a separate document, you could create an order-focused index where both shipments and products are nested fields to prevent array flattening.
The minimal index mapping could then look like:
PUT orders
{
"mappings": {
"properties": {
"shipments": {
"type": "nested",
"properties": {
"products": {
"type": "nested"
}
}
}
}
}
}
The next step is to ensure the quantity is always numeric -- not a string. When that's done, insert said docs:
POST orders/_doc
{"orderNo":"1","shipments":[{"products":[{"productName":"AAA","quantity":2},{"productName":"AAA","quantity":2}]},{"products":[{"productName":"AAA","quantity":3},{"productName":"AAA","quantity":6}]}]}
POST orders/_doc
{"orderNo":"2","shipments":[{"products":[{"productName":"AAA","quantity":1},{"productName":"AAA","quantity":6}]},{"products":[{"productName":"AAA","quantity":4},{"productName":"AAA","quantity":5}]}]}
Finally, you can use nested sorting:
POST orders/_search
{
"sort": [
{
"shipments.products.quantity": {
"nested": {
"path": "shipments.products"
},
"order": "asc"
}
}
]
}
Tip: To make the query even more useful, you could introduce sorted inner_hits to not only sort the top-level orders but also the individual products enclosed in a given order. These inner hits need a nested query so you could simply add a non-negative condition on shipments.products.quantity.
When you combine this query with the above sort and restrict the response to only relevant attributes with filter_path:
POST orders/_search?filter_path=hits.hits._id,hits.hits._source.orderNo,hits.hits.inner_hits.*.hits.hits._source
{
"_source": ["orderNo", "non_negative_quantities"],
"query": {
"nested": {
"path": "shipments.products",
"inner_hits": {
"name": "non_negative_quantities",
"sort": {
"shipments.products.quantity": "asc"
}
},
"query": {
"range": {
"shipments.products.quantity": {
"gte": 0
}
}
}
}
},
"sort": [
{
"shipments.products.quantity": {
"nested": {
"path": "shipments.products"
},
"order": "asc"
}
}
]
}
you'll end up with both sorted orders AND sorted products:
{
"hits" : {
"hits" : [
{
"_id" : "gVc0BHgBly0XYOUcZ4vd",
"_source" : {
"orderNo" : "2" <---
},
"inner_hits" : {
"non_negative_quantities" : {
"hits" : {
"hits" : [
{
"_source" : {
"quantity" : 1, <---
"productName" : "AAA"
}
},
{
"_source" : {
"quantity" : 4, <---
"productName" : "AAA"
}
},
{
"_source" : {
"quantity" : 5, <---
"productName" : "AAA"
}
}
]
}
}
}
},
{
"_id" : "gFc0BHgBly0XYOUcYosz",
"_source" : {
"orderNo" : "1"
},
"inner_hits" : {
"non_negative_quantities" : {
"hits" : {
"hits" : [
{
"_source" : {
"quantity" : 2,
"productName" : "AAA"
}
},
{
"_source" : {
"quantity" : 2,
"productName" : "AAA"
}
},
{
"_source" : {
"quantity" : 3,
"productName" : "AAA"
}
}
]
}
}
}
}
]
}
}

Resources