How to enrich data in elastic search when the data to be enriched is inside an array - elasticsearch

I have used information from the below link to create a pipeline for enriching the data using lookup from other indices.
Enriching the Data in Elastic Search
The problem that I am facing is :
my payload has this structure:
{
field1: value1,
field2:value2,
field3[
{
field3.1.1: value1,
field3.1.2: value2
},
{
field3.2.1: value1,
field3.2.2: value2
}
]
}
I created ingest pipeline for this and I am able to enrich the data correctly at the parent level, i.e. field1, field2.
However since field3 is an array element, enrichment doesn't work straight away. So I applied foreach processor in the pipeline.
and the processor of the pipeline is enrich processor.
PUT _ingest/pipeline/test-data-lookup
{
"processors": [
{
"foreach": {
"field": "field3",
"processor": {
"enrich": {
"policy_name": "field3-policy",
"field": "_ingest._value.field3.1.1",
"target_field": "{{{_ingest._value.field3.1.1}}}"
}
}
}
}
]
}
the target field is generated correctly however if I have to set field3.1.1 with the look up value defined in target_field. I have to use set processor like this.
{
"set": {
"if": "_ingest._value.field3.1.1 != null",
"field": "field1",
"value": "{{_ingest._value.field3.1.1.codevalue}}"
}
}
The problem is if condition here doesn't like _ingest._value and so gives compilation error, and because of this I am not able to compare the value of the target field with the incoming value and so all the elements of the array end up having the same codevalue.
I am new to elastic and have read almost all the documentation that I am able to understand right now. What I am trying to do, is it even possible or not?

Related

ElasticSearch field with different types in one single index

We have a scenario in a service that accepts multiple types of data and we want to store in ElasticSearch so we can benefit from its search capabilities.
Data could be a String, Number, Object or an Array of objects as the following:
POST my-index/_doc/1
{
"additionalData": [
{
"values": {
"some-field": "some-value",
"some-other-field": "some-value"
}
}
]
}
POST my-index/_doc/1
{
"additionalData": [
{
"values": [12345, 9875]
}
]
}
POST my-index/_doc/1
{
"additionalData": [
{
"values": "Some text"
}
]
}
Is there a way to store that in elasticSearch? or better to store in other NoSQL Databases like Mongodb?
PS: we are using Es 7.x, and would like to keep using ES.
If you don't need to search on those values, it's possible with a disabled field (i.e. not indexed, not stored)
However, if you want to search on those value, it's not possible. Each field must have a specific type (object, numeric, text, etc) and then you can only store values of that type in the field.

Use Ingestion Pipeline to split between two indexes

I have documents containing the field "Status", this can have three values "Draft", "In Progress", or "Approved". I am trying to pass this document through a ingest pipeline, and if the status is equal to "Approved" then it should add it in the B index, whereas by default it should index in A index irrespective of status value.
for ex -
1.
{
"id":"123",
"status":"Draft"
}
{
"id":"1234",
"status":"InProgress"
}
{
"id":"12345",
"status":"Approved"
}
1,2,3 document should go to A Index and only document 3 should go to B Index
Is it possible to do it via Ingest Pipeline?
In your ingest pipeline, you can change the _index field very easily like this:
{
"set": {
"if": "ctx.status == 'Approved'",
"field": "_index",
"value": "index-b"
}
},
{
"set": {
"if": "ctx.status != 'Approved'",
"field": "_index",
"value": "index-a"
}
}
It is worth nothing, though, that you cannot send a document to two different indexes within the same pipeline, it's either index-a or index-b, but not both.
However, this can easily be solved by querying both indexes through an alias that spans both index-a and index-b

ElasticSearch query returns wrong results

I'm relatively new to ElasticSearch and encountered this issue which I can't seem to get why.
So for this particular field, it seems to be treating all the values to be zero, even though the individual records are non-zero values. This only seems to happen to this number field and not other similar fields (such as cpu pct, mem pct etc)
The records only show when I query for records that have 'system.filesystem.used.pct == 0', whereas none of them show when I do something like 'system.filesystem.used.pct > 0'.
I also did the querying in the dev tools in kibana like so, yet I don't get any results:
GET metricbeat-*/_search{
"query": {
"range":{
"system.filesystem.used.pct":{
"gt":0
}
}
}
}
However, if I did this, I will get all non-zero results, just like in discover:
GET metricbeat-*/_search
{
"query": {
"term": {
"system.filesytem.used.pct":0
}
}
}
As pointed out by #Ron Serruya, there is a mapping issue. The mapping for system.filesytem.used.pct is detected as to be of integer type. Since, you are getting the expected search results for cpu.pct field, the mapping of cpu.pct, must have been of float type
CASE 1:
If you index the two sample data as (in the same order)
{
"count": 0.45
}
{
"count": 0
}
Then float data type is detected by elasticsearch (if you are using dynamic mapping). this is because the detection of the field type depends on the first data that you have inserted in the field.
CASE 2:
Now, if you index the data in this order
{
"count": 0
}
{
"count": 0.45
}
Here elasticsearch will detect count to be of long data type.
You need to recreate the index, with the new index mapping, reindex the data and then run the search query on system.filesytem.used.pct
Modified index mapping will be
{
"mappings": {
"properties": {
"system": {
"properties": {
"filesytem": {
"properties": {
"used": {
"properties": {
"pct": {
"type": "float"
}
}
}
}
}
}
}
}
}
}

Elastic search apply boost based on nested field value

Below is my indexed document
{
"defaultBoostValue":1.01,
"boostDetails": [
{
"Type": "Type1",
"value": 1.0001
},
{
"Type": "Type2",
"value": 1.002
},
{
"Type": "Type3",
"value": 1.0005
}
]
}
i want to apply boost based on value passed, so suppose i pass Type 1 then boost applied will be 1.0001 and if that Type1 does not exist then it will use defaultBoostValue
below is my query which works but quite slow, is there any way to optimize it further
Original question
Above query works but is slow as we are using _source
{
"query": {
"function_score": {
"boost_mode": "multiply",
"functions": [
"script_score": {
"script": {
"source": """
double findBoost(Map params_copy) {
for (def group : params_copy._source.boostDetails) {
if (group['Type'] == params_copy.preferredBoostType ) {
return group['value'];
}
}
return params_copy._source['defaultBoostValue'];
}
return findBoost(params)
""",
"params": {
"preferredBoostType": "Type1"
}
}
}
}
]
}
}
}
I have removed the condition of not having dynamic mapping, if changing the structure of boostDetails mapping can help then I am ok but please explain how it can help and be faster to query also please give mapping types and modified structure if answer contains modifying mapping.
Using dynamic mappings (lots of fields)
It looks like you adjusted the doc structure compared to your original question.
The query above was thought for nested fields which cannot be easily iterated in a script for performance reasons. Having said that, the above is an even slower workaround which accesses the docs' _source and iterates its contents. But keep in mind that it's not recommended to access the _source in scripts!
If your docs aren't nested anymore, you can access the so-called doc values which are much more optimized for query-time access:
{
"query": {
"function_score": {
...
"functions": [
{
...
"script_score": {
"script": {
"lang": "painless",
"source": """
try {
if (doc['boost.boostType.keyword'].value == params.preferredBoostType) {
return doc['boost.boostFactor'].value;
} else {
throw new Exception();
}
} catch(Exception e) {
return doc['fallbackBoostFactor'].value;
}
""",
"params": {
"preferredBoostType": "Type1"
}
}
}
}
]
}
}
}
thus speeding up your function score query.
Alternative using an ordered list of values
Since the nested iteration is slow and dynamic mappings are blowing up your index, you could store your boosts in a standardized ordered list in each document:
"boostValues": [1.0001, 1.002, 1.0005, ..., 1.1]
and keep track of the corresponding boost types' order in the backend where you construct the queries:
var boostTypes = ["Type1", "Type2", "Type3", ..., "TypeN"]
So something like n-hot vectors.
Then, as you construct the Elasticsearch query, you'd look up the array index of the boostValues based on the boostType and pass this array index to the script query from above which'd access the corresponding boostValues doc-value.
This is guaranteed to be faster than _source access. But it's required that you always keep your boostTypes and boostValues in sync -- preferably append-only (as you add new boostTypes, the list grows in one dimension).

How to extract and visualize values from a log entry in OpenShift EFK stack

I have an OKD cluster setup with EFK stack for logging, as described here. I have never worked with one of the components before.
One deployment logs requests that contain a specific value that I'm interested in. I would like to extract just this value and visualize it with an area map in Kibana that shows the amount of requests and where they come from.
The content of the message field basically looks like this:
[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}
This plz is a German zip code, which I would like to visualize as described.
My problem here is that I have no idea how to extract this value.
A nice first success would be if I could find it with a regexp, but Kibana doesn't seem to work the way I think it does. Following its docs, I expect this /\"plz\":\"[0-9]{5}\"/ to deliver me the result, but I get 0 hits (time interval is set correctly). Even if this regexp matches, I would only find the log entry where this is contained and not just the specifc value. How do I go on here?
I guess I also need an external geocoding service, but at which point would I include it? Or does Kibana itself know how to map zip codes to geometries?
A beginner-friendly step-by-step guide would be perfect, but I could settle for some hints that guide me there.
It would be possible to parse the message field as the document gets indexed into ES, using an ingest pipeline with grok processor.
First, create the ingest pipeline like this:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
}
]
}
Then, when you index your data, you simply reference that pipeline:
PUT plz/_doc/1?pipeline=parse-plz
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}"""
}
And you will end up with a document like the one below, which now has a field called plz with the 12345 value in it:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345"
}
When indexing your document from Fluentd, you can specify a pipeline to be used in the configuration. If you can't or don't want to modify your Fluentd configuration, you can also define a default pipeline for your index that will kick in every time a new document is indexed. Simply run this on your index and you won't need to specify ?pipeline=parse-plz when indexing documents:
PUT index/_settings
{
"index.default_pipeline": "parse-plz"
}
If you have several indexes, a better approach might be to define an index template instead, so that whenever a new index called project.foo-something is created, the settings are going to be applied:
PUT _template/project-indexes
{
"index_patterns": ["project.foo*"],
"settings": {
"index.default_pipeline": "parse-plz"
}
}
Now, in order to map that PLZ on a map, you'll first need to find a data set that provides you with geolocations for each PLZ.
You can then add a second processor in your pipeline in order to do the PLZ/ZIP to lat,lon mapping:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
},
{
"script": {
"lang": "painless",
"source": "ctx.location = params[ctx.plz];",
"params": {
"12345": {"lat": 42.36, "lon": 7.33}
}
}
}
]
}
Ultimately, your document will look like this and you'll be able to leverage the location field in a Kibana visualization:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345",
"location": {
"lat": 42.36,
"lon": 7.33
}
}
So to sum it all up, it all boils down to only two things:
Create an ingest pipeline to parse documents as they get indexed
Create an index template for all project* indexes whose settings include the pipeline created in step 1

Resources