Field-specific versioning in Elasticsearch - elasticsearch

There is a good deal of documentation about how Elasticsearch supports document level external versioning. However, if one wants to do a partial update (say, to a specific field), it'd be useful to have this type of version checking at the field level.
For instance, say I have an object field name, with primitive fields value and timestamp. I only want the partial updates to succeed if the timestamp value is greater than the value currently in Elasticsearch.
Is there an easy way to do this? Can it be done with a script? Or is there a more standard way of doing it?

Yes it's very easy, using a script. See here https://www.elastic.co/guide/en/elasticsearch/reference/2.0/docs-update.html.
I've written an example here to update the "value" field if and only if the specified timestamp value (given in parameter update_time) is greater than the "timestamp" field. If the timestamp field value is less than the update_time parameter then it will be updated, otherwise the update will not be performed.
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"script" : {
"inline": "if(ctx._source.name.timestamp > update_time){ ctx.op = \"none\"};
ctx._source.name.value = value; ctx._source.name.timestamp = update_time;",
"params" : {
"update_time" : 432422,
"value": "My new value"
}
}
}'
You can get the current time in the script if desired, rather than passing as a parameter e.g.:
update_time = DateTime.now().getMillis()

Related

Get date value in update query elasticsearch painless

I'm trying to get millis value of two dates and subtract them to another.
When I used ctx._sourse.begin_time.toInstant().toEpochMilli() like doc['begin_time'].value.toInstant().toEpochMilli() it gives me runtime error.
And ctx._source.begin_time.date.getYear() (like this Update all documents of Elastic Search using existing column value) give me runtime error with message
"ctx._source.work_time = ctx._source.begin_time.date.getYear()",
" ^---- HERE"
What type I get with ctx._source if this code works correctly doc['begin_time'].value.toInstant().toEpochMilli().
I can't find in documentation of painless how to get values correctly. begin_time is date 100%.
So, how can I write a script to get the difference between two dates and write it to another integer?
If you look closely, the script language from the linked question is in groovy but it's not supported anymore. What we use nowadays (2021) is called painless.
The main point here is that the ctx._source attributes are the original JSON -- meaning the dates will be strings or integers (depending on the format) and not java.util.Date or any other data type that you could call .getDate() on. This means we'll have to parse the value first.
So, assuming your begin_time is of the format yyyy/MM/dd, you can do the following:
POST myindex/_update_by_query
{
"query": {
"match_all": {}
},
"script": {
"source": """
DateTimeFormatter dtf = DateTimeFormatter.ofPattern("yyyy/MM/dd");
LocalDate begin_date = LocalDate.parse(ctx._source.begin_time, dtf);
ctx._source.work_time = begin_date.getYear()
"""
}
}
BTW the _update_by_query script context (what's accessible and what's not) is documented here and working with datetime in painless is nicely documented here.

Kibana scripted field which loops through an array

I am trying to use the metricbeat http module to monitor F5 pools.
I make a request to the f5 api and bring back json, which is saved to kibana. But the json contains an array of pool members and I want to count the number which are up.
The advice seems to be that this can be done with a scripted field. However, I can't get the script to retrieve the array. eg
doc['http.f5pools.items.monitor'].value.length()
returns in the preview results with the same 'Additional Field' added for comparison:
[
{
"_id": "rT7wdGsBXQSGm_pQoH6Y",
"http": {
"f5pools": {
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
},
"pool.MemberCount": [
7
]
},
If I try
doc['http.f5pools.items']
Or similar I just get an error:
"reason": "No field found for [http.f5pools.items] in mapping with types []"
Googling suggests that the doc construct does not contain arrays?
Is it possible to make a scripted field which can access the set of values? ie is my code or the way I'm indexing the data wrong.
If not is there an alternative approach within metricbeats? I don't want to have to make a whole new api to do the calculation and add a separate field
-- update.
Weirdly it seems that the number values in the array do return the expected results. ie.
doc['http.f5pools.items.ratio']
returns
{
"_id": "BT6WdWsBXQSGm_pQBbCa",
"pool.MemberCount": [
1,
1
]
},
-- update 2
Ok, so if the strings in the field have different values then you get all the values. if they are the same you just get one. wtf?
I'm adding another answer instead of deleting my previous one which is not the actual question but still may be helpful for someone else in future.
I found a hint in the same documentation:
Doc values are a columnar field value store
Upon googling this further I found this Doc Value Intro which says that the doc values are essentially "uninverted index" useful for operations like sorting; my hypotheses is while sorting you essentially dont want same values repeated and hence the data structure they use removes those duplicates. That still did not answer as to why it works different for string than number. Numbers are preserved but strings are filters into unique.
This “uninverted” structure is often called a “column-store” in other
systems. Essentially, it stores all the values for a single field
together in a single column of data, which makes it very efficient for
operations like sorting.
In Elasticsearch, this column-store is known as doc values, and is
enabled by default. Doc values are created at index-time: when a field
is indexed, Elasticsearch adds the tokens to the inverted index for
search. But it also extracts the terms and adds them to the columnar
doc values.
Some more deep-dive into doc values revealed it a compression technique which actually de-deuplicates the values for efficient and memory-friendly operations.
Here's a NOTE given on the link above which answers the question:
You may be thinking "Well that’s great for numbers, but what about
strings?" Strings are encoded similarly, with the help of an ordinal
table. The strings are de-duplicated and sorted into a table, assigned
an ID, and then those ID’s are used as numeric doc values. Which means
strings enjoy many of the same compression benefits that numerics do.
The ordinal table itself has some compression tricks, such as using
fixed, variable or prefix-encoded strings.
Also, if you dont want this behavior then you can disable doc-values
OK, solved it.
https://discuss.elastic.co/t/problem-looping-through-array-in-each-doc-with-painless/90648
So as I discovered arrays are prefiltered to only return distinct values (except in the case of ints apparently?)
The solution is to use params._source instead of doc[]
The answer for why doc doesnt work
Quoting below:
Doc values are a columnar field value store, enabled by default on all
fields except for analyzed text fields.
Doc-values can only return "simple" field values like numbers, dates,
geo- points, terms, etc, or arrays of these values if the field is
multi-valued. It cannot return JSON objects
Also, important to add a null check as mentioned below:
Missing fields
The doc['field'] will throw an error if field is
missing from the mappings. In painless, a check can first be done with
doc.containsKey('field')* to guard accessing the doc map.
Unfortunately, there is no way to check for the existence of the field
in mappings in an expression script.
Also, here is why _source works
Quoting below:
The document _source, which is really just a special stored field, can
be accessed using the _source.field_name syntax. The _source is loaded
as a map-of-maps, so properties within object fields can be accessed
as, for example, _source.name.first.
.
Responding to your comment with an example:
The kyeword here is: It cannot return JSON objects. The field doc['http.f5pools.items'] is a JSON object
Try running below and see the mapping it creates:
PUT t5/doc/2
{
"items": [
{
"monitor": "default"
},
{
"monitor": "default"
}
]
}
GET t5/_mapping
{
"t5" : {
"mappings" : {
"doc" : {
"properties" : {
"items" : {
"properties" : {
"monitor" : { <-- monitor is a property of items property(Object)
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
}
}
}
}
}

ElasticSearch painless script to remove all the keys except for a list of keys

I want to execute an atomic update operation on a Elasticsearch (6.1) document where I want to remove all the document except for some keys (on the top level, not nested).
I know that for removing a specific key from a document (something in the example) I can do as follows:
curl -XPOST 'localhost:9200/index/type/id/_update' -d '{
"script" : "ctx._source.remove(params.field)",
"params": {
"field": "something"
}
}'
But what If I want to remove every field except for a field called a and a field called b?
I found a way to make it work. I'm posting it here since it might be useful for someone else:
POST /index/type/id/_update
{
"script" : {
"source" : "Object var0 = ctx._source.get(\"a\"); Object var1 = ctx._source.get(\"b\"); ctx._source = params.value; if(var0 != null) ctx._source.put(\"a\", var0); if(var1 != null) ctx._source.put(\"b\", var1);",
"params": {
"value": {
"newKey" : "newValue"
}
}
}
}
This script is updating the document with the content inside params.value while keeping the keys a and b from the previous version of the document. This approach is simpler for my use case since the list of keys to keep is going to be small compared to the amount of keys are present in the existing document.
If you would like only to keep the keys a and be you would first store the keys in variables, then do ctx._source.clear() and then you will add the keys back.

Field [] used in expression does not exist in mappings

The feature I try to fullfit is to create a metric in kibana that display the number of users "unvalidated".
I send a log sent when a user registers, then a log when a user is validated.
So the count I want is the difference between the number of registered and the number of validated.
In kibana I cannot do such a math operation, so I found a workaround:
I added a "scripted field" named "unvalidated" which is equal to 1 when a user registers and -1 when a user validates his account.
The sum of the "unvalidated" field should be the number of unvalidated users.
This is the script I defined in my scripted field:
doc['ctxt_code'].value == 1 ? 1 : doc['ctxt_code'].value == 2 ? -1 : 0
with:
ctxt_code 1 as the register log
ctxt_code 2 as the validated log
This setup works well when all my logs have a "ctxt_code", but when a log without this field is pushed kibana throws the following error:
Field [ctxt_code] used in expression does not exist in mappings
I can't understand this error because kibana says:
If a field is sparse (only some documents contain a value), documents missing the field will have a value of 0
which is the case.
Anyone has a clue ?
It's OK to have logs without the ctxt_code field... but you have to have a mapping for this field in your indices. I see you're querying multiple indices with logstash-*, so you are probably hitting one that does not have it.
You can include a mapping for your field in all indices. Just go into Sense and use this:
PUT logstash-*/_mappings/[your_mapping_name]
{
"properties": {
"ctxt_code": {
"type": "short", // or any other numeric type, including dates
"index": "not_analyzed" // Only works for non-analyzed fields.
}
}
}
If you prefer you can do it from the command line: CURL -XPUT 'http://[elastic_server]/logstash-*/_mappings/[your_mapping_name]' -d '{ ... same JSON ... }'

ElasticSearch index unix timestamp

I have to index documents containing a 'time' field whose value is an integer representing the number of seconds since epoch (aka unix timestamp).
I've been reading ES docs and have found this:
http://www.elasticsearch.org/guide/reference/mapping/date-format.html
But it seems that if I want to submit unix timestamps and want them stored in a 'date' field (integer field is not useful for me) I have only two options:
Implement my own date format
Convert to a supported format at the sender
Is there any other option I missed?
Thanks!
If you supply a mapping that tells ES the field is a date, it can use epoch millis as an input. If you want ES to auto-detect you'll have to provide ISO8601 or other discoverable format.
Update: I should also note that you can influence what strings ES will recognize as dates in your mapping. http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-date-format.html
In case you want to use Kibana, which I expect, and visualize according to the time of a log/entry you will need at least one field to be a date field.
Please note that you have to set the field as date type BEFORE you input any data into the /index/type. Otherwise it will be stored as long and unchangeable.
Simple example that can be pasted into the marvel/sense plugin:
# Make sure the index isn't there
DELETE /logger
# Create the index
PUT /logger
# Add the mapping of properties to the document type `mem`
PUT /logger/_mapping/mem
{
"mem": {
"properties": {
"timestamp": {
"type": "date"
},
"free": {
"type": "long"
}
}
}
}
# Inspect the newly created mapping
GET /logger/_mapping/mem
Run each of these commands in serie.
Generate free mem logs
Here is a simple script that echo to your terminal and logs to your local elasticsearch:
while (( 1==1 )); do memfree=`free -b|tail -n 1|tr -s ' ' ' '|cut -d ' ' -f4`; echo $load; curl -XPOST "localhost:9200/logger/mem" -d "{ \"timestamp\": `date +%s%3N`, \"free\": $memfree }"; sleep 1; done
Inspect data in elastic search
Paste this in your marvel/sense
GET /logger/mem/_search
Now you can move to Kibana and do some graphs. Kibana will autodetect your date field.

Resources