Get date value in update query elasticsearch painless - elasticsearch

I'm trying to get millis value of two dates and subtract them to another.
When I used ctx._sourse.begin_time.toInstant().toEpochMilli() like doc['begin_time'].value.toInstant().toEpochMilli() it gives me runtime error.
And ctx._source.begin_time.date.getYear() (like this Update all documents of Elastic Search using existing column value) give me runtime error with message
"ctx._source.work_time = ctx._source.begin_time.date.getYear()",
" ^---- HERE"
What type I get with ctx._source if this code works correctly doc['begin_time'].value.toInstant().toEpochMilli().
I can't find in documentation of painless how to get values correctly. begin_time is date 100%.
So, how can I write a script to get the difference between two dates and write it to another integer?

If you look closely, the script language from the linked question is in groovy but it's not supported anymore. What we use nowadays (2021) is called painless.
The main point here is that the ctx._source attributes are the original JSON -- meaning the dates will be strings or integers (depending on the format) and not java.util.Date or any other data type that you could call .getDate() on. This means we'll have to parse the value first.
So, assuming your begin_time is of the format yyyy/MM/dd, you can do the following:
POST myindex/_update_by_query
{
"query": {
"match_all": {}
},
"script": {
"source": """
DateTimeFormatter dtf = DateTimeFormatter.ofPattern("yyyy/MM/dd");
LocalDate begin_date = LocalDate.parse(ctx._source.begin_time, dtf);
ctx._source.work_time = begin_date.getYear()
"""
}
}
BTW the _update_by_query script context (what's accessible and what's not) is documented here and working with datetime in painless is nicely documented here.

Related

Elasticsearch manipulate existing field value to add new field

I try to add new field which is value comes from hashed existing field value. So, i want to do;
my_index.hashedusername(new field) = crc32(my_index.username) (existing field)
For example
POST _update_by_query
{
"query": {
"match_all": {}
},
"script" : {
"source": "ctx._source.hashedusername = crc32(ctx._source.username);"
}
}
Please give me an idea how to do this..
java.util.zip.CRC32 is not available in the shared painless API so mocking that package will be non-trivial -- perhaps even unreasonable.
I'd suggest to compute the CRC32 hashes beforehand and only then send the docs to ES. Alternatively, scroll through all your documents, compute the hash and bulk-update your documents.
The painless API was designed to perform comparatively simple tasks and CRC32 is certainly outside of its purpose.

Bulk inject doc to elastic search with nanoseconds timestamp

I'm trying to use the latest nanoseconds support provided by ElasticSearch 7.1 (Actually after 7.0). Not sure how to do this correctly.
Before 7.0, ElasticSearch only support timestamp for milliseconds, I use the _bulk API to inject documents.
#bulk post docs to elastic search
def es_bulk_insert(log_lines, batch_size=1000):
headers = {'Content-Type': 'application/x-ndjson'}
while log_lines:
batch, log_lines = log_lines[:batch_size], log_lines[batch_size:]
batch = '\n'.join([x.es_post_payload for x in batch]) + '\n'
request = AWSRequest(method='POST', url=f'{ES_HOST}/_bulk', data=batch, headers=headers)
SigV4Auth(boto3.Session().get_credentials(), 'es', 'eu-west-1').add_auth(request)
session = URLLib3Session()
r = session.send(request.prepare())
if r.status_code > 299:
raise Exception(f'Received a bad response from Elasticsearch: {r.text}')
The log index is generated per day
#ex:
#log-20190804
#log-20190805
def es_index(self):
current_date = datetime.strftime(datetime.now(), '%Y%m%d')
return f'{self.name}-{current_date}'
The timestamp is in nanoseconds "2019-08-07T23:59:01.193379911Z" and it's automatically mapping to a date type by Elasticsearch before 7.0.
"timestamp": {
"type": "date"
},
Now I want to map the timestamp field to the "date_nanos" type. From here, I think I need to create the ES index with correct mapping before I call the es_bulk_insert() function to upload docs.
GET https://{es_url}/log-20190823
If not exist (return 404)
PUT https://{es_url}/log-20190823/_mapping
{
"properties": {
"timestamp": {
"type": "date_nanos"
}
}
}
...
call es_bulk_insert()
...
My questions are:
1. If I do not remap the old data(ex: log-20190804), so the timestamp will have two mappings (data vs data_nano), will there be a conflict when I use Kibana to search for the logs?
2. I didn't see many posts about using this new feature, will that hurt performance a lot? Did anyone use this in prod?
3. Kibana not support nanoseconds search before 7.3 not sure if can sort by nanoseconds correctly, will try.
Thanks!
You are right: For date_nanos you need to create the mapping explicitly — otherwise the dynamic mapping will fall back to date.
And you are also correct that Kibana supports date_nanos in general in 7.3; though the relevant ticket is IMO https://github.com/elastic/kibana/issues/31424.
However, sorting doesn't work correctly yet. That is because both date (millisecond precision) and date_nanos (nanosecond precision) are represented as a long since the start of the epoche. So the first one will have a value of 1546344630124 and the second one of 1546344630123456789 — this isn't giving you the expected sort order.
In Elasticsearch there is a parameter for search "numeric_type": "date_nanos" that will cast both to nanosecond precision and thus order correctly (added in 7.2). However, that parameter isn't yet used in Kibana. I've raised an issue for that now.
For performance: The release blog post has some details. Obviously there is overhead (including document size), so I would only use the higher precision if you really need it.

ElasticSearch painless scripts - Way to output variable values besides the final score?

I am using a painless script to implement a custom scoring function while querying the ES index, that's being used as a basis for our recommendation engine. While calculating the final score in the painless script, I use a product of intermediate variables such as recency and uniqueness, calculated within the painless script.
Now, it is trivial to get the final scores of the top documents as they are returned within the query response. However, for detailed analysis, I'm trying to find a way to also get the intermediate variables' values (recency and uniqueness as in the above example). I understand these painless variables only exist within the context of the painless script, which does not have standard REPL setup. So is there really no way to access these painless variables? Has anyone found a workaround to do this? Thanks!
E.g. If I have the following simplified painless script:
def recency = 1/doc['date'].value
def uniqueness = doc['ctr].value
return recency * uniqueness
In the final ES response, I get the scores i.e. recency * uniqueness. However, I also want to know what the intermediate variables are i.e. recency and uniqueness
You can try using a modular approach with multiple scripted fields like:
recency -- get the recency field
uniqueness -- get the uniqueness field
access the fields like a normal ES field in your final painless script
if(doc.containsKey('recency.keyword') && doc.containsKey('uniqueness.keyword'))
{
def val1 = doc['recency.keyword'].value;
def val2 = doc['uniqueness.keyword'].value;
}
Hope it helps
There is not direct way of printing it anywhere i suppose.
But here is what you can give it a try to check the intermediate output of any variable.
create another scripted field which will return only the value of that variable.
For Ex: in your case,
"script_fields": {
"derivedRecency": {
"script": {
"lang": "painless",
"source": """
return doc['recency'].value;
"""
}
}
}

Field-specific versioning in Elasticsearch

There is a good deal of documentation about how Elasticsearch supports document level external versioning. However, if one wants to do a partial update (say, to a specific field), it'd be useful to have this type of version checking at the field level.
For instance, say I have an object field name, with primitive fields value and timestamp. I only want the partial updates to succeed if the timestamp value is greater than the value currently in Elasticsearch.
Is there an easy way to do this? Can it be done with a script? Or is there a more standard way of doing it?
Yes it's very easy, using a script. See here https://www.elastic.co/guide/en/elasticsearch/reference/2.0/docs-update.html.
I've written an example here to update the "value" field if and only if the specified timestamp value (given in parameter update_time) is greater than the "timestamp" field. If the timestamp field value is less than the update_time parameter then it will be updated, otherwise the update will not be performed.
curl -XPOST 'localhost:9200/test/type1/1/_update' -d '{
"script" : {
"inline": "if(ctx._source.name.timestamp > update_time){ ctx.op = \"none\"};
ctx._source.name.value = value; ctx._source.name.timestamp = update_time;",
"params" : {
"update_time" : 432422,
"value": "My new value"
}
}
}'
You can get the current time in the script if desired, rather than passing as a parameter e.g.:
update_time = DateTime.now().getMillis()

Kibana 4.1 - use JSON input to create an Hour Of Day field from #timestamp for histogram

Edit: I found the answer, see below for Logstash <= 2.0 ===>
Plugin created for Logstash 2.0
Whomever is interested in this with Logstash 2.0 or above, I created a plugin that makes this dead simple:
The GEM is here:
https://rubygems.org/gems/logstash-filter-dateparts
Here is the documentation and source code:
https://github.com/mikebski/logstash-datepart-plugin
I've got a bunch of data in Logstash with a #Timestamp for a range of a couple of weeks. I have a duration field that is a number field, and I can do a date histogram. I would like to do a histogram over hour of day, rather than a linear histogram from x -> y dates. I would like the x axis to be 0 -> 23 instead of date x -> date y.
I think I can use the JSON Input advanced text input to add a field to the result set which is the hour of day of the #timestamp. The help text says:
Any JSON formatted properties you add here will be merged with the elasticsearch aggregation definition for this section. For example shard_size on a terms aggregation which leads me to believe it can be done but does not give any examples.
Edited to add:
I have tried setting up an entry in the scripted fields based on the link below, but it will not work like the examples on their blog with 4.1. The following script gives an error when trying to add a field with format number and name test_day_of_week: Integer.parseInt("1234")
The problem looks like the scripting is not very robust. Oddly enough, I want to do exactly what they are doing in the examples (add fields for day of month, day of week, etc...). I can get the field to work if the script is doc['#timestamp'], but I cannot manipulate the timestamp.
The docs say Lucene expressions are allowed and show some trig and GCD examples for GIS type stuff, but nothing for date...
There is this update to the BLOG:
UPDATE: As a security precaution, starting with version 4.0.0-RC1,
Kibana scripted fields default to Lucene Expressions, not Groovy, as
the scripting language. Since Lucene Expressions only support
operations on numerical fields, the example below dealing with date
math does not work in Kibana 4.0.0-RC1+ versions.
There is no suggestion for how to actually do this now. I guess I could go off and enable the Groovy plugin...
Any ideas?
EDIT - THE SOLUTION:
I added a filter using Ruby to do this, and it was pretty simple:
Basically, in a ruby script you can access event['field'] and you can create new ones. I use the Ruby time bits to create new fields based on the #timestamp for the event.
ruby {
code => "ts = event['#timestamp']; event['weekday'] = ts.wday; event['hour'] = ts.hour; event['minute'] = ts.min; event['second'] = ts.sec; event['mday'] = ts.day; event['yday'] = ts.yday; event['month'] = ts.month;"
}
This no longer appears to work in Logstash 1.5.4 - the Ruby date elements appear to be unavailable, and this then throws a "rubyexception" and does not add the fields to the logstash events.
I've spent some time searching for a way to recover the functionality we had in the Groovy scripted fields, which are unavailable for scripting dynamically, to provide me with fields such as "hourofday", "dayofweek", et cetera. What I've done is to add these as groovy script files directly on the Elasticsearch nodes themselves, like so:
/etc/elasticsearch/scripts/
hourofday.groovy
dayofweek.groovy
weekofyear.groovy
... and so on.
Those script files contain a single line of Groovy, like so:
Integer.parseInt(new Date(doc["#timestamp"].value).format("d")) (dayofmonth)
Integer.parseInt(new Date(doc["#timestamp"].value).format("u")) (dayofweek)
To reference these in Kibana, firstly create a new search and save it, or choose one of your existing saved searches (Please take a copy of the existing JSON before you change it, just in case) in the "Settings -> Saved Objects -> Searches" page. You then modify the query to add "Script Fields" in, so you get something like this:
{
"query" : {
...
},
"script_fields": {
"minuteofhour": {
"script_file": "minuteofhour"
},
"hourofday": {
"script_file": "hourofday"
},
"dayofweek": {
"script_file": "dayofweek"
},
"dayofmonth": {
"script_file": "dayofmonth"
},
"dayofyear": {
"script_file": "dayofyear"
},
"weekofmonth": {
"script_file": "weekofmonth"
},
"weekofyear": {
"script_file": "weekofyear"
},
"monthofyear": {
"script_file": "monthofyear"
}
}
}
As shown, the "script_fields" line should fall outside the "query" itself, or you will get an error. Also ensure the script files are available to all your Elasticsearch nodes.

Resources