Bulk inject doc to elastic search with nanoseconds timestamp - elasticsearch

I'm trying to use the latest nanoseconds support provided by ElasticSearch 7.1 (Actually after 7.0). Not sure how to do this correctly.
Before 7.0, ElasticSearch only support timestamp for milliseconds, I use the _bulk API to inject documents.
#bulk post docs to elastic search
def es_bulk_insert(log_lines, batch_size=1000):
headers = {'Content-Type': 'application/x-ndjson'}
while log_lines:
batch, log_lines = log_lines[:batch_size], log_lines[batch_size:]
batch = '\n'.join([x.es_post_payload for x in batch]) + '\n'
request = AWSRequest(method='POST', url=f'{ES_HOST}/_bulk', data=batch, headers=headers)
SigV4Auth(boto3.Session().get_credentials(), 'es', 'eu-west-1').add_auth(request)
session = URLLib3Session()
r = session.send(request.prepare())
if r.status_code > 299:
raise Exception(f'Received a bad response from Elasticsearch: {r.text}')
The log index is generated per day
#ex:
#log-20190804
#log-20190805
def es_index(self):
current_date = datetime.strftime(datetime.now(), '%Y%m%d')
return f'{self.name}-{current_date}'
The timestamp is in nanoseconds "2019-08-07T23:59:01.193379911Z" and it's automatically mapping to a date type by Elasticsearch before 7.0.
"timestamp": {
"type": "date"
},
Now I want to map the timestamp field to the "date_nanos" type. From here, I think I need to create the ES index with correct mapping before I call the es_bulk_insert() function to upload docs.
GET https://{es_url}/log-20190823
If not exist (return 404)
PUT https://{es_url}/log-20190823/_mapping
{
"properties": {
"timestamp": {
"type": "date_nanos"
}
}
}
...
call es_bulk_insert()
...
My questions are:
1. If I do not remap the old data(ex: log-20190804), so the timestamp will have two mappings (data vs data_nano), will there be a conflict when I use Kibana to search for the logs?
2. I didn't see many posts about using this new feature, will that hurt performance a lot? Did anyone use this in prod?
3. Kibana not support nanoseconds search before 7.3 not sure if can sort by nanoseconds correctly, will try.
Thanks!

You are right: For date_nanos you need to create the mapping explicitly — otherwise the dynamic mapping will fall back to date.
And you are also correct that Kibana supports date_nanos in general in 7.3; though the relevant ticket is IMO https://github.com/elastic/kibana/issues/31424.
However, sorting doesn't work correctly yet. That is because both date (millisecond precision) and date_nanos (nanosecond precision) are represented as a long since the start of the epoche. So the first one will have a value of 1546344630124 and the second one of 1546344630123456789 — this isn't giving you the expected sort order.
In Elasticsearch there is a parameter for search "numeric_type": "date_nanos" that will cast both to nanosecond precision and thus order correctly (added in 7.2). However, that parameter isn't yet used in Kibana. I've raised an issue for that now.
For performance: The release blog post has some details. Obviously there is overhead (including document size), so I would only use the higher precision if you really need it.

Related

Get date value in update query elasticsearch painless

I'm trying to get millis value of two dates and subtract them to another.
When I used ctx._sourse.begin_time.toInstant().toEpochMilli() like doc['begin_time'].value.toInstant().toEpochMilli() it gives me runtime error.
And ctx._source.begin_time.date.getYear() (like this Update all documents of Elastic Search using existing column value) give me runtime error with message
"ctx._source.work_time = ctx._source.begin_time.date.getYear()",
" ^---- HERE"
What type I get with ctx._source if this code works correctly doc['begin_time'].value.toInstant().toEpochMilli().
I can't find in documentation of painless how to get values correctly. begin_time is date 100%.
So, how can I write a script to get the difference between two dates and write it to another integer?
If you look closely, the script language from the linked question is in groovy but it's not supported anymore. What we use nowadays (2021) is called painless.
The main point here is that the ctx._source attributes are the original JSON -- meaning the dates will be strings or integers (depending on the format) and not java.util.Date or any other data type that you could call .getDate() on. This means we'll have to parse the value first.
So, assuming your begin_time is of the format yyyy/MM/dd, you can do the following:
POST myindex/_update_by_query
{
"query": {
"match_all": {}
},
"script": {
"source": """
DateTimeFormatter dtf = DateTimeFormatter.ofPattern("yyyy/MM/dd");
LocalDate begin_date = LocalDate.parse(ctx._source.begin_time, dtf);
ctx._source.work_time = begin_date.getYear()
"""
}
}
BTW the _update_by_query script context (what's accessible and what's not) is documented here and working with datetime in painless is nicely documented here.

Discover historical trends in Elasticsearch (not visual)

I have some experience with Elastic as logs storage, but I'm stuck on basic trends recognition (where I need to compare found documents to each other) over time periods.
Easy query would answer following question:
Find all occurrences of document rows (row is specified by growing/continues #timestamp value), where specific field (e.g. threads_count) is growing for fixed count of documents, or time period.
So if I have thread_count of some application, logged every minute over a day including timestamp. And I specify that I'm looking for growing trend in 10 minutes - result should return documents or document sets where thread_count was greater over the one from document minute before at least for 10 documents.
It is very similar task to see line graph, and identify growing parts by eye.
Maybe I just miss proper function name for search. I'm not interested in visualization, I would like to search similar situations over the API and take needed actions.
Any reference to documentation or simple example is welcome!
Well script cannot be used between documents. So you will have to use a payload.
In your query sort the result by date.
https://www.elastic.co/guide/en/elastic-stack-overview/6.3/how-watcher-works.html
A script in the payload could tell you if a field is increasing (something like that, don't have access to a es index right now)
"transform": {
"script": {
"source": "ctx.payload.transform = []; def current_score = -1;
def current = []; for (int j=0;j<ctx.payload.hits.hits;j++){
//check in the loop if current_score increasing using ctx.payload.hits.hits[j]._source.message], if not return "FALSE"
} ; return "TRUE",
"lang": "painless"
}
}
If you use logstash to index your documents, take a look to elapsed, could be nice too: https://www.elastic.co/guide/en/logstash/current/plugins-filters-elapsed.html

Unable to loop through array field ES 6.1

I'm facing a problem in ElasticSearch 6.1 that I cannot solve and I don't know why. I have read the docs several times and maybe I'm missing something.
I have a scripted query that needs to do some calculation before decides if a record is available or not.
Here is the following script:
https://gist.github.com/dunice/a3a8a431140ec004fdc6969f77356fdf
What I'm doing is trying to loop though an array field with the following source:
"unavailability": [
{
"starts_at": "2018-11-27T18:00:00+00:00",
"local_ends_at": "2018-11-27T15:04:00",
"local_starts_at": "2018-11-27T13:00:00",
"ends_at": "2018-11-27T20:04:00+00:00"
},
{
"starts_at": "2018-12-04T18:00:00+00:00",
"local_ends_at": "2018-12-04T15:04:00",
"local_starts_at": "2018-12-04T13:00:00",
"ends_at": "2018-12-04T20:04:00+00:00"
},
]
When the script is executed it throws the error: No field found for [unavailability] in mapping with types [aircraft]
Is there any clue to make it work?
Thanks
UPDATE
Query:
https://gist.github.com/dunice/3ccd7d83ca6ddaa63c11013b84e659aa
UPDATE 2
Mapping:
https://gist.github.com/dunice/f8caee114bbd917115a21b8b9175a439
Data example:
https://gist.github.com/dunice/8ad0602bc282b4ca19bce8ae849117ad
You cannot access an array present in the source document via doc_values (i.e. doc). You need to directly access the source document via the _source variable instead, like this:
for(int i = 0; i < params._source['unavailability'].length; i++) {
Note that depending on your ES version, you might want to try ctx._source or just _source instead of params._source
I solve my use-case in a different approach.
Instead having a field as array of object like unavailability was I decided to create two fields as array of datetime:
unavailable_from
unavailable_to
My script walks through the first field then checks the second with the same position.
UPDATE
The direct access to _source is disabled by default:
https://github.com/elastic/elasticsearch/issues/17558

Elasticsearch 2.1: Result window is too large (index.max_result_window)

We retrieve information from Elasticsearch 2.1 and allow the user to page thru the results. When the user requests a high page number we get the following error message:
Result window is too large, from + size must be less than or equal
to: [10000] but was [10020]. See the scroll api for a more efficient
way to request large data sets. This limit can be set by changing the
[index.max_result_window] index level parameter
The elastic docu says that this is because of high memory consumption and to use the scrolling api:
Values higher than can consume significant chunks of heap memory per
search and per shard executing the search. It’s safest to leave this
value as it is an use the scroll api for any deep scrolling https://www.elastic.co/guide/en/elasticsearch/reference/2.x/breaking_21_search_changes.html#_from_size_limits
The thing is that I do not want to retrieve large data sets. I only want to retrieve a slice from the data set which is very high up in the result set. Also the scrolling docu says:
Scrolling is not intended for real time user requests https://www.elastic.co/guide/en/elasticsearch/reference/2.2/search-request-scroll.html
This leaves me with some questions:
1) Would the memory consumption really be lower (any if so why) if I use the scrolling api to scroll up to result 10020 (and disregard everything below 10000) instead of doing a "normal" search request for result 10000-10020?
2) It does not seem that the scrolling API is an option for me but that I have to increase "index.max_result_window". Does anyone have any experience with this?
3) Are there any other options to solve my problem?
If you need deep pagination, one possible solution is to increase the value max_result_window. You can use curl to do this from your shell command line:
curl -XPUT "http://localhost:9200/my_index/_settings" -H 'Content-Type: application/json' -d '{ "index" : { "max_result_window" : 500000 } }'
I did not notice increased memory usage, for values of ~ 100k.
The right solution would be to use scrolling.
However, if you want to extend the results search returns beyond 10,000 results, you can do it easily with Kibana:
Go to Dev Tools and just post the following to your index (your_index_name), specifing what would be the new max result window
PUT your_index_name/_settings
{
"max_result_window" : 500000
}
If all goes well, you should see the following success response:
{
"acknowledged": true
}
The following pages in the elastic documentation talk about deep paging:
https://www.elastic.co/guide/en/elasticsearch/guide/current/pagination.html
https://www.elastic.co/guide/en/elasticsearch/guide/current/_fetch_phase.html
Depending on the size of your documents, the number of shards, and the
hardware you are using, paging 10,000 to 50,000 results (1,000 to
5,000 pages) deep should be perfectly doable. But with big-enough from
values, the sorting process can become very heavy indeed, using vast
amounts of CPU, memory, and bandwidth. For this reason, we strongly
advise against deep paging.
Use the Scroll API to get more than 10000 results.
Scroll example in ElasticSearch NEST API
I have used it like this:
private static Customer[] GetCustomers(IElasticClient elasticClient)
{
var customers = new List<Customer>();
var searchResult = elasticClient.Search<Customer>(s => s.Index(IndexAlias.ForCustomers())
.Size(10000).SearchType(SearchType.Scan).Scroll("1m"));
do
{
var result = searchResult;
searchResult = elasticClient.Scroll<Customer>("1m", result.ScrollId);
customers.AddRange(searchResult.Documents);
} while (searchResult.IsValid && searchResult.Documents.Any());
return customers.ToArray();
}
If you want more than 10000 results then in all the data nodes the memory usage will be very high because it has to return more results in each query request. Then if you have more data and more shards then merging those results will be inefficient. Also es cache the filter context, hence again more memory. You have to trial and error how much exactly you are taking. If you are getting many requests in small window you should do multiple query for more than 10k and merge it by urself in the code, which is supposed to take less application memory then if you increase the window size.
2) It does not seem that the scrolling API is an option for me but that I have to increase "index.max_result_window". Does anyone have any experience with this?
--> You can define this value in index templates , es template will be applicable for new indexes only ,so you either have to delete old indexes after creating template or wait for new data to be ingested in elasticsearch .
{
"order": 1,
"template": "index_template*",
"settings": {
"index.number_of_replicas": "0",
"index.number_of_shards": "1",
"index.max_result_window": 2147483647
},
In my case it looks like reducing the results via the from & size prefixes to the query will remove the error as we don't need all the results:
GET widgets_development/_search
{
"from" : 0,
"size": 5,
"query": {
"bool": {}
},
"sort": {
"col_one": "asc"
}
}

Kibana 4.1 - use JSON input to create an Hour Of Day field from #timestamp for histogram

Edit: I found the answer, see below for Logstash <= 2.0 ===>
Plugin created for Logstash 2.0
Whomever is interested in this with Logstash 2.0 or above, I created a plugin that makes this dead simple:
The GEM is here:
https://rubygems.org/gems/logstash-filter-dateparts
Here is the documentation and source code:
https://github.com/mikebski/logstash-datepart-plugin
I've got a bunch of data in Logstash with a #Timestamp for a range of a couple of weeks. I have a duration field that is a number field, and I can do a date histogram. I would like to do a histogram over hour of day, rather than a linear histogram from x -> y dates. I would like the x axis to be 0 -> 23 instead of date x -> date y.
I think I can use the JSON Input advanced text input to add a field to the result set which is the hour of day of the #timestamp. The help text says:
Any JSON formatted properties you add here will be merged with the elasticsearch aggregation definition for this section. For example shard_size on a terms aggregation which leads me to believe it can be done but does not give any examples.
Edited to add:
I have tried setting up an entry in the scripted fields based on the link below, but it will not work like the examples on their blog with 4.1. The following script gives an error when trying to add a field with format number and name test_day_of_week: Integer.parseInt("1234")
The problem looks like the scripting is not very robust. Oddly enough, I want to do exactly what they are doing in the examples (add fields for day of month, day of week, etc...). I can get the field to work if the script is doc['#timestamp'], but I cannot manipulate the timestamp.
The docs say Lucene expressions are allowed and show some trig and GCD examples for GIS type stuff, but nothing for date...
There is this update to the BLOG:
UPDATE: As a security precaution, starting with version 4.0.0-RC1,
Kibana scripted fields default to Lucene Expressions, not Groovy, as
the scripting language. Since Lucene Expressions only support
operations on numerical fields, the example below dealing with date
math does not work in Kibana 4.0.0-RC1+ versions.
There is no suggestion for how to actually do this now. I guess I could go off and enable the Groovy plugin...
Any ideas?
EDIT - THE SOLUTION:
I added a filter using Ruby to do this, and it was pretty simple:
Basically, in a ruby script you can access event['field'] and you can create new ones. I use the Ruby time bits to create new fields based on the #timestamp for the event.
ruby {
code => "ts = event['#timestamp']; event['weekday'] = ts.wday; event['hour'] = ts.hour; event['minute'] = ts.min; event['second'] = ts.sec; event['mday'] = ts.day; event['yday'] = ts.yday; event['month'] = ts.month;"
}
This no longer appears to work in Logstash 1.5.4 - the Ruby date elements appear to be unavailable, and this then throws a "rubyexception" and does not add the fields to the logstash events.
I've spent some time searching for a way to recover the functionality we had in the Groovy scripted fields, which are unavailable for scripting dynamically, to provide me with fields such as "hourofday", "dayofweek", et cetera. What I've done is to add these as groovy script files directly on the Elasticsearch nodes themselves, like so:
/etc/elasticsearch/scripts/
hourofday.groovy
dayofweek.groovy
weekofyear.groovy
... and so on.
Those script files contain a single line of Groovy, like so:
Integer.parseInt(new Date(doc["#timestamp"].value).format("d")) (dayofmonth)
Integer.parseInt(new Date(doc["#timestamp"].value).format("u")) (dayofweek)
To reference these in Kibana, firstly create a new search and save it, or choose one of your existing saved searches (Please take a copy of the existing JSON before you change it, just in case) in the "Settings -> Saved Objects -> Searches" page. You then modify the query to add "Script Fields" in, so you get something like this:
{
"query" : {
...
},
"script_fields": {
"minuteofhour": {
"script_file": "minuteofhour"
},
"hourofday": {
"script_file": "hourofday"
},
"dayofweek": {
"script_file": "dayofweek"
},
"dayofmonth": {
"script_file": "dayofmonth"
},
"dayofyear": {
"script_file": "dayofyear"
},
"weekofmonth": {
"script_file": "weekofmonth"
},
"weekofyear": {
"script_file": "weekofyear"
},
"monthofyear": {
"script_file": "monthofyear"
}
}
}
As shown, the "script_fields" line should fall outside the "query" itself, or you will get an error. Also ensure the script files are available to all your Elasticsearch nodes.

Resources