How to get newest data from Elasticsearch based on a date field - elasticsearch

I implemented a scheduled script that inject date into my Elasticsearch. The script doesn't check if the data exist already in Elasticsearch so it inserts duplications. What I want is to get all events that have the latest timestamp field value (insertion dateTime).
Note: I don't have an id or a unique field that can help me group by it and set size to 1 to get the latest.
So can you give some other options?

You could aggregate by the latest available timestamp and get the top, potentially duplicate docs like so:
GET index/_search
{
"size": 0,
"aggs": {
"latest": {
"terms": {
"field": "timestamp",
"order": {
"_key": "desc"
},
"size": 1
},
"aggs": {
"latest_docs": {
"top_hits": {
"size": 100
}
}
}
}
}
}

Related

Pagination with aggregation in elasticSearch

ElasticSearch version 8.5
I have cron job inside java class which transfer data from one elasticSearch index to another. To fetch data from first index I use aggregation query. After some time I expect to have big amount of data within one request. Can I use some type of pagination together with aggregation so my backend can handle this amount of data. Updates in first index can occur any time so options like search_after not suitable because of consistency.
Request example to get amount of employee in each department
{ "size": 0, "aggs": { "group_by_company_id": { "terms": { "field": "company_id" }, "aggs": { "group_by_department_id": { "terms": { "field": "department_id" }, "aggs": { "group_by_department_name": { "terms": { "field": "department_name" } } } } } } } }
I try to find information in official documentation but did't find info how combine aggregation and pagination

Paginate an aggregation sorted by hits on Elastic index

I have an Elastic index (say file) where I append a document every time the file is downloaded by a client. Each document is quite basic, it contains a field filename and a date when to indicate the time of the download.
What I want to achieve is to get, for each file the number of times it has been downloaded in the last 3 months. Thanks to another question, I have a query that returns all the results:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads": {
"terms": {
"field": "filename.keyword",
"size": 1000
}
}
},
"size": 0
}
Now, I want to have a paginated result. The term aggreation cannot be paginated, so I use a composite aggregation. Of course, if there is a better aggregation, it can be used here...
So for the moment, I have something like that:
{
"query": {
"range": {
"when": {
"gte": "now-3M"
}
}
},
"aggs": {
"downloads_agg": {
"composite": {
"size": 100,
"sources": [
{
"downloads": {
"terms": {
"field": "filename.keyword"
}
}
}
]
}
}
},
"size": 0
}
This aggregation allows me to paginate (thanks to after_key value in response), but it is not sorted by the number of downloads - it is sorted by the filename.
How can I sort that composite aggregation on the number of documents for each filename in my index?
Thanks.
Composite aggregation don't allow sorting based on the value field.
Excerpt from the discussion on elastic forum:
it's designed as a memory-friendly way to paginate over aggregations.
Part of the tradeoff is that you lose things like ordering by doc
count, since that isn't known until after all the docs have been
collected.
I have no experience with Transforms (part of X-pack & Licensed) but you can try that out. Apart from this, I don't see a way to get the expected output.

elasticsearch Need average per week of some value

I have simple data as
sales, date_of_sales
I need is average per week i.e. sum(sales)/no.of weeks.
Please help.
What i have till now is
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sales",
"interval": "week"
}
},
"TotalSales": {
"sum": {
"field": "sales"
}
},
"myValue": {
"bucket_script": {
"buckets_path": {
"myGP": "TotalSales",
"myCount": "WeekAggergation._bucket_count"
},
"script": "params.myGP/params.myCount"
}
}
}
}
I get the error
Invalid pipeline aggregation named [myValue] of type [bucket_script].
Only sibling pipeline aggregations are allowed at the top level.
I think this may help:
{
"size": 0,
"aggs": {
"WeekAggergation": {
"date_histogram": {
"field": "date_of_sale",
"interval": "week",
"format": "yyyy-MM-dd"
},
"aggs": {
"TotalSales": {
"sum": {
"field": "sales"
}
},
"AvgSales": {
"avg": {
"field": "sales"
}
}
}
},
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales"
}
}
}
}
Note the TotalSales aggregation is now a nested aggregation under the weekly histogram aggregation (I believe there was a typo in the code provided - the simple schema provided indicated the field name of date_of_sale and the aggregation provided uses the plural form date_of_sales). This provides you a total of all sales in the weekly bucket.
Additionally, AvgSales provides a similar nested aggregation under the weekly histogram aggregation so you can see the average of all sales specific to that week.
Finally, the pipeline aggregation avg_all_weekly_sales will give the average of weekly sales based on the TotalSales bucket and the number of non-empty buckets - if you want to include empty buckets, add the gap_policy parameter like so:
...
"avg_all_weekly_sales": {
"avg_bucket": {
"buckets_path": "WeekAggergation>TotalSales",
"gap_policy": "insert_zeros"
}
}
...
(see: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline-avg-bucket-aggregation.html).
This pipeline aggregation may or may not be what you're actually looking for, so please check the math to ensure the result is what is expected, but should provide the correct output based on the original script.

Unexpected results when using min sub-aggregation in Elasticsearch

My documents include the fields name and date_year, and my goal is to find the most recently added names (e.g. the ten last added names with their first year of appearance and the total number of documents). I therefore have a terms aggregation on name, which is ordered by a min sub-aggregation on date_year:
{
"aggs": {
"group_by_name": {
"terms": {
"field": "name",
"order": {
"start_year": "desc"
}
},
"aggs": {
"start_year": {
"min": {
"field": "date_year"
}
}
}
}
}
}
This is returning unexpected results, when not adding size under terms. For example, the first bucket has doc_count 1 and start_year 2015, while I'm sure that there are tens of documents with this name, and the earliest date_year is 1870. When I add a large enough size, the results are accurate. For example:
{
"aggs": {
"group_by_name": {
"terms": {
"field": "name",
"size": 10000, <------ large enough value
"order": {
"start_year": "desc"
}
},
"aggs": {
"start_year": {
"min": {
"field": "date_year"
}
}
}
}
}
}
Can anyone explain to me what is causing this, and how I can limit the number of buckets returned? What I need would look something like this in SQL:
select name, min(year), count(*) from documents group by name order by min(year) desc limit 10

Elasticsearch: get top nested doc per month without top level duplicates

I have some time-based, nested data of which I would like to get the biggest changes, positive and negative, of plugins per month. I work with Elasticsearch 5.3 (and Kibana 5.3).
A document is structured as follows:
{
_id: "xxx",
#timestamp: 1508244365987,
siteURL: "www.foo.bar",
plugins: [
{
name: "foo",
version: "3.1.4"
},
{
name: "baz",
version: "13.37"
}
]
}
However, per id (siteURL), I have multiple entries per month and I would like to use only the latest per time bucket, to avoid unfair weighing.
I tried to solve this by using the following aggregation:
{
"aggs": {
"normal_dates": {
"date_range": {
"field": "#timestamp",
"ranges": [
{
"from": "now-1y/d",
"to": "now"
}
]
},
"aggs": {
"date_histo": {
"date_histogram": {
"field": "#timestamp",
"interval": "month"
},
"aggs": {
"top_sites": {
"terms": {
"field": "siteURL.keyword",
"size": 50000
},
"aggs": {
"top_plugin_hits": {
"top_hits": {
"sort": [
{
"#timestamp": {
"order": "desc"
}
}
],
"_source": {
"includes": [
"plugins.name"
]
},
"size": 1
}
}
}
}
}
}
}
}
}
}
Now I get per month the latest site and its plugins. Next I would like to turn the data inside out and get the plugins present per month and a count of the occurrences. Then I would use a serial_diff to compare months.
However, I don't know how to go from my aggregation to the serial diff, i.e. turn the data inside out.
Any help would be most welcome
PS: extra kudos if I can get it in a Kibana 5.3 table...
It turns out it is not possible to further aggregate on a top_hits query.
I ended up loading the results of the posted query into Python and used Python for further processing and visualization.

Resources