Date_histogram and top_hits from unique values only - elasticsearch

I am trying to do a date_histogram aggregation to show a sum of Duration for each hour.
I have the following documents:
{
"EntryTimestamp": 1567029600000,
"Username": "johndoe",
"UpdateTimestamp": 1567029600000,
"Duration": 10,
"EntryID": "ASDF1234"
}
The following works very well but my problem is that sometimes multiple documents appear with the same EntryID. So ideally I would need to add a top_hits somehow, and order by the UpdateTimestamp as I need the last updated document for each unique EntryID. But not sure how to add this to my query.
{
"size": 0,
"query": {
"bool": {
"filter": [{
"range": {
"EntryTimestamp": {
"gte": "1567029600000",
"lte": "1567065599999",
"format": "epoch_millis"
}
}
}, {
"query_string": {
"analyze_wildcard": true,
"query": "Username.keyword=johndoe"
}
}
]
}
},
"aggs": {
"2": {
"date_histogram": {
"interval": "1h",
"field": "EntryTimestamp",
"min_doc_count": 0,
"extended_bounds": {
"min": "1567029600000",
"max": "1567065599999"
},
"format": "epoch_millis"
},
"aggs": {
"1": {
"sum": {
"field": "Duration"
}
}
}
}
}
}

I think you'll need a top_hits aggregation inside a terms aggregation.
The terms aggregation will get the distinct EntryIDs and the top hit aggregation inside of it will get only the most recent document (based on UpdateTimestamp) for each bucket (each distinct value) of the terms aggregation.
I have no clear syntax adapted to your context, and i believe you might run into some issues regarding the number of sub aggregations (i ran into some limitations with advanced aggregations in the past)
You can see this post for more info on that; i hope it'll prove to be helpful to you.

Related

Elasticsearch Pipelined search?

I've been using Elasticsearch for a while at my company and seems to have been working well so far for our searches.
We've been seeing more complex use cases from our customers to need more "ad-hoc/advanced" query capabilities and inter-document relationships (or joins in the traditional sense).
I understand that ES isn't built for joins and denormalisation is the recommended way. We have been denormalising the documents to support every use case so far and that in itself has become overly complex and expensive for us to do as our customers have to wait for a long time to get this code change rolled out.
We've been more often criticized by our business that "Hey your data model isn't right. It isn't suited for smarter queries". It's painfully harder for the team everytime to make them understand why denormalisation is required.
A few examples of the problems:
"Find me all the persons having the same birthdays"
"Find me all the persons travelling to the same cities within the same time frame"
Imagine every event document is a person record with their travel details.
So is there a concept of a pipeline search where I can break the search into multiple search queries and pass the output of one as an input to another?
Or is there any other recommended way to solve these types of problems without having to boil the ocean?
The two queries above can be solved with aggregations.
I'm assuming the following sample document/schema:
{
"firstName": "John",
"lastName": "Doe",
"birthDate": "1998-04-02",
"travelDate": "2019-10-31",
"city": "London"
}
The first one by aggregating with a terms on the birthdate field (day of the year) and min_doc_count: 2, e.g.:
{
"size": 0,
"aggs": {
"birthdays": {
"terms": {
"script": "return LocalDate.parse(params._source.birthDate).format(DateTimeFormatter.ofPattern('MM/dd'))",
"min_doc_count": 2
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
The second one by aggregating with a terms aggregation on the city field and constrained with a range query on the travelDate field for the desired time frame:
{
"size": 0,
"query": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"aggs": {
"cities": {
"terms": {
"field": "city.keyword"
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
The second query can also be done with field collapsing:
{
"_source": false,
"query": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"collapse": {
"field": "city.keyword",
"inner_hits": {
"name": "people"
}
}
}
If you need both aggregations at the same time, it is definitely possible to do so:
{
"size": 0,
"aggs": {
"birthdays": {
"terms": {
"script": "return LocalDate.parse(params._source.birthDate).format(DateTimeFormatter.ofPattern('MM/dd'))",
"min_doc_count": 2
},
"aggs": {
"persons": {
"top_hits": {}
}
}
},
"travels": {
"filter": {
"range": {
"travelDate": {
"gte": "2019-10-01",
"lt": "2019-11-01"
}
}
},
"aggs": {
"cities": {
"terms": {
"field": "city.keyword"
},
"aggs": {
"persons": {
"top_hits": {}
}
}
}
}
}
}
}

How to group by month in Elastic search

I am using elastic search version 6.0.0
for group by month, I am using date histogram aggregation.
example which I've tried :
{
"from":0,
"size":2000,
"_source":{
"includes":[
"cost",
"date"
],
"excludes":[
],
"aggregations":{
"date_hist_agg":{
"date_histogram":{
"field":"date",
"interval":"month",
"format":"M",
"order":{
"_key":"asc"
},
"min_doc_count":1
},
"aggregations":{
"cost":{
"sum":{
"field":"cost"
}
}
}
}
}
}
}
and as a result i got 1(Jan/January) multiple times.
As I have data of January-2016 ,January-2017 , January-2018 so will return 3 times January. but i Want January only once which contains the sum of All years of January.
Instead of using a date_histogram aggregation you could use a terms aggregation with a script that extracts the month from the date.
{
"from": 0,
"size": 2000,
"_source": {"includes": ["cost","date"],"excludes"[]},
"aggregations": {
"date_hist_agg": {
"terms": {
"script": "doc['date'].date.monthOfYear",
"order": {
"_key": "asc"
},
"min_doc_count": 1
},
"aggregations": {
"cost": {
"sum": {
"field": "cost"
}
}
}
}
}
}
Note that using scripting is not optimal, if you know you'll need the month information, just create another field with that information so you can use a simple terms aggregation on it without having to use scripting.
We can use the calendar_interval with month value:
Documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html#calendar_interval_examples
GET my_index/_search
{
"size": 0,
"query": {},
"aggs": {
"over_time": {
"date_histogram": {
"field": "yourDateAttribute",
"calendar_interval": "month",
"format": "yyyy-MM" // <--- control the output format
}
}
}
}

How can i add additional terms in the ElasticSearch Aggregation with Datetime Buckets?

Using Elastic Search 5.3 aggregation api - unable to write a query which calculates a measure on a date bucket- week split by Dimension/ term/field. i am able to make the date buckets and get the measure calculated for that bucket, but unable to split it down by a term: say application or term say transaction. Elastic search 5+ version has deprecated a lot of APIs from previous versions. here is what i got - this is right now aggregating the measure across all terms for that date bucket. Need to split it by some fields / terms. How do I go about doing it.
POST /index_name/_search?size=0
{
"aggs": {
"myname_Summary": {
"date_histogram": {
"field": "#timestamp",
"interval": "week"
, "format": "yyyy-MM-dd"
, "time_zone": "-04:00"
},
"aggs":{ "total_volume" : {"sum": {"field": "volume"}}
}
}
}}
you can try this
{
"size": 0,
"aggs": {
"myname_Summary": {
"date_histogram": {
"field": "#timestamp",
"interval": "week",
"format": "yyyy-MM-dd",
"time_zone": "-04:00"
},
"aggs": {
"split": {
"terms": {
"field": "application",
"size": 10
},
"aggs": {
"transaction": {
"terms": {
"field": "transaction",
"size": 10
},
"aggs": {
"total_volume": {
"sum": {
"field": "volume"
}
}
}
}
}
}
}
}
}
}
Hope this helps

How to make Nested Aggregations under parent's datetime histogram

I'm following this example inside officle doc: https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-aggregation.html
GET /my_index/blogpost/_search
{
"size" : 0,
"aggs": {
"comments": {
"nested": {
"path": "comments"
},
"aggs": {
"by_month": {
"date_histogram": {
"field": "comments.date",
"interval": "month",
"format": "yyyy-MM"
},
"aggs": {
"avg_stars": {
"avg": {
"field": "comments.stars"
}
}
}
}
}
}
}
}
question is: I need make date_histogram with blogpost's date, but not comments date:
"field": "comments.date",
to :
"field": "date",
and as of the "nested" above this histogram aggs, so this modification didn't work, How to make this work out?
Thanks!
This is currently not possible in NEST yet:
From the nest website:
A special single bucket aggregation that enables aggregating on parent
docs from nested documents.
Not implemented yet
https://nest.azurewebsites.net/nest/aggregations/reverse-nested.html
Well, I did some document search later, and .NET NEST can work on this question:
https://www.elastic.co/guide/en/elasticsearch/client/net-api/current/date-histogram-aggregation-usage.html
and here are some clue on github, maybe help someone else.

How to limit a date histogram aggregation of nested documents to a specific date range?

Version
Using Elasticsearch 1.7.2
Objective
I would like to create a graph of the number of predictions made by users per day for the last n days. In this case, 10 days.
Current query
{
"size": 0,
"aggs": {
"predictions": {
"nested": {
"path": "user_answers"
},
"aggs": {
"predictions_over_time": {
"date_histogram": {
"field": "user_answers.created",
"interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 0
}
}
}
}
}
}
Issue
This query will return a histogram but will return buckets for all available dates across all documents. It doesn't restrict to a specific date range.
What have I tried?
I've tried a number of approaches to solving this, all of which have failed.
* Range filter, then histogram that
* Date range aggregation, then histogram the buckets
* Using extended_bounds with, full dates, now-10d and also timestamps
* Trying a range filter inside the histogram aggregation
Any guidance would be appreciated! Thanks.
query didn't work for me in that situation, what I used is a third aggs:
{
"size": 0,
"aggs": {
"user_answers": {
"nested": { "path": "user_answers" },
"aggs": {
"timed_user_answers": {
"filter": {
"range": {
"user_answers.created": {
"gte": "now",
"lte": "now -10d"
}
}
},
"aggs": {
"predictions_over_time": {
"date_histogram": {
"field": "user_answers.created",
"interval": "day",
"format": "yyyy-MM-dd",
"min_doc_count": 0
}
}
}
}
}
}
}
}
One aggs specifies nested, one specifies filter, and the last specifies the actual aggregation. Don't know why this syntax makes sense, but you seem to not be able to use two on the same aggs.
You need to add a query. Query can be anything except from post_filter. It should be nested and contain date range. One of the ways is to define a constant score query. Inside constant score query, use a nested filter which should use a range filter.
{
"query": {
"constant_score": {
"filter": {
"nested": {
"path": "user_answers",
"filter": {
"range": {
"user_answers.created": {
"gte": "now",
"lte": "now -10d"
}
}
}
}
}
}
}
}
Confirm if this works for you.

Resources