How to make Nested Aggregations under parent's datetime histogram - elasticsearch

I'm following this example inside officle doc: https://www.elastic.co/guide/en/elasticsearch/guide/current/nested-aggregation.html
GET /my_index/blogpost/_search
{
"size" : 0,
"aggs": {
"comments": {
"nested": {
"path": "comments"
},
"aggs": {
"by_month": {
"date_histogram": {
"field": "comments.date",
"interval": "month",
"format": "yyyy-MM"
},
"aggs": {
"avg_stars": {
"avg": {
"field": "comments.stars"
}
}
}
}
}
}
}
}
question is: I need make date_histogram with blogpost's date, but not comments date:
"field": "comments.date",
to :
"field": "date",
and as of the "nested" above this histogram aggs, so this modification didn't work, How to make this work out?
Thanks!

This is currently not possible in NEST yet:
From the nest website:
A special single bucket aggregation that enables aggregating on parent
docs from nested documents.
Not implemented yet
https://nest.azurewebsites.net/nest/aggregations/reverse-nested.html

Well, I did some document search later, and .NET NEST can work on this question:
https://www.elastic.co/guide/en/elasticsearch/client/net-api/current/date-histogram-aggregation-usage.html
and here are some clue on github, maybe help someone else.

Related

Date_histogram and top_hits from unique values only

I am trying to do a date_histogram aggregation to show a sum of Duration for each hour.
I have the following documents:
{
"EntryTimestamp": 1567029600000,
"Username": "johndoe",
"UpdateTimestamp": 1567029600000,
"Duration": 10,
"EntryID": "ASDF1234"
}
The following works very well but my problem is that sometimes multiple documents appear with the same EntryID. So ideally I would need to add a top_hits somehow, and order by the UpdateTimestamp as I need the last updated document for each unique EntryID. But not sure how to add this to my query.
{
"size": 0,
"query": {
"bool": {
"filter": [{
"range": {
"EntryTimestamp": {
"gte": "1567029600000",
"lte": "1567065599999",
"format": "epoch_millis"
}
}
}, {
"query_string": {
"analyze_wildcard": true,
"query": "Username.keyword=johndoe"
}
}
]
}
},
"aggs": {
"2": {
"date_histogram": {
"interval": "1h",
"field": "EntryTimestamp",
"min_doc_count": 0,
"extended_bounds": {
"min": "1567029600000",
"max": "1567065599999"
},
"format": "epoch_millis"
},
"aggs": {
"1": {
"sum": {
"field": "Duration"
}
}
}
}
}
}
I think you'll need a top_hits aggregation inside a terms aggregation.
The terms aggregation will get the distinct EntryIDs and the top hit aggregation inside of it will get only the most recent document (based on UpdateTimestamp) for each bucket (each distinct value) of the terms aggregation.
I have no clear syntax adapted to your context, and i believe you might run into some issues regarding the number of sub aggregations (i ran into some limitations with advanced aggregations in the past)
You can see this post for more info on that; i hope it'll prove to be helpful to you.

Result filter and pagination in Elasticsearch

I need some help or an idea for the correct procedure.
I already indexed a big vaste of documents. Now I found out that there are some documents with almost the same content, f.e.
{
"title": "myDocument",
"date": "2017-09-18",
"page": 1
}
{
"title": "myDocument",
"date": "2017-09-18",
"page": 2
}
The title field is mapped as text, date is date and page is integer. As you can see the only difference is the page value.
Now I want to make a query and filter out these duplicates. Field collapsing seems a good way to do it but in this case I can't get the correct count of results and that's important for me.
An other way would be to get all results first and then filter out "manually" but then I have a problem with pagination.
Try something like this.
GET index/type/_search
{
"aggs": {
"count_by_title_date_page":{
"terms": {
"field": "title.keyword",
"size": 100
},
"aggs": {
"date": {
"terms": {
"field": "date.keyword",
"size": 100
},
"aggs": {
"page": {
"terms": {
"field": "page.keyword",
"size": 100
}
}
}
}
}
}
}
}

Elasticsearch derivate of a deep metric

I have a web crawler that collects data and stores snapshots several times a day. My query has some aggregations that group the snapshots together per day and return the last snapshot of each day using top_hits.
The documents look like this:
"_source": {
"taken_at": "2016-02-01T11:27:09.184-03:00",
... ,
"my_metric": 113
}
I'd like to be able to calculate the derivative of a certain metric, say my_metric, of the documents returned by top_hits (i.e., the derivative of the last snapshots of each day's my_metric).
Here's what I have so far:
{
"aggs": {
"filtered_snapshots": {
"filter": {
// ...
},
"aggs" : {
"grouped_data": {
"date_histogram": {
"field": "taken_at",
"interval": "day",
"format": "YYYY-MM-dd",
"order": { "_key" : "asc" }
},
"aggs": {
"resource_by_date": {
"terms": { "field": "remote_id" },
"aggs": {
"latest_snapshots": {
"top_hits": {
"sort": { "taken_at": { "order": "asc" }},
"size" : 1
}
}
}
},
"my_metric_deriv": {
"derivative": {
"buckets_path": "resource_by_date>latest_snapshots>my_metric"
}
}
}
}
}
}
}
}
I get a "No aggregation [my_metric] found for path ..." error with the query above.
Am I using a wrong bucket_path? I've read through the bucket_path and the derivative documentation and haven't found much that could help.
The documentation mentions briefly "deep metrics", stating that they can be limited in some ways, which I couldn't quite understand. I'm not sure how or if the limitations affect my case.

Elasticsearch: how to scope aggregations to your query and filter?

I have been playing around with elasticsearch query and filter for some time now but never worked with aggregations before. The idea that we can scope the aggregations with our query seems quite amazing to me but I want to understand how to do it properly so that I do not make any mistakes. Currently all my search queries are designed this way:
{
"query": {
},
"filter": {
},
"from": 0,
"size": 60
}
Now, when I added some aggregation buckets, the structure became this:
{
"aggs": {
"all_colors": {
"terms": {
"field": "color.name"
}
},
"all_brands": {
"terms": {
"field": "brand_slug"
}
},
"all_sizes": {
"terms": {
"field": "sizes"
}
}
},
"query": {
},
"filter": {
},
"from": 0,
"size": 60
}
However, the results of the aggregation are always the same irrespective of what info I provide in filter.
Now, when I changed the query structure to something like this, it started showing different results:
{
"aggs": {
"all_colors": {
"terms": {
"field": "color.name"
}
},
"all_brands": {
"terms": {
"field": "brand_slug"
}
},
"all_sizes": {
"terms": {
"field": "sizes"
}
}
},
"query": {
"filtered": {
"query": {
},
"filter": {
}
}
},
"from": 0,
"size": 60
}
Does it mean I will have to change the structure of my search queries everywhere to this new filtered type of structure ? Is there any other workaround which allows me to achieve desired results without having to change that much of code ?
Also, another thing I observed is that if my brand_slug field contains multiple keywords like "peter england", then both of these are returned in separate buckets like this:
{
"buckets": [
{
"key": "england",
"doc_count": 368
},
{
"key": "peter",
"doc_count": 368
}
]
}
How can I ensure that both these end up in a same bucket like this:
{
"buckets": [
{
"key": "peter england",
"doc_count": 368
}
]
}
UPDATE: This second part I have been able to accomplish by indexing brand, color and sizes differently like this:
"sizes": {
"type": "string",
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
What you've noticed is by design. Have a look at my answer to a similar question on SO. Basically, input to both aggregation and filter sections is the output of query section. Filtered Query as you've suggested would be the best way to achieve the results you desire. There is another way too. You can use Filter Aggregation. Then you would not need to change your query and filter sections but simply copy the filter section inside the aggregation sections but that in my opinion would be an overkill and a violation of the DRY principle in general.

Calculating sum of nested fields with date_histogram aggregation in Elasticsearch

I'm having trouble getting the sum of a nested field in Elasticsearch using a date_histogram, and I'm hoping somebody can lend me a hand.
I have a mapping that looks like this:
"client" : {
// various irrelevant stuff here...
"associated_transactions" : {
"type" : "nested",
"include_in_parent" : true,
"properties" : {
"amount" : {
"type" : "double"
},
"effective_at" : {
"type" : "date",
"format" : "dateOptionalTime"
}
}
}
}
I'm trying to get a date_histogram that shows total revenue by month across all clients--i.e. a time series showing the sum associated_transactions.amount in a histogram determined by associated_transactions.effective_date. I tried running this query:
{
"query": {
// ...
},
"aggregations": {
"revenue": {
"date_histogram": {
"interval": "month",
"min_doc_count": 0,
"field": "associated_transactions.effective_at"
},
"aggs": {
"monthly_revenue": {
"sum": {
"field": "associated_transactions.amount"
}
}
}
}
}
}
But the sum it's giving me isn't right. It seems that what ES is doing is finding all clients who have any transaction in a given month, then summing all of the transactions (from any time) for those clients. That is, it's a sum of the amount spent in the lifetime of a client who made a purchase in a given month, not the sum of purchases in a given month.
Is there any way to get the data I'm looking for, or is this a limitation in how ES handles nested fields?
Thanks very much in advance for your help!
David
Try this?
{
"query": {
// ...
},
"aggregations": {
"revenue": {
"date_histogram": {
"interval": "month",
"min_doc_count": 0,
"field": "associated_transactions.effective_at"
"aggs": {
"monthly_revenue": {
"sum": {
"field": "associated_transactions.amount"
}
}
}
}
}
}
}
i.e. move the "aggs" key into the "date_histogram" field.
I stumbled upon this question while trying to solve similar problem with my implementation of ES.
It seems that currently Elasticsearch looks at position of aggregation in the JSON body request tree - not inheritance of its objects and filelds. So you should not put your sum aggregation "inside" "date_histogram", but place it outside on the same JSON tree level.
This worked for me:
{
"size": 0,
"aggs": {
"histogram_aggregation": {
"date_histogram": {
"field": "date_vield",
"calendar_interval": "day"
},
"aggs": {
"views": {
"sum": {
"field": "the_vield_i_want_to_sum"
}
}
}
}
},
"query": {
#some query
}
OP made mistake of placing his sum aggregation inside date histogram aggregation.

Resources