How to get the last Elasticsearch document for each unique value of a field? - elasticsearch

I have a data structure in Elasticsearch that looks like:
{
"name": "abc",
"date": "2022-10-08T21:30:40.000Z",
"rank": 3
}
I want to get, for each unique name, the rank of the document (or the whole document) with the most recent date.
I currently have this:
"aggs": {
"group-by-name": {
"terms": {
"field": "name"
},
"aggs": {
"max-date": {
"max": {
"field": "date"
}
}
}
}
}
How can I get the rank (or the whole document) for each result, and if possible, in 1 request ?

You can use below options
Collapse
"collapse": {
"field": "name"
},
"sort": [
{
"date": {
"order": "desc"
}
}
]
Top hits aggregation
{
"aggs": {
"group-by-name": {
"terms": {
"field": "name",
"size": 100
},
"aggs": {
"top_doc": {
"top_hits": {
"sort": [
{
"date": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}

Related

Elasticsearch aggregation with unqiue counting

My documents consist of a history of orders and their state, here a minimal example:
{
"orderNumber" : "xyz",
"state" : "shipping",
"day" : "2022-07-20",
"timestamp" : "2022-07-20T15:06:44.290Z",
}
the state can be strings like shipping, processing, redo,...
For every possible state, I need to count the number of orders that had this state at some point during a day, without counting a state twice for the same orderNumber that day (which can happen if there is a problem and it needs to start from the beginning that same day).
My aggregation looks like this:
GET order-history/_search
{
"aggs": {
"countDays": {
"terms": {
"field": "day",
"order": {
"_key": "desc"
},
"size": 20
},
"aggs": {
"countStates": {
"terms": {
"field": "state.keyword",
"size": 10
}
}
}
}
}
, "size": 1
}
However, this will count a state for a given orderNumber twice if it reappears that same day. How would I prevent it from counting a state twice for each orderNumber, if it is on the same day?
Tldr;
I don't think there is a flexible and simple solution.
But if you know in advance the number of state that exists. Maybe through another aggregation query, to get all type of state.
You could do the following
POST /_bulk
{"index":{"_index":"73138766"}}
{"orderNumber":"xyz","state":"shipping","day":"2022-07-20"}
{"index":{"_index":"73138766"}}
{"orderNumber":"xyz","state":"redo","day":"2022-07-20"}
{"index":{"_index":"73138766"}}
{"orderNumber":"xyz","state":"shipping","day":"2022-07-20"}
{"index":{"_index":"73138766"}}
{"orderNumber":"bbb","state":"processing","day":"2022-07-20"}
{"index":{"_index":"73138766"}}
{"orderNumber":"bbb","state":"shipping","day":"2022-07-20"}
GET 73138766/_search
{
"size": 0,
"aggs": {
"per_day": {
"date_histogram": {
"field": "day",
"calendar_interval": "day"
},
"aggs": {
"shipping": {
"filter": { "term": { "state.keyword": "shipping" }
},
"aggs": {
"orders": {
"cardinality": {
"field": "orderNumber.keyword"
}
}
}
},
"processing": {
"filter": { "term": { "state.keyword": "processing" }
},
"aggs": {
"orders": {
"cardinality": {
"field": "orderNumber.keyword"
}
}
}
},
"redo": {
"filter": { "term": { "state.keyword": "redo" }
},
"aggs": {
"orders": {
"cardinality": {
"field": "orderNumber.keyword"
}
}
}
}
}
}
}
}
You will obtain the following results
{
"aggregations": {
"per_day": {
"buckets": [
{
"key_as_string": "2022-07-20T00:00:00.000Z",
"key": 1658275200000,
"doc_count": 5,
"shipping": {
"doc_count": 3,
"orders": {
"value": 2
}
},
"processing": {
"doc_count": 1,
"orders": {
"value": 1
}
},
"redo": {
"doc_count": 1,
"orders": {
"value": 1
}
}
}
]
}
}
}

Elasticsearch sort terms agg by arbitrary order

I have a terms aggregation and they want some specific values to always be at the top.
Like:
POST _search
{ "size": 0,
"aggs": {
"pets": {
"terms": {
"field": "species",
"order": "Dogs, Cats"
}
}
}
}
Where the results would be like "Dog", "Cat", "Iguana".
Dog and Cat at the top and everything else below.
Is this possible without scripting?
Thanks!
One way to do it is by filtering values in the terms aggregation. You'd create two terms aggregations, one with the desired terms and another with all other terms.
{
"size": 0,
"aggs": {
"top_terms": {
"terms": {
"field": "species",
"include": ["Dogs", "Cats"],
"order": { "_key" : "desc" }
}
},
"other_terms": {
"terms": {
"field": "species",
"exclude": ["Dogs", "Cats"]
}
}
}
}
Try it out
A script wouldn't be too complicated though -- first boost the two species, then sort by the scores first and then by _count:
GET pets/_search
{
"size": 0,
"query": {
"bool": {
"should": [
{
"terms": {
"species": [
"dog",
"cat"
],
"boost": 10
}
},
{
"match_all": {}
}
]
}
},
"aggs": {
"pets": {
"terms": {
"field": "species.keyword",
"order": [
{
"max_score": "desc"
},
{
"_count": "desc"
}
]
},
"aggs": {
"max_score": {
"max": {
"script": "_score"
}
}
}
}
}
}

How to take more fields when grouping

Trying to group data and take all of its fields by the way.
GET /testnews/default/_search
{
"size": 10,
"from":50,
"query":{
"multi_match": {
"query": "serenay",
"fields": ["Data.Title", "Data.Description", "Data.Tags.Title", "Data.MentionTitle", "Data.Program.title", "Data.Program.description", "Data.Program.original_title"]
}
},
"sort":[{
"Data.CreatedAt": {
"order": "desc"
},
"Data.ViewCount": {
"order": "desc"
}
}],
"aggs": {
"group_by_state": {
"terms": {
"field": "Data.Program.title.keyword"
}
}
}
}
But when I did it, it returns only "Program Title" in the grouped result.
Just like:
{
"key": "Kocamın Ailesi",
"doc_count": 3
}
But I just want it like:
{
"key": "Kocamın Ailesi",
"description": "blabla",
"image": "blabla.jpg",
"date": "YYYY-mm-dd",
"doc_count": 3
}
just like sql
select * from x group by field
Regarding the SQL example, to get the behaviour of
select a, b, count(*) from x group by a, b
you can aggregate on a, then b like this:
"aggs": {
"group_by_a": {
"terms": {
"field": "a"
},
"aggs": {
"group_by_b": {
"terms": {
"field":"b"
}
}
}
}
}
But I don't think that is what you're looking for?
If you want the full documents in aggregations you can use the "top_hits" aggregation to select the top n hits within each aggregation:
{
"aggs": {
"group_by_state": {
"terms": {
"field": "Data.Program.title.keyword"
},
"aggs": {
"state_top_hits": {
"top_hits": {
"sort": [
{ "Data.CreatedAt": { "order": "desc" } },
{ "Data.ViewCount": { "order": "desc" } }
],
"_source": {
"includes": [ "key", "description", "image", "date" ]
},
"size": 10 //Will show top 10 hits within keyword agg ordered according to the sort
}
}
}
}
}
}

how to bucket empty and non empty fields in nested aggregation in elasticsearch?

I have the following set of nested subaggregations in elasticsearch (field2 is a subaggregation of field1 and field3 is a subaggregation of field2).
It turns out however that the terms aggregation for field3 will not bucket documents that dont have field3.
My understanding is that I have to use a Missing subaggregation query to bucket those in addition to the term query for field3.
But I am not sure how can I add it to the query below to bucket both.
{
"size": 0,
"aggregations": {
"f1": {
"terms": {
"field": "field1",
"size": 0,
"order": {
"_count": "asc"
},
"include": [
"123"
]
},
"aggregations": {
"field2": {
"terms": {
"field": "f2",
"size": 0,
"order": {
"_count": "asc"
},
"include": [
"tr"
]
},
"aggregations": {
"field3": {
"terms": {
"field": "f3",
"order": {
"_count": "asc"
},
"size": 0
},
"aggregations": {
"aggTopHits": {
"top_hits": {
"size": 1
}
}
}
}
}
}
}
}
}
}
In version 2.1.2 and later, you can use the missing parameter of the terms aggregation, which allows you to specify a default value for documents that are missing that field. (FYI, the missing parameter was available starting 2.0, but there was a bug which prevented it from working on sub-aggregations, which is how you would use it here.)
...
"aggregations": {
"field3": {
"terms": {
"field": "f3",
"order": {
"_count": "asc"
},
"size": 0,
"missing": "n/a" <----- provide a default here
},
"aggregations": {
"aggTopHits": {
"top_hits": {
"size": 1
}
}
}
}
}
However, if you are working with a pre-2.x ES cluster, you can use the missing aggregation at the same depth as your field3 aggregation to bucket the documents that are missing "f3" like this:
...
"aggregations": {
"field3": {
"terms": {
"field": "f3",
"order": {
"_count": "asc"
},
"size": 0
},
"aggregations": {
"aggTopHits": {
"top_hits": {
"size": 1
}
}
}
},
"missing_field3": {
"missing" : {
"field": "f3"
},
"aggregations": {
"aggTopMissingHit": {
"top_hits": {
"size": 1
}
}
}
}
}

Elasticsearch minBy

Is there a way in elasticsearch to get a field from a document containing the maximum value? (Basically working similarly to maxBy from scala)
For example (mocked):
{
"aggregations": {
"grouped": {
"terms": {
"field": "grouping",
"order": {
"docWithMin": "asc"
}
},
"aggregations": {
"withMax": {
"max": {
"maxByField": "a",
"field": "b"
}
}
}
}
}
}
For which {"grouping":1,"a":2,"b":5},{"grouping":1,"a":1,"b":10}
would return (something like): {"grouped":1,"withMax":5}, where the max comes from the first object due to "a" being higher there.
Assuming you just want the document back for which a is maximum, you can do this:
{
"size": 0,
"aggs": {
"grouped": {
"terms": {
"field": "grouping"
},
"aggs": {
"maxByA": {
"top_hits": {
"sort": [
{"a": {"order": "desc"}}
],
"size": 1
}
}
}
}
}
}

Resources