Latest document for each category? - elasticsearch

I have documents in ElasticSearch with structure like this:
{
"created_on": [timestamp],
"source_id": [a string ID for the source],
"type": [a term],
... other fields
}
Obviously, I can select these documents in Kibana, show them in "discover", produce (for example) a pie chart showing type terms, and so on.
However, the requirement I've been given is to use only the most recent document for each source_id.
The approach I've tried is to map the documents into one bucket per source_id, then for each bucket, reduce to remove all but the document with the latest created_on.
However, when I used the terms aggregator, the result only contained counts, not whole documents I could further process:
"aggs" : {
"sources" : {
"terms" : { "field" : "source_id" }
}
}
How can I make this query?

If I understood correctly what you're trying to do, one way to accomplish that is using the top_hits aggregations under the terms aggregation, which is useful for grouping results by any criteria you'd like to, for each bucket of its parent aggregation. Following your example, you could do something like
{
"aggs": {
"by_source_id": {
"terms": {
"field": "source_id"
},
"aggs": {
"most_recent": {
"top_hits": {
"sort": {
"created_on": "desc"
},
"size": 1
}
}
}
}
}
}
So you are grouping by source_id, which will create a bucket for each one, and then you'll get the top hits for each bucket according to the sorting criteria set in the top_hits agg, in this case the created_on field.
The result you should expect would be something like
....
"buckets": [
{
"key": 3,
"doc_count": 2,
"most_recent": {
"hits": {
"total": 2,
"max_score": null,
"hits": [
{
"_index": "so_sample02",
"_type": "items",
"_id": "2",
"_score": null,
"_source": {
"created_on": "2018-05-01 07:00:01",
"source_id": 3,
"type": "a"
},
"sort": [
1525158001000
]
}
]
}
}
},
{
"key": 5,
"doc_count": 2, .... and so on
Notice how within the bucket, most_recent, we get the corresponding hits. You can furthermore limit the amount of fields returned, by specifying in your top_hits agg "includes": ["fieldA", "fieldB" .. and so on]
Hope that helps.

Related

Limit to max records to be searched in Elastic Search Group by query

We have a strange issue where data for one of our customers has a lot of records based on certain field x. When the user triggers a query for the group by for that x field, the Elastic Search cluster is going for a toss and restarting with OOM.
Is there a way to limit max records that elastic search should look for while aggregating the result for a certain field so that cluster can be saved from going OOM ?
PS: The group by can go on multiple fields such as x,y,x, and w, and the user is searching for the last 30-day data only.
Use Sampler Aggregation with terms aggregation if you wish to restrict the number of documents that should be taken into account for an aggregation (let's say terms aggregation) (in this case)
Index Data:
{
"role": "example",
"number": 1
}
{
"role": "example1",
"number": 2
}
{
"role": "example2",
"number": 3
}
Search Query:
{
"size": 0,
"aggs": {
"sample": {
"sampler": {
"shard_size": 2 // Max documents you need to have for the aggregation
},
"aggs": {
"unique_roles": {
"terms": {
"field": "role.keyword"
}
}
}
}
}
}
Search Result:
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"sample": {
"doc_count": 2, // Note this
"unique_roles": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "example",
"doc_count": 1
},
{
"key": "example1",
"doc_count": 1
}
]
}
}
}

Terms Aggregation return multiple fields (min_doc_count: 0)

I'm making a Terms Aggregation but I want to return multiple fields. I want a user to select buckets via "slug" (my-name), but show the actual "name" (My Name).
At this moment I'm making a TopHits SubAggregation like this:
"organisation": {
"aggregations": {
"label": {
"top_hits": {
"_source": {
"includes": [
"organisations.name"
]
},
"size": 1
}
}
},
"terms": {
"field": "organisations.slug",
"min_doc_count": 0,
"size": 20
}
}
This gives the desired result when my whole query actually find some buckets/results.
You see I've set the min_doc_count to 0 which will return buckets with a doc count of 0. The problem I'm facing here is that my TopHits response is empty, which results of not being able to render the proper name to the client.
Example response:
"organisation": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "my-name",
"doc_count": 27,
"label": {
"hits": {
"total": 27,
"max_score": 1,
"hits": [
{
"_index": "users",
"_type": "doc",
"_id": "4475",
"_score": 1,
"_source": {
"organisations": [
{
"name": "My name"
}]
}
}]
}
}
},
{
"key": "my-name-2",
"doc_count": 0,
"label": {
"hits": {
"total": 0,
"max_score": null,
"hits": []
}
}
},
.....
Anyone has accomplished this desired result? I feel like TopHits won't help me here. It should always fetch the name.
What I've also tried:
Working with a terms sub aggregation. (same result)
Working with a significant terms sub aggregation. (same result)
What I think could be a solution, but feels dirty:
Index a new field with "organisations.slug___organisations.name" and work the magic via this.
Manual query the name field where the count is 0 (read TopHits is empty)
Kind regards,
Thanks in advance

Why does elasticsearch terms bucket size effects the doc_count of inner reverse_nested aggregations?

I've been trying to track down missing doc counts from a reverse nested aggregation.
My query
"aggs": {
"mainGrouping": {
"nested": {
"path": "parent.child"
},
"aggs": {
"uniqueCount": {
"cardinality": {
"field": "parent.child.id"
}
},
"groupBy": {
"terms": {
"field": "parent.child.id",
"size": 20, <- If I change this, my doc count for noOfParents changes
"order": [
{
"noOfParents": "desc"
}
]
},
"aggs": {
"noOfParents": {
"reverse_nested": {}
}
}
}
}
}
So I was running it at size:20. I had one bucket that returned noOfParents of 7 when I know there should be 9 matches. I noticed by accident if I change the size of the terms aggregation to 50 the noOfParents was correctly showing 9 for this bucket.
Why would the size of the terms aggregation affect the doc_count of a reverse aggregation? Is this expected behaviour or a bug? I'm using elasticsearch 5.6.
What you are observing is most likely normal behavior of terms aggregation, because document counts are approximate. This is not related neither to reverse_nested, neither to nested aggregations.
In short, since the data is spread over the shards, Elasticsearch makes its best guess first locally on each shard and then combines the result across the shards. For a better, more detailed explanation please check out this section of the documentation.
To make sure this is actually the case you may add a top_hits aggregation with explain enabled:
"aggs": {
"noOfParents": {
"reverse_nested": {},
"aggs": {
"top hits": {
"top_hits": {
"size": 10,
"explain": true
}
}
}
}
}
This will give you the list of the matched parent documents with their shard ids. Something like this:
"aggregations": {
"mainGrouping": {
...
"groupBy": {
...
"buckets": [
{
"key": "1",
"doc_count": 5,
"noOfParents": {
"doc_count": 5,
"top hits": {
"hits": {
"total": 5,
"max_score": 1,
"hits": [
{
"_shard": "[my-index-2018-12][0]", <-- this is the shard
"_node": "7JNqOhTtROqzQR9QBUENcg",
"_index": "my-index-2018-12",
"_type": "doc",
"_id": "AWdpyZ4Y3HZjlM-Ibd7O",
"_score": 1,
"_source": {
"parent": "A",
"child": {
"id": "1"
}
},
"_explanation": ...
},
Another way to prove that this is the source of the problem is to isolate the query within one shard. To do so it is enough to add routing to the search request: ?routing=0
This will make your terms buckets counts stable within one shard. Then compare the noOfParents with the expected amount of parents (again, within the same shard).
Hope that helps!

Elastic Search Unique Field Values

I am trying to get groups of only unique values in Elastic Search for the searches. I can't figure out why this doesn't behave.
I have gone through many StackOverflow questions, and read the Documentation for most of the day. Nothing seems to work for me, below I provided what I tried doing last.
Is there any reason someone would want to have the same results repeatedly returned? Maybe for differing versions of a Document?
In this example I would like a listing of all mfr_id's, and their mfr_desc as well. I am running this over a type to search document field values only. It seems that Agg Terms is the way to accomplish this, does anyone see anything I am doing wrong?
1: API Call
GET /inventory/item/_search
{
"size": 0,
"_source": ["mfr_id", "mfr_desc"],
"aggs": {
"unique_vals": {
"terms": {
"field": "mfr_id.keyword"
/** I have to use .keyword, seems like my mappings isn't working */
}
}
}
}
2: Mapping File
The Mapping I run after doing a Bulk import is quite simple. I read to not analyze the keys if you want a unique query:
{
"index": "inventory",
"body": {
"settings": {
"number_of_shards": 1
},
"mappings": {
"_default_": {
"properties": {
"mfr_id": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
3: My Results
Aggregation has ~10 records when there are about 100. I would really like to be able to get the _source fields of more than just a key if this is possible.
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 49341,
"max_score": 0,
"hits": []
},
"aggregations": {
"unique_vals": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 6815,
"buckets": [
{
"key": "14",
"doc_count": 24292
},
{
"key": "33",
"doc_count": 5508
},
...
I would really like to be able to get the _source fields of more than
just a key if this is possible.
I think , you have only one option , I have faced same problem . Try this :-
{
"aggregations": {
"byId": {
"terms": {
"field": "mfr_id"
},
"aggs": {
"byDesc": {
"terms": {
"field": "mfr_desc"
}
}
}
}
}
}
Now you will get both id and desc while iterating through Elastic search JAVA API .
Terms aTerms = aAggregations.get("byId");
aTerms.getBuckets().stream().forEach(aBucketById-> {
Terms aTermsDesc = aBucketById.getAggregations().get("byDesc");
aTermsDesc.getBuckets().stream().forEach(aBucketByDesc -> {
//store id and desc
});
});
I would use a filter , it has better performance than an aggregation.
in aggregation you get all of the documents and only than you apply the aggregation . if you using a filter you get only the documents witch match the filter , and also filters can be cached.
{
"query": {
"constant_score": {
"filter": {
"exists": {
"field": "mfr_id"
}
}
}
}
}

Elastic search - how to get aggregate nested _source value

We are using elastic search to get some data.
please tell me how to get aggregate _source.eventName group data.
like this sql
seletc eventName, count(eventName) from events group by eventName;
Here is my aggs query and current response data structure.
{
type: 'event',
size: 3,
aggs: {
event_group: {
terms: {
field: 'eventName'
}
}
}
}
 
"hits": {
"hits": [
{
"_type": "event",
"_source": {
"eventName": "event1",
}
},
{
"_type": "event",
"_source": {
"eventName": "event1",
}
},
{
"_type": "event",
"_source": {
"eventName": "event2",
}
}
]
}
※ideal case(I wanna get like this result.)
{
"eventName": "event1",
"count": 2
},
{
"eventName": "event2",
"count": 1
}
ElasticSearch doesn't support this filtering, but you can use the REST API filter parameter which will return you sth like
GET .../_search?pretty&filter_path=hits.hits._source.*
"hits": {
"hits": [
{
"_source": {...},
"_source": {...},
"_source": {...},
}]
}
Elastic search Documentation on common options
Your query is almost correct. If you only want the count of eventNames, use size: 0 there instead of 3. It will tell ES to not return hits.
The response should have an aggregations property like so:
{
"aggregations":
{
"event_group": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets" : [
{
"key" : "event1",
"doc_count" : 10
},
{
"key" : "event2",
"doc_count" : 10
},
{
"key" : "event3",
"doc_count" : 10
},
]
}
}
}
The doc_count property there is the count you're looking for.
Note: ES will only return the top 10 eventName in the bucket. Depending on the ES version you're using, if you want to get all unique eventNames, you need to specify a size in your terms aggregation. Read ES docs for more info.

Resources