Limit to max records to be searched in Elastic Search Group by query - elasticsearch

We have a strange issue where data for one of our customers has a lot of records based on certain field x. When the user triggers a query for the group by for that x field, the Elastic Search cluster is going for a toss and restarting with OOM.
Is there a way to limit max records that elastic search should look for while aggregating the result for a certain field so that cluster can be saved from going OOM ?
PS: The group by can go on multiple fields such as x,y,x, and w, and the user is searching for the last 30-day data only.

Use Sampler Aggregation with terms aggregation if you wish to restrict the number of documents that should be taken into account for an aggregation (let's say terms aggregation) (in this case)
Index Data:
{
"role": "example",
"number": 1
}
{
"role": "example1",
"number": 2
}
{
"role": "example2",
"number": 3
}
Search Query:
{
"size": 0,
"aggs": {
"sample": {
"sampler": {
"shard_size": 2 // Max documents you need to have for the aggregation
},
"aggs": {
"unique_roles": {
"terms": {
"field": "role.keyword"
}
}
}
}
}
}
Search Result:
"hits": {
"total": {
"value": 3,
"relation": "eq"
},
"max_score": null,
"hits": []
},
"aggregations": {
"sample": {
"doc_count": 2, // Note this
"unique_roles": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "example",
"doc_count": 1
},
{
"key": "example1",
"doc_count": 1
}
]
}
}
}

Related

Latest document for each category?

I have documents in ElasticSearch with structure like this:
{
"created_on": [timestamp],
"source_id": [a string ID for the source],
"type": [a term],
... other fields
}
Obviously, I can select these documents in Kibana, show them in "discover", produce (for example) a pie chart showing type terms, and so on.
However, the requirement I've been given is to use only the most recent document for each source_id.
The approach I've tried is to map the documents into one bucket per source_id, then for each bucket, reduce to remove all but the document with the latest created_on.
However, when I used the terms aggregator, the result only contained counts, not whole documents I could further process:
"aggs" : {
"sources" : {
"terms" : { "field" : "source_id" }
}
}
How can I make this query?
If I understood correctly what you're trying to do, one way to accomplish that is using the top_hits aggregations under the terms aggregation, which is useful for grouping results by any criteria you'd like to, for each bucket of its parent aggregation. Following your example, you could do something like
{
"aggs": {
"by_source_id": {
"terms": {
"field": "source_id"
},
"aggs": {
"most_recent": {
"top_hits": {
"sort": {
"created_on": "desc"
},
"size": 1
}
}
}
}
}
}
So you are grouping by source_id, which will create a bucket for each one, and then you'll get the top hits for each bucket according to the sorting criteria set in the top_hits agg, in this case the created_on field.
The result you should expect would be something like
....
"buckets": [
{
"key": 3,
"doc_count": 2,
"most_recent": {
"hits": {
"total": 2,
"max_score": null,
"hits": [
{
"_index": "so_sample02",
"_type": "items",
"_id": "2",
"_score": null,
"_source": {
"created_on": "2018-05-01 07:00:01",
"source_id": 3,
"type": "a"
},
"sort": [
1525158001000
]
}
]
}
}
},
{
"key": 5,
"doc_count": 2, .... and so on
Notice how within the bucket, most_recent, we get the corresponding hits. You can furthermore limit the amount of fields returned, by specifying in your top_hits agg "includes": ["fieldA", "fieldB" .. and so on]
Hope that helps.

Group results returned by elasticsearch query based on query terms

I am very new with elasticsearch. I am facing an issue building a query. My document structure is like:
{
latlng: {
lat: '<some-latitude>',
lon: '<some-longitude>'
},
gmap_result: {<Some object>}
}
I am doing a search on a list of lat-long. For each coordinate, I am fetching a result that is within 100m. I have been able to do this part. But, the tricky part is that I do not know which results in the output correspond to the which query term. I think this requires using aggregations at some level, but I am currently clueless on how to proceed on this.
An aggregate query is the correct approach. You can learn about them here:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html
An example is below. In this example, I am using a match query to find all instances of the word test in the field title and then aggregating the field status to count the number of results with the word test that are in each status.
GET /my_index/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "*test*"
}
}
]
}
},
"aggs": {
"count_by_status": {
"terms": {
"field": "status"
}
}
},
"size": 0
}
The results look like this:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 346,
"max_score": 0,
"hits": []
},
"aggregations": {
"count_by_status": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Open",
"doc_count": 283
},
{
"key": "Completed",
"doc_count": 36
},
{
"key": "On Hold",
"doc_count": 12
},
{
"key": "Withdrawn",
"doc_count": 10
},
{
"key": "Declined",
"doc_count": 5
}
]
}
}
}
If you provide your query, it would help us give a more specific aggregate query for you to use.

Elastic Search Unique Field Values

I am trying to get groups of only unique values in Elastic Search for the searches. I can't figure out why this doesn't behave.
I have gone through many StackOverflow questions, and read the Documentation for most of the day. Nothing seems to work for me, below I provided what I tried doing last.
Is there any reason someone would want to have the same results repeatedly returned? Maybe for differing versions of a Document?
In this example I would like a listing of all mfr_id's, and their mfr_desc as well. I am running this over a type to search document field values only. It seems that Agg Terms is the way to accomplish this, does anyone see anything I am doing wrong?
1: API Call
GET /inventory/item/_search
{
"size": 0,
"_source": ["mfr_id", "mfr_desc"],
"aggs": {
"unique_vals": {
"terms": {
"field": "mfr_id.keyword"
/** I have to use .keyword, seems like my mappings isn't working */
}
}
}
}
2: Mapping File
The Mapping I run after doing a Bulk import is quite simple. I read to not analyze the keys if you want a unique query:
{
"index": "inventory",
"body": {
"settings": {
"number_of_shards": 1
},
"mappings": {
"_default_": {
"properties": {
"mfr_id": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
}
3: My Results
Aggregation has ~10 records when there are about 100. I would really like to be able to get the _source fields of more than just a key if this is possible.
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0
},
"hits": {
"total": 49341,
"max_score": 0,
"hits": []
},
"aggregations": {
"unique_vals": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 6815,
"buckets": [
{
"key": "14",
"doc_count": 24292
},
{
"key": "33",
"doc_count": 5508
},
...
I would really like to be able to get the _source fields of more than
just a key if this is possible.
I think , you have only one option , I have faced same problem . Try this :-
{
"aggregations": {
"byId": {
"terms": {
"field": "mfr_id"
},
"aggs": {
"byDesc": {
"terms": {
"field": "mfr_desc"
}
}
}
}
}
}
Now you will get both id and desc while iterating through Elastic search JAVA API .
Terms aTerms = aAggregations.get("byId");
aTerms.getBuckets().stream().forEach(aBucketById-> {
Terms aTermsDesc = aBucketById.getAggregations().get("byDesc");
aTermsDesc.getBuckets().stream().forEach(aBucketByDesc -> {
//store id and desc
});
});
I would use a filter , it has better performance than an aggregation.
in aggregation you get all of the documents and only than you apply the aggregation . if you using a filter you get only the documents witch match the filter , and also filters can be cached.
{
"query": {
"constant_score": {
"filter": {
"exists": {
"field": "mfr_id"
}
}
}
}
}

ElasticSearch 2.1.0 - Deep 'children' aggregation with 'sum' metric returning empty results

I have a hierarchy of document types two levels deep. The documents are related by parent-child relationships as follows: category > sub_category > item i.e. each sub_category has a _parent field referring to a category id, and each item has a _parent field referring to a sub_category id.
Each item has a price field. Given a query for categories, which includes conditions for sub-categories and items, I want to calculate a total price for each sub_category.
My query looks something like this:
{
"query": {
"has_child": {
"child_type": "sub_category",
"query": {
"has_child": {
"child_type": "item",
"query": {
"range": {
"price": {
"gte": 100,
"lte": 150
}
}
}
}
}
}
}
}
My aggregation to calculate the price for each sub-category looks like this:
{
"aggs": {
"categories": {
"terms": {
"field": "id"
},
"aggs": {
"sub_categories": {
"children": {
"type": "sub_category"
},
"aggs": {
"sub_category_ids": {
"terms": {
"field": "id"
},
"aggs": {
"items": {
"children": {
"type": "item"
},
"aggs": {
"price": {
"sum": {
"field": "price"
}
}
}
}
}
}
}
}
}
}
}
}
Despite the query response listing matching results, the aggregation response doesn't match any items:
{
"aggregations": {
"categories": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "category1",
"doc_count": 1,
"sub_categories": {
"doc_count": 3,
"sub_category_ids": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "subcat1",
"doc_count": 1,
"items": {
"doc_count": 0,
"price": {
"value": 0
}
}
},
{
"key": "subcat2",
"doc_count": 1,
"items": {
"doc_count": 0,
"price": {
"value": 0
}
}
},
{
"key": "subcat3",
"doc_count": 1,
"items": {
"doc_count": 0,
"price": {
"value": 0
}
}
}
]
}
}
}]
}
}
}
However, omitting the sub_category_ids aggregation does cause the items to appear and for prices to be summed at the level of the categories aggregation. I would expect including the sub_category_ids aggregation to simply change the level at which the prices are summed.
Am I misunderstanding how the aggregation is evaluated, and if so how could I modify it to display the summed prices for each sub-category?
I opened an issue #15413, regarding children aggregation as I and other folks were facing similar issues in ES 2.0
Apparently the problem according to ES developer #martijnvg was that
The children agg makes an assumption (that all segments are being seen by children agg) that was true in 1.x but not in 2.x
PR #15457 fixed this issue, again from #martijnvg
Before we only evaluated segments that yielded matches in parent aggs, which caused us to miss to evaluate child docs in segments we didn't have parent matches for.
The fix for this is stop remember in what segments we have matches for
and simply evaluate all segments. This makes the code simpler and we
can still quickly see if a segment doesn't hold child docs like we did
before
This pull request has been merged and it has also been back ported to the 2.x, 2.1 and 2.0 branches.

How to query for field values that all documents have in common?

I've got the following simple ElasticSearch query:
{
"query": {
"term": {
"categories": "1234"
}
}
}
Which returns a number of documents containing a structure like this:
{
"properties": [
{
"name": "foo",
"value": 20
},
{
"name": "bar",
"value": 30
}
]
}
How do I have to alter the above query so ElasticSearch returns a set of values in properties.name that all result documents have in common?
You can't do this with a simple query. One of the solution is to use a term aggregation, like this one:
{
"query": {
"term": {
"categories": "1234"
}
},
"aggs": {
"properties_name": {
"terms": {
"field": "properties.name"
}
}
}
}
You will get a similar response:
{
"_shards": {
"total": 5,
"successful": 5,
"failed": 0
},
"hits": {
"total": 10,
"hits": [{...}]
}
"aggregations": {
"properties_name": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [{
"key": "foo",
"doc_count": 10
}, {
"key": "bar",
"doc_count": 4
}, {}]
}
}
}
You usual results will be available under hits, and the aggregation results under aggregations.
Then you can use hits.total (10) to find properties_names which are present in all documents. You simply need to iterate over buckets, and keep ones with doc_count == hits.total
In this example only "foo" properties is present in all documents

Resources