Elastic Search Filter on the result of terms aggregation - elasticsearch

Apply Match phrase prefix query on the result of terms aggregation in Elastic Search.
I have terms query and the result looks something like below
"buckets": [
{
"key": "KEY",
"count": 20
},
{
"key": "LOCK",
"count": 30
}
]
Now the requirement is to filter those buckets whose key starts with a certain prefix, so something similar to match phrase prefix. For example if input to match phrase prefix query is "LOC", then only one bucket should be returned(2nd one). So effectively it's a filter on terms aggregation. Thanks for your thoughts.

You could use the include parameter on your terms aggregation to filter out the values based on regex.
Something like this should work:
GET stackoverflow/_search
{
"_source": false,
"aggs": {
"groups": {
"terms": {
"field": "text.keyword",
"include": "LOC.*"
}
}
}
}
Example: Let's say you have three different documents with three different terms(LOCK, KEY & LOL) in an index. So if you perform the following request:
GET stackoverflow/_search
{
"_source": false,
"aggs": {
"groups": {
"terms": {
"field": "text.keyword",
"include": "L.*"
}
}
}
}
You'll get the following buckets:
"buckets" : [
{
"key" : "LOCK",
"doc_count" : 1
},
{
"key" : "LOL",
"doc_count" : 1
}
]
Hope it is helpful.

Related

Sort multi-bucket aggregation by source fields inside inner multi-bucket aggregation

TL;DR: Using an inner multi-bucket aggregation (top_hits with size: 1) inside an outer multi-bucket aggregation, is it possible to sort the buckets of the outer aggregation by the data in the inner buckets?
I have the following index mappings
{
"parent": {
"properties": {
"children": {
"type": "nested",
"properties": {
"child_id": { "type": "keyword" }
}
}
}
}
}
and each child (in data) has also the properties last_modified: Date and other_property: String.
I need to fetch a list of children (of all the parents but without the parents), but only the one with the latest last_modified per each child_id. Then I need to sort and paginate those results to return manageable amounts of data.
I'm able to get the data and paginate over it with a combination of nested, terms, top_hits, and bucket_sort aggregations (and also get the total count with cardinality)
{
"query": {
"match_all": {}
},
"size": 0,
"aggs": {
"children": {
"nested": {
"path": "children"
},
"aggs": {
"totalCount": {
"cardinality": {
"field": "children.child_id"
}
},
"oneChildPerId": {
"terms": {
"field": "children.child_id",
"order": { "_term": "asc" },
"size": 1000000
},
"aggs": {
"lastModified": {
"top_hits": {
"_source": [
"children.other_property"
],
"sort": {
"children.last_modified": {
"order": "desc"
}
},
"size": 1
}
},
"paginate": {
"bucket_sort": {
"from": 36,
"size": 3
}
}
}
}
}
}
}
}
but after more than a solid day of going through the docs and experimenting, I seem to be no closer to figuring out, how to sort the buckets of my oneChildPerId aggregation by the other_property of that single child retrieved by lastModified aggregation.
Is there a way to sort a multi-bucket aggregation by results in a nested multi-bucket aggregation?
What I've tried:
I thought I could use bucket_sort for that too, but apparently its sort can only be used with paths containing other single-bucket aggregations and ending in a metic one.
I've tried to find a way to somehow transform the 1-result multi-bucket of lastModified into a single-bucket, but haven't found any.
I'm using ElasticSearch 6.8.6 (the bucket_sort and similar tools weren't available in ES 5.x and older).
I had the same problem. I needed a terms aggregation with a nested top_hits, and want to sort by a specific field inside the nested aggregation.
Not sure how performant my solution is, but the desired behaviour can be achieved with a single-value metric aggregation on the same level as the top_hits. Then you can sort by this new aggregation in the terms aggregation with the order field.
Here an example:
POST books/_doc
{ "genre": "action", "title": "bookA", "pages": 200 }
POST books/_doc
{ "genre": "action", "title": "bookB", "pages": 35 }
POST books/_doc
{ "genre": "action", "title": "bookC", "pages": 170 }
POST books/_doc
{ "genre": "comedy", "title": "bookD", "pages": 80 }
POST books/_doc
{ "genre": "comedy", "title": "bookE", "pages": 90 }
GET books/_search
{
"size": 0,
"aggs": {
"by_genre": {
"terms": {
"field": "genre.keyword",
"order": {"max_pages": "asc"}
},
"aggs": {
"top_book": {
"top_hits": {
"size": 1,
"sort": [{"pages": {"order": "desc"}}]
}
},
"max_pages": {"max": {"field": "pages"}}
}
}
}
}
by_genre has the order field which sorts by a sub aggregation called max_pages. max_pages has only been added for this purpose. It creates a single-value metric by which the order is able to sort by.
Query above returns (I've shortened the output for clarity):
{ "genre" : "comedy", "title" : "bookE", "pages" : 90 }
{ "genre" : "action", "title" : "bookA", "pages" : 200 }
If you change "order": {"max_pages": "asc"} to "order": {"max_pages": "desc"}, the output becomes:
{ "genre" : "action", "title" : "bookA", "pages" : 200 }
{ "genre" : "comedy", "title" : "bookE", "pages" : 90 }
The type of the max_pages aggregation can be changed as needed , as long as it is a single-value metic aggregation (e.g. sum, avg, etc)

Elasticsearch ranking aggregation with multiple terms query

tl;dr: Want to rank aggregations based on whether bucket key has used either of the search terms.
I have two indices documents and recommendations with the following mappings:
Documents:
{
"id": string,
"document_text" : string,
"author" : { "name": string }
...other fields
}
Recommendations:
{
"id": string,
"recommendation_text" : string,
"author" : { "name": string }
...other fields
}
The problem I am solving is to have top authors for query terms.
This works quite well with multimatch for a single query term like this:
{
"size": 0,
"query": {
"multi_match": {
"query": "science",
"fields": [
"document_text",
"recommendation_text"
],
"type": "phrase",
}
},
"aggs": {
"search-authors": {
"terms": {
"field": "author.name.keyword",
"size": 50
},
"aggs": {
"top-docs": {
"top_hits": {
"size": 100
}
}
}
}
}
}
But when I have multiple keywords, let's say zoology, botany, I want the aggregation ranking to place the authors who have talked about both zoology and botany higher than those who have used either of them.
having multiple multi_match with bool doesn't help since this isn't exactly an and/or situation.

Latest document for each category?

I have documents in ElasticSearch with structure like this:
{
"created_on": [timestamp],
"source_id": [a string ID for the source],
"type": [a term],
... other fields
}
Obviously, I can select these documents in Kibana, show them in "discover", produce (for example) a pie chart showing type terms, and so on.
However, the requirement I've been given is to use only the most recent document for each source_id.
The approach I've tried is to map the documents into one bucket per source_id, then for each bucket, reduce to remove all but the document with the latest created_on.
However, when I used the terms aggregator, the result only contained counts, not whole documents I could further process:
"aggs" : {
"sources" : {
"terms" : { "field" : "source_id" }
}
}
How can I make this query?
If I understood correctly what you're trying to do, one way to accomplish that is using the top_hits aggregations under the terms aggregation, which is useful for grouping results by any criteria you'd like to, for each bucket of its parent aggregation. Following your example, you could do something like
{
"aggs": {
"by_source_id": {
"terms": {
"field": "source_id"
},
"aggs": {
"most_recent": {
"top_hits": {
"sort": {
"created_on": "desc"
},
"size": 1
}
}
}
}
}
}
So you are grouping by source_id, which will create a bucket for each one, and then you'll get the top hits for each bucket according to the sorting criteria set in the top_hits agg, in this case the created_on field.
The result you should expect would be something like
....
"buckets": [
{
"key": 3,
"doc_count": 2,
"most_recent": {
"hits": {
"total": 2,
"max_score": null,
"hits": [
{
"_index": "so_sample02",
"_type": "items",
"_id": "2",
"_score": null,
"_source": {
"created_on": "2018-05-01 07:00:01",
"source_id": 3,
"type": "a"
},
"sort": [
1525158001000
]
}
]
}
}
},
{
"key": 5,
"doc_count": 2, .... and so on
Notice how within the bucket, most_recent, we get the corresponding hits. You can furthermore limit the amount of fields returned, by specifying in your top_hits agg "includes": ["fieldA", "fieldB" .. and so on]
Hope that helps.

Elasticsearch prioritize specific _ids but don't filter?

I'm trying to sort my query in elasticsearch where the query will prioritize documents with specific _ids to appear first but it won't filter the entire query based on the _ids it's just prioritizing them.
Here's an example of what I've tried as an attempt:
{"query":{"constant_score":{"filter":{"terms":{"_id":[2,3,4]}},"boost":2}}}
So the above would be included along with other queries however the query just returns the exact matches and not the rest of the results.
Any ideas as to how this can be done so that it just prioritizes the documents with the ids but doesn't filter the entire query?
Try this (and instead of that match_all() there you can use a query to actually filter the results):
{
"query": {
"function_score": {
"query": {
"match_all": {}
},
"functions": [
{
"filter": {
"terms": {
"_id": [
2,
3,
4
]
}
},
"weight": 2
}
]
}
}
}
If you need to return in exact order as you need go with
"sort": [
{
"_script": {
"script": "doc['id'] != null ? sortOrder.indexOf(doc['id'].value.toInteger()) : 0",
"type": "number",
"params": {
"sortOrder": [
2,3,4
]
},
"order": "desc"
}
},
"_score"
]
P.S. As #Val mentioned wityh _id this will not work, so you would need to store id field as separate.
If you need move documents to top look to function_score

elastic search sort aggregation by selected field

How can I sort the output from an aggregation by a field that is in the source data, but not part of the output of the aggregation?
In my source data I have a date field that I would like the output of the aggregation to be sorted by date.
Is that possible? I've looked at using "order" within the aggregation, but I don't think it can see that date field to use it for sorting?
I've also tried adding a sub aggregation which includes the date field, but again, I cannot get it to sort on this field.
I'm calculating a hash for each document in my ETL on the way in to elastic. My data set contains a lot of duplication, so I'm trying to use the aggregation on the hash field to filter out duplicates and that works fine. I need the output from the aggregation to retain a date sort order so that I can work with the output in angular.
The documents are like this:
{_id: 123,
_source: {
"hash": "01010101010101"
"user": "1"
"dateTime" : "2001/2/20 09:12:21"
"action": "Login"
}
{_id: 124,
_source: {
"hash": "01010101010101"
"user": "1"
"dateTime" : "2001/2/20 09:12:21"
"action": "Login"
}
{_id: 132,
_source: {
"hash": "0202020202020"
"user": "1"
"dateTime" : "2001/2/20 09:20:43"
"action": "Logout"
}
{_id: 200,
_source: {
"hash": "0303030303030303"
"user": "2"
"dateTime" : "2001/2/22 09:32:14"
"action": "Login"
}
So I want to use an aggregation on the hash value to remove duplicates from my set and then render the response in date order.
My query:
{
"query": {
"filtered": {
"filter": {
"bool": {
"must": [
{
"term": {
"action": "Login"
}
}
]
},
"size": 0,
"aggs": {
"md5": {
"terms": {
"field": "hash",
"size": 0
}
},
"size": 0,
"aggs": {
"byDate": {
"terms": {
"field": "dateTime",
"size": 0
}
}
}
}
}
}
}
}
Currently the output is ordered on the hash and I need it ordered on the date field within each hash bucket. Is that possible?
If the aggregation on "hash" is just for removing duplicates, it might work for you to simply aggregate on "dateTime" first, followed by the terms aggregation on "hash". For example:
GET my_index/test/_search
{
"query" : {
"filtered" : {
"filter" : {
"bool": {
"must" : [
{ "term": {"action":"Login"} }
]
}
}
}
},
"size": 0,
"aggs": {
"byDate" : {
"terms": {
"field" : "dateTime",
"order": { "_term": "asc" } <---- EDIT: must specify order here
},
"aggs": {
"byHash": {
"terms": {
"field": "hash"
}
}
}
}
}
}
This way, your results would be sorted by "dateTime" first.

Resources