Elasticsearch include other fields in top level aggregation - elasticsearch

My indexed documents are as follows:
{
"user": {
"email": "test#test.com",
"firstName": "test",
"lastName": "test"
},
...
"category": "test_category"
}
Currently I have an aggregation which counts documents by the user's email and then a sub aggregation to count categories for each user:
"aggs": {
"users": {
"terms": {
"field": "user.email",
"order": {
"_count": "desc"
}
},
"aggs": {
"categories": {
"terms": {
"field": "category",
"order": {
"_count": "desc"
}
}
}
}
}
}
I am trying to include the user's first and last name to the buckets generated by the top aggregation, while still getting the same results from the categories sub aggregation. I've tried including the top_hits aggregation, but I didn't have any luck getting the results I want.
Any advice? Thanks!
EDIT:
Let me rephrase. I actually did get the desired result in terms of user data with the top_hits aggregation, I just don't know how to properly include it in my original aggregation so that the categories sub aggregation still gives me the same result. I tried the following top_hits aggregation:
"aggs": {
"user": {
"top_hits": {
"size": 1,
"_source": {
"include": ["user"]
}
}
}
}
I want to have the user data in the top level agg buckets and then still have the aggregation by category below that.

If i right, user and firstname lastname have a bijection.
So you could retrieve them using a customs script on these fields (and extract these buckets value on client side spliting with the "_" or wathever separator)
aggs: {
users: {
terms: {
script: 'doc["users.email"].value + "_" + doc["users.firstName"].value + "_" + doc["users.lastName"].value'
}
}
}

Related

Elasticsearch ranking aggregation with multiple terms query

tl;dr: Want to rank aggregations based on whether bucket key has used either of the search terms.
I have two indices documents and recommendations with the following mappings:
Documents:
{
"id": string,
"document_text" : string,
"author" : { "name": string }
...other fields
}
Recommendations:
{
"id": string,
"recommendation_text" : string,
"author" : { "name": string }
...other fields
}
The problem I am solving is to have top authors for query terms.
This works quite well with multimatch for a single query term like this:
{
"size": 0,
"query": {
"multi_match": {
"query": "science",
"fields": [
"document_text",
"recommendation_text"
],
"type": "phrase",
}
},
"aggs": {
"search-authors": {
"terms": {
"field": "author.name.keyword",
"size": 50
},
"aggs": {
"top-docs": {
"top_hits": {
"size": 100
}
}
}
}
}
}
But when I have multiple keywords, let's say zoology, botany, I want the aggregation ranking to place the authors who have talked about both zoology and botany higher than those who have used either of them.
having multiple multi_match with bool doesn't help since this isn't exactly an and/or situation.

Excluding inner hits from top hits aggregation with source filter

In my query, I am using the inner_hits to return the list of nested objects that match my query.
I then add an aggregations for categoryId of my document, and then a top hit aggregation to get the display name for that category.
"aggs": {
"category": {
"terms": {
"field": "categoryId",
"size": 100
},
"aggs": {
"category_value": {
"top_hits": {
"size": 1,
"_source": {
"includes": "categoryName"
}
}
}
}
}
}
Now, when I look at the aggregation buckets, I do get a _source document with only the categoryName property, but I also get the entire inner_hits collection:
{
...
"_source": {
"categoryName": "Armchairs"
},
"inner_hits": {
"my_inner_hits": {
"hits": {
"total": 260,
"max_score": null,
"hits": [{
...
"_source": {
//nested document here
}
}
]
}
}
}
}
Is there a way to not include the inner_hits data in a top_hits aggregation?
Since you only need a single field, what I suggest you to do is to get rid of top_hits aggregation and use another terms aggregation for the name:
{
...
"aggs": {
"category": {
"terms": {
"field": "categoryId",
"size": 100
},
"aggs": {
"category_value": {
"terms": {
"field": "categoryName",
"size": 1
}
}
}
}
}
}
That will also be a little bit more efficient.
UPDATE:
Another way to keep using terms/top_hits is to leverage response filtering and only return what you need. For instance, appending this to your URL will make sure that you won't find any inner hits inside your aggregation
?filter_path=hits.hits,aggregations.**.key,aggregations.**.doc_count,aggregations.**.hits.hits.hits._source

Get all documents from elastic search with a field having same value

Say I have documents of type Order and they have a field bulkOrderId. Bulkorderid represents a group or bulk of orders issued at once. They all have the same Id like this :
Order {
bulkOrderId": "bulkOrder:12345678";
}
The id is unique and is generated using UUID.
How do I find groups of orders with the same bulkOrderId from elasticsearch when the bulkOrderId is not known? Is it possible?
You can achieve that using a terms aggregation and a top_hits sub-aggregation, like this:
{
"query": {
"match_all": {}
},
"aggs": {
"bulks": {
"terms": {
"field": "bulkOrderId",
"size": 10
},
"aggs": {
"orders": {
"top_hits": {
"size": 10
}
}
}
}
}
}

elasticsearch 1.7 group by bucket with multi field concat

I have an aggregate statement that groups by firstname and buckets them with necessary fields. But I want to group by concatenation of firstname+lastname. I do not want to use nested aggregates like group by firstname and then group by lastname. How do I change the field to include a string concatenation of multiple fields?
"aggs": {
"by_name": {
"terms": {
"field": "firstname"
},
"aggs": {
"source": {
"top_hits": {
"_source": {
"include": [
"id","name"
]
}
}
}
}
}
}
In ES 1.7
You may use script aggregation with terms aggregation
GET _search
{
"size": 20,
"aggs": {
"con": {
"terms": {
"script": "doc['firstName'].value + doc['lastName'].value"
}
}
}
}
For current version, ie. ES 5.2, there is bucket script aggregaton for the same purpose

Filter elasticsearch results to contain only unique documents based on one field value

All my documents have a uid field with an ID that links the document to a user. There are multiple documents with the same uid.
I want to perform a search over all the documents returning only the highest scoring document per unique uid.
The query selecting the relevant documents is a simple multi_match query.
You need a top_hits aggregation.
And for your specific case:
{
"query": {
"multi_match": {
...
}
},
"aggs": {
"top-uids": {
"terms": {
"field": "uid"
},
"aggs": {
"top_uids_hits": {
"top_hits": {
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}
The query above does perform your multi_match query and aggregates the results based on uid. For each uid bucket it returns only one result, but after all the documents in the bucket were sorted based on _score in descendant order.
In ElasticSearch 5.3 they added support for field collapsing. You should be able to do something like:
GET /_search
{
"query": {
"multi_match" : {
"query": "this is a test",
"fields": [ "subject", "message", "uid" ]
}
},
"collapse" : {
"field" : "uid"
},
"size": 20,
"from": 100
}
The benefit of using field collapsing instead of a top hits aggregation is that you can use pagination with field collapsing.

Resources