Filter elasticsearch results to contain only unique documents based on one field value - elasticsearch

All my documents have a uid field with an ID that links the document to a user. There are multiple documents with the same uid.
I want to perform a search over all the documents returning only the highest scoring document per unique uid.
The query selecting the relevant documents is a simple multi_match query.

You need a top_hits aggregation.
And for your specific case:
{
"query": {
"multi_match": {
...
}
},
"aggs": {
"top-uids": {
"terms": {
"field": "uid"
},
"aggs": {
"top_uids_hits": {
"top_hits": {
"sort": [
{
"_score": {
"order": "desc"
}
}
],
"size": 1
}
}
}
}
}
}
The query above does perform your multi_match query and aggregates the results based on uid. For each uid bucket it returns only one result, but after all the documents in the bucket were sorted based on _score in descendant order.

In ElasticSearch 5.3 they added support for field collapsing. You should be able to do something like:
GET /_search
{
"query": {
"multi_match" : {
"query": "this is a test",
"fields": [ "subject", "message", "uid" ]
}
},
"collapse" : {
"field" : "uid"
},
"size": 20,
"from": 100
}
The benefit of using field collapsing instead of a top hits aggregation is that you can use pagination with field collapsing.

Related

How to paginate sorted data (with terms aggregation) using composite aggrgation?

How to write a pipeline aggregation to paginate the sorted data.[where sorting is done using terms aggregation based on its sub-aggregation]
GET index_name/_search
{
"query":{<some querying>}
"aggs": {
"pagination": {
"composite": {
"sources": [
{
"grouping": {
"terms": {
"field": "field_name.keyword",
"order": "desc"
}
}
}
]
},
"aggs": {
"results": {
"terms": {
"field": "field_name.keyword",
"order": {
"sub_aggregation": "desc"
}
},
"aggs": {
"sub_aggregation": {
"filter": {
"term": {
"field_name": "value"
}
}
}
}
}
}
}
}
}
The main problem is merging the following 2 sub-problems
P1. Sorting data based on selected key for which I used terms aggregation and inside the order of the same I have the sub-aggregation.
P2. I want to paginate the above-sorted data, I have used composite aggregation with terms aggregation.
When the composite aggregation is sub-aggregation to the terms aggregation. I get the following error:
[composite] aggregation cannot be used with a parent aggregation of type: [TermsAggregatorFactory]
and when I try the vice versa I get paginated data of terms data(P2) separately and sorting(P1) data in separate buckets.
How can I merge these two problems?

elasticsearch how do i query (search) in single document?

assuming that index's name is index & document 1's id is "1"
how can i query in single document?
something like this..
GET index/_search
{
"query": {
"id": "1",
"terms": ["is this text in document 1?"]
}
}
or
GET index/_doc/1/_search
{
...
}
far as i found,
GET test/_doc/_search
{
"query": {
"terms" : {
"_id" : ["1"]
}
}
}
this will get the document id of "1", but cannot perform any further queries.
the reason i want to query inside single document is because my app is using live-news view
and once news is retrieved from server, i want to search it in elasticsearch for keywork higlighting, and spam filtering.
You have to compose your query with Boolean Query
The best approch is to specify the id query under the filter because it will not have effect on scoring. You can next specify queries under must, must_not and should, according to your need :
GET index/_search
{
"from": 0,
"size": 10,
"query": {
"bool": {
"must": [
{
"term": {
"field": "value"
}
}
],
"must_not": [],
"should": [],
"filter": [
{
"terms": {"_id": ["1"]}
}
]
}
}
}

Elasticsearch ranking aggregation with multiple terms query

tl;dr: Want to rank aggregations based on whether bucket key has used either of the search terms.
I have two indices documents and recommendations with the following mappings:
Documents:
{
"id": string,
"document_text" : string,
"author" : { "name": string }
...other fields
}
Recommendations:
{
"id": string,
"recommendation_text" : string,
"author" : { "name": string }
...other fields
}
The problem I am solving is to have top authors for query terms.
This works quite well with multimatch for a single query term like this:
{
"size": 0,
"query": {
"multi_match": {
"query": "science",
"fields": [
"document_text",
"recommendation_text"
],
"type": "phrase",
}
},
"aggs": {
"search-authors": {
"terms": {
"field": "author.name.keyword",
"size": 50
},
"aggs": {
"top-docs": {
"top_hits": {
"size": 100
}
}
}
}
}
}
But when I have multiple keywords, let's say zoology, botany, I want the aggregation ranking to place the authors who have talked about both zoology and botany higher than those who have used either of them.
having multiple multi_match with bool doesn't help since this isn't exactly an and/or situation.

How to sort elasticsearch results based on number of collapsed items?

I'm using a a query with collapse in order to gather some documents under a certain person, yet I wish to sort the results based on the number of documents in which the search found a match.. this is my query:
GET documents/_search
{
"_source": {
"includes": [
"text"
]
},
"query": {
"query_string": {
"fields": [
"text"
],
"query": "some text"
}
},
"collapse": {
"field": "person_id",
"inner_hits": {
"name": "top_mathing_docs",
"_source": {
"includes": [
"doc_year",
"text"
]
}
}
}
}
Any suggestions?
Thanks
If I understand correctly, what you require here is to sort the documents i.e. parent documents, based on the count of inner_hits i.e. count of inner_hits based on person_id.
So that means, the _score of the parent documents in the result doesn't matter.
The only way I've found this doable is making use of the Top Hits Aggregation for Field Collapse Example and below is what your query would look like.
Aggregation Query Field Collapse Example:
POST <your_index_name>/_search
{
"size":0,
"query": {
"query_string": {
"fields": [
"text"
],
"query": "some text"
}
},
"aggs": {
"top_person_ids": {
"terms": {
"field": "person_id"
},
"aggs": {
"top_tags_hits": {
"top_hits": {
"size": 10
}
}
}
}
}
}
Note that I'm assuming person_id is of type keyword or any numeric.
Also if you look at query closely, I've mentioned "size":"0". Which means I'm only returning the result of aggregation.
Another note is that the above aggregation has nothing to do with Field Collapse in Search Request feature that you have posted in the question. It's just that using this aggregation, your result could be formatted in a similar way.
Let me know if this helps!

Elasticsearch top_hits aggregation vs latest document

I am trying to get a list of users who has as their last activity "connect". Ideally, I want this as a metric viz or a data table in Kibana showing the number of users that connected last and the list of them, respectively. I have, however, given up being able to do this in Kibana. I can get something similar directly from Elasticsearch using a terms aggregation followed by top_hits as below. But the problem is, even though I am sorting the top_hits by #timestamp, the resulting document in NOT the most recent.
{
"size" : 0,
"sort": { "#timestamp": {"order": "desc"} },
"aggs" : {
"by_user" : {
"terms" : {
"field" : "fields.username.keyword",
"size" : 1
},
"aggs": {
"last_message": {
"top_hits": {
"sort": [
{
"#timestamp": {
"order": "desc"
}
}
],
"_source": {
"includes": ["fields.username.keyword", "#timestamp", "status"]
},
"size": 1
}
}
}
}
}
}
Is there a way to do this directly in Kibana?
How can I make sure top_hits gives me the latest results, rather than the "most relevant"?
I think what you want is field collapsing, which is faster than an aggregation.
Something like this should work for your use case:
GET my-index/_search {
"query": {
"match_all": { }
},
"collapse" : {
"field" : "fields.username.keyword"
},
"sort": [ {
"#timestamp": {
"order": "desc"
}
} ] }
I might be missing something, but I don't think Kibana supports this at the moment.

Resources