Elasticsearch group sorted combined query - sorting

I am faced with the following issue. I have a few sorted queries against a specific group of records in my index, similar to the one below, where the term1 matching values vary per query, while for term2 they remain static for all queries.
{
"query": {
"bool": {
"must": {
"terms": {
"term1": [ "val1", "val2" ]
}
},
"must_not": {
"terms": {
"term2": [ "val3", "val4" ]
}
}
}
},
"sort": [
{ "sort_term": "desc" }
],
"from": 0,
"size": 10
}
Right now, I'm performing all these queries separately and then combining and shuffling their results in code, something that as you can probably tell is not ideal. I was wondering if there's a way to combine these queries in ElasticSearch, while maintaining the group-based sorting.
The reason I want to maintain the sorting order of each individual query is because the sorting values are not uniform and I don't want results from different groups to be buried down the result set.
The only solution I could think of would be to somehow re-process all records and compute a relative sorting value based on the sorting values of all the records in a given group, but these values change very regularly and the index has a lot of records, so that would probably be overkill.
Any ideas would be greatly appreciated!

You can use multiple Terms in sort array. If you combine the query I would first sort by _type, which prevents mixing up your search results. The sort field of your query should be like:
"sort": [
{ "_type": "desc" },
{ "sort_term_query1": "desc" },
{ "sort_term_query2": "desc" },
],

Related

Elasticsearch collapse not working with search_after with single sort field and PIT

I have an Elastic query that initially returns results. When I attempt the query again using search_after for paging, I am getting the error: Cannot use [collapse] in conjunction with [search_after] unless the search is sorted on the same field. Multiple sort fields are not allowed. So far as I can tell, I am sorting and collapsing using just a single field per_id. Is my query structured incorrectly or is there something else I need to do to get this query to run?
GET /_search
{
"query": {
"bool": {
"must": [{
"term": {
"pform": "iphone"
}
}]
}
},
"collapse": {
"field": "per_id"
},
"pit": {
"id": "g-ABCDDEFG12345678ABCDDEFG12345678==",
"keep_alive": "5m"
},
"sort": [
{"per_id": "asc"}
],
"search_after" : [
"ABCDDEFG12345678",
123456
]
}
I needed to exclude the tie breaker in my search_after. It shouldn't cause duplicates because I am using a PIT and sorting on the collapse field, meaning duplicates shouldn't exist in the my result set.
"search_after" : [
"ABCDDEFG12345678"
]
So I needed to remove the tiebreaker returned from the previous result before passing it into the next one

Aggregate results from several indices in Elasticsearch

I have an Elasticsearch query as shown below.
"query": {
"bool": {
"must": [
{
"match": {
"content": "Netherlands"
}
}
]
}
},
"sort": [
{
"file.created": {
"order": "asc"
}
}
]
}
When I query several indices and sort my results as shown below(in ascending or descending order), my results are in order but of each individual index. So I get trial2 results in order then trial3 results in order.
http://localhost:9200/trial2,trial3/_doc/_search?pretty
What I am looking for, since I am querying several indices and sorting by date, is to get the results of all the indices in ascending or descending order. If a document in trial3 is more recent then the one in trial2, it should appear higher regardless of the order of the indices in the query.
Kindly advice
If you are working with multiple indices that have an equal structure, it would make sense to create an alias that contains all these indices together. You can then run your queries against this virtual big index. Also the sorted results are then in a correct order, while the original index is still referenced in each result document.
POST /_aliases
{
"actions": [
{
"add": {
"index": "trial2",
"alias": "my-alias"
}
},
{
"add": {
"index": "trial3",
"alias": "my-alias"
}
}
]
}

Elastic search multi index query

I am building an app where I need to match users based on several parameters. I have two elastic search indexes, one with the user's likes and dislikes, one with some metadata about the user.
/user_profile/abc12345
{
"userId": "abc12345",
"likes": ["chocolate", "vanilla", "strawberry"]
}
/user_metadata/abc12345
{
"userId": "abc12345",
"seenBy": ["aaa123","bbb123", "ccc123"] // Potentially hundreds of thousands of userIds
}
I was advised to make these separate indexes and cross reference them, but how do I do that? For example I want to search for a user who likes chocolate and has NOT been seen by user abc123. How do I write this query?
If this is a frequent query in your use case, I would recommend merging the indices (always design your indices based on your queries).
Anyhow, a possible workaround for your current scenario is to exploit the fact that both indices store the user identifier in a field with the same name (userId). Then, you can (1) issue a boolean query over both indices, to match documents from one index based on the likes field, and documents from the other index based on the seenBy field, (2) use the terms bucket aggregation to get the list of unique userIds that satisfy your conditions.
For example
GET user_*/_search
{
"size": 0,
"query": {
"bool": {
"should": [
{
"match": {
"likes": "chocolate"
}
},
{
"match": {
"seenBy": "abc123"
}
}
]
}
},
"aggs": {
"by_userId": {
"terms": {
"field": "userId.keyword",
"size": 100
}
}
}
}

How are the documents ordered in Elasticsearch if the sort value for two documents is same?

I was working with products data, here: link
The search query that sort by keyword field tags using max mode is as follows.
GET product/_doc/_search
{
"size":100,"from":20,"_source":["tags", "name"],
"query": {
"match_all": {}
},
"sort": [
{"tags":{
"order":"desc",
"mode":"max"
}}
]
}
Some documents have same sort value. I had read somewhere that if the sort value is same, it arranges by internal doc id (_id). However, the case does not seem so. See screenshot below:
First _id: 961 followed by _id:972 (fine). However, then came _id: 114. I am not understanding how it got random.
Help will be appreciated.
As you have already seen, its random. To overcome this you can add another field to be used to sort when the sorting value for first field is same. As you want to use _id the query will be then as follows:
{
"size": 100,
"from": 20,
"_source": [
"tags",
"name"
],
"query": {
"match_all": {}
},
"sort": [
{
"tags": {
"order": "desc",
"mode": "max"
}
},
{
"_id": "asc"
}
]
}

How do I select the top term buckets based on a rescore function in Elasticsearch

Consider the following query for Elasticsearch 5.6:
{
"size": 0,
"query": {
"match_all": {}
},
"rescore": [
{
"window_size": 10000,
"query": {
"rescore_query": {
"function_score": {
"boost_mode": "replace",
"script_score": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
},
"query_weight": 0,
"rescore_query_weight": 1
}
}
],
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1
}
},
"top_score": {
"max": {
"script": {
"inline": "_score"
}
}
}
}
}
}
}
This is a simplified version where the real query has a more complex main query and the rescore function is far more intensive.
Let me explain it's purpose first incase I'm about to spend a 1000 hours developing a pen that writes in space when a pencil would actually solve my problem. I'm performing a fast initial query, then rescoring the top results with a much more intensive function. From those results I want to show the top distinct values, i.e. no two results should have the same identical_id. If there's a better way to do this I'd also consider that an answer.
I expected a query like this would order results by the rescore query, group all the results that had the same identical_id and display the top hit for each such distinct group. I also assumed that since I'm ordering those term aggregation buckets by the max parent _score, they would be ordered to reflect the best result they contain as determined from the original rescore query.
The reality is that the term buckets are ordered by the maximum query score and not the rescore query score. Strangely the top hits within the buckets do seem to use the rescore.
Is there a better way to achieve the end result that I want, or some way I can fix this query to work the way I expect it too?
From documentation :
The query rescorer executes a second query only on the Top-K results returned by the query and post_filter phases. The number of docs which will be examined on each shard can be controlled by the window_size parameter, which defaults to 10.
As the rescore query kicks in after the post_filter phase, I assume the term aggregation buckets are already fixed.
I have no idea on how you can combine rescore and aggregations. Sorry :(
I think I have a pretty great solution to this problem, but I'll let the bounty continue to expiration incase someone comes up with a better approach.
{
"size": 0,
"query": {
"match_all": {}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 10000
},
"aggs": {
"distinct": {
"terms": {
"field": "identical_id",
"order": {
"top_score": "desc"
}
},
"aggs": {
"best_unique_result": {
"top_hits": {
"size": 1,
"sort": [
{
"_script": {
"type": "number",
"script": {
"source": "doc['topic_score'].value"
},
"order": "desc"
}
}
]
}
},
"top_score": {
"max": {
"script": {
"source": "doc['topic_score'].value"
}
}
}
}
}
}
}
}
}
The sampler aggregation will take the top N hits per shard from the core query and run aggregations over those. Then in the max aggregator that defines the bucket order I use the exact same script as the one I use to pick a top hit from the bucket. Now the buckets and the top hits are running over the same top N sets of items and the buckets will order by the max of the same score, generated from the same script. Unfortunately I still need run the script once to order the buckets and once to pick a top hit within the bucket, and you could use the rescore instead for the top hits ordering, but either way it has to run twice and I found it was faster as a sort script then as a rescore

Resources