Elasticsearch: How to search, sort, limit the results then sort again? - sorting

This isn't about multi-level sorting.
I need my results first selected by distance, limited to 50, then those 50 sorted by price.
select *
from
(
select top 50 * from mytable order by distance asc)
)
order by price asc
Essentially, the second sort throws away the ordering of the inner sort - but the inner sort is used to hone in on the top 50 results.
The other answers I've seen for this sort of question looks at second-level sorting, which is not what I'm after.
BTW: I've looked at aggregations - Top N results, but I'm not sure I can apply a sort on the aggregation result sort. Also looked at rescore, but I don't know where to put my 'sorts'

A top hits aggregation will allow you to sort on a separate field, in your case price from the main query sort (on distance). See the documentation here for how to specify sorting in the top hits agg.
It'll look a little like this (which assumes distance is a double type; if it's a geo-location type, use the documentation provided by Volodymyr Bilyachat.)
{
"sort":[
{
"distance":"asc"
}
],
"query":{
"match_all":{}
},
"size":50,
"aggs":{
"top_price_hits":{
"top_hits":{
"sort":[
{
"price":{
"order":"asc"
}
}
],
"size":50
}
}
}
}
However, if there are only 50 results that you want from your primary query, why don't you just sort in the application client side? This would be a better approach as using a top hits aggregation for a secondary sort is a slight abuse of its purpose.
The in-application approach would be more robust.

+1'ed the accepted answer, but I wanted to make sure you were aware of how search scoring, can often deliver a better user experience than traditional sorting.
Based on your current strategy, one could say:
Distance is important, relatively speaking (e.g. top 50 closest) but not in absolute terms (e.g. must be within 50mi).
You only want to show 50 results.
You want those results to be sorted by price (or perhaps alphabetically).
However, if you find yourself trying to generalize about which result a searcher is most likely to choose, you may discover a function of price and distance (or other features) which better models the real-world likelihood of a searcher choosing a particular result.
E.g. Say you discover that
Users will pay more for the convenience of a nearby result
Users will travel greater distances for greater discounts
Then you could model a sample scoring function that generates a result ordering based on this relationship.
E.g. 1/price + 1/distance ... which would generate a higher score as either price or distance decreased.
Which could be generalized to P * 1/price + 1/distance where P represented a tuning coefficient expressing the relative importance of price vs distance.
Armed with this model, you could then write a function score query which would output ordered results with the optimal combinations of price and distance for your users.

As i see it would be better to do select top 50 using size: 50 property in query, and ordering by distance, then sort result in your application by price.

Related

Navigating terms aggregation in Elastic with very large number of buckets

Hope everyone is staying safe!
I am trying to explore the proper way to tacke the following use case in elasticsearch
Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.
So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs
To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.
As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/
It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.
Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....
But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).
I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.
Any help would be extremely appreciated!
Thanks!
There are 3 ways to paginate aggregtation.
Composite aggregation
Partition
Bucket sort
Partition you have already tried.
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Bucket sort
The bucket_sort aggregation, like all pipeline aggregations, is
executed after all other non-pipeline aggregations. This means the
sorting only applies to whatever buckets are already returned from the
parent aggregation. For example, if the parent aggregation is terms
and its size is set to 10, the bucket_sort will only sort over those
10 returned term buckets
So this isn't suitable for your case
You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.
Better option is to use scroll api and perform distinct at client side

ElasticSearch: query for N items of each category

I have an index of goods in ElasticSearch (5.5), of them every product has a field "category", like "GLOVES", "COAT", "TOWEL".
With the terms query I can select items belonging to several categories, e.g.
{
"terms": {
"div_id": ["COAT", "DRESS", "JACKET"]
}
}
Now the problem is that I want to have in response several items of each type, say, not less than 3 (given that total size of answer is 15 records).
And I have no clear idea how to do this. With the given "straight" way it may return any number from any category. The closest I get is to add random_score which makes result "diverse", but it then depends on how many percents every category takes in the index.
I suspect there should be different approach, but can't guess correct keywords, seemingly.
Thanks in advance!
You may want to try top hits agg documented here.

distinct count on hive does not match cardinality count on elasticsearch

I have loaded data into my elasticsearch cluster from hive using the elasticsearch-hadoop plugin from elastic.
I need to fetch a count of unique account numbers. I have the following queries written in both hql and queryDSL, BUT they are returning different counts.
Hive Query:
select count(distinct account) from <tableName> where capacity="550";
// Returns --> 71132
Similarly, in Elasticsearch the query looks like this:
{
"query": {
"bool": {
"must": [
{"match": { "capacity": "550"}}
]
}
},
"aggs": {
"unique_account": {
"cardinality": {
"field": "account"
}
}
}
}
// Returns --> 71607
Am I doing something wrong? What can I do to match the two queries?
Note: There are exactly the same number of records in hive and elasticsearch.
"the first approximate aggregation provided by Elasticsearch is
the cardinality metric
...
As mentioned at the top of this chapter, the cardinality metric is an
approximate algorithm. It is based on the HyperLogLog++ (HLL)
algorithm."
https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
For the OP
precision_threshold
"precision_threshold accepts a number from 0–40,000. Larger values are
treated as equivalent to 40,000.
...
Although not guaranteed by the
algorithm, if a cardinality is under the threshold, it is almost
always 100% accurate. Cardinalities above this will begin to trade
accuracy for memory savings, and a little error will creep into the
metric."
https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
You might also want to take a look at "Support for precise cardinality aggregation #15876"
For the OP, 2
"I have tried several numbers..."
You have 71,132 distinct values while the precision threshold limit is 40,000, therefore the cardinality is over the threshold, which means accuracy is traded for memory saving.
This is how the chosen implementation (based on HyperLogLog++ algorithm) works.
Cardinality does not ensure accurate count even with 40000 precision_threshold. There is another way to get accurate distinct count of a field.
This article on "Accurate Distinct Count and Values from Elasticsearch" explains in detail the solution as well as it's accuracy over Cardinality.

Scoring documents by both textual match and distance to a point

I have an ElasticSearch index with a list of "shops".
I'd like to allow customers to search these shops by both geo_distance (so, search for a point and get shops near that location), and textual match, like matches on shop name / address.
I'd like to get results that match either of these two criteria, and I'd like the order of these results to be a combination of both. The stronger the textual match, and the closer to the point searched, the higher the result. (Obviously, there's going to be a formula to combine these two, that'll need tweaking, not too worried about that part yet).
My issue / what I've tried:
geo_distance is a filter, not a query, so I can't combine both on the query part of the request.
I can use a bool => should filter (rather than query) that matches on either name or location. This gives me the results I want, but not in order.
I can also have _geo_distance as part of a sort clause so that documents closer to the point rank higher.
What I haven't figured out is how I would take the "regular" _score that ElasticSearch gives to documents when doing textual matches, and combine that with the geo_distance score.
By having the textual match in the filter, it doesn't seem to affect the score of documents (which makes sense). And I don't see how I could combine the textual match in the query part and a geo_distance filter so it's an OR rather than an AND.
I guess my best bet would be the equivalent of this:
{
function_score: {
query: { ... },
functions: [
{ geo_distance function },
{ multi_match_result score },
],
score_mode: 'multiply'
}
}
but I'm not sure you can do geo_distance as a score function, and I don't know how to have multi_match_result score as a score function, or if it's even possible.
Any pointers will be greatly appreciated.
I'm working with ElasticSearch v1.4, but I can upgrade if necessary.
but I'm not sure you can do geo_distance as a score function, and I don't know how to have multi_match_result score as a score function, or if it's even possible.
You can't really do it in the way that you're asking, but you can do what you want just as easily. For the simpler case, you get scoring just by using a normal query.
The problem with filters is that they're yes/no questions, so if you use them in a function_score, then it either boosts the score or it doesn't. What you probably want is degradation of the score as the distance from the origin grows. It's the yes/no nature that stops them from impacting the score at all. There's no improvement to relevancy implied by matching a filter -- it just means that it's part of the answer, but it doesn't make sense to say that it should be closer to the top/bottom as a result.
This is where the Decay function score helps. It works with numbers, dates, and -- most helpfully here -- geo_points. In addition to the types of data it accepts, it can decay using either gaussian, exponential, or linear decay functions. The one that you want to choose is honestly arbitrary and you should give the one that chooses the best "experience". I would suggest to start with gauss.
"function_score": {
"functions": [
"gauss": {
"my_geo_point_field": {
"origin": "0, 1",
"scale": "5km",
"offset": "500m",
"decay": 0.5
}
}
]
}
Note that origin is in x, y format (due to standard GeoJSON), which is longitude, latitude.
Each one of the values impacts how the score decays based on the graph (taken wholesale from the documentation). If you would use an offset of 0, then the score begins to drop once it's not exactly at the origin. With the offset, it allows it some buffer to be considered just as good.
The scale is directly associated with the decay in that the score will be chopped down by the decay value once it is scale-distance away from the origin (+/- the offset). In my above example, anything 5km from the origin would get half of the score as anything at the origin.
Again, just note that the different types of decay functions change the shape of scoring.
I'd like the order of these results to be a combination of both.
This is the purpose of the bool / should compound query. You get OR behavior with score improvement based on each match. Combining this with the above, you'd want something like:
{
"query": {
"bool": {
"should": [
{
"multi_match": { ... }
},
{
"function_score": {
"functions": [
"gauss": {
"my_geo_point_field": {
"origin": "0, 1",
"scale": "5km",
"offset": "500m",
"decay": 0.5
}
}
]
}
}
]
}
}
}
NOTE: If you add a must, then the should behavior changes from literal OR-like behavior (at least 1 must match) to completely optional behavior (none must match).
I'm working with ElasticSearch v1.4, but I can upgrade if necessary.
Starting with Elasticsearch 2.0, every filter is a query and every query is also a filter. The only difference is the context that it's used in. This doesn't change my answer here, but it's something that may help you in the future in addition to what I say next.
Geo-related performance increased dramatically in ES 2.2+. You should upgrade (and recreate your geo-related indices) to take advantage of those changes. ES 5.0 will have similar benefits!

How can I multiply the score of two queries together in Elasticsearch?

In Solr I can use the query function query to return a numerical score for a query and I can user that in the context of a bf parameter something like bf=product(query('cat'),query('dog')) to multiply two relevance scores together.
Elasticsearch has search API that is generally more flexible to work with, but I can't figure out how I would accomplish the same feat. I can use _score in a script_function of a function_query but I can only user the _score of the main query. How can I incorporate the score of another query? How can I multiply the scores together?
You could script a TF*IDF scoring function using a function_score query. Something like this (ignoring Lucene's query and length normalization):
"script": "tf = _index[field][term].tf(); idf = (1 + log ( _index.numDocs() / (_index[field][term].df() + 1))); return sqrt(tf) * pow(idf,2)"
You'd take the product of those function results for 'cat' and 'dog' and add them to your original query score.
Here's the full query gist.
Alternately, if you've got something in that bf that's heavyweight enough you'd rather not run it across the entire set of matches, you could use rescore requests to modify the score of the top N ranked ORIGINAL QUERY results using subsequent scoring passes with your (cat, dog, etc...) scoring-queries.

Resources