Elasticsearch near value query - elasticsearch

I have 100 documents with field price between 1000 and 10,000
For example: I want to query 4 documents with prices near 5000, the expected values can be more or less 5000.
If I set ranges it might give me empty results sometimes because it found not results between my max and min values.
I can requery it with larger min and max values but I don't think it is the correct solution.
Also I've tried Span queries but it does n't support numeric values
Is there any way to do this in elasticsearch 6.0 ?

There are at least two ways I can think of doing it.
1: Do a range query over price equal to 5000 plus or minus some constant. Add sort in your query based on their distance from given price (5000 in your case). Reference
2: Use function_score to compute the score of each match in a range query with a function dedicated to compute the absolute distance from price 5000. Reference

Related

Algorithm to do efficient weighted ranking?

I need an algorithm to do fast weighted ranking of Twitter posts.
Each post has a number of ranking scores (like age, author follower count, keyword mentions, etc.). I'm looking for algorithm that can quickly find the top N Tweets, given the weights of each ranking score.
Now, the use case is that these weights will change, and recalculating the ranking scores for every tweet every time the weights change is prohibitively expensive.
I will have access to sorted lists of Tweets, one for each ranking score. So I'm looking for an algorithm to efficiently search through these lists to find my top N.
NOTICE: This answer is provided due to the belief that knowledge is always good (even if it might be used for evil purposes). If you are able to obtain and store/track information like age, author follower count, keyword mentions, etc without ensuring participants fully understand how their data will be used and without obtaining every participant's explicit consent (and without "opt-in, with the ability to opt out at any time"); then you are violating people's privacy and deserve bankruptcy for your grossly unethical malware. It's bad enough that multiple large companies are evil without making it worse.
Assume there's a formula like score = a_rank * a_weight + b_rank * b_weight + c_rank * c_weight.
This can be split into pieces, like:
a_score = a_rank * a_weight
b_score = b_rank * b_weight
c_score = c_rank * c_weight
score = a_score + b_score + c_score
If you know the range of a_rank you can sort the entries into "a_rank buckets". For example, if you have 100 buckets and "a_rank" can be a value from "a_rank_min" to "a_rank_max"; then "bucket_number = (a_rank - a_rank_min) * 100 / (a_rank_max - a_rank_min)".
From here you can say that all entries in a specific "a_rank bucket" must have an "a_score" in a specific range; and you can calculate the minimum and maximum possible "a_score" for all entries in a bucket from "bucket_number" alone; using formulas like "min_a_score_for_bucket = (bucketNumber * (a_rank_max - a_rank_min) / 100 + a_rank_min) * a_weight" and "max_a_score_for_bucket = ( (bucketNumber+1) * (a_rank_max - a_rank_min) / 100 + a_rank_min) * a_weight - 1".
The next step is to establish a "current 10 entries with the highest score so far". Do this by selecting the first 10 entries from the highest "a_rank bucket/s" and calculate their scores fully.
Once this is done (and you know "10th highest score so far") you can calculate a filter for each bucket. If you assume all entries in a bucket have the maximum possible a_rank (determined from the bucket number alone) and the maximum possible c_rank (determined from the possible range of all c_rank values) then you can calculate the minimum value for b_rank that would be needed for the entry's score to be higher than "10th highest score so far"; and in the same way, if you assume all entries in a bucket have the maximum possible a_rank and the maximum possible b_rank you can calculate the minimum value for c_rank that would be needed. The "minimum needed b_rank" and "minimum needed c_rank" can then be used to skip over entries that couldn't possibly beat the "10th highest score so far" without calculating the score for any of those entries.
Of course every time you find an entry with a higher score than the "10th highest score so far" you will get a new "10th highest score so far" and will have to recalculate the "minimum needed b_rank" and "minimum needed c_rank" for the buckets. Ideally you'd look at buckets in "highest a_rank bucket first" order and therefore will only calculate the "minimum needed b_rank" and "minimum needed c_rank" for the current bucket
Near the start (while you're looking at the bucket with the highest a_rank values) it probably won't filter out many entries and might even make performance worse (due to the cost of recalculating "minimum needed b_rank" and "minimum needed c_rank" values). Near the end (while you're looking at the buckets with the lowest a_rank values) you may be able to skip entire buckets without looking at any entry in them.
Note that:
all the weights can change without changing any of the buckets; but it's nicer for performance if "a_rank" has the strongest influence on the score.
the range of values for "a_rank" shouldn't change (you'd have to rebuild the buckets if it does); but the range of values for "b_rank" and "c_rank" can be variable (updated every time a new entry is created)
sorting each bucket in "highest a_rank first" order (and then using "highest b_rank first" as a tie-breaker, etc) will help performance when finding the 10 entries with the highest score; but it will also add overhead when an entry is added. For this reason, for most cases, I probably wouldn't bother sorting the contents of buckets at all.
it would be nice if you can have a bucket for each possible value of "a_rank"; as this gives almost all of the benefits of sorting without any of the overhead of sorting. If you can't have a bucket for each possible value of "a_rank", then increasing the number of buckets can help performance.
in theory; it would be possible to have multiple layers of "bucketing" (e.g. "a_rank buckets" that contain "b_rank buckets"). This would significantly increase complexity, and increase memory consumption; but (especially if no sorting is done) might significantly improve performance (and might make performance worse).

Elasticsearch field collapsing with minimum inner hits count

When using field collapsing, is there a way to filter out results which inner hits count is less than a give threshold?
In an hotel database I want to find hotels with three cheapest available rooms cheaper than X. Each document has a hotel_id, room_id and price. If the hotel has not 3 available rooms cheaper than X, I cannot do anything with it.
So I do a search for rooms cheaper than X, sorted by price, collapsing with hotel_id, but I want to see only groups that contains 3 rooms in inner hits, otherwise that hotel result is unusable. With the size parameter I define a maximum, but I cannot find a way to define a minimum.
Aggregation is not an option due performance constraints.

Sorting/Ranking Tied Scores based on Points Difference

I wanted to use Google Sheets to do a competition ranking which can help me to rank or sort the ranking automatically when I key in the Points.
However, there is a condition where there will be a tied happens. If a tie happens, I will take the Score Differences (SD) into consideration. If the Score Differences is low, then it will be rank higher in the tie condition.
See below table for illustration:
For example: Currently Team A and Team D having the highest PTS, so both of them are currently Rank 1. However, Team D is having a lower SD compare to Team A. So I wanted to have it automatically rank Team D as Rank 1 and Team A as Rank 2.
Is this possible?
One solution might be to create a hidden column with a formula like:
=PTS * 10000 - SD
(Replacing PTS and SD with the actual cell references)
Multiplying PTS by 10000 ensures it has a higher priority than SD.
We want to reward low SDs, so we subtract instead of add.
Finally, in the rank column, we can use a formula like:
=RANK(HiddenScoreCell, HiddenScoreColumnRange, 0)
So, for example, if the HiddenScore column is column K, the actual formula for row 2 might look like
=RANK(K2, K:K, 0)
The third parameter is 0 as we want higher scores to have a lower rank.
To sort, you can just apply a sort on the Rank column.
With sort() you can define multiple sorting criteria (see [documentation][1], e.g.
=sort(A2:I5,8,false,7,false)
So you're going to sort your table (in A2:I5, change accordingly) based first on PTS, descending, then on SD, descending? You can add more criteria with more pairs of parameters (column index, then descending or ascending as a boolean).
Then you need to compare your team name with with the sorted table and find its rank in the sorted list:
=ArrayFormula(match(A2:I5,sort(A2:I5,8,false,7,false),0))
Paste that formula in I2 (assuming your table starts in A1 with its headers, otherwise adjust accordingly).
=ARRAYFORMULA(IF(LEN(A2:A), RANK(H2:H*9^9-G2:G, H2:H*9^9-G2:G), ))

Score documents by the distance from the average in ElasticSearch

I have documents with various price values. I want to filter these documents by something like color:red and rank the document by distance from the documents price to the average of the current result page.
I can use aggregation to compute the average, but the question is how can I do it in one query?

Bucket index in Bucket Sort

I'm trying to improve my bucket sort for large number over 10000.I'm not quite sure, why my code isn't performing well on large numbers.
My Bucket Sort algorithm for array of size n:
Create array of linked list of size n
Calculate range for numbers
Calculate interval for each bucket
Calculate index for bucket, where to put particular number
(Problem: I calculated index by constantly subtracting interval from number and increment counter,every time i subtract interval.Counter is the index)
I believe this particular way of finding index takes very long for large numbers.
How can i improve finding index of buckets?
P.S. i heard there's way to preprocess array and find min and max number of array. Then calculate index by subtracting particular number from min. index=number-min. I didn't quite get the idea of calculating index.
Questions:
1. Is this efficient way to find index?
2. How do i handle cases when i have array of size 4, and numbers 31,34,51,56? 31 goes to bucket 0,34 goes to bucket 3, how about 51 and 56?
3. Is there any other way to calculate index?
You can find your index faster through Division. Index = value / interval. If the first interval starts at 'min' instead of 0, then use (value-min) as the numerator.

Resources