ElasticSearch: Use Query to get single document ranking - elasticsearch

I am trying to use ElasticSearch to compute a ranking. I'm not sure if this is possible and am trying to find out what my options might be. I need to run a query on all documents, sort them descending and then just return what number position in the list a specific record is located.
For example, I want to find out Julie's class ranking. I have records of each student in Julie's grade that contains their names and GPA's and I want to perform 1 query that will tell me what her rank in within her grade.
I am hoping there is an ES guru out there that can help because otherwise I am going to need to run a regular query, get back max 10,000 records and figure it out from there.

This cannot be found in a single query.
First you need to get GPA of "Julia" and then find count of docs which have score higher than Julia.
{
"query": {
"range": {
"gpa": {
"gt": 8 --> GPA of julia
}
}
},
"aggs": {
"count": {
"value_count": {
"field": "name.keyword" --> count where gpa is greater than 8
}
}
}
}
Better option is to store rank in document itself while indexing

Related

Is there a way to specify percentage value in ES DSL Sampler aggregation

I am trying to do a sum aggregation on a certain sample of data, I want to get the sum of costs (field) of only the top 25% records (with the highest cost).
I know I have an option to run a sampler aggregation which can help me achieve this, but there I need to pass the exact number of records on which I want to run the sampler aggregation.
{
"aggs": {
"sample": {
"sampler": {
"shard_size": 300
},
"aggs": {
"total_cost": {
"sum": {
"field": "cost"
}
}
}
}
}
}
But is there a way to specify a percentage instead of an absolute number here, because in my case the total number of document changes pretty regularly and I need to get the top 25% (costliest).
How I get it today is by doing 2 queries
first to get the total number of records
divide the number by 4 and do the sampler query with that number (also I have added a descending sort for the cost field, which is not shown in the query above)

Elasticsearch "size" value not working in terms aggregation with partitions

I am trying to paginate over a specific field using the terms aggregation with partitions.
The problem is that the number of returned terms for each partition is not equal to the size parameter that I set.
These are the steps that I am doing:
Retrieve the number of different unique values for the field with "cardinality" aggregation.
In my data, the result is 21.
From the web page, the user wants to display a table with 10 items per page.
if unique_values % page_size != 0:
partitions_number = (unique_values // page_size) + 1
else:
partitions_number = (unique_values // page_size)
Than I am making this simple query:
POST my_index/_search?pretty
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"match": {
"field_to_paginate": "foo"
}
}
]
}
},
"aggs": {
"by_pchostname": {
"terms": {
"size": 10,
"field": "field_to_paginate",
"include": {
"partition": 0,
"num_partitions": 3
}
}
}
}
}
I am expecting to retrieve 10 results. But if I run the query I have only 7 results.
What am I missing here? Do I need to use a different solution here?
As a side note, I can't use composite aggregation because I need to sort results by doc_count over the whole dataset.
Partitons in terms aggregation divide the values in equal chunks.
In your case no of partition num_partitions is 3 so 21/3 == 7.
Partitons are meant for getting large values in the order of 1000 s.
You may be able to leverage shard_size parameter. My suggestion is to read this part of manual and work with the shard_size param
Terms aggregation does not allow pagination. Use composite aggregation instead (requires ES >= 6.1.0). Below is the quote from reference docs:
If you want to retrieve all terms or all combinations of terms in a
nested terms aggregation you should use the Composite aggregation
which allows to paginate over all possible terms rather than setting a
size greater than the cardinality of the field in the terms
aggregation. The terms aggregation is meant to return the top terms
and does not allow pagination.

Complex ElasticSearch Query

I have documents with (id, value, modified_date). Need to get all the documents for ids which have a specific value as of the last modified_date.
My understanding is that I first need to find such ids and then put them inside a bigger query. To find such ids, looks like, I would use "top_hits" with some post-filtering of the results.
The goal is to do as much work as possible on the server side to speed things up. Would've been trivial in SQL, but with ElasticSearch I am at a loss. And then I would need to write this in python using elasticsearch_dsl. Can anyone help?
UPDATE: In case it's not clear, "all the documents for ids which have a specific value as of the last modified_date" means: 1. group by id, 2. in each group select the record with the largest modified_date, 3. keep only those records that have the specific value, 4. from those records keep only ids, 5. get all documents where ids are in the list coming from 4.
Specifically, 1 is an aggregation, 2 is another aggregation using "top_hits" and reverse sorting by date, 3 is an analog of SQL's HAVING clause - Bucket Selector Aggregation (?), 4 _source, 5 terms-lookup.
My biggest challenge so far has been figuring out that Bucket Selector Aggregation is what I need and putting things together.
This shows an example on how to get the latest elements in each group:
How to get latest values for each group with an Elasticsearch query?
This will return the average price bucketed in days intervals:
GET /logstash-*/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"2": {
"date_histogram": {
"field": "#timestamp",
"interval": "1d",
"time_zone": "Europe/Berlin",
"min_doc_count": 1
},
"aggs": {
"1": {
"avg": {
"field": "price"
}
}
}
}
}
}
I wrote it so it matches all record, that obviously returns more data than you need. Depending on the amount of data it might be easier to finish the task on client side.

What is the difference between must and filter in Query DSL in elasticsearch?

I am new to elastic search and I am confused between must and filter. I want to perform an and operation between my terms, so I did this
POST /xyz/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"city": "city1"
}
},
{
"term": {
"saleType": "sale_type1"
}
}
]
}
}
}
which gave me the required results matching both the terms, and on using filter like this
POST /xyz/_search
{
"query": {
"bool": {
"must": [
{
"term": {
"city": "city1"
}
}
],
"filter": {
"term": {
"saleType": "sale_type1"
}
}
}
}
}
I get the same result, so when should I use must and when should I use filter? What is the difference?
must contributes to the score. In filter, the score of the query is ignored.
In both must and filter, the clause(query) must appear in matching documents. This is the reason for getting same results.
You may check this link
Score
The relevance score of each document is represented by a positive floating-point number called the _score. The higher the _score, the more relevant the document.
A query clause generates a _score for each document.
To know how score is calculated, refer this link
must returns a score for every matching document. This score helps you rank the matching documents, and compare the relative relevance between documents (using the magnitude of the score of each document).
With this, one can say, Doc 1 is how many times more relevant than Doc 2. Or that Doc 1 to 7 are of much higher relevancy than Doc 8+.
For how the relative score is determined, you can refer to the references below.
Briefly, it is related to the number of term occurrences in the document, the document length, and the average number of term occurrences in your database index.
filter doesn't return a score. All one can say is, all matching documents are of relevance. But it won't help in evaluating if one is more relevant than the other. You can think of filter as a must with only 2 scores: zero or non-zero, and where all zero-scored documents are dropped.
filter is helpful if you just want to whitelist/blacklist for e.g., all documents belonging to the topic "pets".
In summary, there are 3 points that will help you in deciding when to use what:
must is your only choice when comparing/ranking documents by relevance
filter excludes all documents that don't match
filter is a lot faster because Elasticsearch doesn't need to compute the relative score
References:
Query vs Filter: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-filter-context.html
Computation of Relevance: https://www.infoq.com/articles/similarity-scoring-elasticsearch/

Elasticsearch, sorting by exact string match

I want to sort results, such that if one specific field (let's say 'first_name') is equal to an exact value (let's say 'Bob'), then those documents are returned first.
That would result in all documents where first_name is exactly 'Bob', would be returned first, and then all the other documents afterwards. Note that I don't intend to exclude documents where first_name is not 'Bob', merely sort them such that they're returned after all the Bobs.
I understand how numeric or alphabetical sorting works in Elasticsearch, but I can't find any part of the documentation covering this type of sorting.
Is this possible, and if so, how?
One solution is to manipulate the score of the results that contain the Bob in the first name field.
For example:
POST /test/users
{
"name": "Bob"
}
POST /test/users
{
"name": "Alice"
}
GET /test/users/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "Bob",
"boost" : 2
}
}
},
{
"match_all": {}
}
]
}
}
}
Would return both Bob and Alice in that order (with approximate scores of 1 and 0.2 respectively).
From the book:
Query-time boosting is the main tool that you can use to tune
relevance. Any type of query accepts a boost parameter. Setting a
boost of 2 doesn’t simply double the final _score; the actual boost
value that is applied goes through normalization and some internal
optimization. However, it does imply that a clause with a boost of 2
is twice as important as a clause with a boost of 1.
Meaning that if you also wanted "Fred" to come ahead of Bob you could just boost it with a 3 factor in the example above.

Resources