Elasticsearch: choose TOP N documents and apply query - elasticsearch

I'm sorry I'm not good at English, please understand it.
Let's assume I have such data:
title category price
book1 study 10
book2 cook 20
book3 study 30
book4 study 40
book5 art 50
I can do "search books in 'study' category and sort them by price-descending order". Result would be:
book4 - book3 - book1
However, I couldn't find a way to do
"search books in 'study' category AMONG the books of TOP 40% in price".
(I wish 'TOP 40% in price' is correct expression)
In this case, result should be "book4" only, because "category search" would be performed for only book5 and book4.
At first, I thought I could do it by
sort all documents by price
select TOP 40%
post another query for category search among them
But now, I still have no idea how I can post a query among "part of documents", not all documents. After 2, I'd have a list of documents in TOP 40%. But how can I make a query which is applied to just them?
I realized that I don't know even "search TOP n%" in elasticsearch. Is there a way that is better than "sort all and select first n%"?
Any advice would be appreciated.
And this is my first question in stackoverflow. If my question is violating any rule of here, please tell me so that I can know it and apology.

If your data is normally distributed, or some other statistical distribution from which you can make sense of the data, you can probably do this in two queries.
You can take a look at the data in histogram form by doing:
{
"query": {
"match_all": {}
},
"facets": {
"stats": {
"histogram": {
"field": "price",
"interval": 100
}
}
}
}
I usually take this data into a spreadsheet to chart it and do other statistical analysis on it. "interval" above will need to be some reasonable value, 100 might not be the right fit.
The is just to decide how to code the intermediate step. Provided the data is normally distributed you can then get the statistical information about the collection using this query:
{
"query": {
"match_all": {}
},
"facets": {
"stats": {
"statistical": {
"field": "price"
}
}
}
}
The above gives you an output that looks like this:
count: 819517
total: 24249527030
min: 32
max: 53352
mean: 29590.023184387876
sum_of_squares: 875494716806082
variance: 192736269.99554798
std_deviation: 13882.94889407679
(the above is not based on your data sample, but just sample of available data I have to demonstrate statistical facet usage.)
So now that you know all of that, you can start applying your knowledge of statistics to the problem at hand. That is, find the Z score at the 60th percentile and find the location of the representative data point based on that.
How your final query looks like this:
{
"query": {
"range": {
"talent_profile": {
"gte": 40,
"lte": 50
}
}
}
the lte is going to be from the "max" from the stats facet and the gte is going to be from your intermediate analysis.

Related

Is there a way to specify percentage value in ES DSL Sampler aggregation

I am trying to do a sum aggregation on a certain sample of data, I want to get the sum of costs (field) of only the top 25% records (with the highest cost).
I know I have an option to run a sampler aggregation which can help me achieve this, but there I need to pass the exact number of records on which I want to run the sampler aggregation.
{
"aggs": {
"sample": {
"sampler": {
"shard_size": 300
},
"aggs": {
"total_cost": {
"sum": {
"field": "cost"
}
}
}
}
}
}
But is there a way to specify a percentage instead of an absolute number here, because in my case the total number of document changes pretty regularly and I need to get the top 25% (costliest).
How I get it today is by doing 2 queries
first to get the total number of records
divide the number by 4 and do the sampler query with that number (also I have added a descending sort for the cost field, which is not shown in the query above)

ElasticSearch: Use Query to get single document ranking

I am trying to use ElasticSearch to compute a ranking. I'm not sure if this is possible and am trying to find out what my options might be. I need to run a query on all documents, sort them descending and then just return what number position in the list a specific record is located.
For example, I want to find out Julie's class ranking. I have records of each student in Julie's grade that contains their names and GPA's and I want to perform 1 query that will tell me what her rank in within her grade.
I am hoping there is an ES guru out there that can help because otherwise I am going to need to run a regular query, get back max 10,000 records and figure it out from there.
This cannot be found in a single query.
First you need to get GPA of "Julia" and then find count of docs which have score higher than Julia.
{
"query": {
"range": {
"gpa": {
"gt": 8 --> GPA of julia
}
}
},
"aggs": {
"count": {
"value_count": {
"field": "name.keyword" --> count where gpa is greater than 8
}
}
}
}
Better option is to store rank in document itself while indexing

Elasticsearch filter vs term query for many ids

I have an index of documents connected with some product_id. And I would like to find all documents for specific ids (around 100 000 product_ids to be found and 100 million are in total in index).
Would the filter query be the fastest and best option in that case?
"query": {
"bool": {
"filter": {"terms": {"product_id": product_ids}
}
}
Or is it better to chunkify ids and use just terms query or smth else?
The question is probably kind of a duplicate, but I would be very grateful for the best practice advice (and a bit of reasoning).
After some testing and more reading I found an answer:
Filter query works much much faster as chunks with just terms query.
But making really big filter can slower getting the result a lot.
In my case, using filter query with chunks of 10 000 ids is 10 times faster, than using filter query with all 100 000 ids at once (btw, this number is already restricted in Elasticsearch 6).
Also from official elasticsearch documentation:
Potentially the amount of ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.
The only disadvantage to be taken into account is that filter query is stored in cache. (The cache implements an LRU eviction policy: when a cache becomes full, the least recently used data is evicted to make way for new data.)
P.S. In all cases I always used scroll.
you can use "paging" or "scrolling" feature of elastic search query for very large result sets.
Use "from - to" query : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
or "scroll" query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
I think that "From / To" is a more efficient way to go unless you want to return thousands of results each time (which could be many many MB of data so you probably don't want that)
Edit:
You can make a query like this in bulks:
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2", "3", .... "10000" ] // tune for the best array length
}
}
}
If your document Id is sequential or some other number form that you could easily order by, and have a field available you can do a "range query"
GET _search
{
"query": {
"range" : {
"document_id_that_is_a_number" : {
"gte" : 0, // bump this on each query by "lte" step factor
"lte" : 10000 // find a good number here
}
}
}
}

Complex ElasticSearch Query

I have documents with (id, value, modified_date). Need to get all the documents for ids which have a specific value as of the last modified_date.
My understanding is that I first need to find such ids and then put them inside a bigger query. To find such ids, looks like, I would use "top_hits" with some post-filtering of the results.
The goal is to do as much work as possible on the server side to speed things up. Would've been trivial in SQL, but with ElasticSearch I am at a loss. And then I would need to write this in python using elasticsearch_dsl. Can anyone help?
UPDATE: In case it's not clear, "all the documents for ids which have a specific value as of the last modified_date" means: 1. group by id, 2. in each group select the record with the largest modified_date, 3. keep only those records that have the specific value, 4. from those records keep only ids, 5. get all documents where ids are in the list coming from 4.
Specifically, 1 is an aggregation, 2 is another aggregation using "top_hits" and reverse sorting by date, 3 is an analog of SQL's HAVING clause - Bucket Selector Aggregation (?), 4 _source, 5 terms-lookup.
My biggest challenge so far has been figuring out that Bucket Selector Aggregation is what I need and putting things together.
This shows an example on how to get the latest elements in each group:
How to get latest values for each group with an Elasticsearch query?
This will return the average price bucketed in days intervals:
GET /logstash-*/_search?size=0
{
"query": {
"match_all": {}
},
"aggs": {
"2": {
"date_histogram": {
"field": "#timestamp",
"interval": "1d",
"time_zone": "Europe/Berlin",
"min_doc_count": 1
},
"aggs": {
"1": {
"avg": {
"field": "price"
}
}
}
}
}
}
I wrote it so it matches all record, that obviously returns more data than you need. Depending on the amount of data it might be easier to finish the task on client side.

Elasticsearch, sorting by exact string match

I want to sort results, such that if one specific field (let's say 'first_name') is equal to an exact value (let's say 'Bob'), then those documents are returned first.
That would result in all documents where first_name is exactly 'Bob', would be returned first, and then all the other documents afterwards. Note that I don't intend to exclude documents where first_name is not 'Bob', merely sort them such that they're returned after all the Bobs.
I understand how numeric or alphabetical sorting works in Elasticsearch, but I can't find any part of the documentation covering this type of sorting.
Is this possible, and if so, how?
One solution is to manipulate the score of the results that contain the Bob in the first name field.
For example:
POST /test/users
{
"name": "Bob"
}
POST /test/users
{
"name": "Alice"
}
GET /test/users/_search
{
"query": {
"bool": {
"should": [
{
"match": {
"name": {
"query": "Bob",
"boost" : 2
}
}
},
{
"match_all": {}
}
]
}
}
}
Would return both Bob and Alice in that order (with approximate scores of 1 and 0.2 respectively).
From the book:
Query-time boosting is the main tool that you can use to tune
relevance. Any type of query accepts a boost parameter. Setting a
boost of 2 doesn’t simply double the final _score; the actual boost
value that is applied goes through normalization and some internal
optimization. However, it does imply that a clause with a boost of 2
is twice as important as a clause with a boost of 1.
Meaning that if you also wanted "Fred" to come ahead of Bob you could just boost it with a 3 factor in the example above.

Resources