Elasticsearch query to find range overlap - elasticsearch

Let's say I have the following indexed document:
{
"field1": [400, 800]
}
I want to create a query using 2 search parameters (min_val = 300 and max_val = 500) to select documents where these two ranges overlaps.
In my example, the above document should be selected, as we can see:
300 500
[======================]
[=====================]
400 800
What is the most efficient way to find documents that overlap two numeric ranges?
I can make it using multiple comparisons, and many ands and ors, but I'm looking for a simpler and efficient way to achieve this.

In ES, a range of numbers like you have for field1 is not actually a range but simply two distinct values, namely 400 and 800. All you have to do is to use a simple range query and compare field1 with the lower and upper bound of the range, i.e.
The range [300, 500] should include either 400 or 800
Expressed with the DSL, you end up with a single range query like this one:
{
"query": {
"range": {
"field1": {
"gte": 300,
"lte": 500
}
}
}
}

Related

Is there a way to specify percentage value in ES DSL Sampler aggregation

I am trying to do a sum aggregation on a certain sample of data, I want to get the sum of costs (field) of only the top 25% records (with the highest cost).
I know I have an option to run a sampler aggregation which can help me achieve this, but there I need to pass the exact number of records on which I want to run the sampler aggregation.
{
"aggs": {
"sample": {
"sampler": {
"shard_size": 300
},
"aggs": {
"total_cost": {
"sum": {
"field": "cost"
}
}
}
}
}
}
But is there a way to specify a percentage instead of an absolute number here, because in my case the total number of document changes pretty regularly and I need to get the top 25% (costliest).
How I get it today is by doing 2 queries
first to get the total number of records
divide the number by 4 and do the sampler query with that number (also I have added a descending sort for the cost field, which is not shown in the query above)

restructure elasticsearch index to allow filtering on sum of values

I've an index of products.
Each product, has several variants (can be a few or hundreds, each has a color & size e.g. Red)
Each variant, is available (in a certain quantity) at several warehouses (aronud 100 warehouses).
Warehouses have codes e.g. AB, XY, CD etc.
If I had my choice, I'd index it as:
stock: {
Red: {
S: { AB: 100, XY: 200, CD: 20 },
M: { AB: 0, XY: 500, CD: 20 },
2XL: { AB: 5, XY: 0, CD: 9 }
},
Blue: {
...
}
}
Here's a kind of customer query I might receive:
Show me all products, that have Red.S color in stock (minimum 100) at warehouses AB & XY.
So this would probably be a filter like
Red.S.AB > 100 AND Red.S.XY > 100
I'm not writing whole filter query here, but its straightforward in elastic.
We might also get SUM queries, e.g. the sum of inventories at AB & XY should be > 500.
That'd be easy through a script filter, say Red.S.AB + Red.S.XY > 500
The problem is, given 100 warehouses, 100 sizes, 25 colors, this easily needs 100*100*25 = 250k mappings. Elasticsearch simply can't handle that many number of keys.
The easy answer is use nested documents, but nested documents pose a particular problem. We cannot sum across a given selection of nested documents, and nested docs are slow, specially when we're going to have 250k per product.
I'm open to external solutions than elastic as well. We're rails/postgres stack.
You have your product index with variants, that's fine, but I'd use another index for managing anything related to the multi-warehouse stock. One document per product/size/color/warehouse with the related count. For instance:
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "AB",
"quantity": 100
}
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "XY",
"quantity": 200
}
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "CD",
"quantity": 20
}
etc...
That way, you'll be much more flexible with your stock queries, because all you'll need is to filter on the fields (product, color, size, warehouse) and simply aggregate on the quantity field, sums, averages or whatever you might think of.
You will probably need to leverage the bucket_script pipeline aggregation in order to decide whether sums are above or below a desired threshold.
It's also much easier to maintain the stock movements by simply indexing the new quantity for any given combination than having to update the master product document every time an item gets out of the stock.
No script, no nested documents required.
The best possible solution will be to create separate indexes for the warehouses and each warehouse index will have documents. One document per product/size/color/warehouse with related values like this:
{
"product": 123,
"color": "Red",
"size": "S",
"warehouse": "AB",
"quantity": 100
}
This will reduce your mappings 100 * 25 = 2500 mappings per index.
Rest for other operations, I feel #Val has mentioned in his answer which is quite impressive and beautiful.
Coming to external solutions, I would say you want to carry to out task of storing data, searching it and fetching it. Elasticsearch and Apache Solr are the best search engines to carry out these kind of tasks. I have not tried Apache Solr but I would highly recommend to go with Elasticsearch because of it's features, active community support and searching is really fast. Searching can also be made fast using analyzers and tokenizers. It also has some features like Full-Text Searching and Term Level Searching to customize searching according to situation or problem statement.

Elasticsearch "size" value not working in terms aggregation with partitions

I am trying to paginate over a specific field using the terms aggregation with partitions.
The problem is that the number of returned terms for each partition is not equal to the size parameter that I set.
These are the steps that I am doing:
Retrieve the number of different unique values for the field with "cardinality" aggregation.
In my data, the result is 21.
From the web page, the user wants to display a table with 10 items per page.
if unique_values % page_size != 0:
partitions_number = (unique_values // page_size) + 1
else:
partitions_number = (unique_values // page_size)
Than I am making this simple query:
POST my_index/_search?pretty
{
"size": 0,
"query": {
"bool": {
"filter": [
{
"match": {
"field_to_paginate": "foo"
}
}
]
}
},
"aggs": {
"by_pchostname": {
"terms": {
"size": 10,
"field": "field_to_paginate",
"include": {
"partition": 0,
"num_partitions": 3
}
}
}
}
}
I am expecting to retrieve 10 results. But if I run the query I have only 7 results.
What am I missing here? Do I need to use a different solution here?
As a side note, I can't use composite aggregation because I need to sort results by doc_count over the whole dataset.
Partitons in terms aggregation divide the values in equal chunks.
In your case no of partition num_partitions is 3 so 21/3 == 7.
Partitons are meant for getting large values in the order of 1000 s.
You may be able to leverage shard_size parameter. My suggestion is to read this part of manual and work with the shard_size param
Terms aggregation does not allow pagination. Use composite aggregation instead (requires ES >= 6.1.0). Below is the quote from reference docs:
If you want to retrieve all terms or all combinations of terms in a
nested terms aggregation you should use the Composite aggregation
which allows to paginate over all possible terms rather than setting a
size greater than the cardinality of the field in the terms
aggregation. The terms aggregation is meant to return the top terms
and does not allow pagination.

Elasticsearch calculate Max with cutoff

its an strange requirement.
we need to calculate a MAX value in our dataset, however, some of our data are BAD meaning, the MAX value will produce an undesired outcome.
say the values in field "myField" are:
INPUT:
10 30 20 40 1000000
CURRENT OUTPUT:
1000000
DESIRED OUTPUT:
40
{"aggs": {
"aggs": {
"maximum": {
"max": {
"field": "myField"
}
}
}
}
}
I thought of sorting the data but that'll be really slow as the actual data counts to 100K+.
So my question, is there a way to cutoff data in aggs so it ignores the actual MAX and return the SECOND MAX, Alternatively to ignore say the top 10% and returns the max value.
have you thought of using percentiles to eliminate outliers? Maybe run a percentile aggregation first and then use that as a base for a range filter?
The requirement seems a bit blurry to me, so this is just another try to help, not sure if this is what you are after.

String range query in Elasticsearch

I'm trying to query data in an Elasticsearch cluster (2.3) using the following range query. To clarify, I'm searching on a field that contains an array of values that were derived by concatenating two ids together with a count. For example:
Schema:
{
id1: 111,
id2: 222,
count: 5
}
The query I'm using looks like the following:
Query:
{
"query": {
"bool": {
"must": {
"range": {
"myfield": {
"from": "111_222_1",
"to": "111_222_2147483647",
"include_lower": true,
"include_upper": true
}
}
}
}
}
}
The to field uses Integer.MAX_VALUE
This works alright but doesn't exactly match the underlying data. Querying through other means produces more results than this method.
More strangely, trying 111_222_5 in the from field produces 0 results, while trying 111_222_10 does produce results.
How is ES (and/or Lucene) interpreting this range query and why is it producing such strange results? My initial guess is that it's not looking at the full value of the last portion of the String and possibly only looking at the first digit.
Is there a way to specify a format for the TermRange? I understand date ranging allows formatting.
A look here provides the answer.
The way it's doing range is lexicographic, 5 comes before 50 comes before 6, etc.
To get around this, I reindexed using a fixed length string for the count.
0000000001
0000000100
0001000101
...

Resources