its an strange requirement.
we need to calculate a MAX value in our dataset, however, some of our data are BAD meaning, the MAX value will produce an undesired outcome.
say the values in field "myField" are:
INPUT:
10 30 20 40 1000000
CURRENT OUTPUT:
1000000
DESIRED OUTPUT:
40
{"aggs": {
"aggs": {
"maximum": {
"max": {
"field": "myField"
}
}
}
}
}
I thought of sorting the data but that'll be really slow as the actual data counts to 100K+.
So my question, is there a way to cutoff data in aggs so it ignores the actual MAX and return the SECOND MAX, Alternatively to ignore say the top 10% and returns the max value.
have you thought of using percentiles to eliminate outliers? Maybe run a percentile aggregation first and then use that as a base for a range filter?
The requirement seems a bit blurry to me, so this is just another try to help, not sure if this is what you are after.
Related
I am trying to do a sum aggregation on a certain sample of data, I want to get the sum of costs (field) of only the top 25% records (with the highest cost).
I know I have an option to run a sampler aggregation which can help me achieve this, but there I need to pass the exact number of records on which I want to run the sampler aggregation.
{
"aggs": {
"sample": {
"sampler": {
"shard_size": 300
},
"aggs": {
"total_cost": {
"sum": {
"field": "cost"
}
}
}
}
}
}
But is there a way to specify a percentage instead of an absolute number here, because in my case the total number of document changes pretty regularly and I need to get the top 25% (costliest).
How I get it today is by doing 2 queries
first to get the total number of records
divide the number by 4 and do the sampler query with that number (also I have added a descending sort for the cost field, which is not shown in the query above)
I saw that there are some concerns to raising the total limit on fields above 1000.
I have a situation where I am not sure how to approach it from the design point of view.
I have lots of simple key value pairs:
key1:15, key2:45, key99999:1313123.
Where key is a string and value is a integer on which I would like to sort my results upon on where as if a certain document receives a key it gets sorted by the value.
I ended up creating an object and just put the key value pairs inside so I can match it easy.
For example I have sorting: "object.key".
I was wondering if I just use a simple object with bunch of strings inside that are just there for exact matching should I worry about raising this limit to 10k, or 20k.
Because I now have an issue where there can be more then 1k of these records. I've found I could use nested sorting but it still has a default limit of 10k.
Is there a good design pattern approach for this or should I not be worried by raising the field limits?
Simplified version of the query:
GET products/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"sortingObject.someSortingKey1": {
"order": "desc",
"missing": 2,
"unmapped_type":"float"
}
}
]
}
Point is that I get the sortingKey from request and I use it to sort my results. There are 100k different ways to sort the result for example
There were some recent improvements (in 7.16) that should help there, but 10K or 20K fields is still a lot of overhead.
I'm not sure what kind of queries you need to run on those keyX fields, but maybe the flattened data-type would work for you? https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html
I have a problem to solve in graphing from ElasticSearch / Kibana. For the sake of argument, I have a turnstile and I need a 100% accurate count of the number of unique people who've passed through the turnstile. If Fred and Joe go through then the count is 2 - but if Fred and Joe and Joe go through (because Joe left and came in again) then the count is still two. Rather than people, I'm dealing with files - and rather than names I'm using UUIDs but the principle is the same.
We've tried using Cardinality Aggregation (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html) but that doesn't work. Even with tuning it only approaches 100% accuracy, and the possibility of a 100% accurate result decreases as the number of data points goes up. The number of data points that I'm looking at is in the tens, and possibly hundreds, of millions.
I understand that there's a performance / accuracy tradeoff - I can live with slow, but I can't live with inaccurate.
What would be the correct function - or correct way - of getting a 100% accurate count of unique names?
There's a workaround of doing a complete terms aggregation and then running a scripted_metric on that, but this is really really expensive.
{
"byFullListScripting": {
"terms": {
"field": "groupId",
"shard_size": Integer.MAX_VALUE,
"size": Integer.MAX_VALUE
},
"aggs": {
"cntScripting": {
"scripted_metric": {
"map_script": "targetId='u'+doc['cntTargetId']; if (_agg[targetId] == null) { _agg[targetId] = 1}",
"reduce_script": "map=[:]; for (a in _aggs){ map.putAll(a) }; return map.size()"
}
}
}
}
Let's say I have the following indexed document:
{
"field1": [400, 800]
}
I want to create a query using 2 search parameters (min_val = 300 and max_val = 500) to select documents where these two ranges overlaps.
In my example, the above document should be selected, as we can see:
300 500
[======================]
[=====================]
400 800
What is the most efficient way to find documents that overlap two numeric ranges?
I can make it using multiple comparisons, and many ands and ors, but I'm looking for a simpler and efficient way to achieve this.
In ES, a range of numbers like you have for field1 is not actually a range but simply two distinct values, namely 400 and 800. All you have to do is to use a simple range query and compare field1 with the lower and upper bound of the range, i.e.
The range [300, 500] should include either 400 or 800
Expressed with the DSL, you end up with a single range query like this one:
{
"query": {
"range": {
"field1": {
"gte": 300,
"lte": 500
}
}
}
}
The query:
{
"aggregations": {
"sigTerms": {
"significant_terms": {
"field": "translatedTitle"
},
"aggs": {
"assocs": {
"significant_terms": {
"field": "translatedTitle"
}
}
}
}
},
"size": 0,
"from": 0,
"query": {
"range": {
"timestamp": {
"lt": "now+1d/d",
"gte": "now/d"
}
}
},
"track_scores": false
}
Error:
{
"bytes_limit": 6844055552,
"bytes_wanted": 6844240272,
"reason": "[request] Data too large, data for [<reused_arrays>] would be larger than limit of [6844055552/6.3gb]",
"type": "circuit_breaking_exception"
}
Index size is 5G. How much memory does the cluster need to execute this query?
You can try to increase the request circuit breaker limit to 41% (default is 40%) in your elasticsearch.yml config file and restart your cluster:
indices.breaker.request.limit: 41%
Or if you prefer to not restart your cluster you can change the setting dynamically using:
curl -XPUT localhost:9200/_cluster/settings -d '{
"persistent" : {
"indices.breaker.request.limit" : "41%"
}
}'
Judging by the numbers showing up (i.e. "bytes_limit": 6844055552, "bytes_wanted": 6844240272), you're just missing ~190 KB of heap, so increasing by 1% to 41% you should get 17 MB of additional heap (your total heap = ~17GB) for your request breaker which should be sufficient.
Just make sure to not increase this value too high, as you run the risk of going OOM since the request circuit breaker also shares the heap with the fielddata circuit breaker and other components.
I am not sure what you are trying to do, but I'm curious to find out. Since you get that exception, I can assume the cardinality of that field is not small. You are basically trying to see, I guess, the relationships between all the terms in that field, based on significance.
The first significant_terms aggregation will consider all the terms from that field and establish how "significant" they are (calculating frequencies of that term in the whole index and then comparing those with the frequencies from the range query set of documents).
After it's doing that (for all the terms), you want a second significant_aggregation that should do the first step, but now considering each term and doing for it another significant_aggregation. That's gonna be painful. Basically, you are computing number_of_term * number_of_terms significant_terms calculations.
The big question is what are you trying to do?
If you want to see a relationship between all the terms in that field, that's gonna be expensive for the reasons explained above. My suggestion is to run a first significant_terms aggregation, take the first 10 terms or so and then run a second query with another significant_terms aggregation but limiting the terms by probably doing a parent terms aggregation and include only those 10 from the first query.
You can, also, take a look at sampler aggregation and use that as a parent for your only one significant terms aggregation.
Also, I don't think increasing the circuit breaker limit is the real solution. Those limits were chosen with a reason. You can increase that and maybe it will work, but it has to make you ask yourself if that's the right query for your use case (as it doesn't sound like it is). That limit value that it's in the exception might not be the final one... reused_arrays refers to an array class in Elasticsearch that is resizeable, so if more elements are needed, the array size is increased and you may hit the circuit breaker again, for another value.
Circuit breakers are designed to deal with situations when request processing needs more memory than available. You can set limit by using following query
PUT /_cluster/settings
{
"persistent" : {
"indices.breaker.request.limit" : "45%"
}
}
You can get more information on
https://www.elastic.co/guide/en/elasticsearch/reference/current/circuit-breaker.html
https://www.elastic.co/guide/en/elasticsearch/reference/1.4/index-modules-fielddata.html