I have a problem to solve in graphing from ElasticSearch / Kibana. For the sake of argument, I have a turnstile and I need a 100% accurate count of the number of unique people who've passed through the turnstile. If Fred and Joe go through then the count is 2 - but if Fred and Joe and Joe go through (because Joe left and came in again) then the count is still two. Rather than people, I'm dealing with files - and rather than names I'm using UUIDs but the principle is the same.
We've tried using Cardinality Aggregation (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html) but that doesn't work. Even with tuning it only approaches 100% accuracy, and the possibility of a 100% accurate result decreases as the number of data points goes up. The number of data points that I'm looking at is in the tens, and possibly hundreds, of millions.
I understand that there's a performance / accuracy tradeoff - I can live with slow, but I can't live with inaccurate.
What would be the correct function - or correct way - of getting a 100% accurate count of unique names?
There's a workaround of doing a complete terms aggregation and then running a scripted_metric on that, but this is really really expensive.
{
"byFullListScripting": {
"terms": {
"field": "groupId",
"shard_size": Integer.MAX_VALUE,
"size": Integer.MAX_VALUE
},
"aggs": {
"cntScripting": {
"scripted_metric": {
"map_script": "targetId='u'+doc['cntTargetId']; if (_agg[targetId] == null) { _agg[targetId] = 1}",
"reduce_script": "map=[:]; for (a in _aggs){ map.putAll(a) }; return map.size()"
}
}
}
}
Related
I have an index of documents connected with some product_id. And I would like to find all documents for specific ids (around 100 000 product_ids to be found and 100 million are in total in index).
Would the filter query be the fastest and best option in that case?
"query": {
"bool": {
"filter": {"terms": {"product_id": product_ids}
}
}
Or is it better to chunkify ids and use just terms query or smth else?
The question is probably kind of a duplicate, but I would be very grateful for the best practice advice (and a bit of reasoning).
After some testing and more reading I found an answer:
Filter query works much much faster as chunks with just terms query.
But making really big filter can slower getting the result a lot.
In my case, using filter query with chunks of 10 000 ids is 10 times faster, than using filter query with all 100 000 ids at once (btw, this number is already restricted in Elasticsearch 6).
Also from official elasticsearch documentation:
Potentially the amount of ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.
The only disadvantage to be taken into account is that filter query is stored in cache. (The cache implements an LRU eviction policy: when a cache becomes full, the least recently used data is evicted to make way for new data.)
P.S. In all cases I always used scroll.
you can use "paging" or "scrolling" feature of elastic search query for very large result sets.
Use "from - to" query : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
or "scroll" query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
I think that "From / To" is a more efficient way to go unless you want to return thousands of results each time (which could be many many MB of data so you probably don't want that)
Edit:
You can make a query like this in bulks:
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2", "3", .... "10000" ] // tune for the best array length
}
}
}
If your document Id is sequential or some other number form that you could easily order by, and have a field available you can do a "range query"
GET _search
{
"query": {
"range" : {
"document_id_that_is_a_number" : {
"gte" : 0, // bump this on each query by "lte" step factor
"lte" : 10000 // find a good number here
}
}
}
}
The other day I saw a method for querying for a random document from a collection using AQL on this very same website:
Randomly select a document in ArangoDB
My implementation of this at the moment is:
//brands
let b1 = (
for brand in brands
filter brand.brand == #brand1
return brand._id
)
//pick random car with brand 1
let c1 = (
for edge in edges
filter edge._from == b1[0]
for car in cars
filter car._id == edge._to
sort rand() limit 1
return car._id
)
However, when I use that method it can hardly be called 'random'. For instance, in a 3500+ document collection I manage to get the same document 5 times in a row, and over the course of 25+ attempts there're maybe 3 to 4 documents that keep being returned to me. It seems the method is geared towards particular documents being output. I was wondering if there's still some improvement to be done here or another method that wasn't mentioned in that thread. The problem is that I can't comment on the thread yet due to low reputation levels, so I can't ask the question in the same place. However I think it merits a discussion nonetheless. I hope someone can help me out in getting a better randomization.
Essentially the rand() function is being seeded the same on each query execution. Multiple calls within the same query will be different, but the next execution will start back from the same number.
I ran this query and saw the same 3 numbers each time:
return {
"1": rand(),
"2": rand(),
"3": rand()
}
Not always, but more often than not got the same numbers:
[
{
"1": 0.5635853144932401,
"2": 0.19330423902096622,
"3": 0.8087405011139256
}
]
Then, seeded with current milliseconds:
return {
"1": rand() + DATE_MILLISECOND(DATE_NOW()),
"2": rand() + DATE_MILLISECOND(DATE_NOW()),
"3": rand() + DATE_MILLISECOND(DATE_NOW())
}
Now I always get a different number.
[
{
"1": 617.8103840407173,
"2": 617.0999366056549,
"3": 617.6308832757169
}
]
You can use various techniques to produce pseudorandom numbers that won't repeat like calling rand() with the same seed.
Edit: this is actually a Windows bug. If you can use linux you should be fine.
its an strange requirement.
we need to calculate a MAX value in our dataset, however, some of our data are BAD meaning, the MAX value will produce an undesired outcome.
say the values in field "myField" are:
INPUT:
10 30 20 40 1000000
CURRENT OUTPUT:
1000000
DESIRED OUTPUT:
40
{"aggs": {
"aggs": {
"maximum": {
"max": {
"field": "myField"
}
}
}
}
}
I thought of sorting the data but that'll be really slow as the actual data counts to 100K+.
So my question, is there a way to cutoff data in aggs so it ignores the actual MAX and return the SECOND MAX, Alternatively to ignore say the top 10% and returns the max value.
have you thought of using percentiles to eliminate outliers? Maybe run a percentile aggregation first and then use that as a base for a range filter?
The requirement seems a bit blurry to me, so this is just another try to help, not sure if this is what you are after.
I'd like to sample 2000 random documents from approximately 60 ES indexes holding about 50 million documents each, for a total of about 3 billion documents overall. I've tried doing the following on the Kibana Dev Tools page:
GET some_index_abc_*/_search
{
"size": 2000,
"query": {
"function_score": {
"query": {
"match_phrase": {
"field_a": "some phrase"
}
},
"random_score": {}
}
}
}
But this query never returns. Upon refreshing the Dev Tools page, I get a page that tells me that the ES cluster status is red (doesn't seem to be a coincidence - I've tried several times). Other queries (counts, simple match_all queries) without the random function work fine. I've read that function score queries tend to be slow, but using a random function score is the only method I've been able to find for getting random documents from ES. I'm wondering if there might be any other, faster way that I can sample random documents from multiple large ES indexes.
EDIT: I would like to do this random sampling entirely using built-in ES functionality, if possible - I do not want to write any code to e.g. implement reservoir sampling on my end. I also tried running my query with a much smaller size - 10 documents - and I got the same result as for 2000.
I'm sorry I'm not good at English, please understand it.
Let's assume I have such data:
title category price
book1 study 10
book2 cook 20
book3 study 30
book4 study 40
book5 art 50
I can do "search books in 'study' category and sort them by price-descending order". Result would be:
book4 - book3 - book1
However, I couldn't find a way to do
"search books in 'study' category AMONG the books of TOP 40% in price".
(I wish 'TOP 40% in price' is correct expression)
In this case, result should be "book4" only, because "category search" would be performed for only book5 and book4.
At first, I thought I could do it by
sort all documents by price
select TOP 40%
post another query for category search among them
But now, I still have no idea how I can post a query among "part of documents", not all documents. After 2, I'd have a list of documents in TOP 40%. But how can I make a query which is applied to just them?
I realized that I don't know even "search TOP n%" in elasticsearch. Is there a way that is better than "sort all and select first n%"?
Any advice would be appreciated.
And this is my first question in stackoverflow. If my question is violating any rule of here, please tell me so that I can know it and apology.
If your data is normally distributed, or some other statistical distribution from which you can make sense of the data, you can probably do this in two queries.
You can take a look at the data in histogram form by doing:
{
"query": {
"match_all": {}
},
"facets": {
"stats": {
"histogram": {
"field": "price",
"interval": 100
}
}
}
}
I usually take this data into a spreadsheet to chart it and do other statistical analysis on it. "interval" above will need to be some reasonable value, 100 might not be the right fit.
The is just to decide how to code the intermediate step. Provided the data is normally distributed you can then get the statistical information about the collection using this query:
{
"query": {
"match_all": {}
},
"facets": {
"stats": {
"statistical": {
"field": "price"
}
}
}
}
The above gives you an output that looks like this:
count: 819517
total: 24249527030
min: 32
max: 53352
mean: 29590.023184387876
sum_of_squares: 875494716806082
variance: 192736269.99554798
std_deviation: 13882.94889407679
(the above is not based on your data sample, but just sample of available data I have to demonstrate statistical facet usage.)
So now that you know all of that, you can start applying your knowledge of statistics to the problem at hand. That is, find the Z score at the 60th percentile and find the location of the representative data point based on that.
How your final query looks like this:
{
"query": {
"range": {
"talent_profile": {
"gte": 40,
"lte": 50
}
}
}
the lte is going to be from the "max" from the stats facet and the gte is going to be from your intermediate analysis.