How to get random documents from Elasticsearch indexes with 50 million documents each - elasticsearch

I'd like to sample 2000 random documents from approximately 60 ES indexes holding about 50 million documents each, for a total of about 3 billion documents overall. I've tried doing the following on the Kibana Dev Tools page:
GET some_index_abc_*/_search
{
"size": 2000,
"query": {
"function_score": {
"query": {
"match_phrase": {
"field_a": "some phrase"
}
},
"random_score": {}
}
}
}
But this query never returns. Upon refreshing the Dev Tools page, I get a page that tells me that the ES cluster status is red (doesn't seem to be a coincidence - I've tried several times). Other queries (counts, simple match_all queries) without the random function work fine. I've read that function score queries tend to be slow, but using a random function score is the only method I've been able to find for getting random documents from ES. I'm wondering if there might be any other, faster way that I can sample random documents from multiple large ES indexes.
EDIT: I would like to do this random sampling entirely using built-in ES functionality, if possible - I do not want to write any code to e.g. implement reservoir sampling on my end. I also tried running my query with a much smaller size - 10 documents - and I got the same result as for 2000.

Related

ElasticSearch - slow aggregations and impact on performance of other operations

There is an aggregation to identify duplicate records:
{
"size": 0,
"aggs": {
"myfield": {
"terms": {
"field": "myfield.keyword",
"size": 250,
"min_doc_count": 2
}
}
}
}
However it is missing many duplicates due to the low size. The actual cardinality is over 2 million. If size is changed to the actual size or some other much larger number, all of the duplicate documents are found, but the operation takes 5X more time to complete.
If I change the size to a larger number, should I expect slow performance or other adverse effects on other operations while this is running?
Yes, size param is very critical in Elasticsearch aggregation performance and if you change it very big number like 10k (limit set by Elasticsearch but you can change that by changing search.max_buckets but it will surely have adverse impact not only on the aggregation you are running but on all the operation running in Elasticsearch cluster.
As you are using terms aggregation which is of bucket aggregation, you can read more here
Note: Reason for increasing the latency when you increase the size is that Elasticsearch has to do significant processing creating those many buckets and compute the entries for those buckets.

Navigating terms aggregation in Elastic with very large number of buckets

Hope everyone is staying safe!
I am trying to explore the proper way to tacke the following use case in elasticsearch
Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.
So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs
To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.
As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/
It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.
Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....
But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).
I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.
Any help would be extremely appreciated!
Thanks!
There are 3 ways to paginate aggregtation.
Composite aggregation
Partition
Bucket sort
Partition you have already tried.
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Bucket sort
The bucket_sort aggregation, like all pipeline aggregations, is
executed after all other non-pipeline aggregations. This means the
sorting only applies to whatever buckets are already returned from the
parent aggregation. For example, if the parent aggregation is terms
and its size is set to 10, the bucket_sort will only sort over those
10 returned term buckets
So this isn't suitable for your case
You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.
Better option is to use scroll api and perform distinct at client side

Application-side Joins Elasticsearch

I have two indexes in Elasticsearch, a system index, and a telemetry index. I'd like to perform queries and aggregations on the telemetry index using filters from the systems index. The systems index is relatively small and only receives new documents occasionally, but the telemetry index is much larger and is constantly receiving new documents. This seems like an ideal situation for using an application-side join.
I tried emulating the example query at the pervious link, but it turns out the filtered query is deprecated as of ES 5.0. (Why is this example in the current documentation?!)
Here are my queries:
GET /system/_search
{
"query": {
"match": {
"name": "George's system"
}
}
}
GET /telemetry/_search
{
"query": {
"bool":{
"must": {
"multi_match": {
"operator": "and",
"fields": ["systemId"]
, [1] }
}
}
}
}
}
The second one fails with a json_parse_exception because for some reason it doesn't like the [ ] characters after "fields".
Can anyone provide a simple example of using application-side joins?
Once such a query is defined (perhaps in Kibana's Dev Tools console) is there a way to visualize it in Kibana?
With elastic there is no way to execute two nested queries like in a relational database where the first query uses the response of the second. The example in the application-side join, means that you are actually making two queries (two different requests to elastic) on the application side.
First query you get the list of ids you need to filter on.
Second query you pass the list of ids that you got to the terms filter.
This works when you have no more than 1024 values for systemId. Because terms query has a limit on the number of terms.
Because this query is not feasible, then you can't visualize it in kibana.
In such case you have to sacrifice a little of space and add the systemId to your mapping.
Good Luck!

Elasticsearch filter vs term query for many ids

I have an index of documents connected with some product_id. And I would like to find all documents for specific ids (around 100 000 product_ids to be found and 100 million are in total in index).
Would the filter query be the fastest and best option in that case?
"query": {
"bool": {
"filter": {"terms": {"product_id": product_ids}
}
}
Or is it better to chunkify ids and use just terms query or smth else?
The question is probably kind of a duplicate, but I would be very grateful for the best practice advice (and a bit of reasoning).
After some testing and more reading I found an answer:
Filter query works much much faster as chunks with just terms query.
But making really big filter can slower getting the result a lot.
In my case, using filter query with chunks of 10 000 ids is 10 times faster, than using filter query with all 100 000 ids at once (btw, this number is already restricted in Elasticsearch 6).
Also from official elasticsearch documentation:
Potentially the amount of ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.
The only disadvantage to be taken into account is that filter query is stored in cache. (The cache implements an LRU eviction policy: when a cache becomes full, the least recently used data is evicted to make way for new data.)
P.S. In all cases I always used scroll.
you can use "paging" or "scrolling" feature of elastic search query for very large result sets.
Use "from - to" query : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
or "scroll" query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
I think that "From / To" is a more efficient way to go unless you want to return thousands of results each time (which could be many many MB of data so you probably don't want that)
Edit:
You can make a query like this in bulks:
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2", "3", .... "10000" ] // tune for the best array length
}
}
}
If your document Id is sequential or some other number form that you could easily order by, and have a field available you can do a "range query"
GET _search
{
"query": {
"range" : {
"document_id_that_is_a_number" : {
"gte" : 0, // bump this on each query by "lte" step factor
"lte" : 10000 // find a good number here
}
}
}
}

Querying large amounts of terms without expanding maxClauseCount

In a data flow of mine, I am trying to retrieve a subset of documents from a previous terms aggregation, but hitting the maxClauseCount limit within my ES cluster. The follow up query is along these lines:
GET dataset/_search
{
"size": 2000,
"query": {
"bool": {
"must": [
(a filter or two)...,
{
"terms":{
"otherid":[
"789e18f2-bacb-4e38-9800-bf8e4c65c206",
"8e6967aa-5b98-483e-b50f-c681c7396a6a",
...
]
}
}
]}
}
}
In my research I've come across a lookup - which sadly we can't use - as well as the ids query.
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html
https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-ids-query.html
From experimentation, it appears that the ids query doesn't share the limit the terms query has (potentially it's not converted into terms clauses). Do any of you know if there's a good way to achieve similar functionality to the ids query without using the ids fields.
My version of ES is 5.0.
Thanks!
instead of using terms use the Terms filter it will solve the issue
OR
index.query.bool.max_clause_count: increase to higher value(*Not Recommended)
http://george-stathis.com/2013/10/18/setting-the-booleanquery-maxclausecount-in-elasticsearch/

Resources