Navigating terms aggregation in Elastic with very large number of buckets - elasticsearch

Hope everyone is staying safe!
I am trying to explore the proper way to tacke the following use case in elasticsearch
Lets say that I have about 700000 docs which I would like to bucket on the basis of a field (let's call it primary_id). This primary id can be same for more than one docs (usually upto 2-3 docs will have same primary_id). In all other cases the primary_id is not repeted in any other docs.
So on average out of every 10 docs I will have 8 unique primary ids, and 1 primary id same among 2 docs
To ensure uniqueness I tried using the terms aggregation and I ended up getting buckets in response to my search request but not for the subsequent scroll requests. Upon googling, I found that scroll queries do not support aggregations.
As a result, I tried finding alternates solutions, and tried the solution in this link as well, https://lukasmestan.com/learn-how-to-use-scroll-elasticsearch-aggregation/
It suggests use of multiple search requests each specifying the partition number to fetch (dependent upon how many partitions do you divide your result in). But I receive client timeouts even with high timeout settings client side.
Ideally, I want to know what is the best way to go about such data where the variance of the field which forms the bucket is almost equal to the number of docs. The SQL equivalent would be select DISTINCT ( primary_id) from .....
But in elasticsearch, distinct things can only be processed via bucketing (terms aggregation).
I also use top hits as a sub aggregation query under terms aggregation to fetch the _source fields.
Any help would be extremely appreciated!
Thanks!

There are 3 ways to paginate aggregtation.
Composite aggregation
Partition
Bucket sort
Partition you have already tried.
Composite Aggregation: can combine multiple datasources in a single buckets and allow pagination and sorting on it. It can only paginate linearly using after_key i.e you cannot jump from page 1 to page 3. You can fetch "n" records , then pass returned after key and fetch next "n" records.
GET index22/_search
{
"size": 0,
"aggs": {
"ValueCount": {
"value_count": {
"field": "id.keyword"
}
},
"pagination": {
"composite": {
"size": 2,
"sources": [
{
"TradeRef": {
"terms": {
"field": "id.keyword"
}
}
}
]
}
}
}
}
Bucket sort
The bucket_sort aggregation, like all pipeline aggregations, is
executed after all other non-pipeline aggregations. This means the
sorting only applies to whatever buckets are already returned from the
parent aggregation. For example, if the parent aggregation is terms
and its size is set to 10, the bucket_sort will only sort over those
10 returned term buckets
So this isn't suitable for your case
You can increase the result size to value greater than 10K by updating setting index.max_result_window. Setting too big a size can cause out of memory issue so you need to test it out see how much your hardware can support.
Better option is to use scroll api and perform distinct at client side

Related

Elasticsearch - Limit of total fields [1000] in index exceeded

I saw that there are some concerns to raising the total limit on fields above 1000.
I have a situation where I am not sure how to approach it from the design point of view.
I have lots of simple key value pairs:
key1:15, key2:45, key99999:1313123.
Where key is a string and value is a integer on which I would like to sort my results upon on where as if a certain document receives a key it gets sorted by the value.
I ended up creating an object and just put the key value pairs inside so I can match it easy.
For example I have sorting: "object.key".
I was wondering if I just use a simple object with bunch of strings inside that are just there for exact matching should I worry about raising this limit to 10k, or 20k.
Because I now have an issue where there can be more then 1k of these records. I've found I could use nested sorting but it still has a default limit of 10k.
Is there a good design pattern approach for this or should I not be worried by raising the field limits?
Simplified version of the query:
GET products/_search
{
"query": {
"match_all": {}
},
"sort": [
{
"sortingObject.someSortingKey1": {
"order": "desc",
"missing": 2,
"unmapped_type":"float"
}
}
]
}
Point is that I get the sortingKey from request and I use it to sort my results. There are 100k different ways to sort the result for example
There were some recent improvements (in 7.16) that should help there, but 10K or 20K fields is still a lot of overhead.
I'm not sure what kind of queries you need to run on those keyX fields, but maybe the flattened data-type would work for you? https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html

Really huge query or optimizing an elasticsearch update

I'm working in documents-visualization for binary classification of a big amount of documents (around 150 000). The challenge is how to present general visual information to end-users, so they can have an idea on the main "concepts" on each category (positive/negative). As each document has an associated set of topics, I thought about asking Elasticsearch through aggregations for the top-20 topics on positive classified documents, and then the same for the negatives.
I created a python script that downloads the data from Elastic and classify the docs, BUT the problem is that the predictions on the dataset are not registered on Elasticsearch, so I can not ask for the top-20 topics on a certain category. First I thought about creating a query in elastic to ask for the aggregations and passing a match
As I have the ids of the positive/negative documents, I can write a query to retrieve the aggregation of topics BUT in the query I should provide a really big amount of documents IDS to indicate, for instance, just the positive documents. That is impossible, since there is a limit on the endpoint and I can not pass 50 000 ids like:
"query": {
"bool": {
"should": [
{"match": {"id_str": "939490553510748161"}},
{"match": {"id_str": "939496983510742348"}}
...
],
"minimum_should_match" : 1
}
},
"aggs" : { ... }
So I tried to register the predicted categories of the classification in the Elastic index, but as the amount of documents is really huge, it takes like half an hour (compared to less than a minute for running the classification)... which is a LOT of time just for storing the predictions.... Then I also need to query the index to et the right data for the visualization. To update the documents, I am using:
for id in docs_ids:
es.update(
index=kwargs["index"],
doc_type=kwargs["doc_type"],
id=id,
body={"doc": {
"prediction": kwargs["category"]
}}
)
Do you know an alternative to update the predictions faster?
You could use bulk query that permits you to serialize your requests and query only one time against elasticsearch executing a lot of searches.
Try:
from elasticsearch import helpers
query_list = []
list_ids = ["1","2","3"]
es = ElasticSearch("myurl")
for id in list_ids:
query_dict ={
'_op_type': 'update',
'_index': kwargs["index"],
'_type': kwargs["doc_type"],
'_id': id,
'doc': {"prediction": kwargs["category"]}
}
query_list.append(query_dict)
helpers.bulk(client=es,actions=query_list)
Please have a read here
Regarding to query the list ids, to get faster response you should't bring with you the match_string value, as you have done in the question, but the _id field. That permits you to use multiget query, a bulk query for the get operation. Here in the python library. Try:
my_ids_list = [<some_ids_here>]
es.mget(index = kwargs["index"],
doc_type = kwargs["index"],
body = {'ids': my_ids_list})

Elasticsearch filter vs term query for many ids

I have an index of documents connected with some product_id. And I would like to find all documents for specific ids (around 100 000 product_ids to be found and 100 million are in total in index).
Would the filter query be the fastest and best option in that case?
"query": {
"bool": {
"filter": {"terms": {"product_id": product_ids}
}
}
Or is it better to chunkify ids and use just terms query or smth else?
The question is probably kind of a duplicate, but I would be very grateful for the best practice advice (and a bit of reasoning).
After some testing and more reading I found an answer:
Filter query works much much faster as chunks with just terms query.
But making really big filter can slower getting the result a lot.
In my case, using filter query with chunks of 10 000 ids is 10 times faster, than using filter query with all 100 000 ids at once (btw, this number is already restricted in Elasticsearch 6).
Also from official elasticsearch documentation:
Potentially the amount of ids specified in the terms filter can be a lot. In this scenario it makes sense to use the terms filter’s terms lookup mechanism.
The only disadvantage to be taken into account is that filter query is stored in cache. (The cache implements an LRU eviction policy: when a cache becomes full, the least recently used data is evicted to make way for new data.)
P.S. In all cases I always used scroll.
you can use "paging" or "scrolling" feature of elastic search query for very large result sets.
Use "from - to" query : https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-from-size.html
or "scroll" query:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html
I think that "From / To" is a more efficient way to go unless you want to return thousands of results each time (which could be many many MB of data so you probably don't want that)
Edit:
You can make a query like this in bulks:
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2", "3", .... "10000" ] // tune for the best array length
}
}
}
If your document Id is sequential or some other number form that you could easily order by, and have a field available you can do a "range query"
GET _search
{
"query": {
"range" : {
"document_id_that_is_a_number" : {
"gte" : 0, // bump this on each query by "lte" step factor
"lte" : 10000 // find a good number here
}
}
}
}

distinct count on hive does not match cardinality count on elasticsearch

I have loaded data into my elasticsearch cluster from hive using the elasticsearch-hadoop plugin from elastic.
I need to fetch a count of unique account numbers. I have the following queries written in both hql and queryDSL, BUT they are returning different counts.
Hive Query:
select count(distinct account) from <tableName> where capacity="550";
// Returns --> 71132
Similarly, in Elasticsearch the query looks like this:
{
"query": {
"bool": {
"must": [
{"match": { "capacity": "550"}}
]
}
},
"aggs": {
"unique_account": {
"cardinality": {
"field": "account"
}
}
}
}
// Returns --> 71607
Am I doing something wrong? What can I do to match the two queries?
Note: There are exactly the same number of records in hive and elasticsearch.
"the first approximate aggregation provided by Elasticsearch is
the cardinality metric
...
As mentioned at the top of this chapter, the cardinality metric is an
approximate algorithm. It is based on the HyperLogLog++ (HLL)
algorithm."
https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
For the OP
precision_threshold
"precision_threshold accepts a number from 0–40,000. Larger values are
treated as equivalent to 40,000.
...
Although not guaranteed by the
algorithm, if a cardinality is under the threshold, it is almost
always 100% accurate. Cardinalities above this will begin to trade
accuracy for memory savings, and a little error will creep into the
metric."
https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
You might also want to take a look at "Support for precise cardinality aggregation #15876"
For the OP, 2
"I have tried several numbers..."
You have 71,132 distinct values while the precision threshold limit is 40,000, therefore the cardinality is over the threshold, which means accuracy is traded for memory saving.
This is how the chosen implementation (based on HyperLogLog++ algorithm) works.
Cardinality does not ensure accurate count even with 40000 precision_threshold. There is another way to get accurate distinct count of a field.
This article on "Accurate Distinct Count and Values from Elasticsearch" explains in detail the solution as well as it's accuracy over Cardinality.

Elastic search - caching results for queries with size and from parameters

I am trying to optimise performance of my queries. My queries have sort and aggregations. Additionally, I specify from and size parameters as I need data to be fetched in batches.
I am using request cache which makes sure aggregations are not executed every time. However, I have not been able to find a way to cache the query data.
Looked at:
https://www.elastic.co/guide/en/elasticsearch/reference/current/shard-request-cache.html
"By default, the requests cache will only cache the results of search
requests where size=0, so it will not cache hits, but it will cache
hits.total, aggregations, and suggestions."
Is there a way I can cache search results in such cases?
Sample query:
GET myIndex/_search?request_cache=true
{
"size": 20,
"from": 0,
"query": { ... },
"sort": { ... },
"aggregations": { ... }
}
this assumption holds still true. A workaround might be to send two queries, one search query and one aggregation query, so that the aggregation part can be cached.
Have you found out what actually is the slow part? The search or the aggs?

Resources