By default, the requests cache will only cache the results of search requests where size=0, so it will not cache hits, but it will cache hits.total, aggregations, and suggestions.
I do not understand the part where stated: "size=0".
What is the the size context/meaning here?
Does it mean that results cache will
cache only for empty results?
cache page 1 only (default 10 results I think)?
No, size param is useful if you want to fetch results different than 10, as default size is 10, so if you are using a search query for which you need to fetch lets suppose 1000 results, than you specify size param to 1000, without this you will get only top 10 search results, sorted on their score in descending order.
size=0, in shard request cache, is that it will not cache the exact results(ie number of documents with their score) but only cache the metadata like total number of results(which is hits.total) and other things.
Related
We are currently upgrading our system from ElasticSearch 6.8.8 to ElasticSearch 7.17. When we run pageable queries using the Java Rest API, the results are incorrect.
For example, in version 6.8.8, if we query for data with and request page 2 with a page size of 10, the query return the 10 items on page 2 and give us a totalElement of 10000 records which is correct. When we run this same exact query on Version 7.17, it returns 10 items on page 2 but only gives us a totalElement of 10 instead of the correct number. We need the correct number, so that our gridview handles paging correctly. Is there a setting I am missing in ElasticSearch version 7.17?
Elasticsearch implemented an option of Track_total_hits in all search in ES 7.X.
Generally the total hit count can’t be computed accurately without visiting all matches, which is costly for queries that match lots of documents. The track_total_hits parameter allows you to control how the total number of hits should be tracked. Given that it is often enough to have a lower bound of the number of hits, such as "there are at least 10000 hits", the default is set to 10,000. This means that requests will count the total hit accurately up to 10,000 hits. It is a good trade-off to speed up searches if you don’t need the accurate number of hits after a certain threshold.
So to force ES to calculate all the hit documents you should set Track_total_hits to true. For more information, you can check the ES official documentation page here.
I have a elasticsearch cluster set up with node query cache enabled, I have set the size of the cache to be 2gb, but I am not completely sure how the LRU caching policy works in this case.
I have a query context run against the elasticsearch index and i expect the result to be cached, so that when there is request for the same query context again - there should be increase in the hit_count, but this is not the behavior i see in ES.
These are the stats of my query_cache
memory_size_in_bytes: 7176480,
total_count: 36605,
hit_count: 15657,
miss_count: 20948,
cache_size: 130,
cache_count: 130,
evictions: 0
Even though the memory_size_in_bytes is not reached its max. The result of the query context is not completely cached and when the same query context is fired against the elasticsearch index i see miss counts stats getting increased rather than hit counts.
Can anyone please explain about how the node query caching works in ES.
When we query ES for records it returns 10 records by default. How can I get all the records in the same query without using any scroll API.
There is an option to specify the size, but size is not known in advance.
You can retrieve up to 10k results in one request (setting "size": 10000). If you have less than 10k matching documents then you can paginate over them using a combination of from/size parameters. If it is more, then you would have to use other methods:
Note that from + size can not be more than the index.max_result_window index setting which defaults to 10,000. See the Scroll or Search After API for more efficient ways to do deep scrolling.
To be able to do pagination over unknown number of documents you will have to get the total count from the first returned query.
Note that if there are concurrent changes in the data, results of paginated retrieval may not be consistent (for example, if one document gets inserted or deleted while you are paginating). Scroll is consistent since it is created from a "snapshot" of the index created at query start time.
As I understand it, queryChache caches a list of matched documentIds for each query.
Based on the information provided in the book Solr in Action, we set the queryResultMaxDocsCached parameter to a value based on the maximum number of documents we want each query to cache. If that is true, does that value add to the amount we set in documentCache ? What is the difference between the two ?
Excerpted from Solr in Action for queryResultMaxDocsCached
As you can imagine, a
result set holding millions of documents in the cache would greatly impact available
memory in Solr. The element allows you to limit the
number of documents cached for each entry in the query result cache.
Excerpted from Solr in Action for documentCache
The query result cache holds a list of internal document IDs that match a query, so even
if the query results are cached, Solr still needs to load the documents from disk to produce
search results. The document cache is used to store documents loaded from disk
in memory keyed by their internal document IDs. It follows that the query result cache
uses the document cache to find cached versions of documents in the cached result set.
As you can see from the descriptions you've posted, the queryCache keeps the query mapped to the document ids for that query. i.e. a search for "foo" gave "these ids": foo -> [1, 2, 3, 4, 5, 6]
The Document Cache simplifies the lookup of those document ids, meaning that Solr won't have to attempt to load them from disk again: 1 -> {'bar': 'foo', 'spam': 'eggs'}, 2 -> {'bar': 'foo', 'python': 'idle'}, 3 -> ..., etc.
If you have a different query, but it still references the same set of (or a subset of) documents, those documents can be looked up in the cache instead of being read from disk: bar -> [2, 8, 16] would still be able to find document 2 in the cache, and avoid going to disk to load the details of the document.
These caches are separate, and handled by separate settings.
In order to load all the documents index by ElasticSearch, I am using the following query through tire.
def all
max = total
Tire.search 'my_documents' do
query { all }
size max
end.results.map { |entry| entry.to_hash }
end
Where max, respectively total is a count query to return the number of present documents. I have indexed about 10,000 documents. Currently, the request takes too long.
I am aware, that I should not query all documents like this. What is the best alternative here? Using pagination, if yes, toward which metric would I define the number of documents per page?
I am also planning to extend the size of the documents, to 100,000 or even 1,000,000 and I don't see yet how this can scale.
I appreciate every comment.
Rationale: I do this, because I am running calculations over these data. Hence, I need all the data, run the computations and save the results back into the documents.
Have a look at the scroll API, which is highly optimized to fetch a large amount of results. It uses the scan search type and doesn't support sorting but let you provide a query to filter the documents you want to fetch. Have a look at the reference to know more about it. Remember the size that you define in the request is per shard; that means that if you have 5 primary shards, setting 10 would lead to have 50 results back per request.