queryResultMaxDocsCached and documentCache in solr - caching

As I understand it, queryChache caches a list of matched documentIds for each query.
Based on the information provided in the book Solr in Action, we set the queryResultMaxDocsCached parameter to a value based on the maximum number of documents we want each query to cache. If that is true, does that value add to the amount we set in documentCache ? What is the difference between the two ?
Excerpted from Solr in Action for queryResultMaxDocsCached
As you can imagine, a
result set holding millions of documents in the cache would greatly impact available
memory in Solr. The element allows you to limit the
number of documents cached for each entry in the query result cache.
Excerpted from Solr in Action for documentCache
The query result cache holds a list of internal document IDs that match a query, so even
if the query results are cached, Solr still needs to load the documents from disk to produce
search results. The document cache is used to store documents loaded from disk
in memory keyed by their internal document IDs. It follows that the query result cache
uses the document cache to find cached versions of documents in the cached result set.

As you can see from the descriptions you've posted, the queryCache keeps the query mapped to the document ids for that query. i.e. a search for "foo" gave "these ids": foo -> [1, 2, 3, 4, 5, 6]
The Document Cache simplifies the lookup of those document ids, meaning that Solr won't have to attempt to load them from disk again: 1 -> {'bar': 'foo', 'spam': 'eggs'}, 2 -> {'bar': 'foo', 'python': 'idle'}, 3 -> ..., etc.
If you have a different query, but it still references the same set of (or a subset of) documents, those documents can be looked up in the cache instead of being read from disk: bar -> [2, 8, 16] would still be able to find document 2 in the cache, and avoid going to disk to load the details of the document.
These caches are separate, and handled by separate settings.

Related

Elasticsearch: Modeling product data with frequent updates

We're struggling with modeling our data in Elasticsearch, and decided to change it.
What we have today: single index to store product data, which holds data of 2 types -
[1] Some product data that changes rarely -
* `name, category, URL, product attributes(e.g. color,price) etc...`
[2] Product data that might change frequentley for past documents,
and indexed on a daily level - [KPIs]
* `product-family, daily sales, daily price, daily views...`
Our requirements are -
Store product-related data (for millions of products)
Index KPIs on a daily level, and store those KPIs for a period of 2 years.
Update "product-family" on a daily level, for thousands of products. (no need to index it daily)
Query and aggregate the data with low latency, to display it in our UI. aggregation examples -
Sum all product sales in the last 3 months, from category 'A' and sort by total sales.
Same as the above, but in-addition aggregate based on product-family field.
Keep efficient indexing rate.
Currently, we're storing everything on the same index, daily, meaning we store repetitive data such as name, category and URL over and over again. This approach is very problematic for multiple reasons-
We're holding duplicates for data of type [1], which hardly changes and causes the index to be very large.
when data of type [2] changes, specifically the product-family field(this happens daily), it requires updating tens of millions of documents (from more than a year ago), which causes the system to be very slow and timeout on queries.
Splitting this data into 2 different indices won't work for us since we have to filter data of type [2] by data of type [1] (e.g. all sales from category 'A'), moreover, we'll have to join that data somehow, and our backend server won't handle this load.
We're not sure how to model this data properly, our thoughts are -
Using parent-child relations - parent is product data of type [1] and children are KPIs of type [2]
Using nested fields to store KPIs (data of type [2]).
Both of these methods allow us to reduce the current index size by eliminating the duplicated data of type [1], and efficiently updating data of type [2] for very old documents.
Specifically, both methods allow us to store product-family for each product once in the parent/non-nested fields, which implies we can only update a single document per product. (these updates are daily)
We think parent-child relation is more suitable, due to the fact that we're adding KPIs on a daily level,
which per our understanding - will cause re-indexing for documents with new KPIs when using nested fields.
On the other side, we're afraid that parent-child relations will increase query latency dramatically, hence will cause our UI to be very slow.
We're not sure what is the proper way to model the data, and if our solutions are on the right path,
we would appreciate any help since we're struggling with it for a long time.
First off, I would recommend against indexing data that changes frequently in Elasticsearch. It is not designed for this and you will get poor performance as well as encounter difficulties when cleaning up old data.
Elasticsearch is best used for immutable data (once you insert it, you don't modify it). For time based data, I would recommend inserting measurements once with their timestamp, in e.g. daily indices (see: index templates), and leaving them alone. Each measurement document would look something like
{"product_family": "widget", # keyword
"timestamp": "2022-08-23", # date
"sales": 798137,
"price": "and so on"}
This document would be inserted into the index yourindex_20220823.
You can have Elasticsearch run roll-up jobs for aggregating historical data, and set up index lifecycle management so that indices older than your retention period get deleted. This is very fast, way faster than running delete-by-query requests to remove all documents with insertionDate > -2yrs.
Now, we have the issue of storing the product category metadata. As you might have found out, ES is better at denormalized data, but it does lead to repetition and you might find your index size blowing up.
For minimizing disk usage, the trick is to tweak individual field mappings (and no, you can't rely on dynamic mapping). You can avoid storing a lot of stuff in the inverted index. See https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-disk-usage.html. I'd need to see your current mapping to check if there are any obvious gains to be made here.
Lastly, a feature that I've never tried out is to move older data (again, having daily indices helps here) to slower storage modes. See cold/frozen storage tiers.

What does size=0 mean in Elasticsearch shard request cache?

By default, the requests cache will only cache the results of search requests where size=0, so it will not cache hits, but it will cache hits.total, aggregations, and suggestions.
I do not understand the part where stated: "size=0".
What is the the size context/meaning here?
Does it mean that results cache will
cache only for empty results?
cache page 1 only (default 10 results I think)?
No, size param is useful if you want to fetch results different than 10, as default size is 10, so if you are using a search query for which you need to fetch lets suppose 1000 results, than you specify size param to 1000, without this you will get only top 10 search results, sorted on their score in descending order.
size=0, in shard request cache, is that it will not cache the exact results(ie number of documents with their score) but only cache the metadata like total number of results(which is hits.total) and other things.

elastic query returns same results after insert

I'm using elasticsearch.js to move a document from one index to another.
1a) Query index_new for all docs and display on the page.
1b) Use query of index_old to obtain a document by id.
2) Use an insert to index_new, inserting result from index_old.
3) Delete document from index_old (by id).
4) Requery index_new to see all docs (including the new one). However, at this point, it returns the same list of results as returned in 1a. Not including the new document.
Is this because of caching? When I refresh the whole page, and 1a is triggered, the new document is there.. But not without a refresh.
Thanks,
Daniel
This is due to the segments merging and refreshing that happens inside the elasticsearch indexes per shard and replica.
Whenever you are writing to the index wou never write to the original index file but rather write to newer smaller files called segment which then gets merged into the bigger file in background batch jobs.
Next question that you might have is
How often does this thing happen or how can one have a control over this
There is a setting in the index level configuration called refresh_interval. It can have multiple values depending upon the kind of strategy that you want to use.
refresh_interval -
-1 : To stop elasticsearch handle the merging and you control at your end with the _refresh API in elasticsearch.
X : x is an integer and has a value in seconds. Hence elasticsearch will refresh all the indexes every x seconds.
If you have replication enabled into your indexes then you might also experience in result value toggling. This happens just because the indexes have multiple shard and a shard has multiple replicas. Hence different replicas have different window pattern for refreshing. Hence while querying the query actually routes to different shard replicas in the meantime which shows different states in the time window.
Hence if you are using a setting to set periods of refresh interval then assume to have a consistent state in next X to 2X seconds at max.
Segment Merge Background details
https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-refresh.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.4/indices-update-settings.html

How does elastic search keep the index

I'm wondering how does elasticsearch search so fast. Does it use inverted index and how is it represented in memory? How is it stored on disk? How does it load from disk to memory? And how does it merges indexes so fast (I mean when searching how does it combine two lists so fast)?
elasticsearch uses lucene to store inverse document indexes. Lucene in turn will store read-only files called segments with inverse index data. Each segment contains some documents. Those segments are read only and are never changed. To delete or update documents elasticsearch will maintain a delete/update list which will be used to overwrite results from read-only segments.
With this approach some segments might become obsolete altogether or contain only few up-to date data. Such segments will be rewritten or deleted.
There is an interesting elasticsearch plugin which visualizes the segments and the rewriting process:
https://github.com/polyfractal/elasticsearch-segmentspy
To see it in action start indexing a lot of data and see the segment information.
With the Segment API you can retrieve information about the segments:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-segments.html
I'll share what I know of ElasticSearch (ES). Yes ES uses an inverted index, here is how it would be structured - if we have a white space analyzer on these documents -
{
"_id": 1,
"text": "Hello, John"
}
AND
{
"_id": 2,
"text": "Bonjour, John"
}
INVERTED INDEX
Word | Docs
___________________
Hello | 1
Bonjour | 2
John | 1&2
This index is built at index time, the document is allocated to a shard based on hashing the document ID. Whenever a search request is made, a lookup is performed an all shards, the results of which are then merged and returned to the requester. The results are returned and merged blazingly fast due to the performance of the inverted index.
ES stores data within the data folder created once you have launched ES and created an index. The file structure resembles this - /data/clustername/nodes/..., if you look into this directory you will understand how it's organised. You can define how ES' index data is stored here. For instance, all indexed data stored in memory on on disk.
There is plenty of information on the ES website there are also several published books on ES, you can see these here.

Solr performance with multiple fields

I have to index around 10 million documents in solr for full text search. Each of these documents have around 25 additional metadata fields attached to them. Each of the metadata fields individually are small (upto 64 characters). Common queries would be involving a search term along with multiple metadata fields used to filter the data. So my questions is which would provide better performance wrt search response time. (indexing time is not a concern):
a. Index the text data as well as push all metadata fields into solr as stored fields and query solr for all the fields using a single query. (Effectively solr does the filtering with metadata as well as search)
b. Store the metadata fields in a db like Mysql. Use solr only for full text and then use the document ids returned from solr as an input to the database to filter based on other metadata to retrieve the final set of documents.
Thanks
Arijit
Definitely a). Solr isn't simply a fulltext search engine, it's much more. It's filter queries are at least as good/fast as MySQL select.
b) is just silly. Fetch many ids from MySQL by selecting those with correct metadata, do a fulltext search in Solr while filtering against that ids list, fetch document from MySQL or Solr (if you choose to store data in it, not just indexes). I can't imagine a case where this would be faster.
Why complicate things, especially if indexing time and HD space is not an issue, you should store all your data (meaning: subset needed by users) in Solr.
Exception would be if you had large amount of text to store (and retrieve) in each document. In those cases it would be faster to fetch it from RDB after you get your search results back. Anyway, noone can tell for sure which one would be faster in your case, so I suggest you test performance of both approaches (using JMeter for example).
Also, since you don't care about index time, you should do all the processing you can at index time instead of at query time (e.g. synonyms, payloads where they can replace boosting, ...).
See here for some additional info on Solr performance:
http://wiki.apache.org/solr/SolrPerformanceFactors

Resources