What considerations should I take into account when increasing the size in the Scroll API in Elasticsearch? - elasticsearch

I am currently toying around with the Scroll API of Elasticsearch, and want to use it to obtain a large set of data and do some manual processing on it. The processing is performed by an external library and is not of the type that can easily be included as a script.
While this seems to work nicely at the moment, I was wondering what considerations that I should take into account when fine-tuning the scroll size for performing this form of processing. A quick observation seems to indicate that increasing the scroll size will reduce the latency of the operation. While I suspect that larger scroll sizes will generally reduce throughput, I have no idea whether this hypothesis is correct. Also, I have no idea if there are any other consequences that I do not envision right now.
So to summarize, my question is: what impact does changing Elasticsearch's scroll size have, especially on performance, in a scenario where the results are processed for each batch that is obtained?
Thanks in advance!

The one (and the only I know of) consideration is to be able to process batch fast enough to not release scroll context (which is controlled by ?scroll=X parameter).
Assuming that you will consume all the data from query, there, scroll should be tuned based on network and 3rd-party app performance. I.e.
if your app can process data in stream-like manner, bigger chunks is better
if your app processing data in batches (waiting for full ES response first), the upper limit for batch size should guarantee processing time < scroll release time
if you work in poor network environment, less batch size is better to handle overhead of dropped connections/retries
generally, bigger batch is obviously better, as it eliminates some network/ES cpu overhead

Related

Elasticsearch performance with heavy usage of search_after and PIT

We are planning to integrate search_after query with UI pagination. If we use PIT and keep the search context alive between UI page fetches, which could be minutes depending on user think time, would that scale well?. We have to support at least a few hundred concurrent users.
Also how would that affect other searches and the background ingestion process?. Documentation says lucene segment merging is impacted by open PIT contexts. We have a fairly rapidly changing large index.
Answering my own question with a response from Elastic:
For scroll requests we have a limitation for the max number of open scroll context of 500, because PIT contexts are much more lightweight, we don’t have any limit on the number of PIT contexts, so you can open as many PIT contexts as possible. We probably need to introduce some limitation though. In the worst case scenario, when you constantly open PIT contexts with a very long keep_alive parameter and constantly update your indices, you may ran out of file descriptors or heap memory, because as you rightly noticed segments used by PIT contexts are being kept and not being deleted by merge.
On other hand, if you use relatively small keep_alive, say 10-15 mins, use high enough refresh_interval not to create many segments, and regularly monitor the number of PIT contexts with GET /_nodes/stats/indices/search than probably it will work fine.

What is a viable strategy to reach a particular cache hit ratio?

Our team is working on building a cache layer for a key-val lookup service, which have general guideline to use 2 level cache: in-host and distributed layer. There is a requirement of 70% cache hit ratio, so only 30% of traffic is expected to fall into the downstream NoSQL. At the begining, we can figure out some factors that influence to the hit ratio:
TTL
Cache size
The query pattern: e.g. 15% of the keys are usually queried than other.
... other?
We also have some initial ideas on achieve it, like do some prefetching data to cache, e.g 70% data. But at the end of the day I realize that it's more complicated than we think and we need a stronger rationale.
Do we have any resource/research or paper related to the issue? Or what is the proper approach to do some test or spike it?
There are 3 main factors that influence your hit ratio:
Access pattern
Caching strategy
Working set size to cache size relation
The access pattern is generally out of your control because it depends on how users access your service. You do have control over the caching strategy but it is generally not straight forward how to change it to improve your hit ratio. The working set is generally not in your control because it depends on the access pattern but you do have control over your cache size.
I would approach your situation as follows:
Make sure the working set fits into your cache (easy to do)
Improve the cache strategy (more complex and time consuming)
To find out your working set size and make sure it fits in the cache you can start with a small cache and gradually (every couple of days for example) increase the cache size and see how much the hit ratio increases. The hit rate increase will become smaller and smaller the bigger the cache gets and once you hit the point of diminishing returns you know your working set size. The hit rate you get at this point is the maximum you will get for your caching strategy.
If your working set fits into your cache and you hit your 70% requirement, you are done. If not, you will need to tweak your caching strategy. This is basically requires clever engineering. Simulation like Ben Manes suggests is definitely a very useful tool for such clever engineering.

Elasticsearch Batch Re-Indexing | To Scroll or Not To Scroll

OK - Here is what I'm trying to achieve.
I've got an ES cluster with tens of millions of data (can linearly grow). These are raw data (something like an audit log). We will have features (incrementally) to retrospectively transform this audit log into a different document (index) depending upon the feature requirement. Therefore this would require reindexing (bulk read and bulk write).
These are my technical requirements:
The "reindexing component" should be horizontally scalable. Linearly scale by spinning up multiple instances of this (to speed up).
The "reindexing component" should be resilient. If one chunk of data fails during read by one worker, some other worker should pick this up.
Resume from where it left. These should be resumable from where it stopped (or crashed) rather than reading through the full index again.
A bit of research showed me that I'd have to build a bespoke solution for my needs.
Now my question is to use scroll or from&size
Scroll is naturally more intended to do bulk reads in an efficient way, but I also need it to be horizontally scalable. I understand there's a "sliced scroll" feature that allows parallel scrolls but this is only limited to the number of shards? ie if number of shards are 5, then I can only have 5 workers reading the elasticsearch. However, the transformations can be scaled though.
Alternatively, I was wondering if the paging (using from and size) would be ticking all my boxes. The approach is, I'd find the total count. Then I'd be computing the offsets and throwing that in a queue. Then a pool of workers would read the offsets from the queue and reading it using the (from & size). By this, I will exactly know which offsets have failed/pending etc and also reads can scale.
However, the important question I have is does it harm elasticsearch by firing more and more large paging requests concurrently (assuming page size is 2000).
I'd like to hear different views/solutions/pointers/comments on this.

Optimal Batch Size PostgreSQL Update

I am using Postgres and I have a ruby task that updates the contents of an entire table at an hourly rate. Currently this is achieved by updating the table in batches. However, I am not exactly sure what the formula is for finding an optimal batch size. Is there a formula or standard for determining an appropriate batch size?
In my opinion there is no theoretical optimal batch size. The optimal batch size will surely depend on your application model, the internal structure and of the accessed tables, the query structure and so on. The only reliable way I see to determine its size is benchmarking.
There are some optimization tips that can help you build a faster application, buy these tips cannot be followed blindly because many of them have corner cases where cannot be applied successfully. Again, the way to determine if a change (adding an index, changing the batch size, enabling the query cache...) improves the performance is benchmarking before and after every single change.

Cassandra client code with high read throughput with row_cache optimization

Can someone point me to cassandra client code that can achieve a read throughput of at least hundreds of thousands of reads/s if I keep reading the same record (or even a small number of records) over and over? I believe row_cache_size_in_mb is supposed to cache frequently used records in memory, but setting it to say 10MB seems to make no difference.
I tried cassandra-stress of course, but the highest read throughput it achieves with 1KB records (-col size=UNIFORM\(1000..1000\)) is ~15K/s.
With low numbers like above, I can easily write an in-memory hashmap based cache that will give me at least a million reads per second for a small working set size. How do I make cassandra do this automatically for me? Or is it not supposed to achieve performance close to an in-memory map even for a tiny working set size?
Can someone point me to cassandra client code that can achieve a read throughput of at least hundreds of thousands of reads/s if I keep reading the same record (or even a small number of records) over and over?
There are some solution for this scenario
One idea is to use row cache but be careful, any update/delete to a single column will invalidate the whole partition from the cache so you loose all the benefit. Row cache best usage is for small dataset and are frequently read but almost never modified.
Are you sure that your cassandra-stress scenario never update or write to the same partition over and over again ?
Here are my findings: when I enable row_cache, counter_cache, and key_cache all to sizable values, I am able to verify using "top" that cassandra does no disk I/O at all; all three seem necessary to ensure no disk activity. Yet, despite zero disk I/O, the throughput is <20K/s even for reading a single record over and over. This likely confirms (as also alluded to in my comment) that cassandra incurs the cost of serialization and deserialization even if its operations are completely in-memory, i.e., it is not designed to compete with native hashmap performance. So, if you want get native hashmap speeds for a small-working-set workload but expand to disk if the map grows big, you would need to write your own cache on top of cassandra (or any of the other key-value stores like mongo, redis, etc. for that matter).
For those interested, I also verified that redis is the fastest among cassandra, mongo, and redis for a simple get/put small-working-set workload, but even redis gets at best ~35K/s read throughput (largely independent, by design, of the request size), which hardly comes anywhere close to native hashmap performance that simply returns pointers and can do so comfortably at over 2 million/s.

Resources