LRU cache with objects that should not be removed - data-structures

I use an LRU cache (LruCache class from android.util) in my Android app. which is generally working fine.
Now I have a special requirement for this LRU cache: I would like that some objects are never removed. Explaining the situation: I have an array of objects (named mymetadata objects) that should be never removed and I have a lot of other objects (named dynamicdata objects) that should be removed with the LRU rule. I would like to store mymetadata objects in the LRU cache, because the array of objects can also grow and using an LRU cache helps avoiding running out of memory.
Is there any trick to guarantee that mymetadata objects are never removed from the LRU cache? Or should I simply access an object from the array so that it is marked as last used?

Is there any trick to guarantee that mymetadata is never removed
from the LRU cache? Or should I simply access a object of the array
and it is marked as last used?
Besides regularly touching the objects you want to keep in the LRU cache (to force raising their ranks), I don't see what else could be done. One problem with this might be when should these objects be touched and what is the performance impact of this operation?
A different approach would be to split the storage of your objects depending on their persistence. Keep a standard map for your persistent objects and an LRU cache for objects that can expire. This mix of two data structures can then be hidden behind a single interface similar to the ones of Map or LruCache (each query is directed to the right internal storage).
I would like to put the mymetadata objects into the LRU cache, because the array of objects can also grow
This seems to be in conflict with your "never removed" requirement for some object. How do you decide when a persistent object is allowed to expire?
Anyway, yet another approach would consist in reimplementing the LRU cache data structure, keeping two separate ordered lists of objects instead of a single one: one for mymetadata objects and one for dynamicdata objects. Each query to this data structure is then directed to the right list and both kind of objects can expire independently (the size of the cache can also be chosen independently for each set of objects). But both kind of objects are stored in the same hash table / map.

Related

Optimal RocksDB configuration for use as secondary "cache"

I am looking at using RocksDB (from Java in my case) as a secondary "cache" behind a RAM based first level cache. I do not expect any items in RocksDB to be dramatically more commonly accessed than others (all the really frequently used items will be in the first level cache) and there will be no "locality" (if there is such a concept in RocksDB?) as the "next" key in sequence is no more likely to be accessed next than any other so I would like to optimize RocksDB for "truly random access" by for instance reading as little data as possible each time, not have any "cache" in Rocks etc.
All suggestions of configurations are appreciated!
The defaults should be more than enough for your use case - but you can increase the block size and pin index and filter blocks
You can also call optimizeForPointLookup if you only are going to do puts and gets to optimize even further

Why lucene's segment is immutable

It is known that when the content of a document is updated or deleted in elasticsearch, the segment is not immediately deleted, but newly created.
And after that, we know that segments are merged through a schedule.
I know that the reason it works like this is because it is expensive.
But I don't know the exact reason why segments are immutable and don't merge immediately.
Even if I search the document, the exact reason cannot be found, but if anyone knows about this, please comment.
thank you.
Having a segment immutable provides a lot of benefits, such as
It can be easily used in a multi-threaded environment, as content is not changeable, you don't have to worry about the shared state and race-conditions and a lot of complexity when you have mutable contents.
It can be cached effectively as caching fast changing dataset will defeat the purpose of caching.
Refer below content from official ES docs on why lucene segments are cache friendly
Lucene is designed to leverage the underlying OS for caching in-memory
data structures. Lucene segments are stored in individual files.
Because segments are immutable, these files never change. This makes
them very cache friendly, and the underlying OS will happily keep hot
segments resident in memory for faster access. These segments include
both the inverted index (for fulltext search) and doc values (for
aggregations).
Also refer benefits of immutable data in general for more details.

What is the name of this kind of cache/ data structure?

I need a fixed-size cache of objects that keeps track how many times each object was requested. When it is full and a new object is added, the object with the lowest usage score gets removed.
So this is different from a LRU-cache of size N in that if some object is heavily requested, then even adding N new objects won't push it out of cache.
Some kind of mix of a cache and a priority queue. Is there a name for that?
Thanks!
Without a time element, this kind of cache clogs up with things that were used a lot in the past, but aren't used currently. Replacement becomes impossible, because everything in the cache has been used more than once, so you won't evict anything in favor of a new item.
You could write some code that degrades the value of the count over time (i.e. take into account the time since last used), but doing so is just a really complicated way of simulating an LRU cache. I experimented with it at one point, but found that it didn't perform any better than the simple LRU cache. At least not in my application.

A better idea/data structure to collect analytics data

I'm collecting analytics data. I'm using a master map that holds many other nested maps.
Considering maps are immutable, many new maps are going to be allocated. (Yes, that is efficient in Clojure).
Basic operation that I'm using is update-in , very convenient to update a value for a given path or create the binding for a non-existant value.
Once I reached a specific point, I'm going to save that data structure to the data base.
What would be a better idea to collect these data more efficiently in Clojure? a transient data structure?
As with all optimizations, measure first, and If the map update is a bottle neck then switching to a transient map is a rather unintrusive code change. If you find that GC overhead is the real culprit, as it often is with persistent data structures, and transients dont help enough then collecting the data into a list and batch adding it into a transient map which is made persistent and saved into the DB at the end may be a more effective though larger change. Adding to a list produces very little GC overhead because unlike adding to a map the old head does not need to be discarded and GCd

What is a multi-tier cache?

I've recently come across the phrase "multi-tier cache" relating to multi-tiered architectures, but without a meaningful explanation of what such a cache would be (or how it would be used).
Relevant online searches for that phrase don't really turn up anything either. My interpretation would be a cache servicing all tiers of some n-tier web app. Perhaps a distributed cache with one cache node on each tier.
Has SO ever come across this term before? Am I right? Way off?
I know this is old, but thought I'd toss in my two cents here since I've written several multi-tier caches, or at least several iterations of one.
Consider this; Every application will have different layers, and at each layer a different form of information can be cached. Each cache item will generally expire for one of two reasons, either a period of time has expired, or a dependency has been updated.
For this explanation, lets imagine that we have three layers:
Templates (object definitions)
Objects (complete object cache)
Blocks (partial objects / block cache)
Each layer depends on it's parent, and we would define those using some form of dependency assignment. So Blocks depend on Objects which depend on Templates. If an Object is changed, any dependencies in Block would be expunged and refreshed; if a Template is changed, any Object dependencies would be expunged, in turn expunging any Blocks, and all would be refreshed.
There are several benefits, long expiry times are a big one because dependencies will ensure that downstream resources are updated whenever parents are updated, so you won't get stale cached resources. Block caches alone are a big help because, short of whole page caching (which requires AJAX or Edge Side Includes to avoid caching dynamic content), blocks will be the closest elements to an end users browser / interface and can save boatloads of pre-processing cycles.
The complication in a multi-tier cache like this though is that it generally can't rely on a purely DB based foreign key expunging, that is unless each tier is 1:1 in relation to its parent (ie. Block will only rely on a single object, which relies on a single template). You'll have to programmatically address the expunging of dependent resources. You can either do this via stored procedures in the DB, or in your application layer if you want to dynamically work with expunging rules.
Hope that helps someone :)
Edit: I should add, any one of these tiers can be clustered, sharded, or otherwise in a scaled environment, so this model works in both small and large environments.
After playing around with EhCache for a few weeks it is still not perfectly clear what they mean by the term "multi-tier" cache. I will follow up with what I interpret to be the implied meaning; if at any time down the road someone comes along and knows otherwise, please feel free to answer and I'll remove this one.
A multi-tier cache appears to be a replicated and/or distributed cache that lives on 1+ tiers in an n-tier architecture. It allows components on multiple tiers to gain access to the same cache(s). In EhCache, using a replicated or distributed cache architecture in conjunction with simply referring to the same cache servers from multiple tiers achieves this.

Resources