cache implementation using concurrent hashmap - caching

We have to create a way to find optimum way of cleanup of a cache
For this task, we can say we have cache as ConcurrentHashMap
We have to device a mechanism that we have to delete entries once a certain limit is reached
lets say, 1000 records present in cache
We have to ensure that we delete only those records which are least frequently used
and also, we have to give weightage to the size of the records, as to build bulkier objects, we need more efforts. Thus, we want to preserve them as well.
So we can have a formula where we calculate score
score can be xsize+ynumber of time used

Related

CosmosDB - Gremlin - high memory usagage with query containing limit() step

I want to retrive a large amount of items but using limit clause:
g.V().hasLabel('foo').as('f').limit(5000).order().by('f_Id',incr).by('f_bar',incr).select('f').unfold().dedup()
This query takes very long time and consumes about 800 MB memory to download the collection
Whan i use below query:
g.V().hasLabel('foo').as('f').has('propA','ValueA').has('propB','ABC').limit(5000).order().by('f_Id',incr).by('f_bar',incr).select('f').unfold().dedup()
it is faster and consumes less memory around 500 MB to download this collection but still high.
My Qestion is how to optimize the first query with just limit if i do not want to filter by Properties A and B.
Second Question why there is such difference in memory size between those two results? In both queries i download 5000 items to memory. What could be possible way to reduce this consumption.
I use GremlinDriver for .Net.
I'm not expert at CosmosDB optimization but from a Gremlin perspective when I look at this traversal:
g.V().hasLabel('foo').as('f').
limit(5000).order().by('f_Id',incr).by('f_bar',incr).
select('f').unfold().dedup()
I wonder why you wouldn't just write it as:
g.V().hasLabel('foo').limit(5000).order().by('f_Id',incr).by('f_bar',incr)
Meaning, you want 5000 "foo" vertices ordered a certain way. The need to use the "f" step label and unfold() seem unnecessary and I don't see how you could end up with duplicates so you can drop dedup(). I'm not sure if those changes will make any difference to how CosmosDB processes things but it certainly removes some unneeded processing.
I'd also wonder if you need to pair down the data returned in your vertices. Right now you're returning all the properties for each vertex. If you don't need all of those it might be better to be more specific and transform the data to the form your application requires:
g.V().hasLabel('foo').limit(5000).order().by('f_Id',incr).by('f_bar',incr).
valueMap('name','age')
That should help reduce serialization costs.

Google datastore - index a date created field without having a hotspot

I am using Google Datastore and will need to query it to retrieve some entities. These entities will need to be sorted by newest to oldest. My first thought was to have a date_created property which contains a timestamp. I would then index this field and sort on this field. The problem with this approach is it will cause hotspots in the database (https://cloud.google.com/datastore/docs/best-practices).
Do not index properties with monotonically increasing values (such as a NOW() timestamp). Maintaining such an index could lead to hotspots that impact Cloud Datastore latency for applications with high read and write rates.
Obviously sorting data on dates is properly the most common sorting performed on a database. If I can't index timestamps, is there another way I can accomplish being able to sort my queires from newest to oldest without hotspots?
As you note, indexing monotonically changed values doesn't scale and can lead to hotspots. Whether you are potentially impacted by this depends on your particular usage.
As a general rule, the hotspotting point of this pattern is 500 writes per second. If you know you're definitely going to stay under that you probably don't need to worry.
If you do need higher than 500 writes per second, but have a upper limit in mind, you could attempt a sharded approach. Basically, if you upper on writes per second is x, then n = ceiling(x/500), where n is the number of shards. When you write your timestamp, prepend random(1, n) at the start. This creates n random key ranges that each can perform up to 500 writes per second. When you query your data, you'll need to issue n queries and do some client side merging of the result streams.

How to store few millions of cache and then track down 20 oldest cache

I got an interview question saying I need to store few millions of cache and then I need to keep a track on 20 oldest cache and as soon as the threshold of cache collection increases, replace the 20 oldest with next set of oldest cache.
I answered to keep a hashmap for it, again the question increases
what if we wanna access any of the element on hashmap fastly, how to
do, so I told its map so accessing won't be time taking but
interviewer was not satisfied. So what should be the idle way for such
scenarios.
A queue is well-suited to finding and removing the oldest members.
A queue implemented as a doubly linked list has O(1) insertion and deletion at both ends.
A priority queue lends itself to giving different weights to different items in the queue (e.g. some queue elements may be more expensive to re-create than others).
You can use a hash map to hold the actual elements and find them quickly based on the hash key, and a queue of the hash keys to track age of cache elements.
By using a double-linked list for the queue and also maintaining a hash map of the elements you should be able to make a cache that supports a max size (or even a LRU cache). This should result in references to objects being stored 3 times and not the object being stored twice, be sure to check for this if you implement this (a simple way to avoid this is to just queue the hash key)
When checking for overflow you just pop the last item off the queue and then remove it from the hash map
When accessing an item you can use the hash map to find the cached item. Then if you are implementing a LRU cache you just remove it from the queue and add it back to the beginning, this.
By using this structure Insert, Update, Read, Delete are all going to be O(1).
The follow on question to expect is for an interviewer to ask for the ability for items to have a time-to-live (TTL) that varies per cached item. For this you need to have another queue that maintains items ordered by time-to-live, the only problem here is that inserts now become O(n) as you have to scan the TTL queue and find the spot to expire, so you have to decide if the memory usage of storing the TTL queue as a n-tree will be worthwhile (thereby yielding O(log n) insert time). Or you could implement your TTL queue as buckets for each ~1minute interval or similar, you get ~O(1) inserts still and just degrade the performance of your expiration background process slightly but not greatly (and it's a background process).

Configuring redis to consistently evict older data first

I'm storing a bunch of realtime data in redis. I'm setting a TTL of 14400 seconds (4 hours) on all of the keys. I've set maxmemory to 10G, which currently is not enough space to fit 4 hours of data in memory, and I'm not using virtual memory, so redis is evicting data before it expires.
I'm okay with redis evicting the data, but I would like it to evict the oldest data first. So even if I don't have a full 4 hours of data, at least I can have some range of data (3 hours, 2 hours, etc) with no gaps in it. I tried to accomplish this by setting maxmemory-policy=volatile-ttl, thinking that the oldest keys would be evicted first since they all have the same TTL, but it's not working that way. It appears that redis is evicting data somewhat arbitrarily, so I end up with gaps in my data. For example, today the data from 2012-01-25T13:00 was evicted before the data from 2012-01-25T12:00.
Is it possible to configure redis to consistently evict the older data first?
Here are the relevant lines from my redis.cnf file. Let me know if you want to see any more of the cofiguration:
maxmemory 10gb
maxmemory-policy volatile-ttl
vm-enabled no
AFAIK, it is not possible to configure Redis to consistently evict the older data first.
When the *-ttl or *-lru options are chosen in maxmemory-policy, Redis does not use an exact algorithm to pick the keys to be removed. An exact algorithm would require an extra list (for *-lru) or an extra heap (for *-ttl) in memory, and cross-reference it with the normal Redis dictionary data structure. It would be expensive in term of memory consumption.
With the current mechanism, evictions occur in the main event loop (i.e. potential evictions are checked at each loop iteration before each command is executed). Until memory is back under the maxmemory limit, Redis randomly picks a sample of n keys, and selects for expiration the most idle one (for *-lru) or the one which is the closest to its expiration limit (for *-ttl). By default only 3 samples are considered. The result is non deterministic.
One way to increase the accuracy of this algorithm and mitigate the problem is to increase the number of considered samples (maxmemory-samples parameter in the configuration file).
Do not set it too high, since it will consume some CPU. It is a tradeoff between eviction accuracy and CPU consumption.
Now if you really require a consistent behavior, one solution is to implement your own eviction mechanism on top of Redis. For instance, you could add a list (for non updatable keys) or a sorted set (for updatable keys) in order to track the keys that should be evicted first. Then, you add a daemon whose purpose is to periodically check (using INFO) the memory consumption and query the items of the list/sorted set to remove the relevant keys.
Please note other caching systems have their own way to deal with this problem. For instance with memcached, there is one LRU structure per slab (which depends on the object size), so the eviction order is also not accurate (although more deterministic than with Redis in practice).

Should I keep the size of stored fields in Solr to a minimum?

I am looking to introduce Solr to power the search for a business listing website. The site has around 2 million records.
There is a search results page which will display some key data for each result. I believe the data needed for this summary information is around 1KB per result.
I could simply index the fields needed for the search within Solr - but this means a separate database call for each result to populate the summary information. If Solr could return all of this data I would expect it to yield greater performance than ~40 database round-trips.
The concern is that Solr's memory usage would be too large (how might I calculate this?) and that indexing might take too long with the extra data.
You would benefit greatly to store those fields in Solr compared to the 40 db roundtrips. Just make sure that you marked the field as "not indexed" (indexed = false) in your schema config and maybe also compressed (compressed = true) (however this will of course use some CPU when indexing and retrieving).
When marking a field as "not indexed" no analyzers will process the field when indexing making it stored much faster than a indexed field.
It's a trade off, and you will have to analyze this yourself.
Solr's performance greatly depends on caching, not only of queries, but also of the documents themselves. Those caches depend on memory, and the bigger your documents are, the less you can fit in a fixed amount of memory.
Document size also affects index size and replication times. For large indices with master slave configurations, this can impact the rate at which you can update the index.
Ideally you should measure cache hit rates at different cache sizes, with and without the fields. If you can spend the memory to get a high enough cache hit rate with the fields, then by all means go for it. If you cannot, you may have to fetch the document content from another system.
There is a third alternative you didn't mention, which is to store the documents outside of the DB, but not in Solr. They should be stored in a format which is as close as possible to what you deliver with search results. The code which creates/updates the indices could create/update these documents as well. This is a lot of work, but like everything it comes down to how much performance you need and what you are willing to do to get it.
EDIT: For measuring cache hit rates and throughput, I've found the best test source is your current query logs. Take a day or two worth of live queries and run them against different indexes and configurations to see how well they work.

Resources