Guava Cache: How to access without it counting for the eviction policy? - caching

I have a Guava cache which I would like to expire after X minutes have passed from the last access on a key. However, I also periodically do an action on all the current key-vals (much more frequently than the X minutes), and I wouldn't like this to count as an access to the key-value pair, because then the keys will never expire.
Is there some way to read the value of the keys without this influencing the internal state of the cache? ie cache._secretvalues.get(key) where I could conceivably subclass Cache to StealthCache and do getStealth(key)? I know relying on internal stuff is non-ideal, just wondering if it's possible at all. I think when I do cache.asMap.get() it still counts as an access internally.

From the official Guava tutorials:
Access time is reset by all cache read and write operations (including
Cache.asMap().get(Object) and Cache.asMap().put(K, V)), but not by
containsKey(Object), nor by operations on the collection-views of
Cache.asMap(). So, for example, iterating through cache.entrySet()
does not reset access time for the entries you retrieve.
So, what I would have to do is iterate through the entrySet instead to do my stealth operations.

Related

Does EX second impact performance in Redis?

I tried googling something similar , but wasn't habel to find something on the topic
I'm just curious, does it matter how big the number of seconds are set in a key impact performance in redis?
For example:
set mykey "foobarValue" EX 100 VS set mykey "foobarValue" EX 2592000
To answer this question, we need to see how Redis works.
Redis maintains tables of a key, value pair with an expiry time, so each entry can be translated to
<Key: <Value, Expiry> >
There can be other metadata associated with this as well. During GET, SET, DEL, EXPIRE etc operations Redis calculates the hash of the given key(s) and tries to perform the operation. Since it's a hash table, it needs to prob during any operation, while probing it may encounter some expired keys. If you have subscribed for "Keyspace notification" then notification would be sent and the given entry is removed/updated based on the operation being performed. It also does rehashing, during rehashing it might find expired keys as well. Redis also runs background tasks to cleanup expire keys, that means if TTL is too small then more keys would be expired, as this process is random, so more event would be generated.
https://github.com/antirez/redis/blob/a92921da135e38eedd89138e15fe9fd1ffdd9b48/src/expire.c#L98
It does have a small performance issue when TTL is small since it needs to free the memory and fix some pointers. But it can so happen that you're running out of memory since expired keys are also present in the database. Similarly, if you use higher expiry time then the given key would present in the system for a longer time, that can create memory issue.
Setting smaller TTL has also more cache miss for the client application, so client will have performance issues as well.

Redis cache updating

EDIT2: Clarification: The code ALREADY has refresh cache on miss logic. What I'm trying to do is reducing the number of missed cache hits.
I'm using Redis as a cache for an API. The idea is that when the API receives a call it first checks the cache and if the data isn't in cache the API will fetch it and cache it afterwards for next time.
At the moment the configuration is the following:
maxmemory 50mb
maxmemory-policy allkeys-lru
That is, use at most 50mb memory, keep trying keys in there and when memory is full start by deleting the least recently used keys (lru).
Now I want to introduce a second category of keys. For this second category I'm going to set a certain expiry time. Now I would like to set up a mechanism such that when these keys expiry this mechanism kicks in and refreshes them (and sets new expiry).
How do I do this?
EDIT:
Some progress. It turns out that Redis has a pub/sub messaging system which in particular can dispatch messages on event. One of them is expiring keys, which can be enabled as such:
notify-keyspace-events Ex
I found this code can describes a blocking python process subscribing to Redis' messaging system. It can easily be changed to detect keys expiring and make a call to the API when a key expires, and the API will then refresh the key.
def work(self, item):
requests.get('http://apiurl/?q={param}'.format(param=item['data']))
So this does precisely what I was asking about.
Often, this feels way too dangerous and out of control. I can imagine a bunch of different situations under which this will very quickly fail.
So, what's a better solution?
http://redis.io/topics/notifications
Keyspace notifications allows clients to subscribe to Pub/Sub channels
in order to receive events affecting the Redis data set in some way.
Examples of the events that is possible to receive are the following:
All the keys expiring in the database 0. (e.g)
...
EXPIRE generates an expire event when an expire is set to the key, or
a expired event every time setting an expire results into the key
being deleted (see EXPIRE documentation for more info).
To expire keys, just use Redis' built-in expiry mechanism. You don't need to refresh the cache contents on expiry, the simplest is to do it when the code experiences a cache miss.

APC User Cache Entries not being removed after timeout

I'm running APC mainly to cache objects and query data as user cache entries, each item it setup with a specific time relevant to the amount of time it's required in the cache, some items are 48 hours but more are 2-5 minutes.
It's my understanding that when the timeout is reached and the current time passes the created at time then the item should be automatically removed from the user cache entries?
This doesn't seem to be happening though and the items are instead staying in memory? I thought maybe the garbage collector would remove these items but it doesn't seem to have done even though it's running once an hour at the moment.
The only other thing I can think is that the default apc.user_ttl = 0 overrides the individual timeout values and sets them to never be removed even after individual timeouts?
In general, a cache manager SHOULD keep your entries for as long as possible, and MAY delete them if/when necessary.
The Time-To-Live (TTL) mechanism exists to flag entries as "expired", but expired entries are not automatically deleted, nor should they be, because APC is configured with a fixed memory size (using apc.shm_size configuration item) and there is no advantage in deleting an entry when you don't have to. There is a blurb below in the APC documentation:
If APC is working, the Cache full count number (on the left) will
display the number of times the cache has reached maximum capacity and
has had to forcefully clean any entries that haven't been accessed in
the last apc.ttl seconds.
I take this to mean that if the cache never "reached maximum capacity", no garbage collection will take place at all, and it is the right thing to do.
More specifically, I'm assuming you are using the apc_add/apc_store function to add your entries, this has a similar effect to the apc.user_ttl, for which the documentation explains as:
The number of seconds a cache entry is allowed to idle in a slot in
case this cache entry slot is needed by another entry
Note the "in case" statement. Again I take this to mean that the cache manager does not guarantee a precise time to delete your entry, but instead try to guarantee that your entries stays valid before it is expired. In other words, the cache manager puts more effort on KEEPING the entries instead of DELETING them.
apc.ttl doesn't do anything unless there is insufficient allocated memory to store new coming variables, if there is sufficient memory the cache will never expire!!. so you have to specify your ttl for every variable u store using apc_store() or apc_add() to force apc to regenerate it after end of specified ttl passed to the function. if u use opcode caching it will also never expire unless the page is modified(when stat=1) or there is no memory. so apc.user_ttl or apc.ttl are actually have nothing to do.

Plone 4.2 how to make PAS cache external usera data

I'm implementing a PAS plugin that handles authentications against mailservers. Actually only DBMail is implemented.
I realized, that the enumerateUsers function from the PAS plugin is called numerous times per request and requires my plugin to open/close an SQL connections for every (subsequent) request. Of course, this is very expensive.
The connections itself are handled in a plone tool, which is able to handle multiple different mailservers and delegeates the enumerateUsers call to wrapper objects that represent registered servers.
My question is now, what sort of cache (OOBTree, Session?) I should use to provide a temporary local storage for repeating enumerations and avoid subsequent SQL connections?
Another idea was, to hook into the user creation process that takes place on the first login, an external user issues and completely "localize" the users.
Third idea was, to store the needed data in the specific member, if possible.
What would be best practice here?
I'd cache the query results, indeed. You need to make a decision on how long to cache the results, and if stored long term, how to invalidate that cache or check for changes.
There are no best practices for these decisions, as they depend entirely on the type of data stored and the APIs of the backends. If they support some kind of freshness query, for example, then you store everything forever and poll the backend to see if the cache needs updating.
You can start with a simple request cache; query once per request, store it on the request object. Your cache will automatically be invalidated at the end of the request as the request object is cleaned up, the next request will be a clean slate.
If your backend users rarely change, you can cache information for longer, in a local cache. I'd use a volatile attribute on the plugin. Any attribute starting with _v_ is ignored by the persistence machinery. Thus, anything stored in a _v_ volatile attribute is both thread-local and only exists for the lifetime of the process, a restart of the server clears these automatically.
At the very least you should use an _v_ volatile attribute to store your backend SQL connections. That way they can stay open between requests, and can be re-used. Something like the following method would do nicely:
def _connection(self):
# Return a backend connection
if getattr(self, '_v_connection', None) is None:
# Create connection here
self._v_connection = yourdatabaseconnection
return self._v_connection
You could also use a persistent attribute on your plugin to store your cache. This cache would be committed to the ZODB and persist across restarts. You then really need to work out how to invalidate the contents; store timestamps and evict data when to old, etc.
Your cache datastructure depends entirely on your application needs. If you don't persist information, a dictionary (username -> information) could be more than enough. Persisted caches could benefit from using a OOBTree instead of a dictionary as they reduce chances of conflicts between different threads and are more efficient when it comes to large sets of data.
Whatever you do, you do not need to use a Session. Sessions are prone to conflicts, do not scale well, and are in any case not the place to store a cache of this kind.

Redis namespacing basics

I am really new to Redis and have been using it along with my Ruby on Rails (Rails 2.3 and Ruby 1.8.7) application using the redis gem for simple tagging functionality as a key value store. I recently realized that I could use it to maintain a user activity feed as well.
The thing is I need the tagging data (stored as key => Sets) in memory and its extremely important to determine results for tagging related operations, where as for the activity feed the data could be deleted on a first in first out basis. Assuming I store X number of activities for every user
Is it possible that I could namespace the redis data sets and have one remain permanently in memory and have the other stay temporarily in the memory. What is the general approach when one uses unrelated data sets that need to have different durations of survival in memory.
Would really appreciate any help on this.
You do not need to define a specific namespace for this. With Redis, you can use the EXPIRE command to set a timeout on a key by key basis.
The general policy regarding key expiration is defined in the configuration file:
# MAXMEMORY POLICY: how Redis will select what to remove when maxmemory
# is reached? You can select among five behavior:
#
# volatile-lru -> remove the key with an expire set using an LRU algorithm
# allkeys-lru -> remove any key accordingly to the LRU algorithm
# volatile-random -> remove a random key with an expire set
# allkeys->random -> remove a random key, any key
# volatile-ttl -> remove the key with the nearest expire time (minor TTL)
# noeviction -> don't expire at all, just return an error on write operations
#
For your purpose, the volatile-lru policy should be set.
You just have to call EXPIRE on the keys you want to be volatile, and let Redis evict them. However please note it is difficult to guarantee that the oldest keys will be evicted first once the timeout has been triggered. More explanations here.
For your specific use case however, I would not use key expiration but rather try to simulate capped collections. If the activity feed for a given user is represented as a list of objects, it is easy to LPUSH the activity objects, and use LTRIM to limit the size of the list. You get FIFO behavior and keep memory consumption under control for free.
UPDATE:
Now, if you really need to isolate data, you have two main possibilities with Redis:
using two distinct databases. Redis database are identified by an integer, and you can have several of them per instance. Use the select command to switch between databases. Databases can be used to isolate data, but not to assign them different properties (like an expiration policy for instance).
using two distinct instances. An empty Redis instance is a very light process. So several of them can be started without any problem. It is actually the best and the more scalable way to isolate data with Redis. Each instance can have its own policies (including eviction policy). The clients should open as many connections as instances.
But again, you do not need to isolate data to implement your eviction policy requirements.

Resources