Memcache tags simulation - caching

Memcached is a great scalable cache layer but it have one big problem (for me) that it cannot manage tags. And tags are really useful for group invalidation.
I have done some research and I'm aware about some solutions:
Memcache tag fork http://code.google.com/p/memcached-tag/
Code implementation to emulate tags (ref. Best way to invalidate a number of memcache keys using standard php libraries?)
One of my favorite solution is namespace, and this solution is explained on memcached wiki.
However I don't understand why we are integrate namespace on key cache?
From what I understood about namespace trick is: to generate key we have to get value of the namespace (on cache). And if the namespace->value cache entry is evicted, we can no longer compute the good key to fetch cache... So the cache for this namespace are virtually invalidate (I said virtually because the cache still exist but we can no more compute the key to access).
So why can we not simply implement something like:
tag1->[key1, key2, key5]
tag2->[key1, key3, key6]
key1->["value" => value1, "tags" => [tag1, tag2]]
key2->["value" => value2, "tags" => [tag1]]
key3->["value" => value3, "tags" => [tag3]]
etc...
With this implementation I come back with the problem that if tag1->[key1, key2, key5] is evicted we can no more invalidate tag1 key. But with
function load($cacheId) {
$cache = $memcache->get($cacheId);
if (is_array($cache)) {
$evicted = false;
// Check is no tags have been evicted
foreach ($cache["tags"] as $tagId) {
if (!$memcache->get($tagId) {
$evicted = true;
break;
}
}
// If no tags have been evicted we can return cache
if (!$evicted) {
return $cache
} else {
// Not mandatory
$memcache->delete($cacheId);
}
// Else return false
return false;
}
}
It's pseudo code
We are sure to return cache if all of this tags are available.
And first thing we can say it's "each time you need to get cache we have to check(/get) X tags and then check on array". But with namespace we also have to check(/get) namespace to retrieve namespace value, the main diff is to iterate under an array...
But I do not think keys will have many tags (I cannot imagine more than 10 tags/key for my application), so iterate under size 10 array it's quite speed..
So my question is: Does someone already think about this implementation? And What are the limits? Did I forget something? etc
Or maybe I have missunderstand the concept of namespace...
PS: I'm not looking for another cache layer like memcached-tag or redis

I think you are forgetting something with this implementation, but it's trivial to fix.
Consider the problem of multiple keys sharing some tags:
key1 -> tag1 tag2
key2 -> tag1 tag2
tag1 -> key1 key2
tag2 -> key1 key2
Say you load key1. You double check both tag1 and tag2 exist. This is fine and the key loads.
Then tag1 is somehow evicted from the cache.
Your code then invalidates tag1. This should delete key1 and key2 but because tag1 has been evicted, this does not happen.
Then you add a new item key3. It also refers to tag1:
key3 -> tag1
When saving this key, tag1 is (re)created:
tag1 -> key3
Later, when loading key1 from cache again your check in the pseudo code to ensure tag1 exists succeeds. and the (stale) data from key1 is allowed to be loaded.
Obviously a way around this is to check the values of the tag1 data to ensure the key you are loading is listed in that array and only consider your key valid if this is true.
Of course this could have performance issues depending on your use case. If a given key has 10 tags, but each of those tags is used by 10k keys, then you are having to do search through an array of 10k items to find your key and repeat that 10 times each time you load something.
At some point, this may become inefficient.
An alternative implementation (and one which I use), is more appropriate when you have a very high read to write ratio.
If reads are very much the common case, then you could implement your tag capability in a more permanent database backend (I'll assume you have a db of sorts anyway so it only needs a couple extra tables here).
When you write an item in the cache, you store the key and the tag in a simple table (key and tag columns, one row for each tag on a key). Writing a key is simple: "delete from cache_tags where id=:key; foreach (tags as tag) insert into cache_tags values(:key, :tag);
(NB use extended insert syntax in real impl).
When invalidating a tag, simply iterate over all keys that have that tag: (select key from cache_tags where tag=:tag;) and invalidate each of them (and optionally delete the key from the cache_tags table too to tidy up).
If a key is evicted from memcache then the cache_tags metadata will be out of date, but this is typically harmless. It will at most result in an inefficiency when invalidating a tag where you attempt to invalidate a key which had that tag but has already been evicted.
This approach gives "free" loading (no need to check tags) but expensive saving (which is already expensive anyway otherwise it wouldn't need to be cached in the first place!).
So depending on your use case and the expected load patterns and usage, I'd hope that either your original strategy (with more stringent checks on load) or a "database backed tag" strategy would fit your needs.
HTHs

Related

StackExchange.Redis server.Keys(pattern: alternative

I need to store some keys in the cache, that each one expires individually, but have them grouped.
So I do something like this:
Connection.GetDatabase().StringSet($"Logged:{userID}", "", TimeSpan.FromSeconds(30));
At some point I want to get all the grouped keys, so following the example in library's github documentation
https://github.com/StackExchange/StackExchange.Redis/blob/main/docs/KeysScan.md
I do this:
var servers = Connection.GetEndPoints();
return Connection.GetServer(servers[0]).Keys(pattern: "Logged:*);
But in the same page in github there is this warning
Either way, both SCAN and KEYS will need to sweep the entire keyspace,
so should be avoided on production servers - or at least, targeted at
replicas.
What else can I use to achieve what I want without using Keys, if we don't have replicas?

Remove cache keys by pattern/wildcard

I'm building a REST API with Lumen and want to cache some of the routes with Redis. E.g. for the route /users/123/items I use:
$items = Cache::remember('users:123:items', 60, function () {
// Get data from database and return
});
When a change is made to the user's items, I clear the cache with:
Cache::forget('users:123:items');
So far so good. However, I also need to clear the cache I've implemented for the routes /users/123 and /users/123/categories since those include an item list as well. This means I also have to run:
Cache::forget('users:123');
Cache::forget('users:123:categories');
In the future, there might be even more caches to clear, which is is why I'm looking for a pattern/wildcard feature such as:
Cache::forget('users:123*');
Is there any way to accommodate this behavior in Lumen/Laravel?
You can use cache tags.
Cache tags allow you to tag related items in the cache and then flush all cached values that have been assigned a given tag. You may access a tagged cache by passing in an ordered array of tag names. For example, let's access a tagged cache and put value in the cache:
Cache::tags(['people', 'artists'])->put('John', $john, $minutes);
You may flush all items that are assigned a tag or list of tags. For example, this statement would remove all caches tagged with either people, authors, or both. So, both Anne and John would be removed from the cache:
Cache::tags(['people', 'authors'])->flush();
First Get the cached keys with pattern
$output = Redis::connection('cache')->keys("*mn");
Output
[
"projectName_database_ProjectName_cahe_:mn"
]
Output contain from four parts
redis prefix ==> config('database.redis.options.prefix')
cache prefix ==> config('cache.prefix')
seperator ":"
your cahced key "mn"
Get Key
$key = end(explode(":", $output[0]));
Cache::forget($key); // delete key

$ORDER vs counting to scan global range

I have a choice between two ways of scanning through a key level in a large global array and am trying to figure out if one method is more efficient than the other.
This is a vendor supplied application and database on the Intersystems Caché database platform. It is written in the old MUMPS style and does not use any of Caché's object persistence functions: all data is stored in globals directly and any indexes are application maintained.
There is a common convention for repeating data elements attached to entities where the first record will contain a count of child records and then each child record is numbered sequentially at the next key level. For example:
^GBDATA(12345,100)="3"
^GBDATA(12345,100,1)="A^Record"
^GBDATA(12345,100,2)="B^Record"
^GBDATA(12345,100,3)="C^Record"
Where "12345" is the entity key, and "100" is one of the attached detail types. Note that the first "100" record with no other keys has the count of subrecords. There could be anywhere between 0 and hundreds of subrecords attached. The entities are often very wide and there is a lot of other data besides this subrecord type (not shown in example).
Given an entity key, I want to scan through all the subrecords of one type. Would it be faster to use $ORDER to go through the subkeys or to use a FOR loop to anticipate the key values? Does it matter?
$ORDER method:
SET EKEY=12345
SET SEQ=""
FOR
{
SET SEQ=$ORDER(^GBDATA(EKEY,100,SEQ), 1, ROWDATA)
QUIT:SEQ=""
WRITE ROWDATA,!
}
FOR count method:
SET EKEY=12345
SET LIM=^GBDATA(EKEY,100)
FOR SEQ=1:1:LIM
{
WRITE ^GBDATA(EKEY,100,SEQ),!
}
Does anyone know how $ORDER vs $GET is implemented internally in Caché?
I'm having trouble testing this empirically since we only have one production instance with appropriate data and I can't take it offline to clear the cache. I'm most interested in from-disk performance as opposed to from-cache performance.
You could use %SYS.MONLBL to figure out definitively. My guess is that $ORDER is slightly better.
http://docs.intersystems.com/cache20122/csp/docbook/DocBook.UI.Page.cls?KEY=GCM_monlbl
In regards to your question, "Does anyone know how $ORDER vs $GET is implemented internally in Caché?" The two are completely different functions.
$Order is used for the direction that you're going in when reviewing your ^Global.
$Get is used to pull the data within the ^Global. Below is an example of it's use. I use Cache ObjectScript; however, this should give you a general idea
Global Structure
^People(LastName,FirstName)="Phone"
Global Data
^People(Doe,John)="1035001234"
^People(Smith,Jane)="7405241305"
^People(Wood,Edgar)="7555127598"
Code Sample
SET LASTNAME=0
FOR QUIT:LASTNAME?." " DO
.SET LASTNAME=$ORDER(^People(LASTNAME)) QUIT:LASTNAME?." "
.SET FIRSTNAME=0
.FOR QUIT:FIRSTNAME?." " DO
..SET FIRSTNAME=$ORDER(^People(LASTNAME,FIRSTNAME)) QUIT:FIRSTNAME?." "
..SET PHONE=$GET(^People(LASTNAME,FIRSTNAME))
In the sample provided above, it will start with the first record within the ^People global and then start with the first record within the last name by utilizing $Order. It will then $Get the data for the ^People(LASTNAME,FIRSTNAME) node, which is the phone number.
For some samples and reference areas, check out the following links:
$Get Information
$Order Information

g-wan kv store KV_INCR_KEY

How to use the KV_INCR_KEY?
I found a useful feature in gwan api, but without any sample.
I want to add items to the KV store with this as primary key.
Also, how to get the value of this key?
The KV_INCR_KEY value is a flag intended to be passed to k_add().
You get the newly inserted key's value by checking the return value of k_add(). The documentation states:
kv_add(): add/update a value associated to a key
return: 0:out of memory, else:pointer on existing/inserted kv_item struct
This was derived from an idea discussed on the G-WAN forum. And, like for some other flags (timestamp or persistence, for example), it has not not been implemented yet (KV_NO_UPDATE is functional).
Since what follows the next version (focussed on new scripted languages) is a kind of zero-configuration mapReduce, the KV store will get more attention soon.

Memcached dependent items

I'm using memcahced (specifically the Enyim memcached client) and I would like to able to make a keys in the cache dependant on other keys, i.e. if Key A is dependent on Key B, then whenever Key B is deleted or changed, Key A is also invalidated.
If possible I would also like to make sure that data integrity is maintained in the case of a node in the cluster fails, i.e. if Key B is at some point unavailable, Key A should still be invalid if Key B should become invalid.
Based on this post I believe that this is possible, but I'm struggling to understand the algorithm enough to convince myself how / why this works.
Can anyone help me out?
I've been using memcached quite a bit lately and I'm sure what you're trying to do with depencies isn't possible with memcached "as is" but would need to be handled from client side. Also that the data replication should happen server side and not from the client, these are 2 different domains. (With memcached at least, seeing its lack of data storage logic. The point of memcached though is just that, extreme minimalism for bettter performance)
For the data replication (protection against a physical failing cluster node) you should check out membased http://www.couchbase.org/get/couchbase/current instead.
For the deps algorithm, I could see something like this in a client: For any given key there is a suspected additional key holding the list/array of dependant keys.
# - delete a key, recursive:
function deleteKey( keyname ):
deps = client.getDeps( keyname ) #
foreach ( deps as dep ):
deleteKey( dep )
memcached.delete( dep )
endeach
memcached.delete( keyname )
endfunction
# return the list of keynames or an empty list if the key doesnt exist
function client.getDeps( keyname ):
return memcached.get( key_name + "_deps" ) or array()
endfunction
# Key "demokey1" and its counterpart "demokey1_deps". In the list of keys stored in
# "demokey1_deps" there is "demokey2" and "demokey3".
deleteKey( "demokey1" );
# this would first perform a memcached get on "demokey1_deps" then with the
# value returned as a list of keys ("demokey2" and "demokey3") run deleteKey()
# on each of them.
Cheers
I don't think it's a direct solution but try creating a system of namespaces in your memcache keys, e.g. http://www.cakemail.com/namespacing-in-memcached/. In short, the keys are generated and contain the current values of other memcached keys. In the namespacing problem the idea is to invalidate a whole range of keys who are within a certain namespace. This is achieved by something like incrementing the value of the namespace key, and any keys referencing the previous namespace value will not match when the key is regenerated.
Your problem looks a little different, but I think that by setting up Key A to be in the Key B "namespace, if a node B was unavailable then calculating Key A's full namespaced key e.g.
"Key A|Key B:<whatever Key B value is>"
will return false, thus allowing you to determine that B is unavailable and invalidate the cache lookup for Key A.

Resources