I have taken a database class this semester and we are studying about maintaining cache consistency between the RDBMS and a cache server such as memcached. The consistency issues arise when there are race conditions. For example:
Suppose I do a get(key) from the cache and there is a cache miss. Because I get a cache miss, I fetch the data from the database, and then do a put(key,value) into the cache.
But, a race condition might happen, where some other user might delete the data I fetched from the database. This delete might happen before I do a put into the cache.
Thus, ideally the put into the cache should not happen, since the data is longer present in the database.
If the cache entry has a TTL, the entry in the cache might expire. But still, there is a window where the data in the cache is inconsistent with the database.
I have been searching for articles/research papers which speak about this kind of issues. But, I could not find any useful resources.
This article gives you an interesting note on how Facebook (tries to) maintain cache consistency : http://www.25hoursaday.com/weblog/2008/08/21/HowFacebookKeepsMemcachedConsistentAcrossGeoDistributedDataCenters.aspx
Here's a gist from the article.
I update my first name from "Jason" to "Monkey"
We write "Monkey" in to the master database in California and delete my first name from memcache in California but not Virginia
Someone goes to my profile in Virginia
We find my first name in memcache and return "Jason"
Replication catches up and we update the slave database with my first name as "Monkey." We also delete my first name from Virginia memcache because that cache object showed up in the replication stream
Someone else goes to my profile in Virginia
We don't find my first name in memcache so we read from the slave and get "Monkey"
How about using a variable save in memcache as a lock signal?
every single memcache command is atomic
after you retrieved data from db, toggle lock on
after you put data to memcache, toggle lock off
before delete from db, check lock state
The code below gives some idea of how to use Memcached's operations add, gets and cas to implement optimistic locking to ensure consistency of cache with the database.
Disclaimer: i do not guarantee that it's perfectly correct and handles all race conditions. Also consistency requirements may vary between applications.
def read(k):
loop:
get(k)
if cache_value == 'updating':
handle_too_many_retries()
sleep()
continue
if cache_value == None:
add(k, 'updating')
gets(k)
get_from_db(k)
if cache_value == 'updating':
cas(k, 'value:' + version_index(db_value) + ':' + extract_value(db_value))
return db_value
return extract_value(cache_value)
def write(k, v):
set_to_db(k, v)
loop:
gets(k)
if cache_value != 'updated' and cache_value != None and version_index(cache_value) >= version_index(db_value):
break
if cas(k, v):
break
handle_too_many_retries()
# for deleting we can use some 'tumbstone' as a cache value
When you read, the following happens:
if(Key is not in cache){
fetch data from db
put(key,value);
}else{
return get(key)
}
When you write, the following happens:
1 delete/update data from db
2 clear cache
Related
I Am going to build a system for flash sale which will share the same Redis instance and will run on 15 servers at a time.
So the algorithm of Flash sale will be.
Set Max inventory for any product id in Redis
using redisTemplate.opsForValue().set(key, 400L);
for every request :
get current inventory using Long val = redisTemplate.opsForValue().get(key);
check if it is non zero
if (val == null || val == 0) {
System.out.println("not taking order....");
}
else{
put order in kafka
and decrement using redisTemplate.opsForValue().decrement(key)
}
But the problem here is concurrency :
If I set inventory 400 and test it with 500 request thread,
Inventory becomes negative,
If I make function synchronized I cannot manage it in distributed servers.
So what will be the best approach to it?
Note: I can not go for RDBMS and set isolation level because of high request count.
Redis is monothreaded, so running a Lua Script on it is always atomic.
You can define then a Lua script on your Redis instance and running it from your Spring instances.
Your Lua script would just be a sequence of operations to execute against your redis instance (the only one to have the correct value of your stock) and returns the new value for instance or an error if the value is negative.
Your Lua script is basically a Redis transaction, there are other methods to achieve Redis transaction but IMHO Lua is the simplest above all (maybe the least performant, but I have found that in most cases it is fast enough).
How do I force the ehcache to load all the data from DB once, after that i need to read the values from ehcache.
I am seeing examples in which every new search goes to db first and then next hit from cache.
getProduct("1") - goes to db - ok
getProduct("1") - goes to cache - ok
getProduct("2") - goes to db - **instead i want this from cache**
getProduct("2") - goes to cache - ok
Please advice.
If you want up-front loading of a set of information in the cache, this is something you need your application to trigger.
The cache itself does not know the valid values to getProduct and so cannot prefetch them on its own.
I need to remove duplicates from a flow I've developed, it can receive the same ${filename} multiple times. I tried using HBase_1_1_2_ClientMapCacheService with DetectDuplicate (I am using NiFi v1.4), but found that it lets a few duplicates through. If I use DistributedMapCache (ClientService and Server), I do not get any duplicates. Why would I receive some duplicates with the HBase Cache?
As a test, I listed a directory (ListSFTP) with 20,000 files on all cluster nodes (4 nodes) and passed to DetectDuplicate (using the HBase Cache service). It routed 20,020 to "non-duplicate", and interestingly the table actually has 20,000 rows.
Unfortunately I think this is due to a limitation in the operations that are offered by HBase.
The DetectDuplicate processor relies on an operation "getAndPutIfAbsent" which is expected to return the original value, and then set the new value if it wasn't there. For example, first time through it would return null and set the new value, indicating it wasn't a duplicate.
HBase doesn't natively support this operation, so the implementation of this method in the HBase map cache client does this:
V got = get(key, keySerializer, valueDeserializer);
boolean wasAbsent = putIfAbsent(key, value, keySerializer, valueSerializer);
if (! wasAbsent) return got;
else return null;
So because it is two separate calls there is a possible race condition...
Imagine node 1 calls the first line and gets null, but then node 2 performs the get and the putIfAbsent, now when node 1 calls putIfAbsent it gets false because node 2 just populated the cache, so now node 1 returns the null value from the original get... both of these look like non-duplicates to DetectDuplicate.
In the DistributedMapCacheServer, it locks the entire cache per operation so it can provide an atomic getAndPutIfAbsent.
I have madde a dataaframe which I repartitined based on its primary key on the nodes
val config=new SparkConf().setAppName("MyHbaseLoader").setMaster("local[10]")
val context=new SparkContext(config)
val sqlContext=new SQLContext(context)
val rows="sender,time,time(utc),reason,context-uuid,rat,cell-id,first-pkt,last-pkt,protocol,sub-proto,application-id,server-ip,server-domain-name, http-proxy-ip,http-proxy-domain-name, video,packets-dw, packets-ul, bytes-dw, bytes-ul"
val scheme= new StructType(rows.split(",").map(e=>new StructField(e.trim,StringType,true)))
val dFrame=sqlContext.read
.schema(scheme)
.format("csv")
.load("E:\\Users\\Mehdi\\Downloads\\ProbDocument\\ProbDocument\\ggsn_cdr.csv")
dFrame.registerTempTable("GSSN")
dFrame.persist(StorageLevel.MEMORY_AND_DISK)
val distincCount=sqlContext.sql("select count(distinct sender) as SENDERS from GSSN").collectAsList().get(0).get(0).asInstanceOf[Long]
dFrame.repartition(distincCount.toInt/3,dFrame("sender"))
Do I need to call my presist method again after repartitioning for next reducing jobs on dataframe?
Yes, repartition returns a new DataFrame so you would need to cache it again.
While the answer provided by Dikei seems to address your direct question it is important to note that in a case like this there is typically no reason to explicitly cache at all.
Every shuffle in Spark (here it is repartition) serves as an implicit caching point. If some part of lineage has to be re-executed and none of the executors has been lost you it won't have to go further back than to the last shuffle and read shuffle files.
It means that caching just before or just after a shuffle is typically a waste of time and resources especially if you're not interested in in-memory only or some non standard caching mechanism.
You would need to persist the reparation DataFrame, since DataFrames are immutable and reparation returns a new DataFrame.
A approach which you could follow is to persist dFrame and after its reparation the new DataFrame which returned is dFrameRepart. At this stage you could persist the dFrameRepart and unpersist the dFrame in order to free up the memory, provided that you won't be using dFrame again. In case your using dFrame after the reparation operation , both the DataFrames can be persisted.
dFrame.registerTempTable("GSSN")
dFrame.persist(StorageLevel.MEMORY_AND_DISK)
val distincCount=sqlContext.sql("select count(distinct sender) as SENDERS from GSSN").collectAsList().get(0).get(0).asInstanceOf[Long]
valdFrameRepart=dFrame.repartition(distincCount.toInt/3, dFrame("sender")).persist(StorageLevel.MEMORY_AND_DISK)
dFrame.unpersist
I have two different sources of data which I need to marry together. Data set A will have a foo_key attribute which can map to Data set B's bar_key attribute with a one to many relationship.
Data set A:
[{ foo_key: 12345, other: 'blahblah' }, ...]
Data set B:
[{ bar_key: 12345, other: '' }, { bar_key: 12345, other: '' }, { bar_key: 12345, other: '' }, ...]
Data set A is coming from a SQS queue and any relationships with data set B will be available as I poll A.
Data set B is coming from a separate SQS queue that I am trying to dump into a memcached cache to do quick look ups on when an object drops into data set A.
Originally I was planning on setting the memcached key to be bar_key from the objects in data set B but then realized that if I did that it would be possible to overwrite the value since there can be many of the same bar_key value. Then I was thinking well I can create a key bar_key and the value just be an array of the SQS messages. But since I have multiple hosts polling the SQS queue I think it might be possible that when I check to see if the key is in memcached, check it out, append the new message to it, and then set it, that another host could be trying to preform the same operation and thus the first host's attempt at appending the value would just be overwritten.
I've looked around at memcached key locking but I'm not sure I understand it entirely. Would the solution be that when I get the key/value pair from memcached I create a temporary dummy lock on a new key called bar_key_dummy that expires in x seconds, and if I try to fetch a key that has a bar_key_dummy lock active I just send the SQS message back to the queue without deleting to try again in x seconds?
Here's some pseudocode for what I have going on in my head. Does this make any sense?
store = MemCache.new(host)
sqs_messages.poll do |message|
dummy_key = "#{message.bar_key}_dummy"
sqs.dont_delete_message && next unless store.get(dummy_key).nil?
# set dummy_key in memcache with a value of 1 for 3 seconds
store.set(dummy_key, 1, 3)
temp_data = store.get(message.bar_key) || []
temp_data << message
store.set(message.bar_key, temp_data, 300)
# delete dummy key when done in case shorter than x seconds
store.delete(dummy_key)
end
Thanks for any help!
Memcached has a special operation - cas Compare and Swap.
Command gets returns Item along with its unique CAS value.
Then dataset can be searched and update must be issued with the cas command which takes original unique CAS value.
If CAS was changed in between two command, update operation will fail with the EXISTS error