In these caching scenarios, where is the code executed? - caching

I'm reading about caching strategies such as cache-aside, write-through, write-back, ... In the specific cases of write-through and write-back, it is implied that the cache itself is responsible for writing to the database and the event queue, respectively (For full context, here is the article - https://github.com/donnemartin/system-design-primer#when-to-update-the-cache)
For example, write-through is illustrated as
Application code:
set_user(12345, {"foo":"bar"})
Cache code:
def set_user(user_id, values):
user = db.query("UPDATE Users WHERE id = {0}", user_id, values)
cache.set(user_id, user)
For now, let's assume we're using Redis.
In the concrete example above, is the hypothetical set_user function invoked on the Redis client's machine, or on the Redis server?
Now, there seems to be ways to invoke custom logic on the Redis server, e.g., by writing Lua scripts, but I'm skeptical that that's done in practice in order to implement this caching strategy, partly because I've never heard of anyone doing it.
I've seen other articles showing this strategy is implemented solely on the Redis client's machine, but I'm not sure what resources to believe at this point.
Thanks for any help!

It's part of the application. In fact, it would be more appropriate to call the example "data store code", instead of "cache code". The set_user method belongs to a base UserStore class, with different implementations based on data store type, write policy etc. For "write-through", it would be:
class WriteThroughUserStore(UserStore):
def __init__(self, cache_user_store, db_user_store):
self.cache_user_store = cache_user_store
self.db_user_store = db_user_store
def get_user(self, user_id):
return self.cache_user_store.get_user(user_id)
def set_user(self, user):
self.db_user_store.set_user(user)
self.cache_user_store.set_user(user)
The key point of "write-through" is that the write operation is confirmed complete only after writing data to both cache and database synchronously. The order does not matter: you could update cache first, or update database first, or even do them in parallel.

Related

How to memoize MySQL connection client cleverly in external module used from e.g. sinatra?

I think the question does not pin-point to the real problem, I have difficulties to nail it down precisely and concisely.
I have a gem that implements i.e. MySQL-database "queries" (also inserts, updates...)
module DBGEM::Query
def self.client settings=DBGEM.settings
##client ||= Mysql2::Client.new settings
end
def query_this
client.query(...)
end
def process_insert_that list_of_things
list_of_things.each do |thing|
# process
client.query(...)
end
end
Furthermore, this gem is used by a sinatra app sitting on a forking webserver like puma.
Within the sinatra-app i can now
get '/path' do
happy = DBGEM::Query.query_this
# process happy
great = DBGEM::Query.process_insert_that 1..20
# go on
end
I like that API and this code should open only one database connection.
But as far as I understood, because the code within the 'get' definition is not guaranteed to be the only one accessing the DBGEM::Query stuff at that time, weird things could happen (through race-conditions, shared internal state?).
Is there a clever way to keep the nice syntax and the connection sharing without boilerplate object creation (query = DBGEM::Query.new() #...) wrapping the stuff in a block (DBGEM::Query.process do |query| #...)?
The example above is obviously simplified. The sinatra handling might be more involved, the Queries actually done in a Service object etc.pp. Also, afaiu in a forking webserver environment, the GC would destroy the client (closing the connection - thats how mysql2 is implemented).
I think that the connection will not be closed every time.
##client is shared between DBGEM::Query object itself (in Ruby modules and classes are also objects) and all the instances of that object (to be precise: all the instances of classes to which that object is mixed in).
So, this variable will live as long as the DBGEM::Query object will live.
You can check out when DBGEM::Query object will be garbage collected, by defining finalizer logging a text and observe the server console.
module DBGEM::Query
ObjectSpace.define_finalizer(self, proc { print 'garbage collected' })
..
end
Im not sure, however I guess that DBGEM::Query object will be garbage collected only when you stop the server.
As it goes for weird "things could happen", I believe you mean potential conflicts, race conditions, situations where you create double records, or update the same record nearly at the same time overwriting something, etc. And when that happen you lose data integrity.
IMHO you can't prevent it by allowing only one client instance. I'd suggest aiming for solid database design (unique constrains, indexes, foreign keys, validations) which can raise errors when race condition occure and then handling that errors in your application.

Does Await have overhead in test case

I have a very simple test cases(scalatest, but doesn't matter) and I provide two implementation of accessing some resources, this method returns either Try or some case class instance.
Test cases:
"ResourceLoader" must
"successfully initialize resource" in {
/async code test
noException should be thrownBy Await.result(ResourceLoader.initializeRemoteResourceAsync(credentials, networkConfig), Duration.Inf)
}
"ResourceLoader" must
"successfully sync initialize remote resources" in {
noException should be thrownBy ResourceLoader.initializeRemoteResource(credentials, networkConfig)
}
This tests testing different code which access some remote resources
Sync version
def initializeRemoteResource(credentials: Credentials, absolutePathToNetworkConfig: String): Resource = {
//some code accessing remote server
}
Async version
def initializeRemoteResourceAsync(credentials: Credentials, absolutePathToNetworkConfig: String): Future[Try[Resource]] = {
Future {
//the same code as in sync version
}
}
In IDEA test tab I see that future based version is twice slower then sync version, my question is there overhead for calling Await.result explicitly? If not, why it slows down the execution? Appreciate any help, Thanks.
Note: I know it is not the best way to measure performance of production system. But it at list says how much time was spend on each test case.
Yes, there will be a small overhead for Await.result, but in practice it probably doesn't amount to much. Future {} requires an ExecutionContext (thread pool or thread creator) in implicit scope so you won't be able to successfully use it without importing the default execution context (which will simply spawn a thread) or some other context. If you're using the default execution context, for example, you will have two threads instead of one which will involve some overhead for context switching. It shouldn't be much though. If 'twice as slow' means 40ms instead of 20 then perhaps it's not worth worrying about.

How to access a python object from a previous HTTP request?

I have some confusion about how to design an asynchronous part of a web app. My setup is simple; a visitor uploads a file, a bunch of computation is done on the file, and the results are returned. Right now I'm doing this all in one request. There is no user model and the file is not stored on disk.
I'd like to change it so that the results are delivered in two parts. The first part comes back with the request response because it's fast. The second part might be heavy computation and a lot of data, so I want it to load asynchronously, whenever it's done. What's a good way to do this?
Here are some things I do know about this. Usually, asynchronicity is done with ajax requests. The request will be to some route, let's say /results. In my controller, there'll be a method written to respond to /results. But this method will no longer have any information from the previous request, because HTTP is stateless. To get around this, people pass info through the request. I could either pass all the data through the request, or I could pass an id which the controller would use to look up the data somewhere else.
My data is a big python object (a pandas DataFrame). I don't want to pass it through the network. If I use an id, the controller will have to look it up somewhere. I'd rather not spin up a database just for these short durations, and I'd also rather not convert it out of python and write to disk. How else can I give the ajax request access to the python object across requests?
My only idea so far is to have the initial request trigger my framework to render a second route, /uuid/slow_results. This would be served until the ajax request hits it. I think this would work, but it feels pretty ad hoc and unnatural.
Is this a reasonable solution? Is there another method I don't know? Or should I bite the bullet and use one of the aforementioned solutions?
(I'm using the web framework Flask, though this question is probably framework agnostic.
PS: I'm trying to get better at writing SO questions, so let me know how to improve it.)
So if your app is only being served by one Python process, you could just have a global object that's a map from ids to DataFrames, but you'd also need some way of expiring them out of the map so you don't leak memory.
So if your app is running on multiple machines, you're screwed. If your app is just running on one machine, it might be sitting behind apache or something and then apache might spawn multiple Python processes and you'd still be screwed? I think you'd find out by doing ps aux and counting instances of python.
Serializing to a temporary file or database are fine choices in general, but if you don't like either in this case and don't want to set up e.g. Celery just for this one thing, then multiprocessing.connection is probably the tool for the job. Copying and lightly modifying from here, the box running your webserver (or another, if you want) would have another process that runs this:
from multiprocessing.connection import Listener
import traceback
RESULTS = dict()
def do_thing(data):
return "your stuff"
def worker_client(conn):
try:
while True:
msg = conn.recv()
if msg['type'] == 'answer': # request for calculated result
answer = RESULTS.get(msg['id'])
conn.send(answer)
if answer:
del RESULTS[msg['id']]
else:
conn.send("doing thing on {}".format(msg['id']))
RESULTS[msg['id']] = do_thing(msg)
except EOFError:
print('Connection closed')
def job_server(address, authkey):
serv = Listener(address, authkey=authkey)
while True:
try:
client = serv.accept()
worker_client(client)
except Exception:
traceback.print_exc()
if __name__ == '__main__':
job_server(('', 25000), authkey=b'Alex Altair')
and then your web app would include:
from multiprocessing.connection import Client
client = Client(('localhost', 25000), authkey=b'Alex Altair')
def respond(request):
client.send(request)
return client.recv()
Design could probably be improved but that's the basic idea.

Closing over java.util.concurrent.ConcurrentHashMap inside a Future of Actor's receive method?

I've an actor where I want to store my mutable state inside a map.
Clients can send Get(key:String) and Put(key:String,value:String) messages to this actor.
I'm considering the following options.
Don't use futures inside the Actor's receive method. In this may have a negative impact on both latency as well as throughput in case I've a large number of gets/puts because all operations will be performed in order.
Use java.util.concurrent.ConcurrentHashMap and then invoke the gets and puts inside a Future.
Given that java.util.concurrent.ConcurrentHashMap is thread-safe and providers finer level of granularity, I was wondering if it is still a problem to close over the concurrentHashMap inside a Future created for each put and get.
I'm aware of the fact that it's a really bad idea to close over mutable state inside a Future inside an Actor but I'm still interested to know if in this particular case it is correct or not?
In general, java.util.concurrent.ConcurrentHashMap is made for concurrent use. As long as you don't try to transport the closure to another machine, and you think through the implications of it being used concurrently (e.g. if you read a value, use a function to modify it, and then put it back, do you want to use the replace(key, oldValue, newValue) method to make sure it hasn't changed while you were doing the processing?), it should be fine in Futures.
May be a little late, but still, in the book Reactive Web Applications, the author has indicated an indirection to this specific problem, using pipeTo as below.
def receive = {
case ComputeReach(tweetId) =>
fetchRetweets(tweetId, sender()) pipeTo self
case fetchedRetweets: FetchedRetweets =>
followerCountsByRetweet += fetchedRetweets -> List.empty
fetchedRetweets.retweets.foreach { rt =>
userFollowersCounter ! FetchFollowerCount(
fetchedRetweets.tweetId, rt.user
)
}
...
}
where followerCountsByRetweet is a mutable state of the actor. The result of fetchRetweets() which is a Future is piped to the same actor as a FetchedRetweets message, which then acts on the message on to modify the state of the acto., this will mitigate any concurrent operation on the state

Blocking findAndModify in Ruby MongoDB Driver

I'm trying to achieve something like this in MonogDB:
require 'base64'
require 'mongo'
class MongoDBQueue
def enq(thing)
collection.insert({ payload: Base64.encode64(Marshal.dump(thing))})
end
alias :<< :enq
def deq
until _r = collection.find_and_modify({ sort: {_id: Mongo::ASCENDING}, remove: true})
Thread.pass
end
return Marshal.load(Base64.decode64(_r["payload"]))
end
alias :pop :deq
private
def collection
# database, collection & mongodb index semantics here
end
end
Naturally enough I want a Disk-backed queue in Ruby that doesn't destroy my available memory, I'm using this with the Anemone web spider framework which by default uses the Queue class, there's a fork which can use the SizedQueue class, however when using a SizedQueue for both the "page queue" and "links queue", it often deadlocks, presumably because it's trying to dequeue a page and process it, and it's found new links, and that situation cannot be reconciled.
There's also an existing implementation of a Redis queue, however that also exhausts all my available memory on this machine (Available memory is 16Gb, so it's not trivial)
Because of that I want to use this MongoDB backend, but I think the implementation is insane. The Thread.pass feels like a horrible solution, but Anemone is multi-threaded, and MongoDB doesn't support blocking reads, so it's a tricky situation.
Here's my references:
Redis queue implementation for anemone: https://github.com/chriskite/anemone/blob/queueadapter/lib/anemone/queue/redis.rb
MongoDB findAndModify: http://www.mongodb.org/display/DOCS/findAndModify+Command
Questions:
Can anyone comment about how sane this is, compared to sleep (which should trigger the VM to pass control to the next thread, anyway, but sleep feels dirtier)
Should I perhaps Thread.pass and sleep? ( I guess not, see above)
Can I make that read from MongoDB block? There was talk of that here, but never came to anything: https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/rqnHNFXaZ0w
1) Reads in MongoDB are blocking. If you do a findOne() or a findAndModify(), the call will not return until the data is present in the client side. If you do a find(), the call will not return until you get a cursor: you can then iterate on the cursor as much as you need.
2) By default, writes to MongoDB are "fire and forget". If you care about data integrity, you need to do either safe writes by setting :safe => true in your connection, database, or collection object
Kernel.sleep is actually a better solution, as otherwise you'll spin there (albeit passing control to other threads after each query).
As the findAndModify is atomic, only one thread (even on JRuby) will take the job, so I don't quite understand what's the "blocking" issue here.

Resources