I have some confusion about how to design an asynchronous part of a web app. My setup is simple; a visitor uploads a file, a bunch of computation is done on the file, and the results are returned. Right now I'm doing this all in one request. There is no user model and the file is not stored on disk.
I'd like to change it so that the results are delivered in two parts. The first part comes back with the request response because it's fast. The second part might be heavy computation and a lot of data, so I want it to load asynchronously, whenever it's done. What's a good way to do this?
Here are some things I do know about this. Usually, asynchronicity is done with ajax requests. The request will be to some route, let's say /results. In my controller, there'll be a method written to respond to /results. But this method will no longer have any information from the previous request, because HTTP is stateless. To get around this, people pass info through the request. I could either pass all the data through the request, or I could pass an id which the controller would use to look up the data somewhere else.
My data is a big python object (a pandas DataFrame). I don't want to pass it through the network. If I use an id, the controller will have to look it up somewhere. I'd rather not spin up a database just for these short durations, and I'd also rather not convert it out of python and write to disk. How else can I give the ajax request access to the python object across requests?
My only idea so far is to have the initial request trigger my framework to render a second route, /uuid/slow_results. This would be served until the ajax request hits it. I think this would work, but it feels pretty ad hoc and unnatural.
Is this a reasonable solution? Is there another method I don't know? Or should I bite the bullet and use one of the aforementioned solutions?
(I'm using the web framework Flask, though this question is probably framework agnostic.
PS: I'm trying to get better at writing SO questions, so let me know how to improve it.)
So if your app is only being served by one Python process, you could just have a global object that's a map from ids to DataFrames, but you'd also need some way of expiring them out of the map so you don't leak memory.
So if your app is running on multiple machines, you're screwed. If your app is just running on one machine, it might be sitting behind apache or something and then apache might spawn multiple Python processes and you'd still be screwed? I think you'd find out by doing ps aux and counting instances of python.
Serializing to a temporary file or database are fine choices in general, but if you don't like either in this case and don't want to set up e.g. Celery just for this one thing, then multiprocessing.connection is probably the tool for the job. Copying and lightly modifying from here, the box running your webserver (or another, if you want) would have another process that runs this:
from multiprocessing.connection import Listener
import traceback
RESULTS = dict()
def do_thing(data):
return "your stuff"
def worker_client(conn):
try:
while True:
msg = conn.recv()
if msg['type'] == 'answer': # request for calculated result
answer = RESULTS.get(msg['id'])
conn.send(answer)
if answer:
del RESULTS[msg['id']]
else:
conn.send("doing thing on {}".format(msg['id']))
RESULTS[msg['id']] = do_thing(msg)
except EOFError:
print('Connection closed')
def job_server(address, authkey):
serv = Listener(address, authkey=authkey)
while True:
try:
client = serv.accept()
worker_client(client)
except Exception:
traceback.print_exc()
if __name__ == '__main__':
job_server(('', 25000), authkey=b'Alex Altair')
and then your web app would include:
from multiprocessing.connection import Client
client = Client(('localhost', 25000), authkey=b'Alex Altair')
def respond(request):
client.send(request)
return client.recv()
Design could probably be improved but that's the basic idea.
Related
I'm looking at writing one of our ETL (or ETL like) processes in kiba and I wonder how to structure it. The main question I have is the overall architecture. The process works roughly like this:
Fetch data from an HTTP endpoint.
For each item returned from that API and make one more HTTP call
Do some transformations for each of the items returned from step 2
Send each item somewhere else
Now my question is: Is it OK if only step one is a source and anything until the end is a transform? Or would it be better to somehow have each HTTP call be a source and then combine these somehow, maybe using multiple jobs?
It is indeed best to use a single source, that you will use to fetch the main stream of the data.
General advice: try to work in batches as much as you can (e.g. pagination in the source, but also bulk HTTP lookup if the API supports it in step 2).
Source section
The source in your case could be a paginating HTTP resource, for instance.
A first option to implement it would be to write write a dedicated class like explained in the documentation.
A second option is to use Kiba::Common::Sources::Enumerable (https://github.com/thbar/kiba-common#kibacommonsourcesenumerable) like this:
source Kiba::Common::Sources::Enumerable, -> {
Enumerator.new do |y|
# do your pagination & splitting here
y << your_item
end
}
# then
transform Kiba::Common::Transforms::EnumerableExploder
Join with secondary HTTP source
It can be done this way:
transform do |r|
# here make secondary HTTP query
result = my_query(...)
# then merge the result
r.merge(secondary_data: ...)
end
There is support for parallelisation of the queries in that step via Kiba Pro's ParallelTransform (https://github.com/thbar/kiba/wiki/Parallel-Transform):
parallel_transform(max_threads: 10) do |r|
# this code will run in its own thread
extra_data = get_extra_json_hash_from_http!(r.fetch(:extra_data_url))
r.merge(extra_data: extra_data)
end
It must be noted too that if you can structure your HTTP calls to handle N rows at one time (if the HTTP backend is flexible enough) things will be even faster.
Step 3 does not need specific advice.
Send each item somewhere else
I would most likely implement a destination for that (but it could also be implemented as a transform actually, and parallelized still with parallel_transform if needed).
I'm reading about caching strategies such as cache-aside, write-through, write-back, ... In the specific cases of write-through and write-back, it is implied that the cache itself is responsible for writing to the database and the event queue, respectively (For full context, here is the article - https://github.com/donnemartin/system-design-primer#when-to-update-the-cache)
For example, write-through is illustrated as
Application code:
set_user(12345, {"foo":"bar"})
Cache code:
def set_user(user_id, values):
user = db.query("UPDATE Users WHERE id = {0}", user_id, values)
cache.set(user_id, user)
For now, let's assume we're using Redis.
In the concrete example above, is the hypothetical set_user function invoked on the Redis client's machine, or on the Redis server?
Now, there seems to be ways to invoke custom logic on the Redis server, e.g., by writing Lua scripts, but I'm skeptical that that's done in practice in order to implement this caching strategy, partly because I've never heard of anyone doing it.
I've seen other articles showing this strategy is implemented solely on the Redis client's machine, but I'm not sure what resources to believe at this point.
Thanks for any help!
It's part of the application. In fact, it would be more appropriate to call the example "data store code", instead of "cache code". The set_user method belongs to a base UserStore class, with different implementations based on data store type, write policy etc. For "write-through", it would be:
class WriteThroughUserStore(UserStore):
def __init__(self, cache_user_store, db_user_store):
self.cache_user_store = cache_user_store
self.db_user_store = db_user_store
def get_user(self, user_id):
return self.cache_user_store.get_user(user_id)
def set_user(self, user):
self.db_user_store.set_user(user)
self.cache_user_store.set_user(user)
The key point of "write-through" is that the write operation is confirmed complete only after writing data to both cache and database synchronously. The order does not matter: you could update cache first, or update database first, or even do them in parallel.
In my sinatra web application, I have a route:
get "/" do
temp = MyClass.new("hello",1)
redirect "/home"
end
Where MyClass is:
class MyClass
#instancesArray = []
def initialize(string,id)
#string = string
#id = id
#instancesArray[id] = this
end
def run(id)
puts #instancesArray[id].string
end
end
At some point I would want to run MyClass.run(1), but I wouldn't want it to execute immediately because that would slow down the servers response to some clients. I would want the server to wait to run MyClass.run(temp) until there was some time with a lighter load. How could I tell it to wait until there is an empty/light load, then run MyClass.run(temp)? Can I do that?
Addendum
Here is some sample code for what I would want to do:
$var = 0
get "/" do
$var = $var+1 # each time a request is recieved, it incriments
end
After that I would have a loop that would count requests/minute (so after a minute it would reset $var to 0, and if $var was less than some number, then it would run tasks util the load increased.
As Andrew mentioned (correctly—not sure why he was voted down), Sinatra stops processing a route when it sees a redirect, so any subsequent statements will never execute. As you stated, you don't want to put those statements before the redirect because that will block the request until they complete. You could potentially send the redirect status and header to the client without using the redirect method and then call MyClass#run. This will have the desired effect (from the client's perspective), but the server process (or thread) will block until it completes. This is undesirable because that process (or thread) will not be able to serve any new requests until it unblocks.
You could fork a new process (or spawn a new thread) to handle this background task asynchronously from the main process associated with the request. Unfortunately, this approach has the potential to get messy. You would have to code around different situations like the background task failing, or the fork/spawn failing, or the main request process not ending if it owns a running thread or other process. (Disclaimer: I don't really know enough about IPC in Ruby and Rack under different application servers to understand all of the different scenarios, but I'm confident that here there be dragons.)
The most common solution pattern for this type of problem is to push the task into some kind of work queue to be serviced later by another process. Pushing a task onto the queue is ideally a very quick operation, and won't block the main process for more than a few milliseconds. This introduces a few new challenges (where is the queue? how is the task described so that it can be facilitated at a later time without any context? how do we maintain the worker processes?) but fortunately a lot of the leg work has already been done by other people. :-)
There is the delayed_job gem, which seems to provide a nice all-in-one solution. Unfortunately, it's mostly geared towards Rails and ActiveRecord, and the efforts people have made in the past to make it work with Sinatra look to be unmaintained. The contemporary, framework-agnostic solutions are Resque and Sidekiq. It might take some effort to get up and running with either option, but it would be well worth it if you have several "run when you can" type functions in your application.
MyClass.run(temp) is never actually executing. In your current request to / path you instantiate a new instance of MyClass then it will immediately do a get request to /home. I'm not entirely sure what the question is though. If you want something to execute after the redirect, that functionality needs to exist within the /home route.
get '/home' do
# some code like MyClass.run(some_arg)
end
Both QWebFrame and QWebPage have void loadFinished(bool ok) signal which can be used to detect when a web page is completely loaded. The problem is when a web page has some content loaded asynchronously (ajax). How to know when the page is completely loaded in this case?
I haven't actually done this, but I think you may be able to achieve your solution using QNetworkAccessManager.
You can get the QNetworkAccessManager from your QWebPage using the networkAccessManager() function. QNetworkAccessManager has a signal finished ( QNetworkReply * reply ) which is fired whenever a file is requested by the QWebPage instance.
The finished signal gives you a QNetworkReply instance, from which you can get a copy of the original request made, in order to identify the request.
So, create a slot to attach to the finished signal, use the passed-in QNetworkReply's methods to figure out which file has just finished downloading and if it's your Ajax request, do whatever processing you need to do.
My only caveat is that I've never done this before, so I'm not 100% sure that it would work.
Another alternative might be to use QWebFrame's methods to insert objects into the page's object model and also insert some JavaScript which then notifies your object when the Ajax request is complete. This is a slightly hackier way of doing it, but should definitely work.
EDIT:
The second option seems better to me. The workflow is as follows:
Attach a slot to the QWebFrame::javascriptWindowObjectCleared() signal. At this point, call QWebFrame::evaluateJavascript() to add code similar to the following:
window.onload = function() { // page has fully loaded }
Put whatever code you need in that function. You might want to add a QObject to the page via QWebFrame::addToJavaScriptWindowObject() and then call a function on that object. This code will only execute when the page is fully loaded.
Hopefully this answers the question!
To check the load of specific element you can use a QTimer. Something like this in python:
#pyqtSlot()
def on_webView_loadFinished(self):
self.tObject = QTimer()
self.tObject.setInterval(1000)
self.tObject.setSingleShot(True)
self.tObject.timeout.connect(self.on_tObject_timeout)
self.tObject.start()
#pyqtSlot()
def on_tObject_timeout(self):
dElement = self.webView.page().currentFrame().documentElement()
element = dElement.findFirst("css selector")
if element.isNull():
self.tObject.start()
else:
print "Page loaded"
When your initial html/images/etc finishes loading, that's it. It is completely loaded. This fact doesn't change if you then decide to use some javascript to get some extra data, page views or whatever after the fact.
That said, what I suspect you want to do here is expose a QtScript object/interface to your view that you can invoke from your page's script, effectively providing a "callback" into your C++ once you've decided (from the page script) that you've have "completely loaded".
Hope this helps give you a direction to try...
The OP thought it was due to delayed AJAX requests but there also could be another reason that also explains why a very short time delay fixes the problem. There is a bug that causes the described behaviour:
https://bugreports.qt-project.org/browse/QTBUG-37377
To work around this problem the loadingFinished() signal must be connected using queued connection.
Okay it's a simple task. After I render html to the client I want to execute a db call with information from the request.
I am using sinatra because it's a lightweight microframework, but really i up for anything in ruby, if it's faster/easier(Rack?). I just want to get the url and redirect the client somewhere else based on the url.
So how does one go about with rack/sinatra a real after_filter. And by after_filter I mean after response is sent to the client. Or is that just not dooable without threads?
I forked sinatra and added after filters, but there is no way to flush the response, even send_data which is suppose to stream files(which is obviously for binary) waits for the after_filters.
I've seen this question: Multipart-response-in-ruby but the answer is for rails. And i am not sure if it really flushes the response to the client and then allows for processing afterwards.
Rack::Callbacks has some before and after callbacks but even those look like they would run before the response is ever sent to the client here's Rack::Callbacks implementation(added comment):
def call(env)
#before.each {|c| c.call(env) }
response = #app.call(env)
#after.each {|c| c.call(env) }
response
#i am guessing when this method returns then the response is sent to the client.
end
So i know I could call a background task through the shell with rake. But it would be nice not to have too... Also there is NeverBlock but is that good for executing a separate process without delaying the response or would it still make the app wait as whole(i think it would)?
I know this is a lot, but in short it's simple after_filter that really runs after the response is sent in ruby/sinatra/rack.
Thanks for reading or answering my question! :-)
Modified run_later port to rails to do the trick the file is available here:
http://github.com/pmamediagroup/sinatra_run_later/tree/master