How to allow sinatra poll for data smartly - ruby

I am wanting to design an application where the back end is constantly polling different sensors while the front end (sinatra) allows for this data to be viewed either via json api, or by simply displaying the results in html.
What considerations should I take to develop such an application and how should I structure the application for best scaling and ease of maintenance.
My first thought is to simply let sinatra poll the sensors every time it receives a request to the proper end points, but this seems like it could bog down quiet fast especially seeing how that some sensors only update themselves every couple seconds.
My second thought is to have a background process (or thread) poll the sensors and store the values for sinatra. When a request is received sinatra can then simply poll the background process for a cached value (or pull it from the threaded code) and present it to the client.
I like the second thought more, but I am not sure how I would develop the "background application" so that sinatra could poll it for data to present to the client. The other option would be for sinatra to thread the sensor polling code so that it can simply grab values from it inside the same process rather than requesting it from another process.
Due note that this application will also be responsible for automation of different relays and such based off the sensors and sinatra is only responsible for relaying the status of the sensors to the user. I think separating the backend (automation + sensor information) in a background process/daemon from the frontend (sinatra) would be ideal, but I am not sure how I would fetch the data for sinatra.
Anyone have any input on how I could structure this? If possible I would also appreciate a sample application that simply displays the idea that I could adopt and modify.
Thanks
Edit::
After a bit more research I have discovered drb (distributed ruby http://ruby-doc.org/stdlib-1.9.3/libdoc/drb/rdoc/DRb.html) which allows you to make remote calls on objects over the network. This may be a suitable solution to this problem as the daemon can automate the relays, read the sensors and store the values in class objects, and then present the class objects over drb so that sinatra can call the getters on the remote object to obtain up to date data from the daemon. This is what I initially wanted to attempt to do.
What do you guys think? Is this advisable for such an application?

I have decided to go with Sinatra, DRB, and Daemons to meet the requirements I have stated above.
The web front end will run in its own process and only serve up statistical information via DRB interactions with the backend. This will allow quick response times for the clients and allow me to separate front end code from backend code.
The backend will run in its own process and constantly poll the sensors for updates and store them as class objects with getters so that Sinatra can fetch the information over DRB when required. It will also use the gathered information for automation that is project specific.
Finally the backend and frontend will be wrapped with a Daemons wrapper so that the project will have the capabilities of starting, restarting, stopping, run status, and automatic restarting of the Daemons if it crashes or quits for what ever reason.
Source information:
http://phrogz.net/drb-server-for-long-running-web-processes
http://ruby-doc.org/stdlib-1.9.3/libdoc/drb/rdoc/DRb.html
http://www.sinatrarb.com/
https://github.com/thuehlinger/daemons

Related

What are some best practices when calling external executable from ASP.NET WEB API 2

I am in need to call an external *.exe compiled in C++
from ASP.NET WEB API 2 using Process (System.Diagnostics)
This executable does some image processing stuff and use lot of memory.
SO my question is if change my API calls to Async. or implement threads will it help, Or it doesn't matter?
Note: All i have is executable so i can not go for a CLI Wrapper.
You can separate the two. Your api is one thing, it needs to be fast, responsive to be able to serve the clients. Your image processing thing is different.
You could implement a queuing system. The api is responsible for adding a new item to this queue and nothing more. You could keep track of what tasks are being run in a separate sql table let's say. Imagine you have a sql table called Tasks. Your api chucks data in there and the status is "Not Running".
Some other app which lives on another machine entirely keeps an eye on this table and takes care of running that executable for each item. When it starts, it changes the status to Running, when it completes it's Done. You do whatever else you need. You could have an api endpoint which takes the ID of the task so your client can keep calling this endpoint to see what the status is. Or you could raise an event when it's done, depending on your application needs.
Bottom line, keep things separate, you gain nothing for blocking the api while a resources heavy task is running. Think what happens if you start that process 5 times, at the same time. You've just killed your api basically.
The app that does the heavy work, could even be located on a separate machine, so it doesn't affect the api at all.

'Fire & Forget' call from Sinatra

I am writing an end-point using Sinatra where I will be receiving raw pdfs from the client and need to process the pdf for internal use. Now the pdf processing takes a while and I do not necessarily want client to wait till the processing is finished and risking a timeout (504). Instead the would like to invoke another method that handles pdf processing while I respond back to the client with appropriate code. What is the best way to implement that using Sinatra?
So there's a few parts to this, so let me break down the various steps that are going to happen:
Client uploads a PDF file: depending on the size of the PDF and the speed of their connection, this could take a while. While you're waiting for the upload your web process is busy receiving the data and is unable to process any other requests for any other clients.
You then need to process the uploaded file, store it somewhere, possibly manipulate it somehow. If you do all that as part of the request process then there is yet more time you're tied up dealing with this one request and unable to serve other clients.
The typical way to solve the latter of those problems, manipulating or processing an uploaded asset, is to use a background job queue such as Sidekiq (http://sidekiq.org). You store the required data somewhere, keep enough information to know what to work on (e.g., the database ID of a model that has stored the relevant information, a filename, etc.), and then pass all of that required information into a background job. You then have separate worker processes that pick up that work and complete it, but they aren't part of your web process so they aren't blocking other clients from receiving information.
This still leaves us with the problem of handling large uploads, fortunately that has a solution too. Take advantage of all of the web capacity Amazon has and have the clients upload the file direct to S3, when it's complete they can post just the filename to you, and you can then queue that up into your worker from the previous step and have it all happen in the background. This blog post has a good explanation of how to wire it together using Paperclip http://blog.littleblimp.com/post/53942611764/direct-uploads-to-s3-with-rails-paperclip-and

CPU bound/stateful distributed system design

I'm working on a web application frontend to a legacy system which involves a lot of CPU bound background processing. The application is also stateful on the server side and the domain objects needs to be held in memory across the entire session as the user operates on it via the web based interface. Think of it as something like a web UI front end to photoshop where each filter can take 20-30 seconds to execute on the server side, so the app still has to interact with the user in real time while they wait.
The main problem is that each instance of the server can only support around 4-8 instances of each "workspace" at once and I need to support a few hundreds of concurrent users at once. I'm going to be building this on Amazon EC2 to make use of the auto scaling functionality. So to summarize, the system is:
A web application frontend to a legacy backend system
task performed are CPU bound
Stateful, most calls will be some sort of RPC, the user will make multiple actions that interact with the stateful objects held in server side memory
Most tasks are semi-realtime, where they have to execute for 20-30 seconds and return the results to the user in the same session
Use amazon aws auto scaling
I'm wondering what is the best way to make a system like this distributed.
Obviously I will need a web server to interact with the browser and then send the cpu-bound tasks from the web server to a bunch of dedicated servers that does the background processing. The question is how to best hook up the 2 tiers together for my specific neeeds.
I've been looking at message Queue systems such as rabbitMQ but these seems to be geared towards one time task where any worker node can simply grab a job form a queue, execute it and forget the state. My needs are a little different since there could be multiple 'tasks' that needs to be 'sticky', for example if step 1 is started in node 1 then step 2 for the same workspace has to go to the same worker process.
Another problem I see is that most worker queue systems seems to be geared towards background tasks that can be processed anytime rather than a system that has to provide user feedback that I'm dealing with.
My question is, is there an off the shelf solution for something like this that will allow me to easily build a system that can scale? Would love to hear your thoughts.
RabbitMQ is has an RPC tutorial. I haven't used this pattern in particular but I am running RabbitMQ on a couple of nodes and it can handle hundreds of connections and millions of messages. With a little work in monitoring you can detect when there is more work to do then you have consumers for. Messages can also timeout so queues won't backup too greatly. To scale out capacity you can create multiple RabbitMQ nodes/clusters. You could have multiple rounds of RPC so that after the first response you include the information required to get second message to the correct destination.
0MQ has this as a basic pattern which will fanout work as needed. I've only played with this but it is simpler to code and possibly simpler to maintain (as it doesn't need a broker, devices can provide one though). This may not handle stickiness by default but it should be possible to write your own routing layer to handle it.
Don't discount HTTP for this as well. When you want request/reply, a strict throughput per backend node, and something that scales well, HTTP is well supported. With AWS you can use their ELB easily in front of an autoscaling group to provide the routing from frontend to backend. ELB supports sticky sessions as well.
I'm a big fan of RabbitMQ but if this is the whole scope then HTTP would work nicely and have fewer moving parts in AWS than the other solutions.

Thin server with application state

I need to build a webservice with application state. By this I mean the webservice needs to load and process a lot of data before being ready to answer requests, so a Rails-like approach where normally you don't keep state at the application level between two requests doesn't look appropriate.
I was wondering if a good approach was a daemon (using Daemon-Kit for instance) embedding a simple web server like Thin. The daemon would load and process the initial data.
But I feel it would be better to use Thin directly (launched with Rack). In this case how can I initialize and maintain my application state ?
EDIT: There will be thousands of requests per second, so having to read the app state from files or DB at each one is not efficient. I need to use global variables, and I am wondering what it the cleanest way to initialize and store then in a Ruby/Thin environment.
You could maintain state a number of ways.
A database, including NoSQL databases like Memcache or Redis
A file, or multiple files
Global variables or class variables, assuming the server never gets restarted/reloaded

tweepy Streaming API integration with Django

I am trying to create a Django webapp that utilizes the Twitter Streaming API via the tweepy.Stream() function. I am having a difficult time conceptualizing the proper implementation.
The simplest functionality I would like to have is to count the number of tweets containing a hashtag in real time. So I would open a stream, filtering by keywords, every time a new tweet comes over the connection i increment a counter. That counter is then displayed on a webpage and updated with AJAX or otherwise.
The problem is that the tweepy.Stream() function must be continuously running and connected to twitter (thats the point). How can I have this stream running in the background of a Django app while incrementing counters that can be displayed in (near) real time?
Thanks in advance!
There are various ways to do this, but using a messaging lib (celery) will probably be the easiest.
1) Keep a python process running tweepy. Once an interesting message is found, create a new celery task
2) Inside this carrot task persist the data to the database (the counter, the tweets, whatever). This task can well run django code (e.g the ORM).
3) Have a regular django app displaying the results your task has persisted.
As a precaution, it's probably a good ideal to run the tweepy process under supervision (supervisord might suit your needs). If anything goes wrong with it, it can be restarted automatically.

Resources