What's a Reasonable Length of Time to Timeout a Ruby Thread? - ruby

I've got a need to retain the data and keep a Ruby program waiting for a response for anything up to a couple of days. I'm thinking about implementing this using threads (there may be a number of concurrent requests across a network). My question; is it reasonable to leave a thread running for anything up to a couple of days awaiting a response?

In general there is no problem with that. Check out the Queue class, it might facilitate the "job polling":


Ruby Bunny exchange wait_for_confirm or die

What would be the best way to incorporate something similar to the RabbitMQ channel.waitForConfirmsOrDie() method, while utilizing the Bunny gem for a publish confirmation?
Right now I am using:
if !#channel.using_publisher_confirmations?
was_successful = #channel.wait_for_confirms()
But ideally, for the scenario I need, I would like to have a much shorter timeout on waiting for the confirmations. Right now, it seems as though there is a default timeout of roughly 15 seconds, but that is far too long to block the thread. If I don't receive confirmation within, say, three seconds, what I'd like to have happen is raise an exception/return false.
I saw there was a waitForConfirmsOrDie() in the RabbitMQ documentation, but Bunny does not have this as a method available.
Am I considering rewriting some methods for similar functionality. Has anyone come across something similar and found a good way to implement this?
Don't wait for confirms synchronously. You should use a technique similar to this to keep track of outstanding confirms and handle them.
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.

MPI: master-slave with the master also doing work

I'm implementing a standard MPI master/slave system: there is a master that distributes work, and there are slaves who ask for chunks and process data.
However... if implemented in a naive way (rank==0 is master, the rest are slaves), the master ends up doing no real work, but still takes one core for what needs practically no real computing power. So I tried to implement a separate "scheduler" thread in the master, but that involved sending MPI messages to itself, and didn't really work...
Do you have any ideas how to solve this?
As I realized after some googling: you can send messages to yourself using tags. Tags are a kind of filter: if you do a recv for only tag==1, then you'll receive only those, with later messages being able to overtake eariler ones.
So, as for the solution:
tag the "scheduler to worker" and "worker to scheduler" messages with a different id
if rank==0: start a scheduler thread
afterwards, regardless of the rank, request work.
This way, the rank 0 worker won't receive its own "let's give me work" messages, because they will have a "to be received by the scheduler only" tag.
Edit: this thing doesn't really seem to be thread-safe though... (= it sometimes crashes in "free()" even though it's written in Python...) so I'd be still interested in the real & proven solution :)

Web crawler in Ruby: How to achieve the best perfomance?

I'm writing a web-crawler that should be able to parse multiple pages at the same time. I use Nokogiri for parsing which is quiet good and solve all my tasks, but I don't know how to achieve better perfomance.
I use threads to make many open-uri requests at the same time and it makes the process quicker, but it seems that it's still far from the potential that I can achieve from a single server. Should I use multiple processes? What are the limits of the threads and processes that can be launched for a single ruby application?
By the other words: how to achieve the best performance in this case.
I really like Typhoeus and Hydra for handling multiple requests at once.
Typhoeus is the http client side, and Hydra is the part that handles multiple requests. The examples are good so go through them and see.
While it sounds like you're not looking for something quite so complex I found this thesis an interesting read awhile ago: Building blocks of a scalable webcrawler - Marc Seeger.
In terms of threading/process limits Ruby has very low threading potential. Standard Ruby (MRI/YARV) and Rubinius don't support simultaneous thread execution, unless using an extension specifically built to support it. Depending on how much of your performance trouble is in the IO and how much is in the processing I could suggest using EventMachine.
Multi process however Ruby works very well, as long as you've got a good manager/database for all the processes to communicate with then running multiple processes should scale as well as your processing power allows.
Hey another way is to use a combination of Nokogiri and IronWorker (IronMQ and IronCache).
See a full blog entry on the Topic here
We use a combination of ActiveMQ/Active Messaging, Event Machine, and multi-threading for this problem. We start off with a big list of URL's to fetch. We then break them down into batches of 100 URL's per batch. Each batch is then pushed into ActiveMQ. Then, we have an array of poller/consumer processes listening to the queue. These consumers can all be on one computer, or they can be spread across multiple computers. The array of consumers can grow arbitrarily large to support as much parallelism as we want. The consumers use Active Messaging, which is a nice Ruby integration with ActiveMQ.
When a consumer receives a message to process a batch of 100 URL's, it kicks off Event Machine to create a thread pool that can process multiple messages in multiple threads. Like you, we use Nokogiri to process each URL.
So, there are three levels of parallelism:
1) Multiple concurrent requests per consumer process, supported by Event Machine and threads.
2) Multiple consumer processes per computer.
3) Multiple computers.
If you want something easy go for http://anemone.rubyforge.org/
If you want something fast, code something with eventmachine/em-http-request
I found redis to be a great multi purpose tool for queue management, caching and so on. You could also use specialized things like beanstalkd/active mq/... but at least in my use case, I didn't really find them to be a big advantage compared to redis.
Especially the load on the backend system could be a bottleneck, so chose your database carefully and pay attention to what you save

How to Monitor Uptime of 20 Websites (Ping or HTTP) in Node.js/RoR

What's the best way to ping a list of 20 websites every 5 minutes (for example) in order to know if the site responds with HTTP 202 or not?
The no brainer idea is to save the 20 URLS in a database and just run the database and ping each one. However, what happen when one doesn't answers? What happens to the ones after that?
Also, is there better but no-brainer solution for this? I'm afraid the list can grow to 20000 websites and then there's not enough time to ping them all in the 5 minutes I need to be pinging.
Basically, I'm describing how PingDom, UptimeRobot, and the likes work.
I'm building this system using node.js and Ruby on Rails.
I'm also inclined to use MongoDB to save the history of all the pings and monitoring results.
Thanks a bunch!
I really like node.js and I would like to tackle this problem and hopefully soon share some code on github to achieve this. Keep in mind that I only have a veryy basic setup right now hosted at https://github.com/alfredwesterveld/freakinping
What's the best way to ping a list of
20 websites every 5 minutes (for
example) in order to know if the site
responds with HTTP 202 or not?
First I would like to know if you want to really do a ping(ICMP) or if you just want to know if the website returns with code 200(OK) and measure the time it takes. I believe from the context that you don't really want to do a ping, but just an http request and measure the time. I ask this because(I believe) pinging from node.js/ruby/python can't be done from normal user because we need raw sockets(root user) to do the pinging(ICMP) from programming language. I for example found this ping script in python(I also believe I saw a simple ruby script somewhere although I am not a really big ruby programmer) but requires root access. I don't believe there is even yet a ping module out there for node.js.
Message Queue
Also, is there better but no-brainer
solution for this? I'm afraid the list
can grow to 20000 websites and then
there's not enough time to ping them
all in the 5 minutes I need to be
Basically, I'm describing how PingDom,
UptimeRobot, and the likes work.
What you need to achieve this kind of scale is to use a message queue like for example redis, beanstalkd or gearmand. At the scale of PingDom one worker process is not going to cut it, but in your case it(I assume) one worker will do. I think(assume) redis will be the fastest message queue because of the C(node.js) extension but then again I should benchmark it against beanstalkd, which is another popular message queue(but does not yet have a C extension).
I'm afraid the list can grow to 20000
If you get at that scale you might have to have host multiple boxes(a lot of worker threads/processes) to handle the load but you aren't at that scale yet and node.js is insane fast. It might even be able to handle that load with even one single box, although I don't know for sure(you need to do/run some benchmarks).
I think this could be achieved pretty easily in node.js(I really like node.js). The way I would do this is use redis as my datastore because it is INSANE FAST!
PING: 20000 ops 46189.38 ops/sec 1/4/1.082
SET: 20000 ops 41237.11 ops/sec 0/6/1.210
GET: 20000 ops 39682.54 ops/sec 1/7/1.257
INCR: 20000 ops 40080.16 ops/sec 0/8/1.242
LPUSH: 20000 ops 41152.26 ops/sec 0/3/1.212
LRANGE (10 elements): 20000 ops 36563.07 ops/sec 1/8/1.363
LRANGE (100 elements): 20000 ops 21834.06 ops/sec 0/9/2.287
using node_redis(with hredis(node.js) c library). I would Add the URLs to redis using sadd.
Run tasks every 5 minutes
This could be achieved without barely any effort. I would use the setInterval(callback, delay, [arg], [...]) to repeatedly test response time of servers. Get all URLs on callback from redis using smembers. I would put all the URLs(messages) on the message queue using rpush.
Checking Response (Time)
However, what happen when one doesn't
answers? What happens to the ones
after that?
I might not completely understand this sentence but here it goes. If one fails it just fails. You could try to check response(time) again in 5 seconds or something to see if it is online. A precise algorithm for this should be devised. The ones after that should not have anything to do with previous URLs unless the are to the same server. Also something you clearly think about I guess because then you should not ping all those URLs to the same server at the same time but queue them up or something.
Processing URL
From the worker process(for now just one would be suffice) fetch message(URL) from redis using brpop command. check response time for URL(message) and fetch next URL(message) from the list. I would probably do a couple of request simultaneous to speed up the process.
There is no "basic way", since you must handle a lot of use cases:
http redirects,
https pages,
request timeouts,
the cpu load of the server you use for pinging,
the type of report you need (availability? Uptime? Responsiveness? Downtime?)
how to aggregate qos measurements by time
lifetime of the data you collect (pinging dozens of targets every five minutes quickly produces a lot of data)
realtime alerts
Pingdom and the like are not "basic" tools, and if you want something similar you may want to pay for it or rely on an existing open-source alternative. I know it for sure because I built a remote monitoring application myself. It's called Uptime, it's written in Node.js and MongoDB, and it's hosted on GitHub (https://github.com/fzaninotto/uptime). It took several weeks of hard work to develop it, so believe me: it is NOT a no-brainer.
use monitoring tools like zabbix, nagios, blah blah which can metric various parameters of your servers in mass numbers.
if u would like to implement it in js, u can do a time interval-ed http request, then to determine http return status code, and use xpath or regex to validate certain element is correct
for ruby, a daemon process and use a thread pool (multithreading idea) and URI open to view the http code and the content, use xpath to validate if the content is behave correctly.
If you're curious, I've created an app called Pinger that does this. It's built on Ruby on Rails and Resque:
There are some free quality services what provide us a very stable website up time check and notification. You can check this instruction and review http://fastjoomlahost.com/how-to-monitor-website-up-time
You can also do this in Node.js using the node-ping-monitor package.

Distributed time synchronization and web applications

I'm currently trying to build an application that inherently needs good time synchronization across the server and every client. There are alternative designs for my application that can do away with this need for synchronization, but my application quickly begins to suck when it's not present.
In case I am missing something, my basic problem is this: firing an event in multiple locations at exactly the same moment. As best I can tell, the only way of doing this requires some kind of time synchronization, but I may be wrong. I've tried modeling the problem differently, but it all comes back to either a) a sucky app, or b) requiring time synchronization.
Let's assume I Really Really Do Need synchronized time.
My application is built on Google AppEngine. While AppEngine makes no guarantees about the state of time synchronization across its servers, usually it is quite good, on the order of a few seconds (i.e. better than NTP), however sometimes it sucks badly, say, on the order of 10 seconds out of sync. My application can handle 2-3 seconds out of sync, but 10 seconds is out of the question with regards to user experience. So basically, my chosen server platform does not provide a very reliable concept of time.
The client part of my application is written in JavaScript. Again we have a situation where the client has no reliable concept of time either. I have done no measurements, but I fully expect some of my eventual users to have computer clocks that are set to 1901, 1970, 2024, and so on. So basically, my client platform does not provide a reliable concept of time.
This issue is starting to drive me a little mad. So far the best thing I can think to do is implement something like NTP on top of HTTP (this is not as crazy as it may sound). This would work by commissioning 2 or 3 servers in different parts of the Internet, and using traditional means (PTP, NTP) to try to ensure their sync is at least on the order of hundreds of milliseconds.
I'd then create a JavaScript class that implemented the NTP intersection algorithm using these HTTP time sources (and the associated roundtrip information that is available from XMLHTTPRequest).
As you can tell, this solution also sucks big time. Not only is it horribly complex, but only solves one half the problem, namely giving the clients a good notion of the current time. I then have to compromise on the server, either by allowing the clients to tell the server the current time according to them when they make a request (big security no-no, but I can mitigate some of the more obvious abuses of this), or having the server make a single request to one of my magic HTTP-over-NTP servers, and hoping that request completes speedily enough.
These solutions all suck, and I'm lost.
Reminder: I want a bunch of web browsers, hopefully as many as 100 or more, to be able to fire an event at exactly the same time.
Let me summarize, to make sure I understand the question.
You have an app that has a client and server component. There are multiple servers that can each be servicing many (hundreds) of clients. The servers are more or less synced with each other; the clients are not. You want a large number of clients to execute the same event at approximately the same time, regardless of which server happens to be the one they connected to initially.
Assuming that I described the situation more or less accurately:
Could you have the servers keep certain state for each client (such as initial time of connection -- server time), and when the time of the event that will need to happen is known, notify the client with a message containing the number of milliseconds after the beginning value that need to elapse before firing the event?
To illustrate:
client A connects to server S at time t0 = 0
client B connects to server S at time t1 = 120
server S decides an event needs to happen at time t3 = 500
server S sends a message to A:
S->A : {eventName, 500}
server S sends a message to B:
S->B : {eventName, 380}
This does not rely on the client time at all; just on the client's ability to keep track of time for some reasonably short period (a single session).
It seems to me like you're needing to listen to a broadcast event from a server in many different places. Since you can accept 2-3 seconds variation you could just put all your clients into long-lived comet-style requests and just get the response from the server? Sounds to me like the clients wouldn't need to deal with time at all this way ?
You could use ajax to do this, so yoǘ'd be avoiding any client-side lockups while waiting for new data.
I may be missing something totally here.
If you can assume that the clocks are reasonable stable - that is they are set wrong, but ticking at more-or-less the right rate.
Have the servers get their offset from a single defined source (e.g. one of your servers, or a database server or something).
Then have each client calculate it's offset from it's server (possible round-trip complications if you want lots of accuracy).
Store that, then you the combined offset on each client to trigger the event at the right time.
(client-time-to-trigger-event) = (scheduled-time) + (client-to-server-difference) + (server-to-reference-difference)
Time synchronization is very hard to get right and in my opinion the wrong way to go about it. You need an event system which can notify registered observers every time an event is dispatched (observer pattern). All observers will be notified simultaneously (or as close as possible to that), removing the need for time synchronization.
To accommodate latency, the browser should be sent the timestamp of the event dispatch, and it should wait a little longer than what you expect the maximum latency to be. This way all events will be fired up at the same time on all browsers.
Google found the way to define time as being absolute. It sounds heretic for a physicist and with respect to General Relativity: time is flowing at different pace depending on your position in space and time, on Earth, in the Universe ...
You may want to have a look at Google Spanner database: http://en.wikipedia.org/wiki/Spanner_(database)
I guess it is used now by Google and will be available through Google Cloud Platform.
