Queue in flask with websocket - websocket

I am using Flask, Gevent and scrapy for a project. The basic idea is that you enter a url and it starts a crawler process with the input as the arguments. It currently seems to be working well with the output piped through websocket.
I am curious what is the best way to handle multiple crawlers being run at the same time, so if two people input a url at the same time. I thought the best way to do this would be a queue system, ideally I only want a controllable amount of crawlers being run at the same time.
Does any have suggestions on how to go about this with the libraries I am already using? Or maybe suggest a different approach?

Try nodejs , webtcp(for websockets) and asynchronous calls for each crawler. also once you are done with crawling you can save it in a temporary storage such as memcached or redis with a expiration key.
so when there is a similar crawl request you can serve it from temporary storage

If the crawler is a gevent job you can use a pool.
http://www.gevent.org/gevent.pool.html
The Pool which a subclass of Group provides a way to limit concurrency: its spawn method blocks if the number of greenlets in the pool has already reached the limit, until there is a free slot.
pseudo code:
crawler_pool = Pool(10)
def spawncrawler(url):
def start():
crawler_pool.spawn(crawl, url) # blocks when max is reached.
gevent.spawn(start)
# give a response to the browser. this will always succeed because
# i put the spawning of the crawler in a separate greenlet so if max
# 10 crawlers is reached the greenlet just holds on untill there is space
# and client can get default response..

Related

Distributed crawling and rate limiting / flow control

I am running a niche search product that works with a web crawler. The current crawler is a single (PHP Laravel) worker crawling the urls and putting the results into an Elastic Search engine. The system continuously keeps re-crawling the found url's with a interval of X milliseconds.
This has served me well but with some new large clients coming up the crawler is going to hit it's limits. I need to redesign the system to a distributed crawler to speed up the crawling. The problem is the combination of specs below.
The system must adhere to the following 2 rules:
multiple workers (concurrency issues)
variable rate-limit per client. I need to be very sure the system doesn't crawl client X more then once every X milliseconds.
What i have tried:
I tried putting the url's in a MySQL table and let the workers query for a url to crawl based on last_crawled_at timestamps in the clients and urls table. But MySQL doesn't like multiple concurrent workers and i receive all sorts of deadlocks.
I tried putting the url's into a Redis engine. I got this kinda working, but only with a Lua script that checks and sets an expiring key for every client that is being served. This all feels way to hackish.
I thought about filling a regular queue but this will violate rule number 2 as i can't be 100% sure the workers can process the queue 'real-time'.
Can anybody explain me how the big boys do this? How can we have multiple processes query a big/massive list of url's based on a few criteria (like rate limiting the client) and make sure we hand out the the url to only 1 worker?
Ideally we won't need another database besides Elastic with all the available / found urls's but i don't think that's possible?
Have a look at StormCrawler, it is a distributed web crawler with has an Elasticsearch module. It is highly customisable and enforces politeness by respecting robots.txt and having by default a single thread per host or domain.

Processing web pages concurrently with Ruby

I am trying to process the content of different pages given an array of URLs, using ruby Thread. However, when trying to open the URL I always get this error: #<SocketError: getaddrinfo: Name or service not known>
This is how I am trying to do it:
sites.each do |site|
threads << Thread.new(site) do |url|
puts url
#web = open(url) { |i| i.read } # same issue opening the web this way
web = Net::HTTP.new(url, 443).get('/', nil)
lock.synchronize do
new_md5[sites_hash[url]] = Digest::MD5.hexdigest(web)
end
end
end
sites is the array of URLs.
The same program but sequential works alright:
sites.each { |site|
web = open(site) { |i| i.read }
new_md5 << Digest::MD5.hexdigest(web)
}
What's the problem?
Ugh. You're going to open a thread for every site you have to process? What if you have 10,000 sites?
Instead, set a limit on the number of threads, and turn sites into a Queue, and have each thread remove a site, process it and get another site. If there are no more sites in the Queue, then the thread can exit.
The example in the Queue documentation will get you started.
Instead of using get and always retrieve the entire body, use a backing database that keeps track of the last time the page was processed. Use head to check to see if the page has been updated since then. If it has, then do a get. That will reduce your, and their, bandwidth and CPU usage. It's all about being a good network citizen, and playing nice with the other people's toys. If you don't play nice, they might not let you play with them any more.
I've written hundreds of spiders and site analyzers. I'd recommend you should always have a backing database and use that to keep track of the sites you're going to read, when you last read them, if they were up or down the last time you tried to get a page, and how many times you've tried to reach them and they were down. (The last is so you don't bang your code's head on the wall trying to reach dead/down sites.)
I had a 75 thread app that read pages. Each thread wrote their findings to the database, and, if a page needed to be processed, that HTML was written to a record in another table. A single app then read that table and did the processing. It was easy for a single app to stay ahead of 75 threads because they're dealing with the slow internet.
The big advantage to using a backend database, is that your code can be shut down, and it'll pick up at the same spot, the next site to be processed, if you write it correctly. You can scale it up to run on multiple hosts easily too.
Regarding not being able to find the host:
Some things I see in your code:
You're not handling redirects. "Following Redirection" shows how to do that.
The request is to port 443, not 80, so Net::HTTP isn't happy trying to use non-SSL to a SSL port. See "Using Net::HTTP.get for an https url", which discusses how to turn on SSL.
Either of those could explain why using open works but your code doesn't. (I'm assuming you're using OpenURI in conjunction with your single-threaded code though you don't show it, since open by itself doesn't know what to do with a URL.)
In general, I'd recommend using Typhoeus and Hydra to process large numbers of sites in parallel. Typhoeus will handle redirects for you also, along with allowing you to use head requests. You can also set up how many requests are handled at the same time (concurrency) and automatically handles duplicate requests (memoization) so redundant URLs don't get pounded.

realtime communication with ruby

I'm about to write a game server with ruby. One feature of the game includes player walking around & others should be able to see it.
I've already written a pure socket demo using event machine. But since most of the communication are going to be http-based, so I'm looking for some http polling solution. And of course I could write it with event machine, but is there any gem out there for this kind of job already?
I've tried something like faye, but most of these are for a messaging system, like subscribing & publish to a channel, I seem not to be able to control what clients I should push to. In my case I need to be able to push to specific clients, like one guy moves from 10,10 to 20,20, only those around him (maybe from 0,0 to 30,30, but not a guy at 40,50) need to receive the message.
------------pregress with cramp
Here's a quick update. I'm working on cramp, with 5000 connections, and 100 client move each second, the CPU usage is almost 100%. When I double both figures, the CPU usage is still 100% or so, and the response is very slow.
Clearly I'm not using every resource I had, instead there's only one CPU core occupied. Need more work on it.
------------Node.js's turn
#aam1r
Actually Node.js is doing better than cramp. With 5000 connections and 100 client moving per seoncd, the Cpu usage is over 60%. When I doubled to 10000 connections and 200 client moving per second, the CPU usage is 100% and response is becoming slow. Same problem here, either cramp or Node.js can only use one cpu core per process. That's a problem.
------------What about JRuby?
Because of the presence of GIL, there's no true multi-thread simultanious execution with Ruby MRI. None with Node.js either.So I'm going to give JRuby a try.
When a client moves, use another thread to find all the other clients need to notify(which is a CPU-heavy work). Then push the result to a channel.
The main thread simply subscribes the channel. When it gets the result, push them to the clients.
Need some time to write a demo though.
I would recommend using Espresso with Server-Sent Events.
On the server-side you define a streaming action:
class App < E
map :/
attr_reader :connections
def subscribe
#connections ||= []
stream :keep_open do |conn|
connections << conn
conn.callback { connections.delete conn }
end
end
private
def communicate_to_clients
connections.each do |conn|
conn << 'some message'
end
end
The :keep_open option will instruct the server to not close connection.
Then open a connection with Javascript:
pool = new EventSource('/subscribe');
pool.on_message = function(msg) {
// here you receive messages sent by server
// via communicate_to_clients method
}
I would suggest not using polling. Polling would result in too much overhead since you'll be making new connections every time make a new request. Also, it won't be real-time enough for you (i.e. you will poll every X seconds -- not instantly)
Instead, I would suggest using something like Cramp. From their website:
Cramp is a fully asynchronous real-time web application framework in
Ruby. It is built on top of EventMachine and primarily designed for
working with larger number of open connections and providing
full-duplex bi-directional communication.
All your clients would maintain a persistent connection through which they can send/receive messages. There won't be overhead of making a new connection every time and messages will be sent in real-time since clients won't be checking "every X seconds".
You can also use Node.js instead of Cramp. It's a Javascript framework that can be used to develop real-time applications.
Here are some more resources that should help you out:
Slideshow on using Node.js with Ruby
Discussion on "Real time ruby apps: CRAMP vs NODE.JS"

How to Monitor Uptime of 20 Websites (Ping or HTTP) in Node.js/RoR

What's the best way to ping a list of 20 websites every 5 minutes (for example) in order to know if the site responds with HTTP 202 or not?
The no brainer idea is to save the 20 URLS in a database and just run the database and ping each one. However, what happen when one doesn't answers? What happens to the ones after that?
Also, is there better but no-brainer solution for this? I'm afraid the list can grow to 20000 websites and then there's not enough time to ping them all in the 5 minutes I need to be pinging.
Basically, I'm describing how PingDom, UptimeRobot, and the likes work.
I'm building this system using node.js and Ruby on Rails.
I'm also inclined to use MongoDB to save the history of all the pings and monitoring results.
Suggestions?
Thanks a bunch!
Github
I really like node.js and I would like to tackle this problem and hopefully soon share some code on github to achieve this. Keep in mind that I only have a veryy basic setup right now hosted at https://github.com/alfredwesterveld/freakinping
What's the best way to ping a list of
20 websites every 5 minutes (for
example) in order to know if the site
responds with HTTP 202 or not?
PING(ICMP)
First I would like to know if you want to really do a ping(ICMP) or if you just want to know if the website returns with code 200(OK) and measure the time it takes. I believe from the context that you don't really want to do a ping, but just an http request and measure the time. I ask this because(I believe) pinging from node.js/ruby/python can't be done from normal user because we need raw sockets(root user) to do the pinging(ICMP) from programming language. I for example found this ping script in python(I also believe I saw a simple ruby script somewhere although I am not a really big ruby programmer) but requires root access. I don't believe there is even yet a ping module out there for node.js.
Message Queue
Also, is there better but no-brainer
solution for this? I'm afraid the list
can grow to 20000 websites and then
there's not enough time to ping them
all in the 5 minutes I need to be
pinging.
Basically, I'm describing how PingDom,
UptimeRobot, and the likes work.
What you need to achieve this kind of scale is to use a message queue like for example redis, beanstalkd or gearmand. At the scale of PingDom one worker process is not going to cut it, but in your case it(I assume) one worker will do. I think(assume) redis will be the fastest message queue because of the C(node.js) extension but then again I should benchmark it against beanstalkd, which is another popular message queue(but does not yet have a C extension).
I'm afraid the list can grow to 20000
websites
If you get at that scale you might have to have host multiple boxes(a lot of worker threads/processes) to handle the load but you aren't at that scale yet and node.js is insane fast. It might even be able to handle that load with even one single box, although I don't know for sure(you need to do/run some benchmarks).
Datastore/Redis
I think this could be achieved pretty easily in node.js(I really like node.js). The way I would do this is use redis as my datastore because it is INSANE FAST!
PING: 20000 ops 46189.38 ops/sec 1/4/1.082
SET: 20000 ops 41237.11 ops/sec 0/6/1.210
GET: 20000 ops 39682.54 ops/sec 1/7/1.257
INCR: 20000 ops 40080.16 ops/sec 0/8/1.242
LPUSH: 20000 ops 41152.26 ops/sec 0/3/1.212
LRANGE (10 elements): 20000 ops 36563.07 ops/sec 1/8/1.363
LRANGE (100 elements): 20000 ops 21834.06 ops/sec 0/9/2.287
using node_redis(with hredis(node.js) c library). I would Add the URLs to redis using sadd.
Run tasks every 5 minutes
This could be achieved without barely any effort. I would use the setInterval(callback, delay, [arg], [...]) to repeatedly test response time of servers. Get all URLs on callback from redis using smembers. I would put all the URLs(messages) on the message queue using rpush.
Checking Response (Time)
However, what happen when one doesn't
answers? What happens to the ones
after that?
I might not completely understand this sentence but here it goes. If one fails it just fails. You could try to check response(time) again in 5 seconds or something to see if it is online. A precise algorithm for this should be devised. The ones after that should not have anything to do with previous URLs unless the are to the same server. Also something you clearly think about I guess because then you should not ping all those URLs to the same server at the same time but queue them up or something.
Processing URL
From the worker process(for now just one would be suffice) fetch message(URL) from redis using brpop command. check response time for URL(message) and fetch next URL(message) from the list. I would probably do a couple of request simultaneous to speed up the process.
There is no "basic way", since you must handle a lot of use cases:
http redirects,
https pages,
request timeouts,
the cpu load of the server you use for pinging,
the type of report you need (availability? Uptime? Responsiveness? Downtime?)
how to aggregate qos measurements by time
lifetime of the data you collect (pinging dozens of targets every five minutes quickly produces a lot of data)
realtime alerts
etc.
Pingdom and the like are not "basic" tools, and if you want something similar you may want to pay for it or rely on an existing open-source alternative. I know it for sure because I built a remote monitoring application myself. It's called Uptime, it's written in Node.js and MongoDB, and it's hosted on GitHub (https://github.com/fzaninotto/uptime). It took several weeks of hard work to develop it, so believe me: it is NOT a no-brainer.
use monitoring tools like zabbix, nagios, blah blah which can metric various parameters of your servers in mass numbers.
if u would like to implement it in js, u can do a time interval-ed http request, then to determine http return status code, and use xpath or regex to validate certain element is correct
for ruby, a daemon process and use a thread pool (multithreading idea) and URI open to view the http code and the content, use xpath to validate if the content is behave correctly.
If you're curious, I've created an app called Pinger that does this. It's built on Ruby on Rails and Resque:
https://github.com/austinthecoder/pinger
There are some free quality services what provide us a very stable website up time check and notification. You can check this instruction and review http://fastjoomlahost.com/how-to-monitor-website-up-time
You can also do this in Node.js using the node-ping-monitor package.

Concurrent web requests with Ruby (Sinatra?)?

I have a Sinatra app that basically takes some input values and then finds data matching those values from external services like Flickr, Twitter, etc.
For example:
input:"Chattanooga Choo Choo"
Would go out and find images at Flickr on the Chattanooga Choo Choo and tweets from Twitter, etc.
Right now I have something like:
#images = Flickr::...find...images..
#tweets = Twitter::...find...tweets...
#results << #images
#results << #tweets
So my question is, is there an efficient way in Ruby to run those requests concurrently? Instead of waiting for the images to finish before the tweets finish.
Threads would work, but it's a crude tool. You could try something like this:
flickr_thread = Thread.start do
#flickr_result = ... # make the Flickr request
end
twitter_thread = Thread.start do
#twitter_result = ... # make the Twitter request
end
# this makes the main thread wait for the other two threads
# before continuing with its execution
flickr_thread.join
twitter_thread.join
# now both #flickr_result and #twitter_result have
# their values (unless an error occurred)
You'd have to tinker a bit with the code though, and add proper error detection. I can't remember right now if instance variables work when declared inside the thread block, local variables wouldn't unless they were explicitly declared outside.
I wouldn't call this an elegant solution, but I think it works, and it's not too complex. In this case there is luckily no need for locking or synchronizations apart from the joins, so the code reads quite well.
Perhaps a tool like EventMachine (in particular the em-http-request subproject) might help you, if you do a lot of things like this. It could probably make it easier to code at a higher level. Threads are hard to get right.
You might consider making a client side change to use asynchronous Ajax requests to get each type (image, twitter) independently. The problem with server threads (one of them anyway) is that if one service hangs, the entire request hangs waiting for that thread to finish. With Ajax, you can load an images section, a twitter section, etc, and if one hangs the others will still show their results; eventually you can timeout the requests and show a fail whale or something in that section only.
Yes why not threads?
As i understood. As soon as the user submit a form, you want to process all request in parallel right? You can have one multithread controller (Ruby threads support works really well.) where you receive one request, then you execute in parallel the external queries services and then you answer back in one response or in the client side you send one ajax post for each service and process it (maybe each external service has your own controller/actions?)
http://github.com/pauldix/typhoeus
parallel/concurrent http requests
Consider using YQL for this. It supports subqueries, so that you can pull everything you need with a single (client-side, even) call that just spits out JSON of what you need to render. There are tons of tutorials out there already.

Resources