Throttle Mechanize gem - ruby

Is there any built-in way to throttle Mechanize gem?
I'm searching something like a callback on making an HTTP request.
Later edit:
I would like to implement bandwith throttling, to avoid flooding parsed sites.
EG: Only allow one request per second.

It may be that pre_connect_hooks is what you are looking for. Sadly, I am unable to find any way to add one but adding directly a lambda/Proc to the array.
They are called here and this method is called here

Related

Fiddler filter to hide recurring requests

Is there any way to tell Fiddler not to log requests that have already been sent/logged previously?
Or even to filter them after you stop the capture, so as to get a smaller list to process?
Having a huge list of multiple identical requests is really difficult to debug...
Seemed simple but after many tries, i couldn't find anything.
Thanks in advance!
EDIT
To clarify things :
I am trying to debug a sort of monitoring system, in which the requests and responses change through time but could be hours and thousands of queries before an event changes the system state, hence the request response data. So i would like to skip logging identical request/response sets.
The easiest way to do this would be to write a bit of FiddlerScript (Rules > Customize Rules).
However, how exactly do you define "identical"? The same URL? The same request headers? The same response body? etc.
The definition you choose obviously has a significant impact on what the necessary FiddlerScript will look like.

Limit size of response read by rest-client

I'm using the Ruby gem rest-client (1.6.7) to retrieve data using HTTP GET requests. However, sometimes the responses are bigger than I want to handle, so I would like some way to have the RestClient stop reading once it exceeds a size limit I set. The documentation says
For cases not covered by the general API, you can use the RestClient::Request class which provide a lower-level API.
but I do not see how that helps me. I do not see anything that looks like a hook into processing the incoming data stream, only operations I could perform after the whole thing is read. I don't want to waste time and memory reading a huge response into a buffer only to discard it.
How can I set a limit on the amount of data read by RestClient in a GET request? Or is there a different client I can use that makes it easy to set such a limit?
rest-client uses ruby's Net::HTTP underneath: https://github.com/rest-client/rest-client/blob/master/lib/restclient/request.rb#L303
Unfortunately, it doesn't seem like Net::HTTP will let you abandon response based on its length as it uses, after all, this method to issue all requests:
http://docs.ruby-lang.org/en/2.0.0/Net/HTTP.html#method-i-transport_request
As you can see, it uses HTTPResponse to read an HTTP response from server:
http://ruby-doc.org/stdlib-2.0.0/libdoc/net/http/rdoc/Net/HTTPResponse.html#method-i-read_body
HTTPResponse seems like the place where you could control whether to read all response and store it into memory, or read and throw away.
I you don't want even to read the response, I guess you'll need to close the socket.
I don't know whether there are rest-clients with functionality you need. I guess you'll need to write your own little rest-client if you want to have such a fine-grained control.

Is typhoeus safe to use with activerecord? resque?

Is typhoeus safe to use with activerecord? resque? I've poked around in the source and googling here and there and I can't make heads or tails of it. I guess what I really want to know is, are the response callbacks run one at a time or in parallel? Because if the callbacks are atomic or sequential or whatever then it's safe to do anything in them. I think.
From my answer to the github question:
I've used it extensively within Resque workers. Typhoeus::Hydra
returns (fires #on_complete) in the order that the requests are
returned. Rather than doing any database work within the #on_complete
block, I'd suggest storing a collection of response objects and
performing database work after you've done any validation you might
need to do.

Concurrent web requests with Ruby (Sinatra?)?

I have a Sinatra app that basically takes some input values and then finds data matching those values from external services like Flickr, Twitter, etc.
For example:
input:"Chattanooga Choo Choo"
Would go out and find images at Flickr on the Chattanooga Choo Choo and tweets from Twitter, etc.
Right now I have something like:
#images = Flickr::...find...images..
#tweets = Twitter::...find...tweets...
#results << #images
#results << #tweets
So my question is, is there an efficient way in Ruby to run those requests concurrently? Instead of waiting for the images to finish before the tweets finish.
Threads would work, but it's a crude tool. You could try something like this:
flickr_thread = Thread.start do
#flickr_result = ... # make the Flickr request
end
twitter_thread = Thread.start do
#twitter_result = ... # make the Twitter request
end
# this makes the main thread wait for the other two threads
# before continuing with its execution
flickr_thread.join
twitter_thread.join
# now both #flickr_result and #twitter_result have
# their values (unless an error occurred)
You'd have to tinker a bit with the code though, and add proper error detection. I can't remember right now if instance variables work when declared inside the thread block, local variables wouldn't unless they were explicitly declared outside.
I wouldn't call this an elegant solution, but I think it works, and it's not too complex. In this case there is luckily no need for locking or synchronizations apart from the joins, so the code reads quite well.
Perhaps a tool like EventMachine (in particular the em-http-request subproject) might help you, if you do a lot of things like this. It could probably make it easier to code at a higher level. Threads are hard to get right.
You might consider making a client side change to use asynchronous Ajax requests to get each type (image, twitter) independently. The problem with server threads (one of them anyway) is that if one service hangs, the entire request hangs waiting for that thread to finish. With Ajax, you can load an images section, a twitter section, etc, and if one hangs the others will still show their results; eventually you can timeout the requests and show a fail whale or something in that section only.
Yes why not threads?
As i understood. As soon as the user submit a form, you want to process all request in parallel right? You can have one multithread controller (Ruby threads support works really well.) where you receive one request, then you execute in parallel the external queries services and then you answer back in one response or in the client side you send one ajax post for each service and process it (maybe each external service has your own controller/actions?)
http://github.com/pauldix/typhoeus
parallel/concurrent http requests
Consider using YQL for this. It supports subqueries, so that you can pull everything you need with a single (client-side, even) call that just spits out JSON of what you need to render. There are tons of tutorials out there already.

How do I get around the Twitter API caching problem?

I'm building a Twitter app that requires to check user data somewhat frequently, but I'm facing trouble with a cache that's oddly on Twitter's side, not mine.
Try the following user:
users/show in XML: http://twitter.com/users/show.xml?screen_name=technolocus
users/show in JSON: http://twitter.com/users/show.json?screen_name=technolocus
normal page: http://twitter.com/technolocus
All these methods of accessing data should return the same values, right? Check the statuses_count for each of them.
XML: 12548
JSON: 12513
normal: 12498
The normal method (i.e. just visiting the profile non-programatically) serves up the most correct value of 12498. If I post or delete tweets to this account, it gets updated on the profile page instantly, but the XML and JSON methods still return cached data.
At this point, the values of the XML and JSON methods are 12 to 18 hours old respectively.
I first tried to access these methods from my website (hosted on Dreamhost). I thought it was Dreamhost caching the responses. Then I tried to access the API directly from my browser. I did a cURL from the command line from my machine after that. It wasn't dreamhost. I thought it was probably my ISP (I think they use NetApp or something like that). Then I asked a friend in another corner of India to try it. He's getting the exact same cached responses as I am.
So it isn't Dreamhost's cache; it isn't my ISP or my country's cache. There's only one conclusion - Twitter is caching responses.
How in the heavens do I get around this?!?
Forgot to mention this: The script on the server is in PHP and is using cURL to retrieve the XML and JSON data from Twitter, while the local tests have been just using the browser. Both have the exact same result!
First, I think you should report this a a bug to Twitter. I see the same discrepancy as you, and no matter what that seems like a bug. Even if they're caching, I'd expect that a cache on their side would store an abstract form that would then be rendered into HTML, JSON, and XML. I wonder if what's actually going on is that these requests are performing similar but different queries.
Are you sure that the values are "old"? For example, did you actually delete about 50 updates recently (since you say the HTML one is newest but shows a lower count than the other two)? If you create another update do you see the HTML number increment while the other numbers stay the same, or do they all increment simultaneously?
If what you are saying is accurate, and it probably is, generally, you can't get around it. Twitter would want to be caching its responses since they are costly to reproduce every single time.
When you use Twitter's APIs, you end up being bound by its conventions, even if that includes caching.
Your best bet is to tweet to #twitterapi and get them to give you a response as to why the two representations are divergent.
Add ?blah=xxxx to all urls.
I don't develop anything against twitter and ocassionaly manually "follow" three tweets by going to them in my browser. They always lag behind by half a day. I add ?asdsadsadsad to the url (everytime something different) and it always updates. I don't know what Twitter is doing here and came here while searching for the problem. But I guess this trick of appending a random value to the url via GET will probably work for your api requests, too.

Resources