Concurrent web requests with Ruby (Sinatra?)? - ruby

I have a Sinatra app that basically takes some input values and then finds data matching those values from external services like Flickr, Twitter, etc.
For example:
input:"Chattanooga Choo Choo"
Would go out and find images at Flickr on the Chattanooga Choo Choo and tweets from Twitter, etc.
Right now I have something like:
#images = Flickr::...find...images..
#tweets = Twitter::...find...tweets...
#results << #images
#results << #tweets
So my question is, is there an efficient way in Ruby to run those requests concurrently? Instead of waiting for the images to finish before the tweets finish.

Threads would work, but it's a crude tool. You could try something like this:
flickr_thread = Thread.start do
#flickr_result = ... # make the Flickr request
end
twitter_thread = Thread.start do
#twitter_result = ... # make the Twitter request
end
# this makes the main thread wait for the other two threads
# before continuing with its execution
flickr_thread.join
twitter_thread.join
# now both #flickr_result and #twitter_result have
# their values (unless an error occurred)
You'd have to tinker a bit with the code though, and add proper error detection. I can't remember right now if instance variables work when declared inside the thread block, local variables wouldn't unless they were explicitly declared outside.
I wouldn't call this an elegant solution, but I think it works, and it's not too complex. In this case there is luckily no need for locking or synchronizations apart from the joins, so the code reads quite well.
Perhaps a tool like EventMachine (in particular the em-http-request subproject) might help you, if you do a lot of things like this. It could probably make it easier to code at a higher level. Threads are hard to get right.

You might consider making a client side change to use asynchronous Ajax requests to get each type (image, twitter) independently. The problem with server threads (one of them anyway) is that if one service hangs, the entire request hangs waiting for that thread to finish. With Ajax, you can load an images section, a twitter section, etc, and if one hangs the others will still show their results; eventually you can timeout the requests and show a fail whale or something in that section only.

Yes why not threads?
As i understood. As soon as the user submit a form, you want to process all request in parallel right? You can have one multithread controller (Ruby threads support works really well.) where you receive one request, then you execute in parallel the external queries services and then you answer back in one response or in the client side you send one ajax post for each service and process it (maybe each external service has your own controller/actions?)

http://github.com/pauldix/typhoeus
parallel/concurrent http requests

Consider using YQL for this. It supports subqueries, so that you can pull everything you need with a single (client-side, even) call that just spits out JSON of what you need to render. There are tons of tutorials out there already.

Related

Fetching and showing if the API JSON response has changed

I am consuming a game API that updates all active player statistics in real time in the game.
I'm trying to make a way for my code to listen to this API (outside of a loop) and when there are changes in your json response, my code will print on the console. I'm currently trying with Ruby ​​Events, but I didn't get anything other than out of a loop (while true).
old_data = ""
while true
sleep 1
data = Utils.req("GET", "https://127.0.0.1:2999/liveclientdata/playerscores?summonerName=Yoruku", {}, false)
#{"assists":0,"creepScore":50,"deaths":0,"kills":5,"wardScore":0}
next if data.eql? old_data
old_data = data
p "NEW DATA: #{data}"
end
Your code seems to be doing exactly what you want it to do.
You used a technique called polling. It has its issues, like performance and rate limits which you need to consider. But you can't really not use a loop in this case. Because that what polling essentially is.
You could maybe use some async scheduler (like sidekiq) and after each http request you could schedule another one in the future. Or you could use something like sidekiq-cron gem. In that way you can avoid using a loop.
If you want to avoid making requests even when nothing changed on the server you'll need to use some websockets or so called long polling. But idk if the api you to talk to supports it.
Alternatively the api could create a webhook and the api would call you when there is a change.

Process request and return response in chunks

I'm making a search agreggator and I've been wondering how could I improve the performance of the search.
Given that I'm getting results from different websites, currently I need to wait to receive the results for each provider but this is done one after another so the whole request takes a while to respond.
The easiest solution would be to just make a request from the client for each provider, but this would end up with a ton of request per search, (but if this is the proper way I'll just do it.)
Why I've been wondering is if there's way to return results everytime a provider responds, so if we have providers A, B and C and B already returned results then send it back to the client. In order for this to work all the searchs would need to run in parallel of course.
Do you know a way of doing this?
I'm trying to build a search experience similar to SkyScanner, that loads results but then you can see it still keeps getting more records and it sorts them on the fly (on client side as far as I can see).
Caching is the key here. Best practices for external API (or scraping) is to be as little of a 'taker' as possible. So in your Laravel setup, get your results, but cache the results for as long as makes sense for your app. Although the odds in a skyscanner situation is low that two users will make the exact same request, the odds are much higher that a user will make the same request multiple times, or may share the link, etc.
https://laravel.com/docs/8.x/cache
cache(['key' => 'value'], now()->addMinutes(10));
$value = cache('key');
To actually scrape the content, you could use this:
https://github.com/softonic/laravel-intelligent-scraper
Or to use an API which is the nicer route:
https://docs.guzzlephp.org/en/stable/
On the client side, you could just make a few calls to your own service in separate requests and that would give you your asynchronous feel you're looking for.

How do I wait and send a response to an HTTP request at a later time?

I am developing an API using Sinatra on the server-side. I would like to have an HTTP request which is made, but continues to hang/wait and keep alive until a later event (another event) causes it to complete with a certain response value at a later time.
In other words:
get '/api/foo/:request_identifier' do
# some code here
wait_until_finished params[:request_identifier]
end
# When this URL is visited, the hanging request with the matching
# request identifier will complete, sending "foo response text" to the
# client.
get '/api/bar/:request_identifier' do
make_it_finish params[:request_identifier] "foo response text"
"bar response text"
end
How could I implement this, or something to this effect?
I have also considered having the client constantly making requests to the server polling for completed requests, but the high number of requests could result in an expensive internet bill.
I'd be careful with hanging requests as it is not a great user experience. That being said, if you need to have one thing finish before another, here are some options:
Use an event emitter
Use an async library
Without full context of your problem it is hard to recommend one over the other, however, based on what you've described it sounds like a "Promise" would solve your issue here, which is recommendation #2. It basically allows you to wait for one thing to finish before doing thing 2.

Processing web pages concurrently with Ruby

I am trying to process the content of different pages given an array of URLs, using ruby Thread. However, when trying to open the URL I always get this error: #<SocketError: getaddrinfo: Name or service not known>
This is how I am trying to do it:
sites.each do |site|
threads << Thread.new(site) do |url|
puts url
#web = open(url) { |i| i.read } # same issue opening the web this way
web = Net::HTTP.new(url, 443).get('/', nil)
lock.synchronize do
new_md5[sites_hash[url]] = Digest::MD5.hexdigest(web)
end
end
end
sites is the array of URLs.
The same program but sequential works alright:
sites.each { |site|
web = open(site) { |i| i.read }
new_md5 << Digest::MD5.hexdigest(web)
}
What's the problem?
Ugh. You're going to open a thread for every site you have to process? What if you have 10,000 sites?
Instead, set a limit on the number of threads, and turn sites into a Queue, and have each thread remove a site, process it and get another site. If there are no more sites in the Queue, then the thread can exit.
The example in the Queue documentation will get you started.
Instead of using get and always retrieve the entire body, use a backing database that keeps track of the last time the page was processed. Use head to check to see if the page has been updated since then. If it has, then do a get. That will reduce your, and their, bandwidth and CPU usage. It's all about being a good network citizen, and playing nice with the other people's toys. If you don't play nice, they might not let you play with them any more.
I've written hundreds of spiders and site analyzers. I'd recommend you should always have a backing database and use that to keep track of the sites you're going to read, when you last read them, if they were up or down the last time you tried to get a page, and how many times you've tried to reach them and they were down. (The last is so you don't bang your code's head on the wall trying to reach dead/down sites.)
I had a 75 thread app that read pages. Each thread wrote their findings to the database, and, if a page needed to be processed, that HTML was written to a record in another table. A single app then read that table and did the processing. It was easy for a single app to stay ahead of 75 threads because they're dealing with the slow internet.
The big advantage to using a backend database, is that your code can be shut down, and it'll pick up at the same spot, the next site to be processed, if you write it correctly. You can scale it up to run on multiple hosts easily too.
Regarding not being able to find the host:
Some things I see in your code:
You're not handling redirects. "Following Redirection" shows how to do that.
The request is to port 443, not 80, so Net::HTTP isn't happy trying to use non-SSL to a SSL port. See "Using Net::HTTP.get for an https url", which discusses how to turn on SSL.
Either of those could explain why using open works but your code doesn't. (I'm assuming you're using OpenURI in conjunction with your single-threaded code though you don't show it, since open by itself doesn't know what to do with a URL.)
In general, I'd recommend using Typhoeus and Hydra to process large numbers of sites in parallel. Typhoeus will handle redirects for you also, along with allowing you to use head requests. You can also set up how many requests are handled at the same time (concurrency) and automatically handles duplicate requests (memoization) so redundant URLs don't get pounded.

How can I (simulate?) do asynchronous HTTP requests in plain Ruby, no Rails?

As a learning exercise, I am making a simple Ruby command-line application that logs into a handful of websites that have a public API (Reddit, Twitter, etc), checks if I have new messages, and reads them out to me.
It works well... but incredibly slowly, because it waits for each login-getcookies-requestmessages-getmessages cycle to complete beyond moving onto the next one. I would really love to have it work asynchronously, so that I can fire off multiple requests simultaneously, deal with whichever data comes back first, then whichever data comes back second, etc.
I've Googled this problem and looked at other StackOverflow threads, but I'm confused by the different options available, and most solutions seem to be assuming that my program is part of a larger Rails app, which it isn't. So I thought I'd ask: what is the simplest, smartest, most efficient way to do what I'm talking about? I don't need to be guided through it, I'd just like some input on my situation from people who know better than I do, and suggestions as to what I should research to solve this problem.
I'd also be willing to write this in JavaScript to run on Node if that'd be more appropriate.
You should try looking into Eventmachine. A decent implementation of asynchronous http request with Javascript ajax like call back is em-synchrony
require "em-synchrony"
require "em-synchrony/em-http"
EM.synchrony do
concurrency = 2
urls = ['http://url.1.com', 'http://url2.com']
EM::Synchrony::Iterator.new(urls, concurrency).each do |url|
resp = EventMachine::HttpRequest.new(url).get
resp.callback { puts "success callback" }
resp.errback { puts "error callback" }
puts resp.response
end
EventMachine.stop
end

Resources