Grabbing JSON data from API with multi-threaded requests - ruby

I'm using httparty for making requests and currently have the following code:
def scr(users)
users.times do |id|
test_url = "siteurl/#{id}"
Thread.new do
response = HTTParty.get(test_url)
open('users.json', 'a') do |f|
f.puts "#{response.to_json}, "
end
p "added"
end
end
sleep
end
It works OK for 100-300 records.
I tried adding Thread.exit after sleep, but if I set users to something like 200000, after a while my terminal throws an error. I don't remember what it was but it's something about threads... resource is busy but some records. (About 10000 were added successfully.)
It looks like I'm doing it wrong and need to somehow break requests to batches.
up
here's what I got:
def scr(users)
threads = []
urls = []
users.times do |id|
test_url = "site_url/#{id}"
urls<<test_url
end
urls.each_slice(8) do |batch|
batch.each do |t|
threads << Thread.new do
response = HTTParty.get(t)
response.to_json
end
end
end
all_values = threads.map {|t| t.value}.join(', ')
open('users.json', 'a') do |f|
f.puts all_values
end

On quick inspection, the problem would seem to be that you have a race condition with regards to your JSON file. Even if you don't get an error, you'll definitely get corrupted data.
The simplest solution is probably just to do all the writing at the end:
def scr(users)
threads = []
users.times do |id|
test_url = "siteurl/#{id}"
threads << Thread.new do
response = HTTParty.get(test_url)
response.to_json
end
end
all_values = threads.map {|t| t.value}.join(', ')
open('users.json', 'a') do |f|
f.puts all_values
end
end
Wasn't able to test that, but it should do the trick. It's also better in general to be using Thread#join or Thread#value instead of sleep.

Related

Why does http request fail when running from rake task in a thread?

I am trying to figure out why the following code does not work. The response is not printed out and from other research the request fails.
task :test_me do
t1 = Thread.new do
puts 'start'
uri = URI.parse("http://google.com/")
response = Net::HTTP.get_response(uri)
puts response.inspect # this line not getting printed
end
# puts t1.value
end
However if I run the following
task :test_me do
t1 = Thread.new do
puts 'start'
uri = URI.parse("http://google.com/")
response = Net::HTTP.get_response(uri)
puts response.inspect # this line is printing because of puts below
end
puts t1.value
end
All is well
Note there are probably many ways to restructure this code, but I have dumbed down the example as far as possible and it's extracted from a gem so I don't have too much control over it.
If I can get a solid reason to why this is not working from a rake task I could potentially go back to them with a PR.
Thanks.
The reason this happens is because you are not calling join after the Thread block. However, when you use .value, it will automatically join the thread for you (as documented here)
Try this:
task :test_me do
t1 = Thread.new do
puts 'start'
uri = URI.parse("http://google.com/")
response = Net::HTTP.get_response(uri)
puts response.inspect
end
t1.join
end

RuntimeError (Circular dependency detected while autoloading constant Apps multithreading

I'm receiving this error:
RuntimeError (Circular dependency detected while autoloading constant Apps
when I'm multithreading. Here is my code below. Why is this happening?
The reason I am trying to multithread is because I am writing a HTML scraping app.
The call to Nokogiri::HTML(open()) is a synchronous blocking call that takes 1 second to return, and I have 100,000+ pages to visit, so I am trying to run several threads to overcome this issue. Is there a better way of doing this?
class ToolsController < ApplicationController
def getWebsites
t1=Thread.new{func1()}
t2=Thread.new{func1()}
t3=Thread.new{func1()}
t4=Thread.new{func1()}
t5=Thread.new{func1()}
t6=Thread.new{func1()}
t1.join
t2.join
t3.join
t4.join
t5.join
t6.join
end
def func1
puts Thread.current
apps = Apps.order("RANDOM()").where("apps.website IS NULL").take(1)
while apps.size == 1 do
app = apps[0]
puts app.name
puts app.iTunes
doc = Nokogiri::HTML(open(app.iTunes))
array = doc.css('div.app-links a').map { |link|
url = link['href']
url = Domainatrix.parse(url)
url.domain + "." + url.public_suffix
}
array.uniq!
if (array.size > 0)
app.website = array.join(', ')
puts app.website
else
app.website = "NONE"
end
app.save
apps = Apps.order("RANDOM()").where("apps.website IS NULL").take(1)
end
end
end
"require" isn't thread-safe
Change your methods so that everything that is to be "required" is done so before the threads start.
For example:
def get_websites
# values = Apps.all # try uncommenting this line if a second-try is required
ar = Apps.where("apps.website IS NULL")
t1 = Thread.new{ func1(ar) }
t2 = Thread.new{ func1(ar) }
t1.join
t2.join
end
def func1( ar )
apps = ar.order("RANDOM()").limit(1)
while (apps.size == 1)
puts Thread.current
end
end
But as somebody pointed out, the way you're multithreading within the controller isn't advised.

I must be misunderstanding Celluloid

I currently have a script written in Ruby that scans a range of IP addresses and tries to connect to them. It's extremely slow at the moment. It takes up to 300 seconds to scan 254 hosts on the network, and that's obviously not very practical. What I'm trying to do is give the script some concurrency in hopes of speeding up the script. So far this is what I have:
require 'socket'
require 'celluloid'
$res_arr = []
class Ranger
include Celluloid
def initialize(host)
#host = host
#timeout = 1
end
def ip_range(host)
host =~ /(?:\d{1,3}\.){3}[xX*]{1,3}/
end
def ctrl(host)
begin
if ip_range(host)
strIP = host.gsub(/[xX*]/, '')
(1..254).each do |oct|
$res_arr << strIP+oct.to_s
end
else
puts "Invalid host!"
end
rescue
puts "onnection terminated."
end
end
def connect
addr = Socket.getaddrinfo(#host, nil)
sock = Socket.new(Socket.const_get(addr[0][0]), Socket::SOCK_STREAM, 0)
begin
sock.connect_nonblock(Socket.pack_sockaddr_in(22, addr[0][3]))
rescue Errno::EINPROGRESS
resp = IO.select(nil, [sock], nil, #timeout.to_i)
if resp.nil?
$res_arr << "#{#host} Firewalled!"
end
begin
if sock.connect_nonblock(Socket.pack_sockaddr_in(22, addr[0][3]))
$res_arr << "#{#host}Connected!"
end
rescue Errno::ECONNREFUSED
$res_arr << "#{#host} Refused!"
rescue
false
end
end
sock
end
def output(contents)
puts contents.value
end
end # Ranger
main = Ranger.new(ARGV[0])
main.ctrl(ARGV[0])
$res_arr.each do |ip|
scan = Ranger.new(ip)
scnftr = scan.future :connect
scan.output(scnftr)
end
The script works, but it takes just as long as before I included Celluloid at all. Am I misunderstanding how Celluloid works and what it's supposed to do?
Your problem is that each iteration of your loop starts a future, then immediately waits for it to return a value. What you want instead is start all futures, then wait for all futures to finish in two separate steps:
futures = $res_arr.map do |ip|
scan = Ranger.new(ip)
scan.future :connect
end
# now that all futures are running, we can start
# waiting for the first one to finish
futures.each do |future|
puts future.value
end
Here's another example from the celluloid source: https://github.com/celluloid/celluloid/blob/master/examples/simple_pmap.rb

How can I terminate a SupervisionGroup?

I am implementing a simple program in Celluloid that ideally will run a few actors in parallel, each of which will compute something, and then send its result back to a main actor, whose job is simply to aggregate results.
Following this FAQ, I introduced a SupervisionGroup, like this:
module Shuffling
class AggregatorActor
include Celluloid
def initialize(shufflers)
#shufflerset = shufflers
#results = {}
end
def add_result(result)
#results.merge! result
#shufflerset = #shufflerset - result.keys
if #shufflerset.empty?
self.output
self.terminate
end
end
def output
puts #results
end
end
class EvalActor
include Celluloid
def initialize(shufflerClass)
#shuffler = shufflerClass.new
self.async.runEvaluation
end
def runEvaluation
# computation here, which yields result
Celluloid::Actor[:aggregator].async.add_result(result)
self.terminate
end
end
class ShufflerSupervisionGroup < Celluloid::SupervisionGroup
shufflers = [RubyShuffler, PileShuffle, VariablePileShuffle, VariablePileShuffleHuman].to_set
supervise AggregatorActor, as: :aggregator, args: [shufflers.map { |sh| sh.new.name }]
shufflers.each do |shuffler|
supervise EvalActor, as: shuffler.name.to_sym, args: [shuffler]
end
end
ShufflerSupervisionGroup.run
end
I terminate the EvalActors after they're done, and I also terminate the AggregatorActor when all of the workers are done.
However, the supervision thread stays alive and keeps the main thread alive. The program never terminates.
If I send .run! to the group, then the main thread terminates right after it, and nothing works.
What can I do to terminate the group (or, in group terminology, finalize, I suppose) after the AggregatorActor terminates?
What I did after all, is change the AggregatorActor to have a wait_for_results:
class AggregatorActor
include Celluloid
def initialize(shufflers)
#shufflerset = shufflers
#results = {}
end
def wait_for_results
sleep 5 while not #shufflerset.empty?
self.output
self.terminate
end
def add_result(result)
#results.merge! result
#shufflerset = #shufflerset - result.keys
puts "Results for #{result.keys.inspect} recorded, remaining: #{#shufflerset.inspect}"
end
def output
puts #results
end
end
And then I got rid of the SupervisionGroup (since I didn't need supervision, ie rerunning of actors that failed), and I used it like this:
shufflers = [RubyShuffler, PileShuffle, VariablePileShuffle, VariablePileShuffleHuman, RiffleShuffle].to_set
Celluloid::Actor[:aggregator] = AggregatorActor.new(shufflers.map { |sh| sh.new.name })
shufflers.each do |shuffler|
Celluloid::Actor[shuffler.name.to_sym] = EvalActor.new shuffler
end
Celluloid::Actor[:aggregator].wait_for_results
That doesn't feel very clean, it would be nice if there was a cleaner way, but at least this works.

Datamapper transaction doesn't rollback

I have such code:
Vhost.transaction do
domains.each do |domain|
unless domain.save
errors << domain.errors
end
end
unless vhost.save
errors << vhost.errors
end
end
I expect a rollback if any domain.save or vhost.save fails. But there is no rollback. What am I doing wrong?
I've had success with this pattern:
DataMapper::Model.raise_on_save_failure = true
MyModel.transaction do
begin
# do stuff
rescue DataMapper::SaveFailureError
t.rollback
end
end
Edit
Ok so you want to keep record of all errors before rolling back, then try something like this:
Vhost.transaction do |t|
new_errors = []
domains.each do |domain|
unless domain.save
new_errors << domain.errors
end
end
unless vhost.save
new_errors << vhost.errors
end
errors += new_errors
t.rollback if new_errors.any?
end

Resources