Why is curl in Ruby slower than command-line curl? - ruby

I am trying to download more than 1m pages (URLs ending by a sequence ID). I have implemented kind of multi-purpose download manager with configurable number of download threads and one processing thread. The downloader downloads files in batches:
curl = Curl::Easy.new
batch_urls.each { |url_info|
curl.url = url_info[:url]
curl.perform
file = File.new(url_info[:file], "wb")
file << curl.body_str
file.close
# ... some other stuff
}
I have tried to download 8000 pages sample. When using the code above, I get 1000 in 2 minutes. When I write all URLs into a file and do in shell:
cat list | xargs curl
I gen all 8000 pages in two minutes.
Thing is, I need it to have it in ruby code, because there is other monitoring and processing code.
I have tried:
Curl::Multi - it is somehow faster, but misses 50-90% of files (does not download them and gives no reason/code)
multiple threads with Curl::Easy - around the same speed as single threaded
Why is reused Curl::Easy slower than subsequent command line curl calls and how can I make it faster? Or what I am doing wrong?
I would prefer to fix my download manager code than to make downloading for this case in a different way.
Before this, I was calling command-line wget which I provided with a file with list of URLs. Howerver, not all errors were handled, also it was not possible to specify output file for each URL separately when using URL list.
Now it seems to me that the best way would be to use multiple threads with system call to 'curl' command. But why when I can use directly Curl in Ruby?
Code for the download manager is here, if it might help: Download Manager (I have played with timeouts, from not-setting it to various values, it did not seem help)
Any hints appreciated.

This could be a fitting task for Typhoeus
Something like this (untested):
require 'typhoeus'
def write_file(filename, data)
file = File.new(filename, "wb")
file.write(data)
file.close
# ... some other stuff
end
hydra = Typhoeus::Hydra.new(:max_concurrency => 20)
batch_urls.each do |url_info|
req = Typhoeus::Request.new(url_info[:url])
req.on_complete do |response|
write_file(url_info[:file], response.body)
end
hydra.queue req
end
hydra.run
Come to think of it, you might get a memory problem because of the enormous amout of files. One way to prevent that would be to never store the data in a variable but instead stream it to the file directly. You could use em-http-request for that.
EventMachine.run {
http = EventMachine::HttpRequest.new('http://www.website.com/').get
http.stream { |chunk| print chunk }
# ...
}

So, if you don't set a on_body handler than curb will buffer the download. If you're downloading files you should use an on_body handler. If you want to download multiple files using Ruby Curl, try the Curl::Multi.download interface.
require 'rubygems'
require 'curb'
urls_to_download = [
'http://www.google.com/',
'http://www.yahoo.com/',
'http://www.cnn.com/',
'http://www.espn.com/'
]
path_to_files = [
'google.com.html',
'yahoo.com.html',
'cnn.com.html',
'espn.com.html'
]
Curl::Multi.download(urls_to_download, {:follow_location => true}, {}, path_to_files) {|c,p|}
If you want to just download a single file.
Curl::Easy.download('http://www.yahoo.com/')
Here is a good resource: http://gist.github.com/405779

There's been benchmarks done that has compared curb with other methods such as HTTPClient. The winner, in almost all categories was HTTPClient. Plus, there have been some documented scenarios where curb does NOT work in multi-threading scenarios.
Like you, I've had your experience. I ran system commands of curl in 20+ concurrent threads and it was 10 X fasters than running curb in 20+ concurrent threads. No matter, what I tried, this was always the case.
I've since then switched to HTTPClient, and the difference is huge. Now it runs as fast as 20 concurrent curl system commands, and uses less CPU as well.

First let me say that I know almost nothing about Ruby.
What I do know is that Ruby is an interpreted language; it's not surprising that it's slower than heavily optimised code that's been compiled for a specific platform. Every file operation will probably have checks around it that curl doesn't. The "some other stuff" will slow things down even more.
Have you tried profiling your code to see where most of the time is being spent?

Stiivi,
any chance that Net::HTTP would suffice for simple downloading of HTML pages?

You didn't specify a Ruby version, but threads in 1.8.x are user-space threads, not scheduled by the OS, so the entire Ruby interpreter only ever use one CPU/core. On top of that there is a Global Interpreter Lock, and probably other locks as well, interfering with concurrency. Since you're trying to maximize network throughput, you're probably underutilizing CPUs.
Spawn as many processes as the machine has memory for, and limit the reliance on threads.

Related

Compress file in Ruby - System vs Zlib?

I have a file in tune of few hundred MBs to be compressed. I don't need to go through that file per se, so I am free to use either system or use Zlib like explained in this SO question.
I am inclined towards system because my ruby process doesn't have to bother reading it and bloating up, so use well known gzip command to run through system. Also, I get the exit status, so I know how it went.
Anything I am missing? Is there a best practice around this? Any loopholes?
If you will use system command than you can't intervene in compression. So you won't be able to redirect compressed output to socket, provide external progress bar, combine custom tar archive etc. These things may be important during compression of large files.
Please look at the following example using ruby-zstds (zstd is better than gzip today).
require "socket"
require "zstds"
require "minitar"
TCPSocket.open "google.com", 80 do |socket|
writer = ZSTDS::Stream::Writer.new socket
begin
Minitar::Writer.open writer do |tar|
tar.add_file_simple "file.txt" do |tar_writer|
File.open "file.txt", "r" do |file|
tar_writer.write(file.read(512)) until file.eof?
end
end
tar.add_file_simple "file2.txt" ...
end
ensure
writer.close
end
end
We are reading file.txt in a streaming way, adding it to tar archive and sending portions to google immediately. We don't need to store any compressed files.

Ruby: getting disk usage information on Linux from /proc (or some other way that is non-blocking and doesn't spawn a new process)

Context:
I'm on Linux. I am writing a disk usage extension for Sensu. Extensions must be non-blocking, because they are included in the agent's main code loop. They also must be very lightweight, because they may be triggered as often as once every 10 seconds, or even down to once per second.
So I cannot spawn a new executable to gather disk usage information. From within Ruby, I can only do stuff like File.open() on /proc and /sys and so on, read the content, parse it, file.close(), then print the result. Repeat.
I've found the sys-filesystem gem, which appears to have everything I need. But I'd rather not force extensions to depend on gems, if it can be avoided. I'll use the gem if it turns out to be the best way, but is there a good alternative? Something that doesn't require a ton of coding?
The information can be accessed via the system call statfs
http://man7.org/linux/man-pages/man2/statfs.2.html
I can see there is a ruby interface to this here:
http://ruby-doc.org/core-trunk/File/Statfs.html

Anything external as fast as an array? So I don't need to re-load arrays each time I run scripts

While I am developing my application I need to do tons of math over and over again, tweaking it and running again and observing results.
The math is done on arrays that are loaded from large files. Many megabytes. Not very large but the problem is each time I run my script it first has to load the files into arrays. Which takes a long time.
I was wondering if there is anything external that works similarly to arrays, in that I can know the location of data and just get it. And that it doesn't need to reload everything.
I don't know much about databases except that they seem to not work the way I need to. They aren't ordered and always need to search through everything. It seems. Still a possibility is in-memory databases?
If anyone has a solution it would be great to hear it.
Side question - isn't it just possible to have user entered scripts that my ruby program runs so I can have the main ruby program run indefinitely? I still don't know anything about user entered options and how that would work though.
Use Marshal:
# save an array to a file
File.open('array', 'w') { |f| f.write Marshal.dump(my_array) }
# load an array from file
my_array = File.open('array', 'r') { |f| Marshal.load(f.read) }
Your OS will keep the file cached between saves and loads, even between runs of separate processes using the data.

Can a watir browser object be re-used in a later Ruby process?

So let's say pretty often a script runs that opens a browser and does web things:
require 'watir-webdriver'
$browser = Watir::Browser.new(:firefox, :profile => "botmode")
=> #<Watir::Browser:0x7fc97b06f558 url="about:blank" title="about:blank">
It could end gracefully with a browser.close, or it could crash sooner and leave behind the memory-hungry Firefox process, unnoticed until they accumulate and slow the server to a crawl.
My question is twofold:
What is a good practice to ensure that even in case of script failure anywhere leading to immediate error exit, the subprocess will always get cleaned up (I already have lots of short begin-rescue-end blocks peppered for other unrelated small tests)
More importantly, can I simply remember this Watir::Browser:0x7fc97b06f558 object address or PID somehow and re-assign it to another $browser variable in a whole new Ruby process, for example irb? I.e. can an orphaned browser on webdriver be re-attached later in another program using watir-webdriver on the same machine? From irb I could then get in and re-attach to the browser left behind by the crashed Ruby script, to examine the website it was on, check what went wrong, what elements are different than expected, etc.
Another hugely advantageous use of the latter would be to avoid the overhead of potentially hundreds of browser startups and shutdowns per day...best to keep one alive as sort of a daemon. The first run would attempt to reuse a previous browser object using my specially prepared botmode profile, otherwise create one. Then I would deliberately not call $browser.close at the end of my script. If nothing else I run an at job to kill the Xvfb :99 display FF runs inside of at the end of the day anyway (giving FF no choice but to die with it, if still running). Yes I am aware of Selenium standalone jar, but trying to avoid that java service footprint too.
Apologies if this is more a basic Ruby question. I just wasn't sure how to phrase it and keep getting irrelevant search results.
I guess, U cant just remember the variable from another process. But the solution might be creating a master process and process your script in loop in thread, periodically checking the browser running state. I'm using some thing similar in my acceptance tests on Cucumber + watir. So it will be some thing like that:
require 'rubygems'
require 'firewatir' # or watir
#browser = FireWatir::Firefox.new
t = Thread.new do
#browser.goto "http://google.com"
#call more browser actions here
end
while not_exit?
if t.stop?
# error occurred in thread, restart or exit
end
if browser_live?
# browser was killed for a some reason
# restart or exit
end
end
#browser.close
not_exit? - can be over TRAP for the ctrl+C
browser_live? - you can check if firefox browser running with processes listings
It is quite tricky but might work for you
You can use DRb like this:
browsers pool:
require 'drb'
require 'watir'
browser = Watir::Browser.new :chrome
DRb.start_service 'druby://127.0.0.1:9395', browser
gets
and then from test script use this browser:
require 'drb'
browser = DRbObject.new_with_uri 'druby://127.0.0.1:9395'
browser.goto 'stackoverflow.com'
I'm pretty sure that at the point ruby exits, any handles or pointers to something like a browser object would become invalid. So re-using something in a later ruby process is likely not a good approach. In addition I might be wrong on this, but it does seem that webdriver is not very good at connecting to a running browser process. So for your approach to work it would really all need to be wrapped by some master process that was calling all the tests etc.. and hey wait a sec, that's starting to sound like a framework, which you might already (or perhaps should be) using in the first place.
So a better solution is probably to look at whatever framework you are using to run your tests and investigate any capability for 'setup/teardown' actions (which can go by different names) which are run before and after either each test, groups of tests, or all tests. Going this way is good since most frameworks are designed to allow you to run any single test, or set of tests that you want to. And if your tests are well designed they can be run singly without having to expect the system was left in some perfect state by a prior test. Thus these sorts of setup/teardown actions are designed to work that way as well.
As an example Cucumber has this at the feature level, with the idea of a 'background' which is basically intended as a way to dry out scenarios by defining common steps to run before each scenario in a feature file. (such as navigating to and logging into your site) This could include a call to a series of steps that would look to see if a browser object existed, and if not create one. However you'd need to put that in every feature file which starts to become rather non dry.
Fortunately cucumber also allows a way to do this in one place via the use of Hooks. You can define hooks to run before steps, in the event of specific conditions, 'before' and 'after' each scenario, as well as code that runs once before any scenarios, and code defined to run 'at_exit' where you could close the browser after all scenarios have run.
If I was using cucumber I'd look at the idea of a some code in env.rb that would run at the start to create a browser, complemented by at_exit code to close the browser. Then perhaps also code in a before hook which could check to see that the browser is still there and re-create it if needed, and maybe logout actions in a after hook. Leave stuff like logging in for the individual scenarios, or a background block if all scenarios in a feature login with the same sort of user.
Not so much a solution but a workaround for part 1 of my question, using pkill. Posting here since it turned out to be a lot less trivial than I had hoped.
After the ruby script exits, its spawned processes (which may not at all belong in the same PID tree anymore, like firefox-bin) have a predictable "session leader" which turned out to be the parent of the bash shell calling rubyprogram.rb in my case. Available as $PPID in Bash, for when you have to go higher than $$.
Thus to really clean up unwanted heavyweight processes eg. after a ruby crash:
#!/bin/bash
# This is the script that wraps on top of Ruby scripts
./ruby_program_using_watirwebdriver_browser.rb myparams & # spawn ruby in background but keep going below:
sleep 11 # give Ruby a chance to launch its web browser
pstree -panu $$ # prints out a process tree starting under Bash, the parent of Ruby. Firefox may not show!
wait # now wait for Ruby to exit or crash
pkill -s $PPID firefox-bin # should only kill firefox-bin's caused above, not elsewhere on the system
# Another way without pkill, will also print out what's getting killed if anything:
awk '$7=="firefox-bin" && $3=="'$PPID'" {print $1}' <(ps x -o pid,pgid,sess,ppid,tty,time,comm) | xargs -rt kill
OPTIONAL
And since I use a dedicated Xvfb Xwindows server just for webdriving on DISPLAY :99, I can also count on xkill:
timeout 1s xwininfo -display :99 -root -all |awk '/("Navigator" "Firefox")/ {print $1}' |xargs -rt xkill -display :99 -id
# the timeout is in case xkill decides to wait for user action, when window id was missing
Just an update on part 2 of my question.
It seems one CAN serialize a Watir:Browser object with YAML, and because it's text-based the contents were quite interesting to me (e.g. some things I've only dreamed of tweaking hidden inside private elements of private classes...but that's a separate topic)
Deserializing from YAML is still trouble. While I haven't tested beyond the first try it gives me some kind of reg exp parse error...not sure what that's about.
(more on that at at how to serialize an object using TCPServer inside? )
Meanwhile, even attempting to serialize with Marshal, which is also built-in to Ruby but stores in binary format, results in a very reasonable-sounding error about not being able to dump a TCPServer object (apparently contained within my Watir:Browser pointed to by $browser)
All in all I'm not surprised at these results, but still pretty confident there is a way, until Watir arrives at something more native (like PersistentWebdriver or how it used to be in the days of jssh when you could simply attach to an already running browser with the right extension)
Until then, if serialization + deserialization to a working object gets too thorny I'll resort to daemonizing a portion of my Ruby to keep objects persistent and spare the frequent and costly setup/teardowns. And I did take a gander at some established (unit testing) frameworks but none seem to fit well yet within my overall software structure--I'm not web testing after all.

Threaded wget - minimalizing resources

I have a script that is getting the GeoIP locations of various ips, this is run daily and I'm going to expect to have around ~50,000 ips to look up.
I have a GeoIP system set up - I just would like to eliminate having to run wget 50,000 times per report.
What I was thinking is, there must be some way to have wget open a connection with the url - then pass the ips, that way it doesn't have to re-establish the connection.
Any help will be much appreciated.
If you give wget several addresses at once, with consecutive addresses belonging to the same HTTP/1.1 (Connection: keep-alive) supporting server, wget will re-use the already-established connection.
If there are too many addresses to list on the command line, you can write them to a file and use the -i/--input-file= option (and, per UNIX tradition, -i-/--input-file=- reads standard input).
There is, however, no way to preserve a connection across different wget invocations.
You could also write a threaded Ruby script to run wget on multiple input files simultaneously to speed the process up. So if you have 5 files containing 10,000 addresses each, you could use this script:
#!/usr/bin/ruby
threads = []
for file in ARGV
threads << Thread.new(file) do |filename|
system("wget -i #{filename}")
end
end
threads.each { |thrd| thrd.join }
Each of these threads would use one connection to download all addresses in a file. The following command then means only 5 connections to the server to download all 50,000 files.
./fetch.rb "list1.txt" "list2.txt" "list3.txt" "list4.txt" "list5.txt"
You could also write a small program (in Java or C or whatever) that sends the list of files as a POST request and the server returns an object with data about them. Shouldn't be too slow either.

Resources