Converting python script to ruby (downloading part of a file) - ruby

I've been at this for a couple of day, and am having no luck at all. Despite reading over these two posts, I can't seem to rewrite this little python script I did up in ruby.
clean_link = link['href'].replace(' ', '%20')
mp3file = urllib2.urlopen(clean_link)
output = open('temp.mp3','wb')
output.write(mp3file.read(2000))
output.close()
I've been looking at using open-uri and net/http to do the same in ruby, but keep hitting a url redirect issue. So far I have
clean_link = link.attributes['href'].gsub(' ', '%20')
link_pieces = clean_link.scan(/http:\/\/(?:www\.)?([^\/]+?)(\/.*?\.mp3)/)
host = link_pieces[0][0]
path = link_pieces[0][1]
Net::HTTP.start(host) do |http|
resp = http.get(path)
open("temp.mp3", "wb") do |file|
file.write(resp.body)
end
end
Is there a simpler way to do this in ruby? Also, as with the python script, is there a way to only download part of the file?
EDIT: progress updated

see here & here
http.request_get('/index.html') {|res|
size = 0
res.read_body do |chunk|
size += chunk.size
# do some processing
break if size >= 2000
end
}
but you can't control chunk sizes here

Related

how to post (http-post) content of pdf using ruby?

I am trying to post (raw) content of a PDF in ruby using the following block
require 'pdf/reader'
require 'curb'
reader = PDF::Reader.new('folder/file.pdf')
raw_string = ''
reader.pages.each do |page|
raw_string = raw_string + page.raw_content.to_s
end
c = Curl::Easy.new('http://0.0.0.0:4567/pdf_upload')
c.http_post(Curl::PostField.content('param1', 'value1'),Curl::PostField.content('param2', 'value2'), c.http_post(Curl::PostField.content('body', raw_string)))
Inside the API implementation params[:body] seems to be empty all the time (though puts raw_string confirms that the variable has all the values.
Also, is there a better way to post pdf content?
Regarding how you're building raw_string...
Instead of:
reader.pages.each do |page|
raw_string = raw_string + page.raw_content.to_s
end
You should be able to do something like one of these:
raw_string = reader.pages.map(&:raw_content).join
raw_string = reader.pages.map{ |p| p.raw_content.to_s }.join
I'd also recommend you write your last line spread across several lines, for clarity and readability:
c.http_post(
Curl::PostField.content('param1', 'value1'),
Curl::PostField.content('param2', 'value2'),
c.http_post(Curl::PostField.content('body', raw_string))
)

How do I avoid EOFError with Ruby script?

I have a Ruby script (1.9.2p290) where I am trying to call a number of URLs, and then append information from those URLs into a file. The issue is that I keep getting an end of file error - EOFError. An example of what I'm trying to do is:
require "open-uri"
proxy_uri = URI.parse("http://IP:PORT")
somefile = File.open("outputlist.txt", 'a')
(1..100).each do |num|
page = open('SOMEURL' + num, :proxy => proxy_uri).read
pattern = "<img"
tags = page.scan(pattern)
output << tags.length
end
somefile.puts output
somefile.close
I don't know why I keep getting this end of file error, or how I can avoid getting the error. I think it might have something to do with the URL that I'm calling (based on some dialogue here: What is an EOFError in Ruby file I/O?), but I'm not sure why that would affect the I/O or cause an end of file error.
Any thoughts on what I might be doing wrong here or how I can get this to work?
Thanks in advance!
The way you are writing your file isn't idiomatic Ruby. This should work better:
(1..100).each do |num|
page = open('SOMEURL' + num, :proxy => proxy_uri).read
pattern = "<img"
tags = page.scan(pattern)
output << tags.length
end
File.open("outputlist.txt", 'a') do |fo|
fo.puts output
end
I suspect that the file is being closed because it's been opened, then not written-to while 100 pages are processed. If that takes a while I can see why they'd close it to avoid apps using up all the file handles. Writing it the Ruby-way automatically closes the file immediately after the write, avoiding holding handles open artificially.
As a secondary thing, rather than use a simple pattern match to try to locate image tags, use a real HTML parser. There will be little difference in processing speed, but potentially more accuracy.
Replace:
page = open('SOMEURL' + num, :proxy => proxy_uri).read
pattern = "<img"
tags = page.scan(pattern)
output << tags.length
with:
require 'nokogiri'
doc = Nokogiri::HTML(open('SOMEURL' + num, :proxy => proxy_uri))
output << doc.search('img').size

Resuming file downloads in Ruby, range header issue

When setting a range header in Ruby 1.8.7, an additional "X-REMOVED: Range" header is being added, which (seemingly) prevents download resumes from working.
size = File.size(local_file)
Net::HTTP.start(domain) do |http|
headers = {
'Range' => "bytes=#{size}-"
}
resp = http.get(remote_file, headers)
open(local_file, "wb") do |file|
file.write(resp.body)
end
end
Header sent:
GET /test.zip HTTP/1.1..Host: 192.168.50.1..Accept: */*..X-REMOVED: Range..Range: bytes=481-....
I've also tried using set_range with the same result.
Well this is embarrassing. The resumes not working had nothing to do with the range header. It's just that I was opening the file with "wb" instead of "ab".

How to decompress Gzip string in ruby?

Zlib::GzipReader can take "an IO, or IO-like, object." as it's input, as stated in docs.
Zlib::GzipReader.open('hoge.gz') {|gz|
print gz.read
}
File.open('hoge.gz') do |f|
gz = Zlib::GzipReader.new(f)
print gz.read
gz.close
end
How should I ungzip a string?
The above method didn't work for me.
I kept getting incorrect header check (Zlib::DataError) error. Apparently it assumes you have a header by default, which may not always be the case.
The work around that I implemented was:
require 'zlib'
require 'stringio'
gz = Zlib::GzipReader.new(StringIO.new(resp.body.to_s))
uncompressed_string = gz.read
Zlib by default asumes that your compressed data contains a header.
If your data does NOT contain a header it will fail by raising a Zlib::DataError.
You can tell Zlib to assume the data has no header via the following workaround:
def inflate(string)
zstream = Zlib::Inflate.new(-Zlib::MAX_WBITS)
buf = zstream.inflate(string)
zstream.finish
zstream.close
buf
end
You need Zlib::Inflate for decompression of a string and Zlib::Deflate for compression
def inflate(string)
zstream = Zlib::Inflate.new
buf = zstream.inflate(string)
zstream.finish
zstream.close
buf
end
In Rails you can use:
ActiveSupport::Gzip.compress("my string")
ActiveSupport::Gzip.decompress().
zstream = Zlib::Inflate.new(16+Zlib::MAX_WBITS)
Using (-Zlib::MAX_WBITS), I got ERROR: invalid code lengths set and ERROR: invalid block type
The only following works for me, too.
Zlib::GzipReader.new(StringIO.new(response_body)).read
I used the answer above to use a Zlib::Deflate
I kept getting broken files (for small files) and it took many hours to figure out that the problem can be fixed using:
buf = zstream.deflate(string,Zlib::FINISH)
without the the zstream.finish line!
def self.deflate(string)
zstream = Zlib::Deflate.new
buf = zstream.deflate(string,Zlib::FINISH)
zstream.close
buf
end
To gunzip content, use following code (tested on 1.9.2)
Zlib::GzipReader.new(StringIO.new(content), :external_encoding => content.encoding).read
Beware of encoding problems
We don't need any extra parameters these days. There are deflate and inflate class methods which allow for quick oneliners like these:
>> data = "Hello, Zlib!"
>> compressed = Zlib::Deflate.deflate(data)
=> "x\234\363H\315\311\311\327Q\210\312\311LR\004\000\032\305\003\363"
>> uncompressed = Zlib::Inflate.inflate(compressed)
=> "Hello, Zlib!"
I think it answers the question "How should I ungzip a string?" the best. :)

Download image with Ruby RIO gem

My code:
require 'rio'
rio('nice.jpg') < rio('http://farm4.static.flickr.com/3134/3160515898_59354c9733.jpg?v=0')
But the image downloaded is currupted. Whtat is wrong with this solution?
pjb3 is correct. You must call binmode on the left-hand term:
rio('nice.jpg').binmode < rio('http://...')
If this still does not work (notably, it may happen for large jpeg files, i.e. rio uses an intermediate temp file when retrieving from the URL you have provided), then apply the binmode modifier to both terms:
rio('nice.jpg').binmode < rio('http://...').binmode
2011 UPDATE
According to Luke C., the above answer no longer applies to more recent versions of the gem:
Neither of these work. On Linux having .binmode set on the destination causes a Errno::ENOENT exception. Doing: rio('nice.jpg') < rio('http://...').binmode works
It works for me. Are you on windows? It might be because the file isn't being opened with the binary flag.
I had similar problems downloading images on Linux, I found that this worked for me:
rio(source_url).binmode > rio(filename)
Here is some simple ruby code to download an image
require 'net/http'
url = URI.parse("http://www.somedomain.com/image.jpg")
Net::HTTP.start(url.host, url.port) do |http|
resp, data = http.get(url.path, nil)
open( File.join(File.dirname(__FILE__), "image.jpg"), "wb" ) { |file| file.write(resp.body) }
end
This can even be extended to follow redirects:
require 'net/http'
url = URI.parse("http://www.somedomain.com/image.jpg")
Net::HTTP.start(url.host, url.port) do |http|
resp, data = http.get(url.path, nil)
prev_redirect = ''
while resp.header['location']
raise "Recursive redirect: #{resp.header['location']}" if prev_redirect == resp.header['location']
prev_redirect = resp.header['location']
url = URI.parse(resp.header['location'])
host = url.host if url.host
port = url.port if url.port
http = Net::HTTP.new(host, port)
resp, data = http.get(url.path, nil)
end
open( File.join(File.dirname(__FILE__), "image.jpg"), "wb" ) { |file| file.write(resp.body) }
end
It can probably be prettied up some, but it gets the job done, and is not dependent on any 3rd party gems! :)
I guess this is a bug. On windows all 0x0A replaced with 0x0D 0x0A. And as so, it makes sence that properly used (with .binmode) it works on Linux.
For downloading pictures from the web page, you can use ruby gem image_downloader

Resources