How to calculate the amount of data downloaded and the total data to be downloaded in Ruby - ruby

I'm trying to build a desktop client that manages some downloads with Ruby. I would like to know how to go about trying to identify how much of the data is downloaded and the size of the entire data that is to be downloaded.
Im trying to do this with Ruby so any help would be useful.
Thanks in advance.

Like Wayne said in his comment, it depends on the protocol that is used to transfer the files. With HTTP for example, the HTTP response will include a Content-Length header which will tell you the length of the file that you are downloading. After you know that you will have to keep track of the number of bytes that you've read from the HTTP connection.
Something like this seems to work (for HTTP), but I wouldn't be surprised if it could be done more elegantly:
require 'net/http'
url = URI.parse('http://www.google.com/index.html')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) do |http|
http.request(req) do |res|
remaining = res.content_length
puts "total length: #{remaining}"
res.read_body do |segment|
puts "read #{segment.length} bytes"
remaining = remaining - segment.length
puts "#{remaining} bytes remaining"
end
end
end
www.google.com/index.html is a bad example since the content gets returned in one segment, but try it on a larger object and you should see multiple "read..." lines.

If you're using Net::HTTP then the length of whatever you're requesting should be in the response header. Net::HTTP mixin NET::HTTPHeader, in it you'll find content_length(). Although it only works if the size is determined before the transfer happens.
Net::HTTPResponse has a method that reads the body in chunks, so you can use that to determine the progress. Start at 0 and add the length of each chunk, compare it to the total size and you're done.
http.request_get('/index.html') {|res|
res.read_body do |segment|
print segment
end
} #Example taken from Ruby-Documentation
If you're using FTP then it should be easier through NET::FTP. Connect to the server, get the size of a given file with size(filename), and then download the file with get, getbinaryfile or gettextfile.
This is the signature of the get method: get(remotefile, localfile = File.basename(remotefile), blocksize = DEFAULT_BLOCKSIZE) {|data| ...}
ftp.get('file.something', 'file.something.local', 1024){ |data|
puts "Downloaded 1024 more bytes"
}

Related

Trying to use open-uri in ruby, some HTML contents are coming in as "Loading..."

I am trying to create a program to compare a specific thing on a webpage, and then compare it another time, I'm currently working on getting the piece of information that will change. But, the text that would change appears if I inspect element in the page, but not if I use open-uri, it comes in as "Loading..." (see picture), is there a way to get all the HTML text?
Picture here.
This is the current code I have
contents = open('https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841', &:read)
File.open("testing.txt", "w") do |line|
line.puts "\r" + "#{contents}"
end
Any help to get the Loading... to change to the actual HTML code would be amazing.
Thanks
The problem
So, open uri just makes HTTP requests and gives you access to the body. In this case, the body is html. That html has a placeholder for this data, which is what you're seeing. Then that html says to load up some javascript that will make another request to the server to get the data, and when the data comes in, it will replace the placeholder with the real data. So, to handle this, you ultimately need whatever is coming back from that request the javascript is making.
Three solutions
Ordered from my least favourite to my most favourite.
You can try to evaluate the JavaScript to have it operate on the html. This is going to be painful, so I don't recommend it, but if you wanted to go down that path, I think there's a gem called "the ruby racer" or something (IIRC, it wraps v8).
You can launch a web browser, let the browser handle all the cray cray, and then ask the browser for the html after it's been updated. This is what Rahul's solution does, and it's a really nice solution. It's not my favourite because it's pretty heavy and you're relegated to information displayed in the html. This is called "scraping", and it's pretty fragile (some designer moves something around the page and your script breaks), and the information is in human presentation format, which means you usually have to do a lot of little parsing things.
You can open your browser's devtools, go to the network tab, filter to the XHR requests, and reload the page. One of these made the request to get the data that was used to fill in the place holder. Figure out which one it is and then you can make that request yourself. There's ways this can be fragile, too, eg sometimes you have to have the right cookies, and you often have to experiment with what the browser sent to figure out how much of it you need (usually it's way less than was sent, which is true for your case). Protip: When you do this, separate requesting the data from parsing and exploring it (ie save it to a file and then, while looking through the data, get it from the file rather than making a new request every time... this way it won't change on you and you won't get rate limited)
Solution #3
So, I was curious and went ahead and tried solution number 3 myself, and it worked pretty admirably, check it out:
require 'uri'
require 'net/http'
# build a post request to the URL that the page got the data from
uri = URI 'https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=untrackedExternal_true_0'
req = Net::HTTP::Post.new(uri)
# set some headers
req['origin'] = 'https://www.cargurus.com' # for cross origin requests
req['cache-control'] = 'no-cache' # no caching, just in case,
req['pragma'] = 'no-cache' # we prob don't want stale data
# looks like you can pass it an awful lot of filters to use
req.set_form_data(
"page"=>"1", "zip"=>"", "address"=>"", "latitude"=>"", "longitude"=>"",
"distance"=>"100", "selectedEntity"=>"d841", "transmission"=>"ANY",
"entitySelectingHelper.selectedEntity2"=>"", "minPrice"=>"", "maxPrice"=>"",
"minMileage"=>"", "maxMileage"=>"", "bodyTypeGroup"=>"", "serviceProvider"=>"",
"filterBySourcesString"=>"", "filterFeaturedBySourcesString"=>"",
"displayFeaturedListings"=>"true", "searchSeoPageType"=>"",
"inventorySearchWidgetType"=>"AUTO", "allYearsForTrimName"=>"false",
"daysOnMarketMin"=>"", "daysOnMarketMax"=>"", "vehicleDamageCategoriesRaw"=>"",
"minCo2Emission"=>"", "maxCo2Emission"=>"", "vatOnly"=>"false",
"minEngineDisplacement"=>"", "maxEngineDisplacement"=>"", "minMpg"=>"",
"maxMpg"=>"", "minEnginePower"=>"", "maxEnginePower"=>"", "isRecentSearchView"=>"false"
)
# make the request (200 means it worked)
res = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) { |http| http.request req }
res.code # => "200"
# parse the response
require 'json'
json = JSON.parse res.body
# we're on page 1 of 1, and there are 48 results on this page
json['page'] # => 1
json['listings'].size # => 48
json['remainingResults'] # => false
# apparently we're looking at some sort of car or smth
json['modelId'] # => "d841"
json['modelName'] # => "Mazda MAZDASPEED6"
# a bunch of places sell this car
json['sellers'].size # => 47
json['sellers'][0]['location'] # => "Portland OR, 97217"
# the first of our 48 cars seems to be a deal
listing = json['listings'][0]
listing['mainPictureUrl'] # => "https://static.cargurus.com/images/forsale/2018/05/24/02/58/2006_mazda_mazdaspeed6-pic-61663369386257285-152x114.jpeg"
listing['expectedPriceString'] # => "$8,972"
listing['priceString'] # => "$6,890"
listing['daysOnMarket'] # => 61
listing['savingsRecommendation'] # => "Good Deal"
listing['carYear'] # => 2006
listing['mileageString'] # => "81,803"
# none of the 48 are salvaged or lemons
json['listings'].count { |l| l['lemon'] } # => 0
json['listings'].count { |l| l['salvage'] } # => 0
# the savings recommendations seem reasonably distributed
json['listings'].group_by { |l| l["savingsRecommendation"] }.map { |rec, ls| [rec, ls.size] }
# => [["Good Deal", 4],
# ["Fair Deal", 11],
# ["No Price Analysis", 23],
# ["High Price", 8],
# ["Overpriced", 2]]
Your web page contains ajax request and open-uri only returns server-side page, it not wait for ajax request
You can use the below code which waits for page loading
#load the libraries
require 'watir'
browser = Watir::Browser.new
browser.goto "https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841"
# giving some time for website to load
sleep 2
puts browser.html
NOTE: you need chromedriver to use the script http://chromedriver.chromium.org/downloads
if you don't want to open url in browser then you can use headless-WebKit

why is content_length in Net::HTTP.get_response sometimes nil even on good results?

I have the following ruby code (was trying to write a simple http-ping)
require 'net/http'
res1 = Net::HTTP.get_response 'www.google.com' , '/'
res2 = Net::HTTP.get_response 'www.google.com' , '/search?q=abc'
res1.code #200
res2.code #200
res1.content_length #5213
res2.content_length #nil **<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< WHY**
res2.body[0..60]
=> "<!doctype html><html itemscope=\"\" itemtype=\"http://schema.org"
Why does res2 content_length does not show through? Is it in some other attribute of res2 (how does one see those?)
I am a newcomer at ruby. Using irb 0.9.6 on AWS Linux
Thanks a lot.
It appears that the value returned is not necessarily the length of the body, but the fixed length of the content, when that fixed length is known in advance and stored in the content-length header.
See the source for the implementation of HTTPHeader#content_length (taken from http://ruby-doc.org/stdlib-2.3.1/libdoc/net/http/rdoc/Net/HTTPHeader.html):
# File net/http/header.rb, line 262
def content_length
return nil unless key?('Content-Length')
len = self['Content-Length'].slice(/\d+/) or
raise Net::HTTPHeaderSyntaxError, 'wrong Content-Length format'
len.to_i
end
What this probably means in this case is that the response was a multi-part MIME response, and the content-length header is not used in this case.
What you most likely want in this case is body.length, since that's the only real way to tell the actual length of the response body for a multi-part response.
Note that may be performance implications by always using content.body to find the content length; you may choose to try the content_length approach first and if it's nil, fall back to body.length.
Here's an example modification to your code:
require 'net/http'
res1 = Net::HTTP.get_response 'www.google.com' , '/'
res2 = Net::HTTP.get_response 'www.google.com' , '/search?q=abc'
res1.code #200
res2.code #200
res1.content_length #5213
res2.content_length.nil? ? res2.body.length : res2.content_length #57315 **<<<<<<<<<<<<<<< Works now **
res2.body[0..60]
=> "<!doctype html><html itemscope=\"\" itemtype=\"http://schema.org"
or, better yet, capture the content_length and use the captured value for comparison:
res2_content_length = res2.content_length
if res2_content_length.nil?
res2_content_length = res2.body.length
end
Personally, I'd just stick with always checking body.length and deal with any potential performance issue if and when it arises.
This should reliably retrieve the actual length of the content for you, regardless of whether you received a simple response of a multi-part response.

Ruby GPGME - How to encrypt large files

I'm having difficulty to Encrypt large files (bigger than available memory) using GPGME in Ruby.
#!/usr/bin/ruby
require 'gpgme'
def gpgfile(localfile)
crypto = GPGME::Crypto.new
filebasename = File.basename(localfile)
filecripted = crypto.encrypt File.read(localfile), :recipients => "info#address.com", :always_trust => true
File.open("#{localfile}.gpg", 'w') { |file| file.write(filecripted) }
end
gpgpfile("/home/largefile.data")
In this case I got an error of memory allocation:
"read: failed to allocate memory (NoMemoryError)"
Someone can explain me how to read the source file chunk by chunk (of 100Mb for example) and write them passing by the crypting?
The most obvious problem is that you're reading the entire file into memory with File.read(localfile). The Crypto#encrypt method will take an IO object as its input, so instead of File.read(localfile) (which returns the contents of the file as a string) you can pass it a File object. Likewise, you can give an IO object as the :output option, letting you write the output directly to a file instead of in memory:
def gpgfile(localfile)
infile = File.open(localfile, 'r')
outfile = File.open("#{localfile}.gpg", 'w')
crypto = GPGME::Crypto.new
crypto.encrypt(infile, recipients: "info#address.com",
output: outfile,
always_trust: true)
ensure
infile.close
outfile.close
end
I've never used ruby-gpgme, so I'm not 100% sure this will solve your problem since it depends a bit on what ruby-gpgme does behind the scenes, but from the docs and the source I've peeked at it seems like a sanely-built gem so I'm guessing this will do the trick.

Is it possible to get the headers of a request using Ruby's HTTPClient gem before the request completes?

I have a requirement to proxy a request in a Rails app. I was hoping I could proxy it with chunking (so, 1 chunk received, one chunk is sent). The app is working fine without chunking (load the request into memory, and transmit).
Here is my code to proxy the chunks through to the end-client:
self.response.headers['Last-Modified'] = Time.now.ctime.to_s
self.response_body = Enumerator.new do |y|
client = HTTPClient.new
http_response = client.get(proxy_url, nil, headers) do |chunk|
y << chunk
end
end
The problem is, I can't inspect "http_response" until all the chunks have been received, thus I can't set the headers based on the headers of the client.
What I'm trying to do is transmit the headers returned from the client before the first chunk is sent. Is this possible?
If not, is this pattern possible in any other Ruby HTTP client gem?
Update
I have a solution for you.
If you call get_async instead, it will retun immediately with an HTTPClient::Connection object that is updated with the header information as soon as it is received. This code sample demonstrates.
The patch to HTTPClient::Connection is almost certainly not necessary for you, but it lets you write things like conn.queue.size? and conn.queue.empty?.
conn.pop blocks until the response (or exception) has been pushed to the queue by the async thread and then returns the normal HTTP::Message object. (Note that, if you are using the monkey patch, you can use conn.queue.empty? to see if pop is going to block.)
resp.content returns an IO object which is a pipe read endpoint, and can be called as soon as pop hs returned. The other end is written by the async thread as the data arrives, and you can read the entire content in one go or in whatever size chunks you like using read.
require 'httpclient'
class HTTPClient::Connection
attr_reader :queue
end
client = HTTPClient.new
conn = client.get_async 'http://en.wikipedia.org/wiki/Ruby_(programming_language)'
resp = conn.pop
resp.header.all.each { |name, val| puts "#{name}=#{val}" }
puts
pipe = resp.content
while chunk = pipe.read(8192)
print chunk
end
You could parse the first chunk you receive to extract the headers, but I suggest you call head first to get the header information. Then do the get as well.
(Updated - the first chunk holds the beginning of the content so this won't work.)

Zlib inflate error

I am trying to save compressed strings to a file and load them later for use in the game. I kept getting "in 'finish': buffer error" errors when loading the data back up for use. I came up with this:
require "zlib"
def deflate(string)
zipper = Zlib::Deflate.new
data = zipper.deflate(string, Zlib::FINISH)
end
def inflate(string)
zstream = Zlib::Inflate.new
buf = zstream.inflate(string)
zstream.finish
zstream.close
buf
end
setting = ["nothing","nada","nope"]
taggedskills = ["nothing","nada","nope","nuhuh"]
File.open('testzip.txt','wb') do |w|
w.write(deflate("hello world")+"\n")
w.write(deflate("goodbye world")+"\n")
w.write(deflate("etc")+"\n")
w.write(deflate("etc")+"\n")
w.write(deflate("Setting: name "+setting[0]+" set"+(setting[1].class == String ? "str" : "num")+" "+setting[1].to_s)+"\n")
w.write(deflate("Taggedskill: "+taggedskills[0]+" "+taggedskills[1]+" "+taggedskills[2]+" "+taggedskills[3])+"\n")
w.write(deflate("etc")+"\n")
end
File.open('testzip.txt','rb') do |file|
file.each do |line|
p inflate(line)
end
end
It was throwing errors at the "Taggedskill:" point. I don't know what it is, but trying to change it to "Skilltag:", "Skillt:", etc. continues to throw a buffer error, while things like "Setting:" or "Thing:" work fine, while changing the setting line to "Taggedskill:" continues to work fine. What is going on here?
In testzip.txt, you are storing newline separated binary blobs. However, binary blobs may contain newlines by themselves, so when you open testzip.txt and split it by line, you may end up splitting one binary blob that inflate would understand, into two binary blobs that it does not understand.
Try to run wc -l testzip.txt after you get the error. You'll see the file contains one more line, than the number of lines you are putting in.
What you need to do, is compress the whole file at once, not line by line.

Resources