why is content_length in Net::HTTP.get_response sometimes nil even on good results? - ruby

I have the following ruby code (was trying to write a simple http-ping)
require 'net/http'
res1 = Net::HTTP.get_response 'www.google.com' , '/'
res2 = Net::HTTP.get_response 'www.google.com' , '/search?q=abc'
res1.code #200
res2.code #200
res1.content_length #5213
res2.content_length #nil **<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< WHY**
res2.body[0..60]
=> "<!doctype html><html itemscope=\"\" itemtype=\"http://schema.org"
Why does res2 content_length does not show through? Is it in some other attribute of res2 (how does one see those?)
I am a newcomer at ruby. Using irb 0.9.6 on AWS Linux
Thanks a lot.

It appears that the value returned is not necessarily the length of the body, but the fixed length of the content, when that fixed length is known in advance and stored in the content-length header.
See the source for the implementation of HTTPHeader#content_length (taken from http://ruby-doc.org/stdlib-2.3.1/libdoc/net/http/rdoc/Net/HTTPHeader.html):
# File net/http/header.rb, line 262
def content_length
return nil unless key?('Content-Length')
len = self['Content-Length'].slice(/\d+/) or
raise Net::HTTPHeaderSyntaxError, 'wrong Content-Length format'
len.to_i
end
What this probably means in this case is that the response was a multi-part MIME response, and the content-length header is not used in this case.
What you most likely want in this case is body.length, since that's the only real way to tell the actual length of the response body for a multi-part response.
Note that may be performance implications by always using content.body to find the content length; you may choose to try the content_length approach first and if it's nil, fall back to body.length.
Here's an example modification to your code:
require 'net/http'
res1 = Net::HTTP.get_response 'www.google.com' , '/'
res2 = Net::HTTP.get_response 'www.google.com' , '/search?q=abc'
res1.code #200
res2.code #200
res1.content_length #5213
res2.content_length.nil? ? res2.body.length : res2.content_length #57315 **<<<<<<<<<<<<<<< Works now **
res2.body[0..60]
=> "<!doctype html><html itemscope=\"\" itemtype=\"http://schema.org"
or, better yet, capture the content_length and use the captured value for comparison:
res2_content_length = res2.content_length
if res2_content_length.nil?
res2_content_length = res2.body.length
end
Personally, I'd just stick with always checking body.length and deal with any potential performance issue if and when it arises.
This should reliably retrieve the actual length of the content for you, regardless of whether you received a simple response of a multi-part response.

Related

Trying to use open-uri in ruby, some HTML contents are coming in as "Loading..."

I am trying to create a program to compare a specific thing on a webpage, and then compare it another time, I'm currently working on getting the piece of information that will change. But, the text that would change appears if I inspect element in the page, but not if I use open-uri, it comes in as "Loading..." (see picture), is there a way to get all the HTML text?
Picture here.
This is the current code I have
contents = open('https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841', &:read)
File.open("testing.txt", "w") do |line|
line.puts "\r" + "#{contents}"
end
Any help to get the Loading... to change to the actual HTML code would be amazing.
Thanks
The problem
So, open uri just makes HTTP requests and gives you access to the body. In this case, the body is html. That html has a placeholder for this data, which is what you're seeing. Then that html says to load up some javascript that will make another request to the server to get the data, and when the data comes in, it will replace the placeholder with the real data. So, to handle this, you ultimately need whatever is coming back from that request the javascript is making.
Three solutions
Ordered from my least favourite to my most favourite.
You can try to evaluate the JavaScript to have it operate on the html. This is going to be painful, so I don't recommend it, but if you wanted to go down that path, I think there's a gem called "the ruby racer" or something (IIRC, it wraps v8).
You can launch a web browser, let the browser handle all the cray cray, and then ask the browser for the html after it's been updated. This is what Rahul's solution does, and it's a really nice solution. It's not my favourite because it's pretty heavy and you're relegated to information displayed in the html. This is called "scraping", and it's pretty fragile (some designer moves something around the page and your script breaks), and the information is in human presentation format, which means you usually have to do a lot of little parsing things.
You can open your browser's devtools, go to the network tab, filter to the XHR requests, and reload the page. One of these made the request to get the data that was used to fill in the place holder. Figure out which one it is and then you can make that request yourself. There's ways this can be fragile, too, eg sometimes you have to have the right cookies, and you often have to experiment with what the browser sent to figure out how much of it you need (usually it's way less than was sent, which is true for your case). Protip: When you do this, separate requesting the data from parsing and exploring it (ie save it to a file and then, while looking through the data, get it from the file rather than making a new request every time... this way it won't change on you and you won't get rate limited)
Solution #3
So, I was curious and went ahead and tried solution number 3 myself, and it worked pretty admirably, check it out:
require 'uri'
require 'net/http'
# build a post request to the URL that the page got the data from
uri = URI 'https://www.cargurus.com/Cars/inventorylisting/ajaxFetchSubsetInventoryListing.action?sourceContext=untrackedExternal_true_0'
req = Net::HTTP::Post.new(uri)
# set some headers
req['origin'] = 'https://www.cargurus.com' # for cross origin requests
req['cache-control'] = 'no-cache' # no caching, just in case,
req['pragma'] = 'no-cache' # we prob don't want stale data
# looks like you can pass it an awful lot of filters to use
req.set_form_data(
"page"=>"1", "zip"=>"", "address"=>"", "latitude"=>"", "longitude"=>"",
"distance"=>"100", "selectedEntity"=>"d841", "transmission"=>"ANY",
"entitySelectingHelper.selectedEntity2"=>"", "minPrice"=>"", "maxPrice"=>"",
"minMileage"=>"", "maxMileage"=>"", "bodyTypeGroup"=>"", "serviceProvider"=>"",
"filterBySourcesString"=>"", "filterFeaturedBySourcesString"=>"",
"displayFeaturedListings"=>"true", "searchSeoPageType"=>"",
"inventorySearchWidgetType"=>"AUTO", "allYearsForTrimName"=>"false",
"daysOnMarketMin"=>"", "daysOnMarketMax"=>"", "vehicleDamageCategoriesRaw"=>"",
"minCo2Emission"=>"", "maxCo2Emission"=>"", "vatOnly"=>"false",
"minEngineDisplacement"=>"", "maxEngineDisplacement"=>"", "minMpg"=>"",
"maxMpg"=>"", "minEnginePower"=>"", "maxEnginePower"=>"", "isRecentSearchView"=>"false"
)
# make the request (200 means it worked)
res = Net::HTTP.start(uri.hostname, uri.port, use_ssl: true) { |http| http.request req }
res.code # => "200"
# parse the response
require 'json'
json = JSON.parse res.body
# we're on page 1 of 1, and there are 48 results on this page
json['page'] # => 1
json['listings'].size # => 48
json['remainingResults'] # => false
# apparently we're looking at some sort of car or smth
json['modelId'] # => "d841"
json['modelName'] # => "Mazda MAZDASPEED6"
# a bunch of places sell this car
json['sellers'].size # => 47
json['sellers'][0]['location'] # => "Portland OR, 97217"
# the first of our 48 cars seems to be a deal
listing = json['listings'][0]
listing['mainPictureUrl'] # => "https://static.cargurus.com/images/forsale/2018/05/24/02/58/2006_mazda_mazdaspeed6-pic-61663369386257285-152x114.jpeg"
listing['expectedPriceString'] # => "$8,972"
listing['priceString'] # => "$6,890"
listing['daysOnMarket'] # => 61
listing['savingsRecommendation'] # => "Good Deal"
listing['carYear'] # => 2006
listing['mileageString'] # => "81,803"
# none of the 48 are salvaged or lemons
json['listings'].count { |l| l['lemon'] } # => 0
json['listings'].count { |l| l['salvage'] } # => 0
# the savings recommendations seem reasonably distributed
json['listings'].group_by { |l| l["savingsRecommendation"] }.map { |rec, ls| [rec, ls.size] }
# => [["Good Deal", 4],
# ["Fair Deal", 11],
# ["No Price Analysis", 23],
# ["High Price", 8],
# ["Overpriced", 2]]
Your web page contains ajax request and open-uri only returns server-side page, it not wait for ajax request
You can use the below code which waits for page loading
#load the libraries
require 'watir'
browser = Watir::Browser.new
browser.goto "https://www.cargurus.com/Cars/l-Used-Mazda-MAZDASPEED6-d841"
# giving some time for website to load
sleep 2
puts browser.html
NOTE: you need chromedriver to use the script http://chromedriver.chromium.org/downloads
if you don't want to open url in browser then you can use headless-WebKit

ruby and net/http request without content-type

I'm trying to make a call to a Tika server using Net::HTTP::Put. The issue is that the call always passes the Content-Type, which keeps Tika from running the detectors (which I want) and then chokes due to the default Content-Type of application/x-www-form-urlencoded. Tika docs suggest to not use that.
So, I have the following:
require 'net/http'
port = 9998
host = "localhost"
path = "/meta"
req = Net::HTTP::Put.new(path)
req.body_stream = File.open(file_name)
req['Transfer-Encoding'] = 'chunked'
req['Accept'] = 'application/json'
response = Net::HTTP.new(host, port).start { |http|
http.request(req)
}
I tried adding req.delete('content-type') and setting initheaders = {} in various ways, but the default content-type keeps getting sent.
Any insights would be greatly appreciated, since I would rather avoid having to make multiple curl calls ... is there any way to suppress the sending of that default header?
If you set req['Content-Type'] = nil then Net::HTTP will set it to the default of 'application/x-www-form-urlencoded', but if you set it to a blank string Net::HTTP leaves it alone:
req['Content-Type'] = ''
Tika should see that as an invalid type and enable the detectors.
It seems that Tika will run the detectors if the Content-Type is application/octet-stream. Adding
req.content_type = "application/octet-stream"
is now allowing me to get results.

Specify Character Encoding for Net::HTTP

When I make this HTTP request:
Net::HTTP.get_response('www.telize.com',"/geoip/190.88.39.27").body
=> "{\"timezone\":\"America\\/Curacao\",\"isp\":\"United Telecommunication Services (UTS)\",\"country\":\"Cura\xE7ao\",\"dma_code\":\"0\",\"region_code\":\"00\",\"area_code\":\"0\",\"ip\":\"190.88.39.27\",\"asn\":\"AS11081\",\"continent_code\":\"NA\",\"city\":\"Willemstad\",\"longitude\":-68.9167,\"latitude\":12.1,\"country_code\":\"CW\",\"country_code3\":\"CUW\"}\n"
It returns a JSON body, but notice the country: \"country\":\"Cura\xE7ao\". The response body should actually looks like this: "country":"Curaçao". It looks like Net::HTTP is assuming this is ASCII-8BIT:
Net::HTTP.get_response('www.telize.com',"/geoip/190.88.39.27").body.encoding
=> Encoding:ASCII-8BIT
but this can't be the case. How can I tell Net::HTTP which character encoding to use when making the request?
As the Tin Man determined, "\xE7" is the latin-1 encoding for LATIN SMALL LETTER C WITH CEDILLA, which as far as I can determine isn't a valid json encoding.
But...once you know the encoding, you can change it from ruby's ASCII-8BIT(which just means ruby considers the data to be binary, i.e. unencoded) to UTF-8, like this:
require 'net/http'
server_encoding = "ISO-8859-1"
resp = Net::HTTP.get_response('www.telize.com',"/geoip/190.88.39.27")
json = resp.body.force_encoding(server_encoding).encode("UTF-8")
puts json
--output:--
{"timezone":"America\/Curacao","isp":"United Telecommunication Services
UTS)","country":"Curaçao","dma_code":"0","region_code":"00","area_code":"0",
"ip":"190.88.39.27","asn":"AS11081","continent_code":"NA","city":"Willemstad",
"longitude":-68.9167,"latitude":12.1,"country_code":"CW","country_code3":"CUW"}
It looks like Net::HTTP is assuming this is ASCII-8BIT
Net::HTTP tags the data as binary/ASCII-8BIT, i.e. the data has no encoding, and leaves it to you to figure out how to interpret the data.
You can't tell a server what encoding to use, but you can ask it what it thinks the file's encoding is and then pass that Net::HTTP.
Look at the head method:
response = nil
Net::HTTP.start('www.telize.com',80) { |http|
response = http.head('/geoip/190.88.39.27')
}
response.each_header { |h| p "#{ h } => #{ response[h] }" }
Running that tells you the contents of the various headers:
"server => nginx"
"date => Thu, 12 Jun 2014 23:42:16 GMT"
"content-type => application/json; charset=iso-8859-1"
"connection => close"
The content-type value is what you want:
response['content-type'].split('=').last
# => "iso-8859-1"
Note that the server rarely does a consistency check to see whether the encoding it's told to use actually matches the file it's serving. This means the content you receive could vary wildly from what the server said it is, and, at that point, you're totally on your own to figure out what it really is, especially when the file has mixed encodings. Welcome to the wild and wooly internet.

How to parse HTTP response using Ruby

I've written a short snippet which sends a GET request, performs auth and checks if there is a 200 OK response (when auth success). Now, one thing I saw with this specific GET request, is that the response is always 200 irrespective of whether auth success or not.
The diff is in the HTTP response. That is when auth fails, the first response is 200 OK, just the same as when auth success, and after this then there is a second step. The page gets redirected again to the login page.
I am just trying to make a quick script which can check my login user and pass on my web application and tell me which auth passed and which didn't.
How should I check this? The sample code is like this:
def funcA(u, p)
print_A("#{ip} - '#{u}' : '#{p}' - Pass")
end
def try_login(u, p)
path = '/index.php?uuser=#{u}&ppass=#{p}'
r = send_request_raw({
'URI' => 'path',
'method' => 'GET'
})
if (r and r.code.to_i == 200)
check = true
end
if check == true
funcA(u, p)
else
out = "#{ip} - '#{u} - Fail"
print_B(out)
end
return check, r
end
end
Update:
I also tried adding a new check for matching a 'Success/Fail' keyword coming in HTTP response. It didn't work either. But I now noticed that the response coming back seems to be in a different form. The Content-Type in response is text/html;charset=utf-8 though. And I am not doing any parsing so it is failing.
Success Response is in form of:
{"param1":1,"param2"="Auth Success","menu":0,"userdesc":"My User","user":"uuser","pass":"ppass","check":"success"}
Fail response is in form of:
{"param1":-1,"param2"="Auth Fail","check":"fail"}
So now I need some pointers on how to parse this response.
Many Thanks.
I do this with with "net/http"
require 'net/http'
uri = URI(url)
connection = Net::HTTP.start(uri.host, uri.port)
#response = Net::HTTP.get_response(URI(url))
#httpStatusCode = #response.code
connection.finish
If there's a redirect from a 200 then it must be a javascript or meta redirect. So just look for that in the response body.

How to calculate the amount of data downloaded and the total data to be downloaded in Ruby

I'm trying to build a desktop client that manages some downloads with Ruby. I would like to know how to go about trying to identify how much of the data is downloaded and the size of the entire data that is to be downloaded.
Im trying to do this with Ruby so any help would be useful.
Thanks in advance.
Like Wayne said in his comment, it depends on the protocol that is used to transfer the files. With HTTP for example, the HTTP response will include a Content-Length header which will tell you the length of the file that you are downloading. After you know that you will have to keep track of the number of bytes that you've read from the HTTP connection.
Something like this seems to work (for HTTP), but I wouldn't be surprised if it could be done more elegantly:
require 'net/http'
url = URI.parse('http://www.google.com/index.html')
req = Net::HTTP::Get.new(url.path)
res = Net::HTTP.start(url.host, url.port) do |http|
http.request(req) do |res|
remaining = res.content_length
puts "total length: #{remaining}"
res.read_body do |segment|
puts "read #{segment.length} bytes"
remaining = remaining - segment.length
puts "#{remaining} bytes remaining"
end
end
end
www.google.com/index.html is a bad example since the content gets returned in one segment, but try it on a larger object and you should see multiple "read..." lines.
If you're using Net::HTTP then the length of whatever you're requesting should be in the response header. Net::HTTP mixin NET::HTTPHeader, in it you'll find content_length(). Although it only works if the size is determined before the transfer happens.
Net::HTTPResponse has a method that reads the body in chunks, so you can use that to determine the progress. Start at 0 and add the length of each chunk, compare it to the total size and you're done.
http.request_get('/index.html') {|res|
res.read_body do |segment|
print segment
end
} #Example taken from Ruby-Documentation
If you're using FTP then it should be easier through NET::FTP. Connect to the server, get the size of a given file with size(filename), and then download the file with get, getbinaryfile or gettextfile.
This is the signature of the get method: get(remotefile, localfile = File.basename(remotefile), blocksize = DEFAULT_BLOCKSIZE) {|data| ...}
ftp.get('file.something', 'file.something.local', 1024){ |data|
puts "Downloaded 1024 more bytes"
}

Resources