How can I get both header and web page content in a single call? - ruby

I’m using Rails 4.2.7. I know how to use openURI to get the headers from a URL …
open(url){|f| pp f.meta }
and I know how to get the contents of the URL
open(url).read
So how can I get both headers and contents in one call, preferably storing headers into one variable and contents into another?

You just have to reuse the result of the open call:
f = open(url)
pp f.meta
pp f.read

Related

Find a url in a document using regex in ruby

I have been trying to find a url in a html document and this has to be done in regex since the url is not in any html tag so I can't use nokogiri for that. To get the html i used httparty and i did it this way
require 'httparty'
doc = HTTParty.get("http://127.0.0.1:4040")
puts doc
That outputs the html code. And to get the url i used the .split() method to reach to the url. The full code is
require 'httparty'
doc = HTTParty.get('http://127.0.0.1:4040').split(".ngrok.io")[0].split('https:')[2]
puts "https:#{doc}.ngrok.io"
I wanted to do this using regex since ngrok might update their localhost html file and so this code won't work anymore. How do i do it?
If I understood correctly you want to find all hostnames matching "https://(any subdomain).ngrok.io", right ?
If then you want to use String#scan with a regexp. Here is an example:
# get your body (replace with your HTTP request)
body = "my doc contains https://subdomain.ngrok.io and https://subdomain-1.subdomain.ngrok.io"
puts body
# Use scan and you're done
urls = body.scan(%r{https://[0-9A-Za-z-\.]+\.ngrok\.io})
puts urls
It will result in an array containing ["https://subdomain.ngrok.io", "https://subdomain-1.subdomain.ngrok.io"]
Call .uniq if you want to get rid of duplicates
This doesn't handle ALL edge cases but it's probably enough for what you need

Streaming data in Ruby net/http PUT request

In the Ruby-doc Net/HTTP there is a detailed example for streaming response bodies - it applies when you try to download a large file.
I am looking for an equivalent code snippet to upload a file via PUT. Spent quite a bit of time trying to make code work with no luck. I think I need to implement a particular interface and pass it the to request.body_stream
I need streaming because I want to alter the content of the file while it is being uploaded so I want to have access to the buffered chunks while the upload takes place. I would gladly use a library like http.rb or rest-client as long as I can use streaming.
Thanks in advance!
For reference following is the working non streamed version
uri = URI("http://localhost:1234/upload")
Net::HTTP.start(uri.host, uri.port) do |http|
request = Net::HTTP::Put.new uri
File.open("/home/files/somefile") do |f|
request.body = f.read()
end
# Ideally I would need to use **request.body_stream** instead of **body** to get streaming
http.request request do |response|
response.read_body do |result|
# display the operation result
puts result
end
end
end

How to get the current URL for a HTML page

I am scraping a website using Nokogiri. This particular website deals with absolute URLs differently.
If I give it a URL like:
page = Nokogiri::HTML(open(link, :allow_redirections => :all))
it will redirect to the HTTPS version, and also redirect to the long version of the URL. For example, a link like
http://www.website.com/name
turns into
http://www.website.com/other-area/name
This is fine and doesn't really affect my scraper, however, there are certain edge-cases where, if I can tell my scraper what the current URL is, I can avoid them.
After I pass in the above link to my page variable, how can I get the current URL of that page after the redirect happens?
I'm assuming you're using the open_uri_redirections gem because :allow_redirections is not necessary in Ruby 2.4+.
Save the result of OpenURI's open:
require 'open-uri'
r = open('http://www.google.com/gmail')
r.base_uri
# #<URI::HTTPS https://accounts.google.com/ServiceLogin?service=mail&passive=true&rm=false&continue=https://mail.google.com/mail/&ss=1&scc=1&ltmpl=default&ltmplcache=2&emr=1&osid=1#>
page = Nokogiri::HTML(r)
Use Mechanize, then you can do:
agent = Mechanize.new
page = agent.get url
puts page.uri # this will be the redirected url

Skip large pages in Mechanize (Ruby)

I'm trying to skip processing a few large pages (some over 10MB) scattered in a result set, as Mechanize (version 2.7.3) crawls an array of links.
Unfortunately I can't find a 'content-length' property or a similar indicator. The Mechanize::FileResponse class has a content_length method but Mechanize::Page does not.
Current approach
At the moment I'm calling content.length on the page. This is very slow when one of the large pages is crawled:
detail_links.each do |detail_link|
detail_page = detail_link.click
# skip long pages
break if detail_page.content.length > 100_000
# rest of the processing
end
Content_length during response_read:
In the Mechanize source code I found a reference to content_length when the response is read. Is querying the response properties a possible solution?
# agent.rb extract from the Mechanize project
def response_read response, request, uri
content_length = response.content_length
if use_tempfile? content_length then
body_io = make_tempfile 'mechanize-raw'
else
body_io = StringIO.new.set_encoding(Encoding::BINARY)
end
Mechanize will normally "get" the entire page. Instead you should use a head request first to get the page size, then conditionally get the page. See "How can I perform a Head request using mechanize in Ruby" for an example.
The thing to be careful of is that a dynamically generated resource might not have a known size when you do the head request, so you could get a response without the size entry. Notice that in the selected answer for the question linked above, that Google didn't return the content-length header because it's a dynamically generated page. Static pages and resources should have the header... unless the server doesn't return them for some reason.
The Mechanize documentation mentions this:
Problems with content-length
Some sites return an incorrect content-length value. Unlike a browser, mechanize raises an error when the content-length header does not match the response length since it does not know if there was a connection problem or if the mismatch is a server bug.
The error raised, Mechanize::ResponseReadError, can be converted to a parsed Page, File, etc. depending upon the content-type:
agent = Mechanize.new
uri = URI 'http://example/invalid_content_length'
begin
page = agent.get uri
rescue Mechanize::ResponseReadError => e
page = e.force_parse
end
In other words, while head can help, it's not necessarily going to give you enough information to allow you to skip huge pages. You have to investigate the site you're crawling and learn how their server responds.

Fetching only X/HTML links (not images) based on mime type

I'm crawling a site using Ruby + OpenURI + Nokogiri. Fetch a page, find all the a[href] and (if they're in the same domain and right protocol) follow them to crawl again.
Sometimes there are links to large binaries (e.g. jpeg, exe), and I don't want to crawl those.
I tried using the HTTP "Accept" header to get an error or empty response for the wrong mime types like so:
require 'open-uri'
page = open(url, 'Accept'=>'text/html,application/xhtml+xml,application/xml')
...but OpenURI still downloads binaries sent with another mime type.
Other than looking at file extensions in the url for a probable file type, how can I prevent the download (or detect a conflicting response type) for an arbitrary URL?
You could send a HEAD request first, then check the Content-type header of the response and only make the real request if it’s acceptable:
ACCEPTABLE_TYPES = %w{text/html application/xhtml+xml application/xml}
uri = URI(url)
type = Net::HTTP.start(uri.host, uri.port) do |http|
http.head(uri.path).content_type
end
if ACCEPTABLE_TYPES.include? type
# fetch the url
else
# do whatever
end
This will need an extra request for each page, but I can’t see a way of avoiding it. It also relies on the server sending the same headers for a HEAD request as it does for a GET, which I think is a reasonable assumption but something to be aware of.

Resources