Retrieve contents of URL as string - ruby

For tedious reasons to do with Hpricot, I need to write a function that is passed a URL, and returns the whole contents of the page as a single string.
I'm close. I know I need to use OpenURI, and it should look something like this:
require 'open-uri'
open(url) {
# do something mysterious here to get page_string
}
puts page_string
Can anyone suggest what I need to add?

You can do the same without OpenURI:
require 'net/http'
require 'uri'
def open(url)
Net::HTTP.get(URI.parse(url))
end
page_content = open('http://www.google.com')
puts page_content
Or, more succinctly:
Net::HTTP.get(URI.parse('http://www.google.com'))

The open method passes an IO representation of the resource to your block when it yields. You can read from it using the IO#read method
open([mode [, perm]] [, options]) [{|io| ... }]
open(path) { |io| data = io.read }

require 'open-uri'
open(url) do |f|
page_string = f.read
end
See also the documentation of IO class

I was also very confused what to use for better performance and speedy results. I ran a benchmark for both to make it more clear:
require 'benchmark'
require 'net/http'
require "uri"
require 'open-uri'
url = "http://www.google.com"
Benchmark.bm do |x|
x.report("net-http:") { content = Net::HTTP.get_response(URI.parse(url)).body if url }
x.report("open-uri:") { open(url){|f| content = f.read } if url }
end
Its result is:
user system total real
net-http: 0.000000 0.000000 0.000000 ( 0.097779)
open-uri: 0.030000 0.010000 0.040000 ( 0.864526)
I'd like to say that it depends on what your requirement is and how you want to process.

To make code a little clearer, the OpenURI open method will return the value returned by the block, so you can assign open's return value to your variable. For example:
xml_text = open(url) { |io| io.read }

Starting with Ruby 3.0, calling URI.open via Kernel#open has been removed, so instead call URI.open directly:
require 'open-uri'
page_string = URI.open(url, &:read)

Try the following instead:
require 'open-uri'
content = URI(your_url).read

require 'open-uri'
open(url) {|f| #url must specify the protocol
str = f.read()
}

Related

Warning message with Nokogiri and Open-Uri

Hello I have an error message which is the following :
warning: calling URI.open via Kernel#open is deprecated, call URI.open directly or use URI#open
I don't understand why there is this message, I also tried the URI.open ("link") command. Read
But I have to use Nokogiri.
Here is my code:
require 'nokogiri'
require 'open-uri'
puts "Wait a second data is coming..."
PAGE_URL = "https://coinmarketcap.com/all/views/all/"
page = Nokogiri::HTML(open(PAGE_URL))
currency_name_array = page.xpath("//tr/td/a[contains(#class, 'currency-name-container')]/text()").map {|x| x.to_s }
currency_value_array = page.xpath("//tr/td/a[contains(#class, 'price')]/text()").map {|x| x.to_s }
currency_result = Hash[currency_name_array.zip(currency_value_array)]
puts currency_result
Thanks in advance.

Getting all unique URL's using nokogiri

I've been working for a while to try to use the .uniq method to generate a unique list of URL's from a website (within the /informatics path). No matter what I try I get a method error when trying to generate the list. I'm sure it's a syntax issue, and I was hoping someone could point me in the right direction.
Once I get the list I'm going to need to store these to a database via ActiveRecord, but I need the unique list before I get start to wrap my head around that.
require 'nokogiri'
require 'open-uri'
require 'active_record'
ARGV[0]="https://www.nku.edu/academics/informatics.html"
ARGV.each do |arg|
open(arg) do |f|
# Display connection data
puts "#"*25 + "\nConnection: '#{arg}'\n" + "#"*25
[:base_uri, :meta, :status, :charset, :content_encoding,
:content_type, :last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end
# Display the href links
base_url = /^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]
puts "base_url: #{base_url}"
Nokogiri::HTML(f).css('a').each do |anchor|
href = anchor['href']
# Make Unique
if href =~ /.*informatics/
puts href
#store stuff to active record
end
end
end
end
Replace the Nokogiri::HTML part to select only those href attributes that matches with /*.informatics/ and then you can use uniq, as it's already an array:
require 'nokogiri'
require 'open-uri'
require 'active_record'
ARGV[0] = 'https://www.nku.edu/academics/informatics.html'
ARGV.each do |arg|
open(arg) do |f|
puts "#{'#' * 25} \nConnection: '#{arg}'\n #{'#' * 25}"
%i[base_uri meta status charset content_encoding, content_type last_modified].each do |method|
puts "#{method.to_s}: #{f.send(method)}" if f.respond_to? method
end
puts "base_url: #{/^(.*\.nku\.edu)\//.match(f.base_uri.to_s)[1]}"
anchors = Nokogiri::HTML(f).css('a').select { |anchor| anchor['href'] =~ /.*informatics/ }
puts anchors.map { |anchor| anchor['href'] }.uniq
end
end
See output.

How to first modified in ruby

how i can get the first modified in ruby and the code down below is for the last modified
but i can't get the first modified
require 'open-uri'
open("link") do |f|
f.each_line {|line| p line}
puts p f.last_modified
end
how i can get the first modified in ruby and thanks
so what is the code that i have to write it
and i tried
require 'open-uri'
open("link") do |f|
f.each_line {|line| p line}
puts p f.first_modified
end
and it didn't work
There's no "first_modified" method in OpenURI, because there's no support for it from underlying HTTP protocol. So, what you want to do - it's impossible.
OpenURI docs: http://ruby-doc.org/stdlib-2.1.0/libdoc/open-uri/rdoc/OpenURI/Meta.html
First-Modified is not a standard or common non-standard HTTP header.
Once you request a header that exists, you can query it if it is present by using the meta hash, even if it is not exposed as a ruby method.
require 'open-uri'
p open('http://www.google.com').meta['Your-Header-Name']

Save Webscraped data

I am trying to scrape a website. I am able to scrape data from that website. I am having trouble saving the data from the scrape to yaml file that I have included
My Code:
require 'rubygems'
require 'open-uri'
require 'hpricot'
article = []
doc = open("http://www.cmegroup.com/trading/interest-rates/cleared-otc/irs.html"{|f| Hpricot(f) }
(doc/"/html/body/div/div/div/div/table/").each do |article|
puts "#{article.inner_html}"
end
File.open('test.yaml', 'w') { |f|
f <<article.to_yaml
}
First you are missing a closing parenthesis for the open call (a ) right before the block starts).
When you add that you'll notice that you'll get a NoMethodError (undefined method 'to_yaml' for []:Array). To fix that you have to require 'yaml', which pulls in the monkey-patches for the Array class. After that you'll notice that your yaml file is empty, because you never put anything into article. Here's a fixed version:
require 'rubygems'
require 'open-uri'
require 'hpricot'
require 'yaml'
articles = []
url = "http://www.cmegroup.com/trading/interest-rates/cleared-otc/irs.html"
doc = open(url) {|f| Hpricot(f) }
(doc/"/html/body/div/div/div/div/table/").each do |article|
articles << article.inner_html
end
File.open('test.yaml', 'w') { |f| f << articles.to_yaml }

How do I get the destination URL of a shortened URL using Ruby?

How do I take this URL http://t.co/yjgxz5Y and get the destination URL which is http://nickstraffictricks.com/4856_how-to-rank-1-in-google/
require 'net/http'
require 'uri'
Net::HTTP.get_response(URI.parse('http://t.co/yjgxz5Y'))['location']
# => "http://nickstraffictricks.com/4856_how-to-rank-1-in-google/"
I've used open-uri for this, because it's nice and simple. It will retrieve the page, but will also follow multiple redirects:
require 'open-uri'
final_uri = ''
open('http://t.co/yjgxz5Y') do |h|
final_uri = h.base_uri
end
final_uri # => #<URI::HTTP:0x00000100851050 URL:http://nickstraffictricks.com/4856_how-to-rank-1-in-google/>
The docs show a nice example for using the lower-level Net::HTTP to handle redirects.
require 'net/http'
require 'uri'
def fetch(uri_str, limit = 10)
# You should choose better exception.
raise ArgumentError, 'HTTP redirect too deep' if limit == 0
response = Net::HTTP.get_response(URI.parse(uri_str))
case response
when Net::HTTPSuccess then response
when Net::HTTPRedirection then fetch(response['location'], limit - 1)
else
response.error!
end
end
puts fetch('http://www.ruby-lang.org')
Of course this all breaks down if the page isn't using a HTTP redirect. A lot of sites use meta-redirects, which you have to handle by retrieving the URL from the meta tag, but that's a different question.
For resolving redirects you should use a HEAD request to avoid downloading the whole response body (imagine resolving a URL to an audio or video file).
Working example using the Faraday gem:
require 'faraday'
require 'faraday_middleware'
def resolve_redirects(url)
response = fetch_response(url, method: :head)
if response
return response.to_hash[:url].to_s
else
return nil
end
end
def fetch_response(url, method: :get)
conn = Faraday.new do |b|
b.use FaradayMiddleware::FollowRedirects;
b.adapter :net_http
end
return conn.send method, url
rescue Faraday::Error, Faraday::Error::ConnectionFailed => e
return nil
end
puts resolve_redirects("http://cre.fm/feed/m4a") # http://feeds.feedburner.com/cre-podcast
You would have to follow the redirect. I think that would help :
http://shadow-file.blogspot.com/2009/03/handling-http-redirection-in-ruby.html

Resources