Getting webpage content with Ruby -- I'm having troubles - ruby

I want to get the content off this* page. Everything I've looked up gives the solution of parsing CSS elements; but, that page has none.
Here's the only code that I found that looked like it should work:
file = File.open('http://hiscore.runescape.com/index_lite.ws?player=zezima', "r")
contents = file.read
puts contents
Error:
tracker.rb:1:in 'initialize': Invalid argument - http://hiscore.runescape.com/index_lite.ws?player=zezima (Errno::EINVAL)
from tracker.rb:1:in 'open'
from tracker.rb:1
*http://hiscore.runescape.com/index_lite.ws?player=zezima
If you try to format this as a link in the post it doesn't recognize the underscore (_) in the URL for some reason.

You really want to use open() provided by the Kernel class which can read from URIs you just need to require the OpenURI library first:
require 'open-uri'
Used like so:
require 'open-uri'
file = open('http://hiscore.runescape.com/index_lite.ws?player=zezima')
contents = file.read
puts contents
This related SO thread covers the same question:
Open an IO stream from a local file or url

The appropriate way to fetch the content of a website is through the NET::HTTP module in Ruby:
require 'uri'
require 'net/http'
url = "http://hiscore.runescape.com/index_lite.ws?player=zezima"
r = Net::HTTP.get_response(URI.parse(url).host, URI.parse(url).path)
File.open() does not support URIs.
Best wishes,
Fabian

Please use open-uri, its support both uri and local files
require 'open-uri'
contents = open('http://www.google.com') {|f| f.read }

Related

Save and parse CSV file from URL

I am looking for an implementation that would allow me to download a CSV file from a browser (via a URL), to a point where I can open that file manually and view its contents in CSV form.
I have been doing some research and can see that I should use the IO, CSV or File classes.
I have a URL that looks something like:
"https://mydomain/manage/reporting/index?who=user&users=0&teams=0&datasetName=0&startDate=2015-10-18&endDate=2015-11-17&format=csv"
From what I have read I have:
href = page.find('#csv-download > a')['href']
csv_path = "https://mydomain/manage/reporting/index?who=user&users=0&teams=0&datasetName=0&startDate=2015-10-18&endDate=2015-11-17&format=csv"
require 'open-uri'
download = open(csv_path, ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE)
IO.copy_stream(download, 'test.csv')
This actually outputs:
2684
Which tells me that I have successfully got the data?
When downloading the file, the contents are just
#<StringIO:0x00000003e07d30>
Would there be any reason for this?
It's where to go from here, could anyone point me in the right direction please?
This should read from remote, write and then parse the file:
require 'open-uri'
require 'csv'
url = "https://mydomain/manage/reporting/index?who=user&users=0&teams=0&datasetName=0&startDate=2015-10-18&endDate=2015-11-17&format=csv"
download = open(url)
IO.copy_stream(download, 'test.csv')
CSV.new(download).each do |l|
puts l
end
If all you want to do is read a file and save it, it's simple. This untested code should be all that's required:
require 'open-uri'
CSV_PATH = "https://mydomain/manage/reporting/index?who=user&users=0&teams=0&datasetName=0&startDate=2015-10-18&endDate=2015-11-17&format=csv"
IO.copy_stream(
open(
CSV_PATH,
ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE
),
'test.csv'
)
OpenURI's open returns an IO stream, which is all you need to make copy_stream happy.
More typically you'll see the open, read, write pattern. open will create the IO stream for the remote document and read will retrieve the remote document and write will output it to a text file on your local disk. See their documentation for more information.
require 'open-uri'
CSV_PATH = "https://mydomain/manage/reporting/index?who=user&users=0&teams=0&datasetName=0&startDate=2015-10-18&endDate=2015-11-17&format=csv"
File.write(
'test.csv',
open(
CSV_PATH,
ssl_verify_mode: OpenSSL::SSL::VERIFY_NONE
).read
)
There might be a scalability advantage to using copy_stream for huge files that potentially wouldn't fit into memory. That'd be a test for the user.
Here is a one-liner I use. Of course if the file is huge - I might want to stream or download it first, but this works in 99% of cases, just fine.
require 'open-uri'
require 'csv'
csv_data = CSV.readlines(open(download_url), headers: true)

Open a local file with open-uri

I am doing data scraping with Ruby and Nokogiri. Is it possible to download and parse a local file in my computer?
I have:
require 'open-uri'
url = "file:///home/nav/Desktop/Scraping/scrap1.html"
It gives error as:
No such file or directory # rb_sysopen - file:\home/nav/Desktop/Scraping/scrap1.html
If you want to parse a local file with Nokogiri you can do it like this.
file = File.read('/home/nav/Desktop/Scraping/scrap1.html')
doc = Nokogiri::HTML(file)
When you open a local file in a browser, the URL in the address bar is displayed as:
file:///Users/7stud/Desktop/accounts.txt
But that doesn't mean you use that format in a Ruby script. Your Ruby script doesn't send the file name to a browser and then ask the browser to retrieve the file. Your Ruby script searches your file system directly.
The same is true for URLs: your Ruby script doesn't ask your browser to go retrieve a page from the internet, Ruby retrieves the page itself by sending a request using your system's network interface. After all, a browser and a Ruby program are both just computer programs. What your browser can do over a network, a Ruby program can do, too.
This works for me:
require 'open-uri'
text = open('./data.txt').read
puts text
You have to get your path right, though. The only reason I can think of to use open() is if you had an array of filenames and URLs mixed together. If that isn't your situation, see new2code's answer.
This is how I do it as according to the documentation.
f = File.open("//home/nav/Desktop/Scraping/scrap1.html")
doc = Nokogiri::HTML(f)
f.close
I would make use of Mechanize and save the file locally, then parse it with Nokogiri like so:
# Save the file
agent = Mechanize.new
agent.pluggable_parser.default = Mechanize::Download
current_url = 'http://www.example.com'
file = agent.get(current_url)
file.save!("#{Rails.root}/tmp/")
# Read the file
page = Nokogiri::HTML::Reader(File.open(file))
Hope that helps!

Get filename on server while using Ruby gem Curb

Is there a way to get the filename of the file being downloaded (without having to parse the url provided)? I am hoping to find something like:
c = Curl::Easy.new("http://google.com/robots.txt")
c.perform
File.open( c.file_name, "w") { |file| file.write c.body_str }
Unfortunately, there's nothing in the Curb documentation regarding polling the filename. I don't know whether you have a particular aversion to parsing, but it's a simple process if using the URI module:
require 'uri'
url = 'http://google.com/robots.txt'
uri = URI.parse(url)
puts File.basename(uri.path)
#=> "robots.txt"
UPDATE:
In the comments to this question, the OP suggests using split() to split the URL by slashes (/). While this may work in the majority of situations, it isn't a catch-all solution. For instance, versioned files won't be parsed correctly:
url = 'http://google.com/robots.txt?1234567890'
puts url.split('/').last
#=> "robots.txt?1234567890"
In comparison, using URI.parse() guarantees the filename – and only the filename – is returned:
require 'uri'
url = 'http://google.com/robots.txt?1234567890'
uri = URI.parse(url)
puts File.basename(uri.path)
#=> "robots.txt"
In sum, for optimal coherence and integrity, it's wise to use the URI library to parse universal resources – it's what it was created for, after all.

How do I open a web page and write it to a file in ruby?

If I run a simple script using OpenURI, I can access a web page. The results get written to the terminal.
Normally I would use bash redirection to write the results to a file.
How do I use ruby to write the results of an OpenURI call to a file?
require 'open-uri'
open("file_to_write.html", "wb") do |file|
URI.open("http://www.example.com/") do |uri|
file.write(uri.read)
end
end
Note: In Ruby < 2.5 you must use open(url) instead of URI.open(url). See https://bugs.ruby-lang.org/issues/15893
The pickaxe to the rescue. (this used to be a good page, but is no longer working)
Try this instead: Open an IO stream from a local file or url

Open an IO stream from a local file or url

I know there are libs in other languages that can take a string that contains either a path to a local file or a url and open it as a readable IO stream.
Is there an easy way to do this in ruby?
open-uri is part of the standard Ruby library, and it will redefine the behavior of open so that you can open a url, as well as a local file. It returns a File object, so you should be able to call methods like read and readlines.
require 'open-uri'
file_contents = open('local-file.txt') { |f| f.read }
web_contents = open('http://www.stackoverflow.com') {|f| f.read }

Resources