Open a local file with open-uri - ruby

I am doing data scraping with Ruby and Nokogiri. Is it possible to download and parse a local file in my computer?
I have:
require 'open-uri'
url = "file:///home/nav/Desktop/Scraping/scrap1.html"
It gives error as:
No such file or directory # rb_sysopen - file:\home/nav/Desktop/Scraping/scrap1.html

If you want to parse a local file with Nokogiri you can do it like this.
file = File.read('/home/nav/Desktop/Scraping/scrap1.html')
doc = Nokogiri::HTML(file)

When you open a local file in a browser, the URL in the address bar is displayed as:
file:///Users/7stud/Desktop/accounts.txt
But that doesn't mean you use that format in a Ruby script. Your Ruby script doesn't send the file name to a browser and then ask the browser to retrieve the file. Your Ruby script searches your file system directly.
The same is true for URLs: your Ruby script doesn't ask your browser to go retrieve a page from the internet, Ruby retrieves the page itself by sending a request using your system's network interface. After all, a browser and a Ruby program are both just computer programs. What your browser can do over a network, a Ruby program can do, too.
This works for me:
require 'open-uri'
text = open('./data.txt').read
puts text
You have to get your path right, though. The only reason I can think of to use open() is if you had an array of filenames and URLs mixed together. If that isn't your situation, see new2code's answer.

This is how I do it as according to the documentation.
f = File.open("//home/nav/Desktop/Scraping/scrap1.html")
doc = Nokogiri::HTML(f)
f.close

I would make use of Mechanize and save the file locally, then parse it with Nokogiri like so:
# Save the file
agent = Mechanize.new
agent.pluggable_parser.default = Mechanize::Download
current_url = 'http://www.example.com'
file = agent.get(current_url)
file.save!("#{Rails.root}/tmp/")
# Read the file
page = Nokogiri::HTML::Reader(File.open(file))
Hope that helps!

Related

Is it possible to download an YAML contained in a webpage using Page Objects / Capybara / Selenium Webdriver?

I have a webpage that just displays a YAML file.
Is there anyway I can get that YAML file so I can save it using a cucumber project (that uses Page Object, Capybara and Selenium Webdriver and it's in Ruby).
Thanks
So, I just had to get the body and save its contents to a YAML file.
body = #browser.find_element(:tag_name => 'body')
f = File.new "#{path}/info.yml"
f.write body.text
Instead of viewing the YAML file in the browser and then having to parse the YAML out of the HTML, you can directly download the YAML file instead:
Determine the location of the YAML file
Use Open-URI to download the file
Assuming that the page has a link to the YAML file like:
Invoice Number
Then you can download the file and save it to disk using:
require 'capybara'
require 'open-uri'
# Specify where to save the file
save_location = 'C:\some\file\path\name.yaml'
# Go to the page that has the link to the YAML file
page = Capybara::Session.new(:selenium)
page.visit('http://www.your_page.com')
# Determine the location of the YAML file
yaml_link = page.find('#invoice')
yaml_location = yaml_link['href']
# Save the file to disk
File.open(save_location, "w") do |file|
file.puts open(yaml_location).read
end

Open URLs from CSV

I am using Ruby 2.1.0p0 on Mac OS.
I'm parsing a CSV file and grabbing all the URLs, then using Nokogiri and OpenURI to scrape them which is where I'm getting stuck.
When I try to use an each loop to run through the URLs array, I get this error:
initialize': No such file or directory # rb_sysopen - URL (Errno::ENOENT)
When I manually create an array, and then run through it I get no error. I've tried to_s, URI::encode, and everything I could think of and find on Stack Overflow.
I can copy and paste the URL from the CSV or from the terminal after using puts on the array and it opens in my browser no problem. I try to open it with Nokogiri it's not happening.
Here's my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
require 'csv'
events = Array.new
CSV.foreach('productfeed.csv') do |row|
events.push URI::encode(row[0]).to_s
end
events.each do |event|
page = Nokogiri::HTML(open("#{event}"))
#eventually, going to find info on the page, and scrape it, but not there yet.
#something to show I didn't get an error
puts "open = success"
end
Please help! I am completely out of ideas.
It looks like you're processing the header row, where on of those values is literally "URL". That's not a valid URI so open-uri won't touch it.
There's a headers option for the CSV module that will make use of the headers automatically. Try turning that on and referring to row["URL"]
I tried doing the same thing and found it to work better using a text file.
Here is what I did.
#!/usr/bin/python
#import webbrowser module and time module
import webbrowser
import time
#open text file as "dataFile" and verify there is data in said file
dataFile = open('/home/user/Desktop/urls.txt','r')
if dataFile > 1:
print("Data file opened successfully")
else:
print("!!!!NO DATA IN FILE!!!!")
exit()
#read file line by line, remove any spaces/newlines, and open link in chromium-browser
for lines in dataFile:
url = str(lines.strip())
print("Opening " + url)
webbrowser.get('chromium-browser').open_new_tab(url)
#close file and exit
print("Closing Data File")
dataFile.close()
#wait two seconds before printing "Data file closed".
#this is purely for visual effect.
time.sleep(2)
print("Data file closed")
#after opener has run, user is prompted to press enter key to exit.
raw_input("\n\nURL Opener has run. Press the enter key to exit.")
exit()
Hope this helps!

how to parse XML file remotely from FTP with nokogiri gem, without downloading

require 'net/ftp'
require 'nokogiri'
server = "xxxxxx"
user = "xxxxx"
password = "xxxxx"
ftp = Net::FTP.new(server, user, password)
files = ftp.nlst('File*.xml')
files.each do |file|
ftp.getbinaryfile(file)
doc = Nokogiri::XML(open(file))
# some operations with doc
end
With the code above I'm able to parse/read XML file, because it first downloads a file.
But how can I parse remote XML file without downloading it?
The code above is a part of rake task that loads rails environment when run.
UPDATE:
I'm not going to create any file. I will import info into the mongodb using mongoid.
If you simply want to avoid using a temporary local file, it is possible to to fetch the file contents direct as a String, and process in memory, by supplying nil as the local file name:
files.each do |file|
xml_string = ftp.getbinaryfile( file, nil )
doc = Nokogiri::XML( xml_string )
# some operations with doc
end
This still does an FTP fetch of the contents, and XML parsing happens at the client.
It is not really possible to avoid fetching the data in some form or other, and if FTP is the only protocol you have available, then that means copying data over the network using an FTP get. However, it is possible, but far more complicated, to add capabilities to your FTP (or other net-based) server, and return the data in some other form. That could include Nokogiri parsing done remotely on the server, but you'd still need to serialise the end result, fetch it and deserialise it.

How do I open a web page and write it to a file in ruby?

If I run a simple script using OpenURI, I can access a web page. The results get written to the terminal.
Normally I would use bash redirection to write the results to a file.
How do I use ruby to write the results of an OpenURI call to a file?
require 'open-uri'
open("file_to_write.html", "wb") do |file|
URI.open("http://www.example.com/") do |uri|
file.write(uri.read)
end
end
Note: In Ruby < 2.5 you must use open(url) instead of URI.open(url). See https://bugs.ruby-lang.org/issues/15893
The pickaxe to the rescue. (this used to be a good page, but is no longer working)
Try this instead: Open an IO stream from a local file or url

Getting webpage content with Ruby -- I'm having troubles

I want to get the content off this* page. Everything I've looked up gives the solution of parsing CSS elements; but, that page has none.
Here's the only code that I found that looked like it should work:
file = File.open('http://hiscore.runescape.com/index_lite.ws?player=zezima', "r")
contents = file.read
puts contents
Error:
tracker.rb:1:in 'initialize': Invalid argument - http://hiscore.runescape.com/index_lite.ws?player=zezima (Errno::EINVAL)
from tracker.rb:1:in 'open'
from tracker.rb:1
*http://hiscore.runescape.com/index_lite.ws?player=zezima
If you try to format this as a link in the post it doesn't recognize the underscore (_) in the URL for some reason.
You really want to use open() provided by the Kernel class which can read from URIs you just need to require the OpenURI library first:
require 'open-uri'
Used like so:
require 'open-uri'
file = open('http://hiscore.runescape.com/index_lite.ws?player=zezima')
contents = file.read
puts contents
This related SO thread covers the same question:
Open an IO stream from a local file or url
The appropriate way to fetch the content of a website is through the NET::HTTP module in Ruby:
require 'uri'
require 'net/http'
url = "http://hiscore.runescape.com/index_lite.ws?player=zezima"
r = Net::HTTP.get_response(URI.parse(url).host, URI.parse(url).path)
File.open() does not support URIs.
Best wishes,
Fabian
Please use open-uri, its support both uri and local files
require 'open-uri'
contents = open('http://www.google.com') {|f| f.read }

Resources