Open URLs from CSV - ruby

I am using Ruby 2.1.0p0 on Mac OS.
I'm parsing a CSV file and grabbing all the URLs, then using Nokogiri and OpenURI to scrape them which is where I'm getting stuck.
When I try to use an each loop to run through the URLs array, I get this error:
initialize': No such file or directory # rb_sysopen - URL (Errno::ENOENT)
When I manually create an array, and then run through it I get no error. I've tried to_s, URI::encode, and everything I could think of and find on Stack Overflow.
I can copy and paste the URL from the CSV or from the terminal after using puts on the array and it opens in my browser no problem. I try to open it with Nokogiri it's not happening.
Here's my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
require 'csv'
events = Array.new
CSV.foreach('productfeed.csv') do |row|
events.push URI::encode(row[0]).to_s
end
events.each do |event|
page = Nokogiri::HTML(open("#{event}"))
#eventually, going to find info on the page, and scrape it, but not there yet.
#something to show I didn't get an error
puts "open = success"
end
Please help! I am completely out of ideas.

It looks like you're processing the header row, where on of those values is literally "URL". That's not a valid URI so open-uri won't touch it.
There's a headers option for the CSV module that will make use of the headers automatically. Try turning that on and referring to row["URL"]

I tried doing the same thing and found it to work better using a text file.
Here is what I did.
#!/usr/bin/python
#import webbrowser module and time module
import webbrowser
import time
#open text file as "dataFile" and verify there is data in said file
dataFile = open('/home/user/Desktop/urls.txt','r')
if dataFile > 1:
print("Data file opened successfully")
else:
print("!!!!NO DATA IN FILE!!!!")
exit()
#read file line by line, remove any spaces/newlines, and open link in chromium-browser
for lines in dataFile:
url = str(lines.strip())
print("Opening " + url)
webbrowser.get('chromium-browser').open_new_tab(url)
#close file and exit
print("Closing Data File")
dataFile.close()
#wait two seconds before printing "Data file closed".
#this is purely for visual effect.
time.sleep(2)
print("Data file closed")
#after opener has run, user is prompted to press enter key to exit.
raw_input("\n\nURL Opener has run. Press the enter key to exit.")
exit()
Hope this helps!

Related

Ruby - How to add EOF marker into a PDF file or otherwise bypass PDF::Reader::MalformedPDFError: PDF does not contain EOF marker

I'm using the Mechanize ruby gem to click a button on the web to download a PDF file and save it to the local file system.
URL = "www.my-site.com"
agent = Mechanize.new
agent.pluggable_parser.pdf = Mechanize::File # FYI I have also tried Mechanize::FileSaver and Mechanize::Download here
page = agent.get(URL)
form = page.forms.first
button = page.form.button_with(:value => "Some Button Text")
local_file = "path/to/file.pdf"
response = agent.submit(form, button)
response.save_as(local_file)
But when I try to read this PDF file using the PDF::Reader gem, I get an error "PDF does not contain EOF marker".
reader = PDF::Reader.new(local_file) # this also happens if I try to use PDF::Reader.new(response.body) and PDF::Reader.new(response.body_io) depending on the different pluggable_parser configurations mentioned above
#> PDF::Reader::MalformedPDFError: PDF does not contain EOF marker
I'm able to save the PDF locally and view it and it looks fine, but the PDF::Reader gem is complaining about it missing an EOF marker.
So my question is: is there a way I could add an EOF marker into the PDF or something to get around this error so I can parse the PDF?
Thanks.
Related (unanswered) question: PDF does not contain EOF marker (PDF::Reader::MalformedPDFError) with pdf-reader
Related Docs:
http://mechanize.rubyforge.org/Mechanize/File.html
http://mechanize.rubyforge.org/Mechanize/Download.html
http://mechanize.rubyforge.org/Mechanize/FileSaver.html
https://github.com/yob/pdf-reader
EDIT:
I found the EOF marker somewhere in the middle of the downloaded file contents, followed by some HTML-looking stuff that I can't seem to figure out how to get rid of. I want to isolate the PDF content and then parse that, but still running into issues. Here is the full script I am using:
https://gist.github.com/s2t2/c6766846d024edd696586b2bc7fee0bf
The issue seems to be with the website you're accessing: http://employmentsummary.abaquestionnaire.org
The add HTML data at the end of the response.
However, you could truncate the response by searching for the first substring %EOF and removing all the data after that.
i.e.:
pdf_data = result.body
pdf_data.slice!(0, pdf_data.index("%EOL").to_i + 4)
if(pdf_data.length <= 4)
# handle error
else
# save/send pdf_data
end

Ruby: Reading from a file written to by the system process

I'm trying to open a tmpfile in the system $EDITOR, write to it, and then read in the output. I can get it to work, but I am wondering why calling file.read returns an empty string (when the file does have content)
Basically I'd like to know the correct way of reading the file once it has been written to.
require 'tempfile'
file = Tempfile.new("note")
system("$EDITOR #{file.path}")
file.rewind
puts file.read # this puts out an empty string "" .. why?
puts IO.read(file.path) # this puts out the contents of the file
Yes, I will be running this in an ensure block to nuke the file once used ;)
I was running this on ruby 2.2.2 and using vim.
Make sure you are calling open on the file object before attempting to read it in:
require 'tempfile'
file = Tempfile.new("note")
system("$EDITOR #{file.path}")
file.open
puts file.read
file.close
file.unlink
This will also let you avoid calling rewind on the file, since your process hasn't written any bytes to it at the time you open it.
I believe IO.read will always open the file for you, which is why it worked in that case. Whereas calling .read on an IO-like object does not always open the file for you.

Changes to loaded files are not executed

I'm debugging a program, and it's not executing the new puts in the loaded files. The main file is running, and changes on that file are executed. So I'm skeptical any changes are loaded.
What could I be doing wrong?
The first file works fine. its the second file... the newmods.rb file that the prompt isn't changing when run, and any new 'puts' statements are not taking.
I'm using load instead of require_relative. the file doesn't compile with require_relative.
File 1
#!/usr/bin/env ruby
load "/Users/abcde/Ruby/learning/shorterZork/monsters.rb"
load "/Users/abcde/Ruby/learning/shorterZork/room.rb"
load "/Users/abcde/Ruby/learning/shorterZork/avatar.rb"
load "/Users/abcde/Ruby/learning/shorterZork/newmods.rb"
load "/Users/abcde/Ruby/learning/shorterZork/items.rb"
def main()
game_loop(avatar)
end
def game_loop(avatar)
puts "You are in the: #{#room_name}"
puts "#{avatar.name} has #{avatar.life}!"
health_bar(avatar)
# Check for avatar's life.
#avatar_life(avatar)
#puts #room_monsters.to_s
monster = #room_monsters
test_encounter(avatar, monster)
# need to put the meat of the game here.
# puts "I need something here."
#puts "--------------------------------------------This is the main loop."
prompt; action = gets.chomp
if action.downcase == "help"
help()
..etc...etc.
end
main()
File 2 newmods.rb
# This includes any modules that might be nice to use.
load "/Users/abcde/Ruby/learning/shorterZork/monsters.rb"
load "/Users/abcde/Ruby/learning/shorterZork/room.rb"
load "/Users/abcde/Ruby/learning/shorterZork/avatar.rb"
load "/Users/abcde/Ruby/learning/shorterZork/items.rb"
#require 'items'
#require 'monsters'
#require 'avatar'
#require 'room'
#require 'shortZork'
def prompt()
print ">> "
end
From my experience, common mistakes in such cases are
The edit displayed on the editor/IDE is not saved to the file.
There are different versions of a file, and the edit is made on the version that is not used.
Joy. Just when you reach out for help the answer slaps you in the face.
The ruby load is loading a file from a different path than the current expected path. The username is different, so...

RubyZip Unzipping .docx, modifying, and zipping back up throws Errno::EACCESS error

So, I'm using Nokogiri and Rubyzip to unzip a .docx file, modify the word/docoument.xml file in it (in this case just change every element wrapped in to say "Dreams!"), and then zip it back up.
require 'nokogiri'
require 'zip'
zip = Zip::File.open("apple.docx")
doc = zip.find_entry("word/document.xml")
xml = Nokogiri::XML.parse(doc.get_input_stream)
inputs = xml.root.xpath("//w:t")
inputs.each{|element| element.content = "DREAMS!"}
zip.get_output_stream("word/document.xml", "w") {|f| f.write(xml.to_s)}
zip.close
Running the code through IRB line by line works perfectly and makes the changes to the .docx file as I needed, but if I run the script from the command line
ruby xmltodoc.rb
I receive the following error:
C:/Ruby193/lib/ruby/gems/1.9.1/gems/rubyzip-1.1.7/lib/zip/file.rb:416:in `rename': Permission denied - (C:/Users/Bane/De
sktop/apple.docx20150326-6016-k9ff1n, apple.docx) (Errno::EACCES)
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/rubyzip-1.1.7/lib/zip/file.rb:416:in `on_success_replace'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/rubyzip-1.1.7/lib/zip/file.rb:308:in `commit'
from C:/Ruby193/lib/ruby/gems/1.9.1/gems/rubyzip-1.1.7/lib/zip/file.rb:332:in `close'
from ./xmltodoc.rb:15:in `<main>'
All users on my computer have all permissions for that .docx file. The file also doesn't have any special settings--just a new file with a paragraph. This error only shows up on Windows, but the script works perfectly on Mac and Ubuntu. Running Powershell as Admin throws the same error. Any ideas?
On my Windows 7 system the following works.
require 'nokogiri'
require 'zip'
Zip::File.open("#{File.dirname(__FILE__)}/apple.docx") do |zipfile|
doc = zipfile.read("word/document.xml")
xml = Nokogiri::XML.parse(doc)
inputs = xml.root.xpath("//w:t")
inputs.each{|element| element.content = "DREAMS!"}
zipfile.get_output_stream("word/document.xml") {|f| f.write(xml.to_s)}
end
Instead you also could use the gem docx, here is an example, the names of the bookmarks are in dutch because, well that's the language my MS Office is in.
require 'docx'
# Create a Docx::Document object for our existing docx file
doc = Docx::Document.open('C:\Users\Gebruiker\test.docx'.gsub(/\\/,'/'))
# Insert a single line of text after one of our bookmarks
# p doc.bookmarks['bladwijzer1'].methods
doc.bookmarks['bladwijzer1'].insert_text_after("Hello world.")
# Insert multiple lines of text at our bookmark
doc.bookmarks['bladwijzer3'].insert_multiple_lines(['Hello', 'World', 'foo'])
# Save document to specified path
doc.save('example-edited.docx')

Open a local file with open-uri

I am doing data scraping with Ruby and Nokogiri. Is it possible to download and parse a local file in my computer?
I have:
require 'open-uri'
url = "file:///home/nav/Desktop/Scraping/scrap1.html"
It gives error as:
No such file or directory # rb_sysopen - file:\home/nav/Desktop/Scraping/scrap1.html
If you want to parse a local file with Nokogiri you can do it like this.
file = File.read('/home/nav/Desktop/Scraping/scrap1.html')
doc = Nokogiri::HTML(file)
When you open a local file in a browser, the URL in the address bar is displayed as:
file:///Users/7stud/Desktop/accounts.txt
But that doesn't mean you use that format in a Ruby script. Your Ruby script doesn't send the file name to a browser and then ask the browser to retrieve the file. Your Ruby script searches your file system directly.
The same is true for URLs: your Ruby script doesn't ask your browser to go retrieve a page from the internet, Ruby retrieves the page itself by sending a request using your system's network interface. After all, a browser and a Ruby program are both just computer programs. What your browser can do over a network, a Ruby program can do, too.
This works for me:
require 'open-uri'
text = open('./data.txt').read
puts text
You have to get your path right, though. The only reason I can think of to use open() is if you had an array of filenames and URLs mixed together. If that isn't your situation, see new2code's answer.
This is how I do it as according to the documentation.
f = File.open("//home/nav/Desktop/Scraping/scrap1.html")
doc = Nokogiri::HTML(f)
f.close
I would make use of Mechanize and save the file locally, then parse it with Nokogiri like so:
# Save the file
agent = Mechanize.new
agent.pluggable_parser.default = Mechanize::Download
current_url = 'http://www.example.com'
file = agent.get(current_url)
file.save!("#{Rails.root}/tmp/")
# Read the file
page = Nokogiri::HTML::Reader(File.open(file))
Hope that helps!

Resources