Ruby Cucumber PDF reader - ruby

I'm running tests to render and check a PDF. I've got it working but the PDF's are date stamped in the filename. I'm looking for a way to always have today's generated file to be opened. I've tried the Date.today approach but had no joy as PDF reader doesn't see it as a correct filename. Here is my code so you can see what I'm trying to do:
today = Date.today
Given /^I open the saved PDF and confirm the VRM is "(.*?)"$/ do |vrm|
filename = 'C:\Users\user\Downloads\vehicle_summary_VRM_#{today}.pdf'
PDF::Reader.open(filename) do |reader|
reader.pages.each do |page|
expect(reader.page(1)).to have_content vrm
puts page.text
end
end
end
I get the following exception: input must be an IO-like object or a filename (ArgumentError)
Any ideas?
Thanks

Change single quotes in:
filename = 'C:\Users\user\Downloads\vehicle_summary_VRM_#{today}.pdf'
to double quotes:
filename = "C:\Users\user\Downloads\vehicle_summary_VRM_#{today}.pdf"

Related

Replace text String from the shell disabling any regular expression

I need to replace a large set of broken HTML links in a file. For that, I'd need to do a find/replace disabling any kind of regular expression- i.e. the kind of basic Find/Replace you would do from your notepad.
I came across to a Ruby script which should do exactly that:
ruby -p -i -e "gsub('Home', 'NEWLINK')" test.txt
However, the file test.txt is not changed, nor an output is returned. (I don't know much about ruby so I might be just missing something obvious)
Is there any other tool which does what I need?
Edit: I'd expect that the following test.txt file:
Home
....is changed to:
NEWLINK
Thanks
Instead of a regular expression consider using a HTML parser which actually understands HTML and won't leave you with a broken HTML document.
# link_parser.rb
require 'bundler/inline'
gemfile do
source 'https://rubygems.org'
gem 'nokogiri'
end
fn = ARGV[0]
if File.exist(fn)
puts "Processing #{fn}..."
File.open(fn, 'rw') do |file|
doc = Nokogiri::HTML(file)
links = doc.css('a[href="index.php?option=com_content&view=article&id=130&catid=111&Itemid=324"]')
if links.any?
links.each do |link|
link.href = "NEWLINK"
end
file.rewind
file.write(doc.to_s)
puts "#{links.length} links replaced"
else
puts "No links found"
end
end
else
puts "File not found."
end
ruby link_parser.rb path/to/file.html

Ruby - How to add EOF marker into a PDF file or otherwise bypass PDF::Reader::MalformedPDFError: PDF does not contain EOF marker

I'm using the Mechanize ruby gem to click a button on the web to download a PDF file and save it to the local file system.
URL = "www.my-site.com"
agent = Mechanize.new
agent.pluggable_parser.pdf = Mechanize::File # FYI I have also tried Mechanize::FileSaver and Mechanize::Download here
page = agent.get(URL)
form = page.forms.first
button = page.form.button_with(:value => "Some Button Text")
local_file = "path/to/file.pdf"
response = agent.submit(form, button)
response.save_as(local_file)
But when I try to read this PDF file using the PDF::Reader gem, I get an error "PDF does not contain EOF marker".
reader = PDF::Reader.new(local_file) # this also happens if I try to use PDF::Reader.new(response.body) and PDF::Reader.new(response.body_io) depending on the different pluggable_parser configurations mentioned above
#> PDF::Reader::MalformedPDFError: PDF does not contain EOF marker
I'm able to save the PDF locally and view it and it looks fine, but the PDF::Reader gem is complaining about it missing an EOF marker.
So my question is: is there a way I could add an EOF marker into the PDF or something to get around this error so I can parse the PDF?
Thanks.
Related (unanswered) question: PDF does not contain EOF marker (PDF::Reader::MalformedPDFError) with pdf-reader
Related Docs:
http://mechanize.rubyforge.org/Mechanize/File.html
http://mechanize.rubyforge.org/Mechanize/Download.html
http://mechanize.rubyforge.org/Mechanize/FileSaver.html
https://github.com/yob/pdf-reader
EDIT:
I found the EOF marker somewhere in the middle of the downloaded file contents, followed by some HTML-looking stuff that I can't seem to figure out how to get rid of. I want to isolate the PDF content and then parse that, but still running into issues. Here is the full script I am using:
https://gist.github.com/s2t2/c6766846d024edd696586b2bc7fee0bf
The issue seems to be with the website you're accessing: http://employmentsummary.abaquestionnaire.org
The add HTML data at the end of the response.
However, you could truncate the response by searching for the first substring %EOF and removing all the data after that.
i.e.:
pdf_data = result.body
pdf_data.slice!(0, pdf_data.index("%EOL").to_i + 4)
if(pdf_data.length <= 4)
# handle error
else
# save/send pdf_data
end

Open URLs from CSV

I am using Ruby 2.1.0p0 on Mac OS.
I'm parsing a CSV file and grabbing all the URLs, then using Nokogiri and OpenURI to scrape them which is where I'm getting stuck.
When I try to use an each loop to run through the URLs array, I get this error:
initialize': No such file or directory # rb_sysopen - URL (Errno::ENOENT)
When I manually create an array, and then run through it I get no error. I've tried to_s, URI::encode, and everything I could think of and find on Stack Overflow.
I can copy and paste the URL from the CSV or from the terminal after using puts on the array and it opens in my browser no problem. I try to open it with Nokogiri it's not happening.
Here's my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
require 'csv'
events = Array.new
CSV.foreach('productfeed.csv') do |row|
events.push URI::encode(row[0]).to_s
end
events.each do |event|
page = Nokogiri::HTML(open("#{event}"))
#eventually, going to find info on the page, and scrape it, but not there yet.
#something to show I didn't get an error
puts "open = success"
end
Please help! I am completely out of ideas.
It looks like you're processing the header row, where on of those values is literally "URL". That's not a valid URI so open-uri won't touch it.
There's a headers option for the CSV module that will make use of the headers automatically. Try turning that on and referring to row["URL"]
I tried doing the same thing and found it to work better using a text file.
Here is what I did.
#!/usr/bin/python
#import webbrowser module and time module
import webbrowser
import time
#open text file as "dataFile" and verify there is data in said file
dataFile = open('/home/user/Desktop/urls.txt','r')
if dataFile > 1:
print("Data file opened successfully")
else:
print("!!!!NO DATA IN FILE!!!!")
exit()
#read file line by line, remove any spaces/newlines, and open link in chromium-browser
for lines in dataFile:
url = str(lines.strip())
print("Opening " + url)
webbrowser.get('chromium-browser').open_new_tab(url)
#close file and exit
print("Closing Data File")
dataFile.close()
#wait two seconds before printing "Data file closed".
#this is purely for visual effect.
time.sleep(2)
print("Data file closed")
#after opener has run, user is prompted to press enter key to exit.
raw_input("\n\nURL Opener has run. Press the enter key to exit.")
exit()
Hope this helps!

ruby rss module not reading full path

I am downloading an rss file posted as xml, and saving it with the rss extension.
I then use the rss module to read it as an rss file. The issue I have is the following:
If I create the file (page.rss) with an implicit path and I use just
that filename to process it with the rss parsing function, everything
is fine (downloaded_file = 'page.rss')
If I explicity enter manually the full path into the script (downloaded_file = "E:/Libraries/Documents/Android dev/page.rss"), everything works fine also.
But if I "calculate" the value of the absolute path with: downloaded_file = File.join(Dir.pwd, 'page.rss') the rss function fails. The value of the variable is apparently the same ("E:/Libraries/Documents/Android dev/page.rss") but there must be an invisible difference. I would like to be able to use the 'calculated' absolute path. I am sure there is a subtle difference in the way this string is interpreted by the rss function. How can I elucidate it?
Thanks for any suggestion.
Here is my script:
require 'rss'
require 'open-uri'
url = 'http://tutorialspoint.com/android/sampleXML.xml'
downloaded_file = File.join(Dir.pwd, 'page.rss') # FAILS
puts "Path = #{downloaded_file}"#=> "E:/Libraries/Documents/Android dev/page.rss"
downloaded_file = 'page.rss' # WORKS
#downloaded_file = "E:/Libraries/Documents/Android dev/page.rss" # WORKS
puts "Used path/filename: #{downloaded_file}"
File.open(downloaded_file, 'wb') do |file| # Download url content into rss file
file << open(url).read
end
rss = RSS::Parser.parse(downloaded_file, false) # Read rss from downloaded_file
puts "Title: #{rss.channel.title}"
NEW ANSWER
Okay, so your downloaded_file string has been marked as tainted, and the RSS::Parser won't open a tainted file string for some reason (see rss/parser.rb about l. 105 for more details). The solution is to either: untaint the downloaded_file string before you call parse, e.g.:
RSS::Parser.parse(downloaded_file.untaint, false)
or to just open the file for the parser, e.g.:
RSS::Parser.parse(File.open(downloaded_file), false)
I'd never run into this issue before, so thanks! I'd heard of object tainting before, but I never really had any use to look into it. There is a bit more information about it here: What are tainted objects, and when should we untaint them?.
PREVIOUS ANSWER
Dir.pwd is going to change depending on where you call the script from. Unless you are calling the script from E:/Libraries/Documents/Android dev, the filepath will be off.
It's better to build your filepath from the location of your script itself. To do so you can add:
ROOT = File.expand_path('..', __FILE__)
downloaded_file = File.join(ROOT, 'page.rss')
# or just downloaded_file = File.expand_path('../page.rss', __FILE__)

Ruby System Call Executing Before Script Finishes

I have a Ruby script that produces a Latex document using an erb template. After the .tex file has been generated, I'd like to make a system call to compile the document with pdflatex. Here are the bones of the script:
class Book
# initialize the class, query a database to get attributes, create the book, etc.
end
my_book = Book.new
tex_file = File.open("/path/to/raw/tex/template")
template = ERB.new(tex_file.read)
f = File.new("/path/to/tex/output.tex")
f.puts template.result
system "pdflatex /path/to/tex/output.tex"
The system line puts me in interactive tex input mode, as if the document were empty. If I remove the call, the document is generated as normal. How can I ensure that the system call isn't made until after the document is generated? In the meantime I'm just using a bash script that calls the ruby script and then pdflatex to get around the issue.
The File.new will open a new stream that won't be closed (saved to disk) until the script ends of until you manually close it.
This should work:
...
f = File.new("/path/to/tex/output.tex")
f.puts template.result
f.close
system "pdflatex /path/to/tex/output.tex"
Or a more friendly way:
...
File.open("/path/to/tex/output.tex", 'w') do |f|
f.puts template.result
end
system "pdflatex /path/to/tex/output.tex"
The File.open with a block will open the stream, make the stream accessible via the block variable (f in this example) and auto-close the stream after the block execution. The 'w' will open or create the file (if the file already exists the content will be erased => The file will be truncated)

Resources