Ruby - How to add EOF marker into a PDF file or otherwise bypass PDF::Reader::MalformedPDFError: PDF does not contain EOF marker - ruby

I'm using the Mechanize ruby gem to click a button on the web to download a PDF file and save it to the local file system.
URL = "www.my-site.com"
agent = Mechanize.new
agent.pluggable_parser.pdf = Mechanize::File # FYI I have also tried Mechanize::FileSaver and Mechanize::Download here
page = agent.get(URL)
form = page.forms.first
button = page.form.button_with(:value => "Some Button Text")
local_file = "path/to/file.pdf"
response = agent.submit(form, button)
response.save_as(local_file)
But when I try to read this PDF file using the PDF::Reader gem, I get an error "PDF does not contain EOF marker".
reader = PDF::Reader.new(local_file) # this also happens if I try to use PDF::Reader.new(response.body) and PDF::Reader.new(response.body_io) depending on the different pluggable_parser configurations mentioned above
#> PDF::Reader::MalformedPDFError: PDF does not contain EOF marker
I'm able to save the PDF locally and view it and it looks fine, but the PDF::Reader gem is complaining about it missing an EOF marker.
So my question is: is there a way I could add an EOF marker into the PDF or something to get around this error so I can parse the PDF?
Thanks.
Related (unanswered) question: PDF does not contain EOF marker (PDF::Reader::MalformedPDFError) with pdf-reader
Related Docs:
http://mechanize.rubyforge.org/Mechanize/File.html
http://mechanize.rubyforge.org/Mechanize/Download.html
http://mechanize.rubyforge.org/Mechanize/FileSaver.html
https://github.com/yob/pdf-reader
EDIT:
I found the EOF marker somewhere in the middle of the downloaded file contents, followed by some HTML-looking stuff that I can't seem to figure out how to get rid of. I want to isolate the PDF content and then parse that, but still running into issues. Here is the full script I am using:
https://gist.github.com/s2t2/c6766846d024edd696586b2bc7fee0bf

The issue seems to be with the website you're accessing: http://employmentsummary.abaquestionnaire.org
The add HTML data at the end of the response.
However, you could truncate the response by searching for the first substring %EOF and removing all the data after that.
i.e.:
pdf_data = result.body
pdf_data.slice!(0, pdf_data.index("%EOL").to_i + 4)
if(pdf_data.length <= 4)
# handle error
else
# save/send pdf_data
end

Related

How to replace the first few bytes of a file in Ruby without opening the whole file?

I have a 30MB XML file that contains some gibberish in the beginning, and so typically I have to remove that in order for Nokogiri to be able to parse the XML document properly.
Here's what I currently have:
contents = File.open(file_path).read
if contents[0..123].include? 'authenticate_response'
fixed_contents = File.open(file_path).read[123..-1]
File.open(file_path, 'w') { |f| f.write(fixed_contents) }
end
However, this actually causes the ruby script to open up the large XML file twice. Once to read the first 123 characters, and another time to read everything but the first 123 characters.
To solve the first issue, I was able to accomplish this:
contents = File.open(file_path).read(123)
However, now I need to remove these characters from the file without reading the entire file. How can I "trim" the beginning of this file without having to open the entire thing in memory?
You can open the file once, then read and check the "garbage" and finally pass the opened file directly to nokogiri for parsing. That way, you only need read the file once and don't need to write it at all.
File.open(file_path) do |xml_file|
if xml_file.read(123).include? 'authenticate_response'
# header found, nothing to do
else
# no header found. We rewind and let nokogiri parse the whole file
xml_file.rewind
end
xml = Nokogiri::XML.parse(xml_file)
# Now to whatever you want with the parsed XML document
end
Please refer to the documentation of IO#read, IO#rewind and Nokigiri::XML::Document.parse for details about those methods.

How to create, read and transform an XML file with Ruby

I am downloading an XML record from Musicbrainz.org, applying an XSLT transformation and outputting a new and different XML record.
I am running into one issue that I wonder if it is a limitation with my approach, XSLT transformations or applying Ruby to text.
I download the record:
require 'open-uri'
mb_metadata = open('http://musicbrainz.org/ws/2/release/?query=barcode:744861082927', 'User-Agent' => 'MarcBrainz marc4brainz#gmail.com').read
File.open('mb_record.xml', 'w').write(mb_metadata)
This works fine.
Then I want to transform that record. First I tried using Nokogiri:
# mb_metadata to transformed record
mb_record = Nokogiri::XML(File.read('mb_record.xml'))
#if we have the xslt document locally this introduces it
template = Nokogiri::XSLT(File.read('mb_to_marc.xsl'))
# this transforms the input document with the template.xslt
puts template.transform(mb_record)
If I run this on its own it works, however if I download the record and then run this it doesn't, it produces a transformed record which just contains some inserts, no element from the original XML file is transformed.
So I thought this might be an issue with Nokogiri and then I tried using the Ruby/XSLT gem:
xslt = XML::XSLT.new()
xslt.xml = 'mb_record.xml'
xslt.xsl = 'mb_to_marc.xsl'
out = xslt.serve()
print out;
Again, if I'm running this on a local file it works, but if I download it and try to transform it it doesn't work - it produces the following error:
xslt.xml = 'mb_record.xml'
Both methods work fine if I just run them on a file which has been downloaded already.
So what's the issue? Is it a naming problem, an XSLT issue, or something else?
Here's the whole script:
#!/usr/bin/env ruby
# encoding: UTF-8
require 'rubygems' if RUBY_VERSION >= '1.9'
require 'pathname'
require 'httpclient'
require 'xml/xslt'
require 'nokogiri'
require 'open-uri'
# DOWNLOAD RECORD FROM MusicBrainz.org - this works
mb_metadata = open('http://musicbrainz.org/ws/2/release/?query=barcode:744861082927', 'User-Agent' => 'MarcBrainz marc4brainz#gmail.com').read
#puts record
File.open('mb_record.xml', 'w').write(mb_metadata)
# mb_metadata to transformed record - this works on a saved file but not if the file is created earlier in this file .
#
#mb_record = Nokogiri::XML(File.read('mb_record.xml'))
#if we have the xslt document locally this introduces it
#template = Nokogiri::XSLT(File.read('mb_to_marc.xsl'))
# this is supposed to transform the input document with the template.xslt
#puts template.transform(mb_record)
# TRYING ANOTHER TACK
# This works if acting on a saved file. i.e. if I comment out the nokogiri lines above and just run the lines below - to 'print out' the xml is correctly transfored by the xslt to produce more xml.
# I added 'sleep 3' to see if that would help but it doesn't make a difference.
xslt = XML::XSLT.new()
xslt.xml = 'mb_record.xml'
xslt.xsl = 'mb_to_marc.xsl'
out = xslt.serve()
print out;
File.open('mb_record.xml', 'w').write(mb_metadata)
is better written as
File.write('mb_record.xml', mb_metadata)
The first will result in a file that hasn't been closed, and possibly not flushed to the disk, which can mean the file has no contents, or only partial contents.
The second writes the file and immediately flushes and closes it.

Open URLs from CSV

I am using Ruby 2.1.0p0 on Mac OS.
I'm parsing a CSV file and grabbing all the URLs, then using Nokogiri and OpenURI to scrape them which is where I'm getting stuck.
When I try to use an each loop to run through the URLs array, I get this error:
initialize': No such file or directory # rb_sysopen - URL (Errno::ENOENT)
When I manually create an array, and then run through it I get no error. I've tried to_s, URI::encode, and everything I could think of and find on Stack Overflow.
I can copy and paste the URL from the CSV or from the terminal after using puts on the array and it opens in my browser no problem. I try to open it with Nokogiri it's not happening.
Here's my code:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'uri'
require 'csv'
events = Array.new
CSV.foreach('productfeed.csv') do |row|
events.push URI::encode(row[0]).to_s
end
events.each do |event|
page = Nokogiri::HTML(open("#{event}"))
#eventually, going to find info on the page, and scrape it, but not there yet.
#something to show I didn't get an error
puts "open = success"
end
Please help! I am completely out of ideas.
It looks like you're processing the header row, where on of those values is literally "URL". That's not a valid URI so open-uri won't touch it.
There's a headers option for the CSV module that will make use of the headers automatically. Try turning that on and referring to row["URL"]
I tried doing the same thing and found it to work better using a text file.
Here is what I did.
#!/usr/bin/python
#import webbrowser module and time module
import webbrowser
import time
#open text file as "dataFile" and verify there is data in said file
dataFile = open('/home/user/Desktop/urls.txt','r')
if dataFile > 1:
print("Data file opened successfully")
else:
print("!!!!NO DATA IN FILE!!!!")
exit()
#read file line by line, remove any spaces/newlines, and open link in chromium-browser
for lines in dataFile:
url = str(lines.strip())
print("Opening " + url)
webbrowser.get('chromium-browser').open_new_tab(url)
#close file and exit
print("Closing Data File")
dataFile.close()
#wait two seconds before printing "Data file closed".
#this is purely for visual effect.
time.sleep(2)
print("Data file closed")
#after opener has run, user is prompted to press enter key to exit.
raw_input("\n\nURL Opener has run. Press the enter key to exit.")
exit()
Hope this helps!

Open a local file with open-uri

I am doing data scraping with Ruby and Nokogiri. Is it possible to download and parse a local file in my computer?
I have:
require 'open-uri'
url = "file:///home/nav/Desktop/Scraping/scrap1.html"
It gives error as:
No such file or directory # rb_sysopen - file:\home/nav/Desktop/Scraping/scrap1.html
If you want to parse a local file with Nokogiri you can do it like this.
file = File.read('/home/nav/Desktop/Scraping/scrap1.html')
doc = Nokogiri::HTML(file)
When you open a local file in a browser, the URL in the address bar is displayed as:
file:///Users/7stud/Desktop/accounts.txt
But that doesn't mean you use that format in a Ruby script. Your Ruby script doesn't send the file name to a browser and then ask the browser to retrieve the file. Your Ruby script searches your file system directly.
The same is true for URLs: your Ruby script doesn't ask your browser to go retrieve a page from the internet, Ruby retrieves the page itself by sending a request using your system's network interface. After all, a browser and a Ruby program are both just computer programs. What your browser can do over a network, a Ruby program can do, too.
This works for me:
require 'open-uri'
text = open('./data.txt').read
puts text
You have to get your path right, though. The only reason I can think of to use open() is if you had an array of filenames and URLs mixed together. If that isn't your situation, see new2code's answer.
This is how I do it as according to the documentation.
f = File.open("//home/nav/Desktop/Scraping/scrap1.html")
doc = Nokogiri::HTML(f)
f.close
I would make use of Mechanize and save the file locally, then parse it with Nokogiri like so:
# Save the file
agent = Mechanize.new
agent.pluggable_parser.default = Mechanize::Download
current_url = 'http://www.example.com'
file = agent.get(current_url)
file.save!("#{Rails.root}/tmp/")
# Read the file
page = Nokogiri::HTML::Reader(File.open(file))
Hope that helps!

Uploading Images through Sinatra

I'm using the example code from this page:
http://www.wooptoot.com/file-upload-with-sinatra
When I try to upload an image file (png or jpg), it uploads successfully and I can see the file in the proper directory, but it gets corrupted in the process. I cannot open the image. Doing a diff with the original files, I see several newlines that are missing in the uploaded version.
I'm running Ruby 1.9.3p392 on Windows.
Edit:
I tried a test outside the context of Sinatra
File.open('57-new.jpg', "wb") do |f|
f.write(File.open('57.jpg', 'rb').read)
end
That works. The only difference is the addition of the binary flags. When using Sinatra I can set the binary flag on the write operation, but I'm not sure how I can set it on the read since I seem to be passed a file object by the request.
File.open('uploads/' + params['myfile'][:filename], "wb") do |f|
f.write(params['myfile'][:tempfile].read)
end
Okay, so it looks like all I needed was the binary flag when opening the new file.
File.open('uploads/' + params['myfile'][:filename], "wb") do |f|
f.write(params['myfile'][:tempfile].read)
end

Resources