Has anyone been able to find a way to test pdf's with ruby within the browser? I have tried a few different ways and the only way I have been able to get any pdf testing to work is to save off the pdf and use the pdf_reader gem. This only seems to work on pdf's that, when the link is clicked, opens up a dialog box with the options to open or save the pdf. Unfortunately I have not been able to find a way to do anything like this with pdf's that are opened in browser, with no dialog box options to save it. Any ideas?
Maybe testing it in the browser isnt the best way. When you say test the pdf what are you trying to do? I wouldnt test the pdf in the browser if I was you.
Try docsplit, if you want to verify its contents.
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text via OCR if necessary, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
You are not inventing a browser, or a PDF generator.
Use unit tests to check your back-end modules can take data in, and write PDF out, then serve the PDF in a website and let the browser do its thing. Test (as what Rails calls a "functional test") that the MVC will produce a web page containing a link to the PDF, and you are done.
You can use gem 'mechanize' to download an online PDF (the PDF with in a browser) on your computer and then read it via gem PDF reader.
Related
I am trying to read a PDF as text, and I can write it back with junk in it, which is fine as I have a parser component to get the bits I need.
My question is how can I read specific parts of the PDF and ignore the rest?
If your PDF is well formatted, you can do it using text scraping, but that means you need to open the PDF file and it must be visible for Native Scraping to work
I need to download a pdf from a website which does not provide a link ending with (.pdf) using ruby. Manually, when i click on the link to download the pdf, it takes me to a new page and the dialog box to save/open the file appears after some time.
Please help me in downloading the file.
The link
You an do this
require 'open-uri'
File.open('my_file_name.pdf', "wb") do |file|
file.write open('http://someurl.com/2013-1-2/somefile/download').read
end
I have been doing this for my projects and it works.
If you just need a simple ruby script to do it, I'd just run wget. Like this exec 'wget "http://path.to.the.file/and/some/params"'
At that point though, you might as well run wget.
The other way, is to just run a get on the page that you know the pdf is at
source = Net::HTTP.get("http://the.website.com", "/and/some/params")
There are a number of other http clients that you could use, but as long as you make a get request to the endpoint that the pdf is at, it should give you the raw data. Then you can just rename the file, and you'll have the pdf
In your case, I ran the following commands to get the pdf
wget http://www.lawcommission.gov.np/en/documents/prevailing-laws/constitution/func-download/129/chk,d8c4644b0f086a04d8d363cb86fb1647/no_html,1/
mv index.html thefile.pdf
Then open the pdf. Note that these are linux commands. If you want to get the file with a ruby script, you could use something like what I previously mentioned.
Update:
There is an added complication that was not initially stated, which is that the url to the pdf changes every time there is an update to the pdf. In order to make this work, you probably want to do something involving web scraping. I suggest nokogiri. This way you can look at the page where the download is and then perform a get request on the desired URL. Furthermore, the server that hosts the pdf is misconfigured, and breaks chrome within a few seconds of opening the page.
How to solve this problem: I went to the site, and refreshed it. Then broke the connection to the server (press the X where there would otherwise be a refresh button). Then right click next to the download link, and select inspect element. Then browse the dom to find something that is definitively identifying (like an id). Thankfully, I found something <strong id="telecharger"> Download</strong>. This means that you can use something like page.css('strong#telecharger')[0].parent['href'] This should give you a URL. Then you can perform a get request as described above. I don't have time to make the script for you (too much work to do), but this should be enough to solve the problem.
I have a site - www.jcrocetta.com.
On this site I have 2 pdf files. One file has blurred data and the other is clear, both files were created with pdftk.
In order to blur out some personal data in the pdf I used Inkscape. But Inkscape only opens/edits one PDF page at a time. After I made my edits in Inkscape I saved the files as .pdf formatted files. At that point I had three separate pdf files, pages 1 through 3. I then used pdftk to concatenate the 3 files into one.
The final pdftk-produced files are on www.jcrocetta.com. Just click the public information button.
In Chrome viewing inline works fine.
Downloading the file from Firefox works fine too.
But viewing inline on Firefox it renders blank pages. How can I fix this?
Also, I know that pdf files not produced with pdftk will render correctly on both Chrome and Firefox.
Thanks for your help.
FireFox has a lovely new feature: It now uses the PDF.js library to render PDF files, instead of calling out to an Adobe Reader plugin, or forcing you to save the file to disk. Unfortunately, it seem that PDF.js isn't quite perfect yet. A quick search shows that other people have the same issue, but the only "solution" I've seen offered boils down to "file a bug report at https://github.com/mozilla/pdf.js/issues or https://bugzilla.mozilla.org/enter_bug.cgi?product=Firefox&component=PDF+Viewer".
Also: Do the three individual PDF files render in FireFox, before you use pdftk to concatenate them?
Cross-posted from the Cukes Google Group:
I have experimented with a number of methods of saving screenshots,
but settled on the method that is built into watir-webdriver. No
matter which method I have used, I am not able to successfully embed a
link to this image in the Cucumber HTML report.
In c:\ruby\cucumber\project_name\features\support\hooks.rb, I'm using:
After do |scenario|
if scenario.failed?
#browser.driver.save_screenshot("screenshot.png")
embed("screenshot.png", "image/png")
end
end
A link with text "Screenshot" IS added to the report, but the URL is
the project directory path ("c:\ruby\cucumber\project_name") rather
than a direct link to the file ("c:\ruby\cucumber\project_name\screenshot.png"). I have tried a number of different image formats
and direct paths using Dir.pwd with the same results each time.
What am I missing?
Thanks
Windows XP
Ruby 1.8.7
watir-webdriver (0.2.4)
cucumber (0.10.3)
Aslak:
Try this:
After do |scenario|
if scenario.failed?
encoded_img = #browser.driver.screenshot_as(:base64)
embed("data:image/png;base64,#{encoded_img}",'image/png')
end
end
Aslak
Adam:
Aslak was able to see the embedded
image in the file that I emailed him,
while I was still unable to do so in
IE 8. I tried it out in Firefox 3.6
and the image appears as expected.
The problem may have originally been
with the embedding method itself (or
rather, my use of it), but using
Aslak's base64 solution it only fails
to work in the Internet Explorer
browser.
Aslak:
I believe Base64-encoding of images in
HTML pages [1] works in all decent
browsers (sorry, IE is not one of
them). However, it should work in
IE:
http://dean.edwards.name/weblog/2005/06/base64-ie/
(but maybe they broke it in IE8, or
maybe it only works with gifs, or
maybe IE needs a special kind of
base64 encoding, or maybe you should
just ditch IE)
If being able to read cucumber html
reports with screenshots in IE is
really important to you, you could
always write each image to disk:
png = #browser.driver.screenshot_as(:png)
path = (0..16).to_a.map{|a| rand(16).to_s(16)}.join + '.png' # Or use some GUID library to make a unique filename - scenario names are not guaranteed to be unique.
File.open(path, 'wb') {|io| io.write(png)}
embed(path, 'image/png')
Obviously you have to make sure the
relative path you pass to embed is
right (depending on where you write
the html itself)
[1]
http://en.wikipedia.org/wiki/Data_URI_scheme
HTH, Aslak
I would like to save a web page programmatically.
I don't mean merely save the HTML. I would also like automatically to store all associated files (images, CSS files, maybe embedded SWF, etc), and hopefully rewrite the links for local browsing.
The intended usage is a personal bookmarks application, in which link content is cached in case the original copy is taken down.
Take a look at wget, specifically the -p flag
−p −−page−requisites
This option causes Wget to download all the files
that are necessary to properly display
a givenHTML page. Thisincludes such
things as inlined images, sounds, and
referenced stylesheets.
The following command:
wget -p http://<site>/1.html
Will download page.html and all files it requires.
On Windows: you can run IE as a com object and pull everything out.
On other thing, you can take the source of Mozilla.
In Java, Lobo.
Or commons-httpclient and write a lot of code.
You could try the MHTML format (which is what IE uses). http://en.wikipedia.org/wiki/MHTML
In other words, you'd be downloading each object (image, css, etc.) to your computer, and then "embedding" them, via Base64, into a single file.