(Unit)test pdf generation

(Unit)test pdf generation - magento

We implemented a magento module https://github.com/firegento/firegento-pdf/ and I plan to write tests for the module.
The problem is: The extension generates pdfs.
Is there any framework, or whatever to test pdfs? It would be totally fine if I can check for text in the pdf. I don't want to check the correct placement.
Andy ideas?
This looks promising but feels oversized. http://webcheatsheet.com/php/reading_clean_text_from_pdf.php

I use PdfBox for a similar module, a Java based command line utility that extracts text from a PDF and optionally converts it to HTML: http://pdfbox.apache.org/commandline/#extractText
To use it within PHPUnit tests, I wrote a PHP interface for the relevant PdfBox methods: https://github.com/schmengler/PdfBox
Example
use SGH\PdfBox;
//$pdf = GENERATED_PDF;
$converter = new PdfBox;
$converter->setPathToPdfBox('/usr/bin/pdfbox-app-1.7.0.jar');
$text = $converter->textFromPdfStream($pdf);
Further reading: Unit Test Generated PDFs with PHPUnit and PDFBox

Maybe you could use Inkscape to convert it into SVG and make asserts on some SVG Nodes.
That would do if you only want to check the text or some simple formatting.
$ inkscape -f invoice.pdf --export-plain-svg=thepdf.svg
For the correct position you need to be a bit fuzzy, though.
Nevertheless the PDF source can be plain enough to be checked with simple strpos().

You have to resign yourself to testing "we sent the right commands to Magento". Testing the output PDF would cause fragility.
Mock the PDF-writing library, and test that your code called the library correctly. This has the added benefit of speed, but requires a little more discipline. If a PDF output fails a manual inspection, you MUST fix that test-first, to keep your mocks honest.

Related

Examine generated ReST from sphinx extension

I'm using a Sphinx extension (this one) which generates some ReST markup dynamically. Sphinx uses that ReST to generate html documentation.
I want to examine and run doctests on the generated ReST markup. Normally I use
sphinx-build -E -b html docs dist/docs
to generate the html output, but there is no rst "builder" equivalent to the html one.
How can I examine the generated ReST markup?

Use the Sphinx extension sphinx.ext.doctest, following its syntax and markup.
Then run make doctest to run the doctests.
Update
In response to your comments and taking the suggestion from #mzjn, it sounds like you want to generate an intermediate set of documentation that is in reStructuredText. The Sphinx extension restbuilder might be want you seek.
From that point, make doctest might be want you want.

Multiple Converters/Generators in Jekyll?

I would like to write a Jekyll plugin that makes all posts available in PDF format by utilizing Kramdown's LaTeX export capabilities. For each post in Markdown format, I'd like to end up with the normal .html post along with a .tex file containing the LaTeX markup and finally a .pdf.
Following the documentation for creating plugins, I see two ways of approaching the problem, either with a Converter or with a Generator.
Converter plugins seem to run after the built-in Converters, so the .markdown files have all been converted to .html by the time they reach the Converter.
When I try to implement a Generator, I am able to use fileutils to write a file successfully, but by the end of Jekyll's cycle, that file has been removed. It seems there's a StaticFile class which you can use to register new output files with Jekyll, but I cannot find any real guidance on how to use it.

If you take a look at the ThumbGenerator class in this: https://github.com/matthewowen/jekyll-slideshow/blob/master/_plugins/jekyll_slideshow.rb you'll seen a similar example. This particular plugin makes thumbnail sized versions of all images in the site. Hopefully it gives a useful guide to how you can interact with Jekyll's StaticFile class (though I'm not a Ruby pro, so forgive any poor style).
Unfortunately, there isn't really documentation for this - I gleaned it from reading through the source.
I wrote this a few months ago and don't particularly remember the details (which is why I gave an example rather than a workthrough), but if this doesn't get you on the right track let me know and I'll try to help.

I try to do the same but with direct html->pdf conversion.
It did not work inside a gitlab-ci pipeline at this time, nonetheless it work on my workstation (see here) with a third possibility : a hook !
(here with pdfkit)
require 'pdfkit'
module Jekyll
Jekyll::Hooks.register :site, :post_write do |post|
post.posts.docs.each do |post|
filename = post.site.dest + post.id + ".pdf"
dirname = File.dirname(filename)
Dir.mkdir(dirname) unless File.exists?(dirname)
kit = PDFKit.new(post.content, :page_size => 'Letter')
kit.stylesheets << './css/bootstrap.min.css'
kit.to_file(filename)
end
end
end

Graphical user visualization for Ruby testing

What kind of (small) tool can we use in order to render a graphical result from testing?
Actually, I would like to display a graphic instead of this test (for example):
Finished in 3.44 seconds
5 examples, 0 failures
Maybe a JS graph for instance, where each bugged test could be in red...
Thanks.

If you use a Continuous Integration tool such as Hudson you can chart failures over time.

Well Rspec comes with a html formatter. you could just pass the option to command line like
rspec -f html -o index.html
You could also inherit RSpec formatter to write your own formatters.

Maybe you are looking for fuubar, it's not JS but shows a nice progress bar in colors. I also use it to have a quick overview of passing or failing tests.

Generate line graph for any benchmark?

I had spent so many hours failing to find a line graph generator for my benchmark results that I just wanted to plug in. I tried quite a few like Google's chart API but it still seemed confusing or not graceful looking, I am clueless.
Examples of benchmark images I wished to make something like are this:
What specific applications /web services do you recommend for generating something even close to this? I want something "neat".

You can use python mathplotlib, which generates beautiful graphs like:
(Source code)

I use gnuplot. It is not a lib, but a separate executable file. You can output plotting data to one file, and plotting commands in another - script file, which refer to data file. Then call gnuplot with this script file.
Another way is to use qwt. It is a real library, but it depends on Qt. If you already use Qt in your project, it is very straigth way to plot graphs. If not, then just use gnuplot

Methods of Parsing Large PDF Files

I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database.
Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manner:
Name | Address | Cash Reported | Year Reported | Holder Name
Sometimes the Name field overflows into the address field, in which case the remaining columns are displayed on the following line.
Due to the irregular format, I've been stuck on figuring this out. At the very least, could anyone point me to a Ruby PDF library for this task?
UPDATE: I accidentally provided incorrect information! The actual size of the file is 300 MB, or 300,000 KB. I made the change above to reflect this.

I assume you can copy'n'paste text snippets without problems when your PDF is opened in Acrobat Reader or some other PDF Viewer?
Before trying to parse and extract text from such monster files programmatically (even if it's 200 MByte only -- for simple text in tables that's huuuuge, unless you have 200000 pages...), I would proceed like this:
Try to sanitize the file first by re-distilling it.
Try with different CLI tools to extract the text into a .txt file.
This is a matter of minutes. Writing a Ruby program to do this certainly is a matter of hours, days or weeks (depending on your knowledge about the PDF fileformat internals... I suspect you don't have much experience of that yet).
If "2." works, you may halfway be done already. If it works, you also know that doing it programmatically with Ruby is a job that can in principle be solved. If "2." doesn't work, you know it may be extremely hard to achieve programmatically.
Sanitize the 'Monster.pdf':
I suggest to use Ghostscript. You can also use Adobe Acrobat Distiller if you have access to it.
gswin32c.exe ^
-o Monster-PDF-sanitized ^
-sDEVICE=pdfwrite ^
-f Monster.pdf
(I'm curious how much that single command will make your output PDF shrink if compared to the input.)
Extract text from PDF:
I suggest to first try pdftotext.exe (from the XPDF folks). There are other, a bit more inconvenient methods available too, but this might do the job already:
pdftotext.exe ^
-f 1 ^
-l 10 ^
-layout ^
-eol dos ^
-enc Latin1 ^
-nopgbrk ^
Monster-PDF-sanitized.pdf ^
first-10-pages-from-Monster-PDF-sanitized.txt
This will not extract all pages but only 1-10 (for proof of concept, to see if it works at all). To extract from every page, just leave off the -f 1 -l 10 parameter. You may need to tweak the encoding by changing the parameter to -enc ASCII7 (or UTF-8, UCS-2).
If this doesn't work the quick'n'easy way (because, as sometimes happens, some font in the original PDF uses "custom encoding vector") you should ask a new question, describing the details of your findings so far. Then you need to resort bigger calibres to shoot down the problem.

At the very least, could anyone point
me to a Ruby PDF library for this
task?
If you haven't done so, you should check out the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are some of the relatively popular suggested libraries. There is even a suggestion of using JRuby and some Java PDF library parser.
I'm not sure if any of these solutions is actually suitable for your problem, especially that you are dealing with such huge PDF files. So unless someone offers a more informative answer, perhaps you should select a library or two and take them for a test drive.

This will be a difficult task, as rendered PDFs have no concept of tabular layout, just lines and text in predetermined locations. It may not be possible to determine what are rows and what are columns, but it may depend on the PDF itself.
The java libraries are the most robust, and may do more than just extract text. So I would look into JRuby and iText or PDFbox.

Check whether there is any structured content in the PDF. I wrote a blog article explaining this at http://www.jpedal.org/PDFblog/?p=410
If not, you will need to build it.

Maybe the Prawn ruby library? link text

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio