How to parse word documents with ruby? - ruby

Does anyone know of a library that I can use on OS X/Linux to parse Word files and output the content as HTML?
I've had a look at win32ole but as far as I can see it's for Windows only, although I could be wrong.
Any suggestions?

The Word document format (ignoring docx for the moment) is terrible and was constantly changing. IMHO that is why there are so few (read: zero) Ruby libraries out there to parse them.
What I recommend doing is using JRuby and some of the established Java libraries for reading the doc format. Google should help you out there: http://schmidt.devlib.org/java/libraries-word.html.
There is a Java project for reading MIcrosoft file formats, POI (http://poi.apache.org/) and they do have Ruby bindings (http://poi.apache.org/poi-ruby.html) but I'm not sure how up-to-date those are. On their site it says the Ruby bindings are for 1.8.2...

Related

Xzing recognizes Barcode only in Java version

I'm trying to bulk scan some jpg files with barcodes on them. I've used the ruby bindings for the c++ port of xzing. When I have this file:
scanned by the Web-Version of Xzing (https://zxing.org/w/decode.jspx) everything turns out fine. When I try to scan this one in ruby (using https://github.com/glassechidna/zxing_cpp.rb) nothing is recognized. I already tried cranking up the contrast, but it did not help. It's not my ruby setup because it works for loads of other nearly identical codes. The only thing I can think of is any difference between the Java version and the C++ port, but this is absolute poking in the dark, I've started using zxing just today.
Could anyone get this code recognized in ruby? Thank you very much.
The gem you're using and/or it's dependencies our out of date. If you want to still use Ruby for your project, you can try using one of the online services in the comments for the decoding. You could either try to use the
mechanize gem or roll your own using other http ruby tools such as httparty or Ruby's Net::HTTP

Convert MediaWiki Markup to Textile Markup

I have a problem :0
At my place of work we have two wiki systems and I have been charged with finding a way of migrating from a MediaWiki to a redmine wiki -- only problem is they use different markup languages (WikiText vs Textile) and a possible solution (Pandoc) only goes the other way :0 Any suggestions on how to do this would be greatly appreciated!!!
The MediaWiki to Redmine Migration Tool (MRMT) has just been released.
It migrates the whole history with the correct user assigned to each revision.
Besides a basic Pandoc translation it also adds some helpful replacements that will very likely be necessary in any migration of that kind.
The development version of pandoc now has a mediawiki reader. It doesn't support all of mediawiki syntax (e.g. templates), and it is not very well tested, but you could try it out.
You would need to install the development version of pandoc from source to do this. Install the Haskell Platform, then follow the instructions here.
(These instructions assume a *nix build environment.)
You will probably want to use some scripting to adjust the result, e.g. making links with title "wikilink" into proper redmine wikilinks. It is easiest to do this at the level of the pandoc AST, rather than in the textile result. The document on Scripting with pandoc on the pandoc website may be of help here.
Another approach is to scrape the HTML your redmine wiki produces, and use pandoc to convert that to textile. This approach typically requires a lot of preprocessing and postprocessing, though.
You could also try using one of the various alternative mediawiki parsers, producing HTML or DocBook and converting that to textile using pandoc.

Extracting source code from a document for testing

I wrote tutorials for some Ruby gems I wrote. It is in markdown (Kramdown) text document. To ensure the integrity of the source code in the tutorials as the development of the gems continue I want to extract the source code from the tutorial document and run test to ensure the code is correct and working. Before reinventing the wheel I searched but found nothing on this kind of problem. Is there any software that can help me solve my problem? Ruby software would be cool but I'm not particular about the language. I'm sure I can't be the first person to encounter this problem.
The other option is to only have place holders in the tutorial documents and have all the files externally en then populate the document prior to publishing. This would mean a lot more loose files but would be significantly easier to implement.
Org mode in emacs can do that, but it means you can't write in Markdown.
Are you after something like Ruby DocTest?

additional settings for wkhtmltopdf?

I am converting some docs to pdf using wkhtmltopdf (currently using perl and the command line versions). Is it possible to change the "PDF Producer", "PDF Version" and "Fast Web View" fields? The current defaults are "wkhtmltopdf", "1.4 (Acrobat 5.x)", and "No", respectively. I didn't see anything in the wiki page
Pass the following with the command line to see supported features: " --extended-help"
Not sure if those specific params are supported or not.
I patched wkhtmltopdf to support an additional flag recently, and it would be quite easy to add parameters to change those. I don't believe they are supported currently, though.
PDF Producer: Nope. Most apps want folks to know that particular app generated the PDF.
PDF Version: Nope, but trivial. The version number at the beginning of the file is just a courtesy really. What exactly are you after with this? Chances are you don't really need it. The PDF generated isn't going to acquire any features automagically just because the PDF claims to be this version or that. It's only really used so a viewer opening a newer PDF can say something like "I don't support this version, some stuff might not work". Because everything will work regardless (unless someone happens to have a VERY old version of Acrobat/Reader), I don't see the issue.
Fast Web View: Nope, and decidedly non-trivial. "Fast Web View" means everything needed to display the first page of the PDF is sorted to the front of the file, and there are various "hints" on where an app downloading the PDF can find this or that. It's not just a flag, not by a long stretch.
Zero for three. Sorry.

Good metadata image dump utilities?

I'm looking for the best tool out there to extract any and all metadata embedded within the most popular image file formats (JPEG and PNG specifically). I would like to know about whatever is in there (XMP, Exif, IPTC/IIM, etc.). Ideally I am looking for an all-in-one solution that I can run from a command line, but am interested to hear about any other tools in this area that are of value.
I have found the following, each with advantages/disadvantages:
ExifTool is good, but the output is a little more roughshod that I would like.
DumpImage from the Metadata Working Group has good formatting of the metadata it does find, but doesn't support PNG.
I have recently released Binspector, the tool I ended up writing to answer this question to my own satisfaction. The basic premise of the tool is that it takes a format grammar and uses it to analyze a binary file. As long as the format grammar and the binary file are well-formed, one can inspect and analyze innumerable binary files and formats.
Code is hosted on GitHub, and a blog for the tool is here. (The overview post for the tool is here.)
As you did not mention any preferred programming language I take PHP as an example.
There is an Exif Extension for PHP which can be used to easily retrieve Metadata from an Image.
http://www.php.net/manual/en/function.exif-read-data.php
You could easily create a script that you can call from the command line. I must add that the extension only seems to provide support for JPEG and TIFF images.
You could try the official ADOBE XMP SDK. It is available for download at :
http://www.adobe.com/devnet/xmp.html
This is the complete SDK to read/write/manipulate metadata across a variety of formats.
In the SDK package there is one particular sample that might be of interest to you. Go to the "samples" folder build the samples as per documentation (available in the package). Look for the sample exe "DumpFile". This dumps all the metadata in the file to the console.

Resources