TeXtoGIF for XeTeX - image

I need to extend TeXtoGIF to be able to handle XeTeX, ConTeXt, and other more modern implementations of TeX (point being that they can handle multiple fonts). Unfortunately, XeTeX in particular does not support DVI as an output format for its input, and my modifications break.
Please see the diff of changes at GitHub. My changes to the codebase are as follows:
Introduce a variable $cmdTeX to hold the TeX engine (LaTeX, XeLaTeX, etc.)
Add the option -xetex (or anything beginning with an x, really) to specify xelatex as the engine
Substitute the hard-coded latex call with the variable $cmdTeX.
I see two options to fixing this issue:
Coerce XeLaTeX to produce standard DVI output which, IIRC, isn't possible.
Find another sequence of commands (probably a different use of GS, which is why I included the tag, but so be it) to work with PDF output directly instead of DVI
So, I guess the question boils down to:
How can I convert a PDF into GIF without using graphical software?
which, probably, isn't a good fit for SO anymore IMHO.

It sounds like what you have is a patch you would like to submit to the author. Have you contacted him? Unfortunately his software doesn't (appear to) include a license so it may be hard to proceed from a legal standpoint. Most of the time in the open source world, if you encounter a non-responsive (or unwilling) author, you can do as you have already done, fork and patch. At that point you can choose to publish your new version, possibly with a new name, and conforming to the author's license.
From a software standpoint, the code is rather ancient, written for Perl 4. Because Perl has excellent backwards compatibility it will probably still work, but the question is, do you really want to? It may depend on your use-case. The original author was making gifs to use in web pages. If this is what you are doing, you might want to look at MathJaX which lets you use LaTeX right in your browser/HTML directly.

Instead of adding to my Q, this turned out to be a valid solution to my overall issue and should be recorded as such.
I should also note that someone over at TeX.SX pointed me to the standalone class which provides an option convert which, using -shell-escape, can do just about everything I need. Thus,
\documentclass[convert={density=6000,
size=1920x1600,
outext=.png},
border=1cm]{standalone}
\usepackage{fontspec}
\setmainfont{Zapfino}
\pagestyle{empty}
\begin{document}
it's all text
\end{document}
%%% Local Variables:
%%% mode: latex
%%% TeX-engine: xetex
%%% TeX-master: t
%%% End:

ConTeXt is not a modern TeX implementation (like LuaTeX, for instance). It's a macro package for several engines.
Since you want to support specific engines (e.g. XeTeX) and particular macro packages (e.g. ConTeXt), MathJax is not an option. You have to run the TeX engine, create a PDF and post process that PDF. I don't know why you choose GIF as a format, the vector format SVG would produce much prettier results, PNG would be my second choice.
Since you are not very specific about your input I assume you deal with multi page input files. You can use ghostscript to convert the PDF to a series of images.
As you said, you require GIF. According to gs -h ghostscript does not support GIF output, so we convert to PNG first:
gs \
-sDEVICE=png256 \
-dNOPAUSE \
-dBATCH \
-dSAFER \
-dTextAlphaBits=4 \
-q \
-r300x300 \
-sOutputFile=output-%02d.png input.pdf
Then use graphicsmagick or imagemagick to convert the PNGs to GIFs:
mogrify --format gif -- output-*.png

Related

Conversion between knitr and sweave

This might have been asked before, but until now I couldn't find a really helpful answer for me.
I am using R Studio with knitr and a colleague of mine who I need to cooperate with uses the sweave format. Is there a good way to convert a script back and forth between these two?
I have already found "Sweave2knitr" and hoped this would have an .rmd as output with all chunks changed (<<>> to {} etc.) but this is not the case. My main problem is that I would also need the option to convert from .rmd back to .rnw so that my colleague can also re-edit my work-over.
Thanks a lot!
To process the code chunks and convert the .Rnw file to .tex, you use the knit() function in the knitr package rather than Sweave().
R -e 'library(knitr);knit("my_file.Rnw")'
Sweave2knitr() is for converting old Sweave-based .Rnw files to the knitr syntax.
In Program defaults change :
Weave Rnw files using Sweave or knitr
The Rnw format is really LaTeX with some modifications, whereas the Rmd format is Markdown with some modifications. There are two main flavours of Rnw, the one used by Sweave being the original, and the one used by knitr being a modification of it, but they are very similar.
It's not hard to change Sweave flavoured Rnw to knitr flavoured Rnw (that's what Sweave2knitr does), but changing either one to Rmd would require extensive changes, and probably isn't feasible: certainly I'd expect a lot of manual work after the change.
So for your joint work with a co-author, I would recommend that you settle on a single format, and just use that. I would choose Rmd for this: it's much easier for your co-author to learn Markdown than for you to learn LaTeX. (If you already know LaTeX, that might push the choice the other way.)

Automatic gettext translation generator for testing (pseudolocalization)

I'm currently in process of making site i18n-aware. Marking hardcoded strings as translatable.
I wonder if there's any automated tool that would let me browse the site and quickly see which strings are marked and which still aren't. I saw a few projects like django-i18n-helper that try to highlight translated strings using HTML facilities, but this doesn't work well with JavaScript.
So I thought FДЦЖ CУЯILLIC, 𝔅𝔩𝔞𝔠𝔨𝔩𝔢𝔱𝔱𝔢𝔯 or ʇxǝʇ uʍop-ǝpısdn (or something along those lines) should do the trick. Easy to distinguish visually, still readable, yet doesn't depend on any rich text formatting besides Unicode support.
The problem is, I can't find any readily-available tool that'd eat gettext .po/.pot file(s) and spew out such translation. Still, I think the idea is pretty obvious, so there must be something out there, already.
In my case I'm using Python/Django, but I suppose this question applies to anything that uses gettext-compatible library. The only thing the tool should be aware of, is that there could be HTML fragments in translation strings.
The msgfilter program will let you run your translations through any program you want. It works especially well with GNU sed.
For example, to turn all your translations into uppercase (HTML is mostly case-insensitive, so this should work):
msgfilter -i django.po sed -e 's/\(.*\)/\U\1/'
The only strings in your app that have lowercase letters in them would then be the hardcoded ones.
If you really want to do faux cyrillic, you just have to write a program or script that reads Latin and outputs that, and feed that program to msgfilter instead of sed.
If your distribution has a talkfilters package, it might provide a few programs that might be useful in this specific case. All of these should work as msgfilter filters. (My personal favorite is chef. Bork bork bork!)
Haven't tried this myself yet, but found podebug tool from Translate Toolkit. Based on documentation (flipped and unicode rewrite options), this looks exactly the tool I wished for.

How does man.cgi deal with cat vs. mdoc man pages?

I want to start a site with a collection of BSD man-pages, similar to man.cgi, but static HTML, and which includes all the stuff from the ports trees, too.
I've tried unpacking man/ from all the OpenBSD packages for a recent release, and I've noticed that although some packages provide mdoc pages, in man/man?/page.?, some others only provide terminal formatted pages in man/cat?/page.0.
I can use groff -mdoc -Thtml or mandoc -Txhtml for the mdoc files in man/man?/, but how do I convert the cat files from man/cat?/ into XHTML?
How do those man.cgi scripts at FreeBSD.org and NetBSD.org do this?
In MirBSD we’re delivering all online manpages as static HTML (the actual web CGI is thus very small), and use a crafty script to convert the output of nroff -Tcol foo.1 | col -x to XHTML/1.1 – although, for this to work, we had to tweak nroff(1) and the mdoc and man macropackages (and ms and me etc.) slightly. We only ship all manpages from base, as well as the historic BSD docs, though.
Also, GNU gnroff has no -Tcol, but -Tascii will work – but if you want to use this with gnroff output, you might need to change the regexps accordingly.
Be extra careful when editing this file: it contains normal UTF-8 stuff as well as extra control characters and invalid byte sequences; if you’re not careful, your editor will corrupt it. (I’m using jupp myself.)
For more live feedback on this, feel free to visit the MirBSD IRC channel.
As to your original goal: I suggest to only harvest manpages from binary packages, because they often get changed during compilation, such as by AC_SUBST in autoconf, or even only generated as part of the package build.

Methods of Parsing Large PDF Files

I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database.
Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manner:
Name | Address | Cash Reported | Year Reported | Holder Name
Sometimes the Name field overflows into the address field, in which case the remaining columns are displayed on the following line.
Due to the irregular format, I've been stuck on figuring this out. At the very least, could anyone point me to a Ruby PDF library for this task?
UPDATE: I accidentally provided incorrect information! The actual size of the file is 300 MB, or 300,000 KB. I made the change above to reflect this.
I assume you can copy'n'paste text snippets without problems when your PDF is opened in Acrobat Reader or some other PDF Viewer?
Before trying to parse and extract text from such monster files programmatically (even if it's 200 MByte only -- for simple text in tables that's huuuuge, unless you have 200000 pages...), I would proceed like this:
Try to sanitize the file first by re-distilling it.
Try with different CLI tools to extract the text into a .txt file.
This is a matter of minutes. Writing a Ruby program to do this certainly is a matter of hours, days or weeks (depending on your knowledge about the PDF fileformat internals... I suspect you don't have much experience of that yet).
If "2." works, you may halfway be done already. If it works, you also know that doing it programmatically with Ruby is a job that can in principle be solved. If "2." doesn't work, you know it may be extremely hard to achieve programmatically.
Sanitize the 'Monster.pdf':
I suggest to use Ghostscript. You can also use Adobe Acrobat Distiller if you have access to it.
gswin32c.exe ^
-o Monster-PDF-sanitized ^
-sDEVICE=pdfwrite ^
-f Monster.pdf
(I'm curious how much that single command will make your output PDF shrink if compared to the input.)
Extract text from PDF:
I suggest to first try pdftotext.exe (from the XPDF folks). There are other, a bit more inconvenient methods available too, but this might do the job already:
pdftotext.exe ^
-f 1 ^
-l 10 ^
-layout ^
-eol dos ^
-enc Latin1 ^
-nopgbrk ^
Monster-PDF-sanitized.pdf ^
first-10-pages-from-Monster-PDF-sanitized.txt
This will not extract all pages but only 1-10 (for proof of concept, to see if it works at all). To extract from every page, just leave off the -f 1 -l 10 parameter. You may need to tweak the encoding by changing the parameter to -enc ASCII7 (or UTF-8, UCS-2).
If this doesn't work the quick'n'easy way (because, as sometimes happens, some font in the original PDF uses "custom encoding vector") you should ask a new question, describing the details of your findings so far. Then you need to resort bigger calibres to shoot down the problem.
At the very least, could anyone point
me to a Ruby PDF library for this
task?
If you haven't done so, you should check out the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are some of the relatively popular suggested libraries. There is even a suggestion of using JRuby and some Java PDF library parser.
I'm not sure if any of these solutions is actually suitable for your problem, especially that you are dealing with such huge PDF files. So unless someone offers a more informative answer, perhaps you should select a library or two and take them for a test drive.
This will be a difficult task, as rendered PDFs have no concept of tabular layout, just lines and text in predetermined locations. It may not be possible to determine what are rows and what are columns, but it may depend on the PDF itself.
The java libraries are the most robust, and may do more than just extract text. So I would look into JRuby and iText or PDFbox.
Check whether there is any structured content in the PDF. I wrote a blog article explaining this at http://www.jpedal.org/PDFblog/?p=410
If not, you will need to build it.
Maybe the Prawn ruby library? link text

Visual Basic 6: Tif Page convert from compression type ZLW, JPEG etc to Group4

Does anyone know of an open source library in Visual Basic 6 that converts pages in a TIF file from lets say LZW to Group4 format?
Thanks.
An edit of my answer to your other question!
ImageMagick is an excellent free open source image manipulation package. There is an OLE (ActiveX) control which you could use from VB6. I've never tried it myself - I always use the ImageMagick command line. I understand the control just takes the normal command lines anyway.
The command-line element for changing the TIFF format would be -compress. Something like this below to write in Group4 format (air code based on tutorial and manual)
convert myfile.tiff -compress Group4 myGroup4file.tiff
Choices for the -compress argument qualifier include: None, Group4, and LZW.
EDIT: ImageMagick is licensed under the GPL: if you use the control and redistribute your program, it's possible your program would have to be free open-source. Apparently it's not yet been legally tested whether dynamic linking to a GPL library or control invokes the GPL. You could always launch the ImageMagick command-line which to me should be safe [I am not a lawyer].
EDIT2: The ImageMagick website says it uses GPL but the license wording doesn't look like GPL 1 2 or 3 to me. It also contains this "For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof"
The FreeImage library has worked well for me in the past.
http://freeimage.sourceforge.net/

Resources