I want to start a site with a collection of BSD man-pages, similar to man.cgi, but static HTML, and which includes all the stuff from the ports trees, too.
I've tried unpacking man/ from all the OpenBSD packages for a recent release, and I've noticed that although some packages provide mdoc pages, in man/man?/page.?, some others only provide terminal formatted pages in man/cat?/page.0.
I can use groff -mdoc -Thtml or mandoc -Txhtml for the mdoc files in man/man?/, but how do I convert the cat files from man/cat?/ into XHTML?
How do those man.cgi scripts at FreeBSD.org and NetBSD.org do this?
In MirBSD we’re delivering all online manpages as static HTML (the actual web CGI is thus very small), and use a crafty script to convert the output of nroff -Tcol foo.1 | col -x to XHTML/1.1 – although, for this to work, we had to tweak nroff(1) and the mdoc and man macropackages (and ms and me etc.) slightly. We only ship all manpages from base, as well as the historic BSD docs, though.
Also, GNU gnroff has no -Tcol, but -Tascii will work – but if you want to use this with gnroff output, you might need to change the regexps accordingly.
Be extra careful when editing this file: it contains normal UTF-8 stuff as well as extra control characters and invalid byte sequences; if you’re not careful, your editor will corrupt it. (I’m using jupp myself.)
For more live feedback on this, feel free to visit the MirBSD IRC channel.
As to your original goal: I suggest to only harvest manpages from binary packages, because they often get changed during compilation, such as by AC_SUBST in autoconf, or even only generated as part of the package build.
Related
I like reading the PoC||GTFO issues and one thing I found remarkable when I first discovered it, was the "polyglot" nature of their PDF files.
Let met explain: when you consider for example their 8th issue, you may unzip files from it; execute the encryption they are talking about by running it as a script and even better(worse?) with their 9th issue you can even play it as a music file!
I'm currently in the process of writing small scripts every week and writing each time a little one page PDF in LaTeX to explain the said scripts. So I would really enjoy being able to create the same kind of PDF files. Sadly they explained (partly) in their first issue how to include zip files, but they did so through three small sketches of cmd lines without actual explanations.
So my question is basically :
how can one create such a polyglot PDF file containing stuff like a zip as well as being a shell script which may be run using arguments just like normal scripts?
I'm asking here about the process of creation, not just an explanation of how this is possible. The ideal way for me would that there are already some scripts or programs allowing to create easily such PDF files.
I've tried to search the net for the keywords "polyglot files" and others of the kind and wasn't able to find any useful matches. Maybe this process has another name?
I've already read the presentation by Julia Wolf which explains how things works, but I sadly haven't had time to apply the knowledge there to real world, because I'm sadly not used to play with file headers and the way a PDF is constructed.
EDIT:
Okay, I've read more and found the 7th edition of PoC||GTFO to be really informative concerning this subject. I may end up being able to create my own scripts to do such polyglot PDF files if I have some more time to consider it.
I played around with polyglots myself after attending Ange's talks and also talking to him in person. You really need to understand the file formats to be able to nest them into each other.
However, long story short, here are some links I found extremely useful for creating polyglots:
Some older Google Code Trunk
PoC of the polyglot stuff
Especially the second link (to github) will help you creating polyglots, but also understanding how they are working and how they are implemented. Since it is mostly Python stuff and very well / clean written, it is very useful and easy to follow.
I feel dissecting some file formats would be a good place to start. You can find many file format specifications for different file types through Google, but they can be a tough read and will likely take you some time to translate into whatever language you are using.
PDF: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf
ELF: https://www.cs.cmu.edu/afs/cs/academic/class/15213-s00/doc/elf.pdf
ZIP: http://kat.sdf.org/zip_file_format.txt
The language(s) you select will need a way to read and write raw bytes (not just ascii alphanumeric), so perhaps C would be good for more direct access to memory. Some Python tricks could help with open sourcing the scripts easily.
To dissect the files, you may want to build a tool kinda like https://github.com/kvesel/zipbrk/ to take them apart, then put them all back together in a polyglot format. For example, zip does not require the section headers to be at the start (or even contiguous for that matter), and PDF magic number can appear in multiple places within the file as well. I also believe I recall a polyglot tool being included in one of the PoC||GTFO publishings (maybe issue 8 or 2??) as a polyglot in the pdf file.
Don't forget the hackers bible! :)
https://nostarch.com/gtfo
I'm currently in process of making site i18n-aware. Marking hardcoded strings as translatable.
I wonder if there's any automated tool that would let me browse the site and quickly see which strings are marked and which still aren't. I saw a few projects like django-i18n-helper that try to highlight translated strings using HTML facilities, but this doesn't work well with JavaScript.
So I thought FДЦЖ CУЯILLIC, 𝔅𝔩𝔞𝔠𝔨𝔩𝔢𝔱𝔱𝔢𝔯 or ʇxǝʇ uʍop-ǝpısdn (or something along those lines) should do the trick. Easy to distinguish visually, still readable, yet doesn't depend on any rich text formatting besides Unicode support.
The problem is, I can't find any readily-available tool that'd eat gettext .po/.pot file(s) and spew out such translation. Still, I think the idea is pretty obvious, so there must be something out there, already.
In my case I'm using Python/Django, but I suppose this question applies to anything that uses gettext-compatible library. The only thing the tool should be aware of, is that there could be HTML fragments in translation strings.
The msgfilter program will let you run your translations through any program you want. It works especially well with GNU sed.
For example, to turn all your translations into uppercase (HTML is mostly case-insensitive, so this should work):
msgfilter -i django.po sed -e 's/\(.*\)/\U\1/'
The only strings in your app that have lowercase letters in them would then be the hardcoded ones.
If you really want to do faux cyrillic, you just have to write a program or script that reads Latin and outputs that, and feed that program to msgfilter instead of sed.
If your distribution has a talkfilters package, it might provide a few programs that might be useful in this specific case. All of these should work as msgfilter filters. (My personal favorite is chef. Bork bork bork!)
Haven't tried this myself yet, but found podebug tool from Translate Toolkit. Based on documentation (flipped and unicode rewrite options), this looks exactly the tool I wished for.
I am looking for a (preferably) command line utility to stamp/watermark unicode text content into a PDF document.
I tried PDF Stamp and a couple of others that I found over the net, but to no avail with Greek characters (e.g. ΓΔΘΛ become ÃÄÈË).
Many thanks for any help!
With sufficiently "odd" characters, you generally need to specify a font and an encoding. I suspect that at least one of the tools you experimented with have the capability to define such things.
Reading their docs, it looks like PDFStamp will let you specify a font, but not an encoding. That doesn't bode well. It might always pick "Identity-H" for system fonts... worth trying.
I must admit, I'm surprised. "Disappointed" even. Have you contacted their email support?
Once upon a time, iText shipped with a number of command line tools that were mostly intended as examples but were none the less useful. I suspect you could dig them out of the SVN archive on sourceforge and get them to build again, if your Java-fu is up to the task. Just be sure to use BaseFont.IDENTITY_H whenever you're given a choice of encodings for a font.
I'm trying to script the creation of a hyrbid (iso/joliet/hfs) iso with hdiutil. I can, for example, build an iso that hides things on the mac side like so:
hdiutil makehybrid -o foo.iso -hfs -joliet -iso -hide-hfs "{foo/bar.txt,foo/other.rtf}" foo
That's just an example of course, but the point is I can get it to hide say seven or eight example files I specify like that, with spaces in the filenames and verious dots and underscores.
But for my actual real-deal script I need to list in the neighborhood of 70 files, which does not seem to work when I test it. The whole string is being passed in correctly, I know this because when you turn on '-verbose' it prints the string and says it doesn't match anything.
So my best guess is it has something to do with the length of the string passed in, but I don't see anything in the docs indicating that. Any ideas? Think it's a bug? An alternative way of accomplishing this?
This is on Mac OS X 10.5.8, btw.
Two [UPDATE, make it Three] (untested) suggestions:
use the -plistin option to
specify all the parameters;
(better) try organizing the files to be
hidden into directories, if
necessary, so you can easily hide
them by directory-specific globs
rather than having to spell out each
file.
[UPDATE] you could try using mkisofs from cdrtools to make the ISO image. MacPorts has a supported port of it. It could be that the code in hdiutil was originally based on an earlier version. In any case, you have the advantage of access to the source code and perhaps figuring out what the limitations are.
P.S. There seems to be a couple of minor nits with the MacPorts port. In particular, the
man pages are installed in the wrong directory. [UPDATE: fixed in 3.00_1]
I have a very large PDF File (200,000 KB or more) which contains a series of pages containing nothing but tables. I'd like to somehow parse this information using Ruby, and import the resultant data into a MySQL database.
Does anyone know of any methods for pulling this data out of the PDF? The data is formatted in the following manner:
Name | Address | Cash Reported | Year Reported | Holder Name
Sometimes the Name field overflows into the address field, in which case the remaining columns are displayed on the following line.
Due to the irregular format, I've been stuck on figuring this out. At the very least, could anyone point me to a Ruby PDF library for this task?
UPDATE: I accidentally provided incorrect information! The actual size of the file is 300 MB, or 300,000 KB. I made the change above to reflect this.
I assume you can copy'n'paste text snippets without problems when your PDF is opened in Acrobat Reader or some other PDF Viewer?
Before trying to parse and extract text from such monster files programmatically (even if it's 200 MByte only -- for simple text in tables that's huuuuge, unless you have 200000 pages...), I would proceed like this:
Try to sanitize the file first by re-distilling it.
Try with different CLI tools to extract the text into a .txt file.
This is a matter of minutes. Writing a Ruby program to do this certainly is a matter of hours, days or weeks (depending on your knowledge about the PDF fileformat internals... I suspect you don't have much experience of that yet).
If "2." works, you may halfway be done already. If it works, you also know that doing it programmatically with Ruby is a job that can in principle be solved. If "2." doesn't work, you know it may be extremely hard to achieve programmatically.
Sanitize the 'Monster.pdf':
I suggest to use Ghostscript. You can also use Adobe Acrobat Distiller if you have access to it.
gswin32c.exe ^
-o Monster-PDF-sanitized ^
-sDEVICE=pdfwrite ^
-f Monster.pdf
(I'm curious how much that single command will make your output PDF shrink if compared to the input.)
Extract text from PDF:
I suggest to first try pdftotext.exe (from the XPDF folks). There are other, a bit more inconvenient methods available too, but this might do the job already:
pdftotext.exe ^
-f 1 ^
-l 10 ^
-layout ^
-eol dos ^
-enc Latin1 ^
-nopgbrk ^
Monster-PDF-sanitized.pdf ^
first-10-pages-from-Monster-PDF-sanitized.txt
This will not extract all pages but only 1-10 (for proof of concept, to see if it works at all). To extract from every page, just leave off the -f 1 -l 10 parameter. You may need to tweak the encoding by changing the parameter to -enc ASCII7 (or UTF-8, UCS-2).
If this doesn't work the quick'n'easy way (because, as sometimes happens, some font in the original PDF uses "custom encoding vector") you should ask a new question, describing the details of your findings so far. Then you need to resort bigger calibres to shoot down the problem.
At the very least, could anyone point
me to a Ruby PDF library for this
task?
If you haven't done so, you should check out the two previous questions: "Ruby: Reading PDF files," and "ruby pdf parsing gem/library." PDF::Reader, PDF::Toolkit, and Docsplit are some of the relatively popular suggested libraries. There is even a suggestion of using JRuby and some Java PDF library parser.
I'm not sure if any of these solutions is actually suitable for your problem, especially that you are dealing with such huge PDF files. So unless someone offers a more informative answer, perhaps you should select a library or two and take them for a test drive.
This will be a difficult task, as rendered PDFs have no concept of tabular layout, just lines and text in predetermined locations. It may not be possible to determine what are rows and what are columns, but it may depend on the PDF itself.
The java libraries are the most robust, and may do more than just extract text. So I would look into JRuby and iText or PDFbox.
Check whether there is any structured content in the PDF. I wrote a blog article explaining this at http://www.jpedal.org/PDFblog/?p=410
If not, you will need to build it.
Maybe the Prawn ruby library? link text