Programmatically copy annotations from one PDF to another with Ruby - ruby

I have a large number of books in PDF format, mostly from publishers like the Pragmatic Programmers, and I have made many annotations (comments, highlights, notes, etc.) in the PDFs that I'd like to preserve. This becomes a problem when the publisher updates the PDF and I download the update. I'd like a way to merge my annotations (which are basically changes) with the new version. Opening each one and manually copying and pasting is obviously a ridiculously tedious chore, so:
Is there any way to do this programmatically? I'd prefer to use Ruby, but I'd be almost as happy with a shell script-based solution, and I'd even be willing to learn Python or use Perl if those were the only options.

Related

Why do many file formats are disguised zip files? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Through the years I've had many opportunities to "reverse engineer" proprietary files, and I noticed that many times these are "disguised" ZIP files which just pack standard XML, HTML, config and raw text files. However, I don't understand why would the developpers do that.
A few examples on top of my head of these "disguised" file formats are:
PPTX, XLS, DOCX and probably all Microsoft's file formats
EPUB
JAR, WAR, though this one I can understand as it is meant to be an archive
There are many other files formats of this sort, and sometimes even company that really don't want their data files to be publicly read rely on this disguised ZIP to store data (like games saves).
What are the technological advantages of ZIP files over custom file types ?
Is there a name for the practice of building a (sometimes proprietary) new file format on top of ZIP ?
If you want your new file format to be interoperable by other applications, you'll need to define your format completely. Building on top of other standards, such as ZIP, XML, and HTML cut down a large part of documents and maintenance effort.
The format designer is usually also the first implementor. Using existing standards means they can use existing, known to be correct and working tools to create and read. This means Microsoft Office file format designer, for example, doesn't need to debug serializing and deserializing logic since they're already using the industry-proven XML.
Using a compressed archive instead of plain archiving such as TAR means your format automatically reduces the required storage when possible. ZIP is an ISO standard and patent-free (as long as it's not encrypted with a strong algorithm), so the designer and implementor don't need to pay for a license, unlike, say, RAR.
Implementing the consuming application on different hardware or platform may require rewriting a large part of code unless it's built on top of already popular standards. An EPUB reader, for example, can be patched together with the ZIP reader library (which is usually built-in with various frameworks) and HTML viewer. That's a near-zero effort from the developer side who can then focus on other features. Since the framework and CPU are likely optimized to handle ZIP compression, they usually perform much better than custom compression format. Another rarely considered factor is security and reliability. A custom archiving format may seemingly work faster or compress more efficiently, but on real-world data it might crash, or worse, return wrong reads which can result in a security breach or incorrect result.
As for companies not wanting their file to be read, plenty of solutions that can be built on top of ZIP. AES encryption is available as an open standard for ZIP under AE-x. Maybe they don't need to hide the entire structure, just values, they can encrypt the individual entries on the XML/JSON or files. EPUB DRMs can be broken easily, but that's going to happen regardless if the ebook was using non-zip based format.
I don't think there's a specific name for building a new format based on top of ZIP. When you want to store a string, you pick one of the available text encoding standards, if you want to keep the value secret, you encrypt it with yet another encryption standard, not invent a new encoding scheme. What those designers doing is simply taking the existing standards, and they're not just using ZIP, they're also using XML, Unicode, various image formats, etc.
About Microsoft formats being ZIP, well, not all of them. Pre-2007 Office files aren't, which is partly the reason behind the difficulties of implementing and improving the format (another reason is Microsoft deliberately prevent people from doing it in the first place by not documenting them). XLSB is ZIP, but instead of XMLs it uses binary serialization, which speeds up saving & opening, but afterward, it operates as fast and as memory efficient as XLSX file. ACCDB, like the precursor MDB, aren't ZIP files, database, in general, are allergic to being compressed. Visio transitioned slower, Visio 2010 uses XML based VDX (not compressed), then in 2013, it adds VSDX (XML and ZIP based), while Project and Publisher don't seem to be moving on new format soon. XPS, Nuget, and Appx are zip, but csproj, vbproj etc aren't. MSI installers are archives but they're not ZIP files.
It's interesting you stopped at JAR & WAR, because continuing on, Android APK files are ZIP files (which in itself may contain the content of the JAR it referenced), so does the overarching AAB. On iOS, IPA files are ZIP too. The LibreOffice default format, ODT, ODS, and ODP are all ZIP & XML based, designed around the same time as Microsoft Office's new format.

Preparing PDFs for use as Prawn templates

We've got a system that takes in a large variety of PDFs from unknown sources, and then uses them as templates for new PDFs generated by Prawn.
Occasionally some PDFs don't work as templates for Prawn- they either trigger a generic Prawn error ("Prawn::Errors::TemplateError => Error reading template file. If you are sure it's a valid PDF, it may be a bug.") or the resulting PDF comes out malformed.
(It's a known issue that some PDFs don't work as templates in Prawn, so I'm not trying to address that here:
[1]
[2])
If I take any of the problematic PDFs, and manually re-save them on my Mac using Preview > Save As [new PDF], I can then always use them as Prawn templates without any problem.
My question is, is there some (open source) server-side utility I can use that might be able to do the same thing- i.e. process problematic PDFs into something Prawn can use?
Yarin, it at least partially depends on why the PDFs don't work in the first place. If you can use them after re-saving with Apple's (quite bad) preview PDF code, you should be able to get the same result using a number of different tactics:
-) Use an actual PDF library to open and save the PDF files (libraries from Adobe and Global Graphics come to mind). These are typically commercial products but (I know the Adobe library the best) they do allow you to open a file and save it, performing a number of optimisations in the process. The Adobe libraries are currently licensed through a company called DataLogics (http://www.datalogics.com)
-) Use a commercial product that embeds these libraries. callas pdfToolbox comes to mind (warning, I'm affiliated with this product). This basically gives you the same possibilities as the previous point, but in a somewhat easier to use package (command-line use for example).
-) Use an open source product. I'm not very well positioned to provide useful links for that.
There is another approach that may work depending on your workflow and files. In graphic arts bad files are sometimes "made better" by a process called re-distilling; you basically convert the PDF file to PostScript and re-distill the postscript into PDF again. Because this rewrites the whole file structure, it often fixes fundamental problems. However, it also comes with risks as you're going through a different file format. Libraries such as GhostScript (watch the licensing conditions) may allow you to do this.
Given that your files seem to be fixed simply by using preview, I would think a redistilling approach would be overly dangerous and overkill. I would look into finding a good PDF library that can automatically open and save your files.

I need a wrapper (or alternative) for Open Office XML Presentations / Powerpoints

I recently automated the creation of Powerpoint Presentations in a site I'm making. I found the Office Interop libraries extremely simple to use.
Office isn't built for this kind of thing in a webserver environment, so I'm looking at creating the Powerpoints using Open Office XML, only it's so extremely complex. For example I downloaded some code to create a blank presentation with some text. This code was around 300 lines! Using the Office Interop libraries I could do the same thing in just a couple of lines of code.
I don't have time, nor do I want to attempt to learn how to interact with the Open Office XML libraries, so I'm hoping someone has made a wrapper for the Open Office XML libraries. So far all my searching has only given me one result, Aspose Slides for .NET. This looks really hopeful, but it also looks rather expensive
Has anyone ever used a decent wrapper or alternative before?
If you are looking at automating the creation of Powerpoint presentation files, I'd say you continue with OpenXML, there's nothing better than it. Everything else is either paid or don't offer entire gamut of functionality that Open XML can provide.
If you find creating a blank file tedious, you could save an empty file somewhere and use that as a template for performing further operations on it.
The only thing close to a wrapper for PowerPoint I've found is the Open XML PowerTools. It includes a PresentationBuilder class which can be used for some specific tasks like combining slides from multiple PowerPoint documents into a new document. Although its pretty limited in its functionality you could extend the class.
However, I've come to the conclusion that there just is not a good wrapper out there so I've had to do what everybody pretty much recommends and that is using the Open XML SDK Productivity Tool and the Reflect code button.
I put together a basic presentation then Reflect Code and put that into a class. Yes its a lot of lines of code and its not the most elegant solution but it does work. Then from there I can extend or modify that class to do the specific things I need to do with each slide. The Productivity Tool is a big help for figuring out the code need to do specific things. I try to keep it simple and just do one or two things at a time, Reflect Code, then look at the code to see what it does.
You could try SoftArtisans PowerPointWriter, it has a template mode that allows you to start with an existing PowerPoint file with a few place holders, and merge your data with your presentation with as little as 5 lines of code.
Disclaimer: I work for SoftArtisans

Better way to edit complex UIs than using IB

For simple UIs, IB is a great tool to edit controls and outlets.
If UIs get more complex and contain many bindings, things tend to get opaque. At one side, you edit source code, at the other side, you edit XIBs. Xcode's search feature finds certain names in XIB, but not all. For example, Xcode doesn't find properties of bindings in XIBs.
Thus, I wonder, if better ways to edit UIs exist.
If UIs could be - optionally - specified using XML, one could easily search and replace all occurrences of a given name [or even dynamically generate XML specifications].
I feel Adobe Flex' UI editor - either visual or using the nicely integrated XML-editor - combines both worlds in a good way: The XML-editor is fully aware of defined names and provides a helpful auto-completer.
How should complex UIs be managed using Xcode?
IB has a lot of problems. It's also still the best tool for the job in most cases. (The same can be said of Xcode generally.) As much as possible, keep your nib files simple and avoid really fancy or complex bindings.
If you find cases where Xcode's Refactor...Rename tool does not correctly modify nib files, you should open a bug at bugreport.apple.com.
You always have the option of examining/editing the nib files directly using the command line tools like ibtool. I use it at times to inspect complex bindings.
Beyond the graphical editors, Xcode and even Interface Builder are really just faces for a collection unix command line tools. You can always dig as far under the hood as you wish.
Nib files are just plist files which are just a specific xml schema so you can edit them directly if you wish. However, they are much more complex than Flex files for obvious reasons.

MS Excel automation without macros in the generated reports. Any thoughts?

I know that the web is full of questions like this one, but I still haven't been able to apply the answers I can find to my situation.
I realize there is VBA, but I always disliked having the program/macro living inside the Excel file, with the resulting bloat, security warnings, etc. I'm thinking along the lines of a VBScript that works on a set of Excel files while leaving them macro-free. Now, I've been able to "paint the first column blue" for all files in a directory following this approach, but I need to do more complex operations (charts, pivot tables, etc.), which would be much harder (impossible?) with VBScript than with VBA.
For this specific example knowing how to remove all macros from all files after processing would be enough, but all suggestions are welcome. Any good references? Any advice on how to best approach external batch processing of Excel files will be appreciated.
Thanks!
PS: I eagerly tried Mark Hammond's great PyWin32 package, but the lack of documentation and interpreter feedback discouraged me.
You could put your macros in a separate excel file.
Almost anything you can do in VBA to automate excel you can do in VBScript (or any other script/language that supports COM).
Once you have created an instance of Excel.Application you can pretty much drop your VBA into a VBS and go from there.
If it's the Excel/VBA capability that you're looking to use then you could always start by creating all of the code that will interact with the Excel files you're wanting to work on within an Excel file - a kind of master file that is separated from the regular files, as suggested by Karsten W.
This gives you the freedom to write Excel/VBA.
Then you can call your master workbook (which can be configured to run your code when the book is opened, for example) from a VB script, batch file, Task Scheduler, etc.
If you want to get fancy, you can even use VBA in your master file to create/modify/delete custom macros/VBA modules in any of the target files that you're processing.
The info for just about all of the techniques I'm describing I got from the Excel VBA built-in reference docs, but it certainly helps to be familiar with the specific programming tasks that you're tackling. I'd advise that the best approach is to put together your tasks (eg, make column blue, update/sort data etc) one by one and then worry about the automation at the end.

Resources