We've got a system that takes in a large variety of PDFs from unknown sources, and then uses them as templates for new PDFs generated by Prawn.
Occasionally some PDFs don't work as templates for Prawn- they either trigger a generic Prawn error ("Prawn::Errors::TemplateError => Error reading template file. If you are sure it's a valid PDF, it may be a bug.") or the resulting PDF comes out malformed.
(It's a known issue that some PDFs don't work as templates in Prawn, so I'm not trying to address that here:
[1]
[2])
If I take any of the problematic PDFs, and manually re-save them on my Mac using Preview > Save As [new PDF], I can then always use them as Prawn templates without any problem.
My question is, is there some (open source) server-side utility I can use that might be able to do the same thing- i.e. process problematic PDFs into something Prawn can use?
Yarin, it at least partially depends on why the PDFs don't work in the first place. If you can use them after re-saving with Apple's (quite bad) preview PDF code, you should be able to get the same result using a number of different tactics:
-) Use an actual PDF library to open and save the PDF files (libraries from Adobe and Global Graphics come to mind). These are typically commercial products but (I know the Adobe library the best) they do allow you to open a file and save it, performing a number of optimisations in the process. The Adobe libraries are currently licensed through a company called DataLogics (http://www.datalogics.com)
-) Use a commercial product that embeds these libraries. callas pdfToolbox comes to mind (warning, I'm affiliated with this product). This basically gives you the same possibilities as the previous point, but in a somewhat easier to use package (command-line use for example).
-) Use an open source product. I'm not very well positioned to provide useful links for that.
There is another approach that may work depending on your workflow and files. In graphic arts bad files are sometimes "made better" by a process called re-distilling; you basically convert the PDF file to PostScript and re-distill the postscript into PDF again. Because this rewrites the whole file structure, it often fixes fundamental problems. However, it also comes with risks as you're going through a different file format. Libraries such as GhostScript (watch the licensing conditions) may allow you to do this.
Given that your files seem to be fixed simply by using preview, I would think a redistilling approach would be overly dangerous and overkill. I would look into finding a good PDF library that can automatically open and save your files.
Related
We are in the 21st century and still there is no good way to tag photos and videos? There is always a dependency on some tool... Is there no way to make the file autonomous with respect to its tags?
Video files, for example, are not friendly to tags. some video formats do not allow tagging at all. Some tools keep the meta data in their own external representation and when you copy the original file to some new destination, the meta data of the file in the destination is lost. Also this metadata may only be seen by this proprietary tool and is not seen by other tools (e.g. tagging by Adobe products are not visible/searchable in Windows Explorer)
Is there a universal way to tag any file including video files so that
searching for files having a certain tag is possible in any tool
when a file is copied, the tags are transferred with it
when the file is edited in any tool and re-saved, the tags are not lost...?
There are no universal ways at this point, if there ever will be one.
Probaby the closest we got is file tagging provided by popular OSes based on a certain file systems' feature called 'forking'. By this means Windows and Mac provide an ability to easily add meta data (including keywords) to any file on the file system, without changing the file's content. One serious drawback of this feature is that it does not cross file-system's boundary, i.e. if you simply upload a file to the web, or copy it to a different type filesystem - the metadata will be lost. There are ways to copy such metadata but that requires consideration and use of appropriate tools.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Through the years I've had many opportunities to "reverse engineer" proprietary files, and I noticed that many times these are "disguised" ZIP files which just pack standard XML, HTML, config and raw text files. However, I don't understand why would the developpers do that.
A few examples on top of my head of these "disguised" file formats are:
PPTX, XLS, DOCX and probably all Microsoft's file formats
EPUB
JAR, WAR, though this one I can understand as it is meant to be an archive
There are many other files formats of this sort, and sometimes even company that really don't want their data files to be publicly read rely on this disguised ZIP to store data (like games saves).
What are the technological advantages of ZIP files over custom file types ?
Is there a name for the practice of building a (sometimes proprietary) new file format on top of ZIP ?
If you want your new file format to be interoperable by other applications, you'll need to define your format completely. Building on top of other standards, such as ZIP, XML, and HTML cut down a large part of documents and maintenance effort.
The format designer is usually also the first implementor. Using existing standards means they can use existing, known to be correct and working tools to create and read. This means Microsoft Office file format designer, for example, doesn't need to debug serializing and deserializing logic since they're already using the industry-proven XML.
Using a compressed archive instead of plain archiving such as TAR means your format automatically reduces the required storage when possible. ZIP is an ISO standard and patent-free (as long as it's not encrypted with a strong algorithm), so the designer and implementor don't need to pay for a license, unlike, say, RAR.
Implementing the consuming application on different hardware or platform may require rewriting a large part of code unless it's built on top of already popular standards. An EPUB reader, for example, can be patched together with the ZIP reader library (which is usually built-in with various frameworks) and HTML viewer. That's a near-zero effort from the developer side who can then focus on other features. Since the framework and CPU are likely optimized to handle ZIP compression, they usually perform much better than custom compression format. Another rarely considered factor is security and reliability. A custom archiving format may seemingly work faster or compress more efficiently, but on real-world data it might crash, or worse, return wrong reads which can result in a security breach or incorrect result.
As for companies not wanting their file to be read, plenty of solutions that can be built on top of ZIP. AES encryption is available as an open standard for ZIP under AE-x. Maybe they don't need to hide the entire structure, just values, they can encrypt the individual entries on the XML/JSON or files. EPUB DRMs can be broken easily, but that's going to happen regardless if the ebook was using non-zip based format.
I don't think there's a specific name for building a new format based on top of ZIP. When you want to store a string, you pick one of the available text encoding standards, if you want to keep the value secret, you encrypt it with yet another encryption standard, not invent a new encoding scheme. What those designers doing is simply taking the existing standards, and they're not just using ZIP, they're also using XML, Unicode, various image formats, etc.
About Microsoft formats being ZIP, well, not all of them. Pre-2007 Office files aren't, which is partly the reason behind the difficulties of implementing and improving the format (another reason is Microsoft deliberately prevent people from doing it in the first place by not documenting them). XLSB is ZIP, but instead of XMLs it uses binary serialization, which speeds up saving & opening, but afterward, it operates as fast and as memory efficient as XLSX file. ACCDB, like the precursor MDB, aren't ZIP files, database, in general, are allergic to being compressed. Visio transitioned slower, Visio 2010 uses XML based VDX (not compressed), then in 2013, it adds VSDX (XML and ZIP based), while Project and Publisher don't seem to be moving on new format soon. XPS, Nuget, and Appx are zip, but csproj, vbproj etc aren't. MSI installers are archives but they're not ZIP files.
It's interesting you stopped at JAR & WAR, because continuing on, Android APK files are ZIP files (which in itself may contain the content of the JAR it referenced), so does the overarching AAB. On iOS, IPA files are ZIP too. The LibreOffice default format, ODT, ODS, and ODP are all ZIP & XML based, designed around the same time as Microsoft Office's new format.
I have a large number of books in PDF format, mostly from publishers like the Pragmatic Programmers, and I have made many annotations (comments, highlights, notes, etc.) in the PDFs that I'd like to preserve. This becomes a problem when the publisher updates the PDF and I download the update. I'd like a way to merge my annotations (which are basically changes) with the new version. Opening each one and manually copying and pasting is obviously a ridiculously tedious chore, so:
Is there any way to do this programmatically? I'd prefer to use Ruby, but I'd be almost as happy with a shell script-based solution, and I'd even be willing to learn Python or use Perl if those were the only options.
I have a program which creates certain save files during its use. Technically they are XML files, however I don't want to use the .xml extension as I will be modifying the shell so that my program opens when the files are double clicked in Explorer.
Is there any guidance on what file extensions I can effectively "invent"? I can't find any official guidelines anywhere.
I want to use .senx but I have no idea if this is safe to do so?
There is no "official" registry of file extensions (although there probably should be).
An Internet search will reveal several different sites that contain unofficial listings of the file extensions in common use by various applications, but if you go by that, there are hardly any file extensions still available to choose from.
The important thing is to figure out which applications your target audience is likely to have installed, and then make sure that your custom file extension doesn't conflict with any of those. (If you must do so, you also must provide the user with an option to revert your file extension associations, and preferably make it a configurable option during installation.)
Remember that there's no reason you should have to limit yourself to three or even four character file extensions. You can use as many as you need, which exponentially increases the likelihood that your choice will be unique. For example, Visual Studio persists its environment settings in a .vssettings file; it's very unlikely any other application will conflict with that any time soon.
In fact, this is Microsoft's official advice:
Do Not Use Short File Name Extensions
Long file name extensions offer the following advantages:
The limited length of short extensions make them prone to extension collisions. An extension collision occurs when the same extension is used to classify multiple file types. Using long extensions significantly decreases the chances of a collision.
Short file names tend to be somewhat cryptic. Long extensions tend to be more meaningful because additional information can be embedded in the extension.
For more information, see file name extensions.
I want to dynamically load (AJAX) the text from some Microsoft Word files into a webpage. So I might have a link to essays I've written and upon mouseover have it load the first few sentences in a tooltip.
Only if you have a parser. I think the new format is a zip archive with XML schema. But the old one is just binary.
There are some parsers out there.
I know of wvWare but it seems it's outdated. (http://wvware.sourceforge.net/)
This is maybe something worth looking at: http://poi.apache.org/hwpf/index.html
And yeah, forgot to mention how to do this. :-)
First you need to make the javascript ask for the data through ajax. The serverside has to take care of the parsing and return the text to the javascript. This will be a pain in the ass. I haven't done this myself and have never tried the parsers I linked, so I'm not sure if they suit you. Images, stylesheets, etc.... not sure if that will be useable.
At least, good luck.
For security reasons, it is not possible to directly load a local file (such as a Word document) into the page using simply Javascript. The user will need to upload the file to the server, which you will want to parse on the server and then you can load whatever result you like into the page using Ajax.
It sounds like you mean to upload your files (e.g. essays) to your server to allow users to download them, and want to create a server-side page that will parse the files and print the first few lines (so it can be called by an AJAX method that displays a preview on hover).
To suggest a tool for this, we'll need to know whether these are "old" Word format (Office 2003 - extension is .doc) or "new" Word format (Office 2007 - extension is .docx).
It will also be good to know what you're using to create your pages server-side, since different document-reading tools support different programming languages. If you're using Java to read .doc files, you can use the tool we use at my place of work, which is POI (http://poi.apache.org/). If you're using something else, try searching google for {read in }, e.g. {read .docx in ruby}.
If all of this is Greek to you and you have no prior experience with developing custom server-side web code, this is probably going to be unnecessarily painful and you should consider an alternative (like manually creating a 3-line text "preview" page for each regular page, and then just showing that).