Why do many file formats are disguised zip files? [closed] - format

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
Through the years I've had many opportunities to "reverse engineer" proprietary files, and I noticed that many times these are "disguised" ZIP files which just pack standard XML, HTML, config and raw text files. However, I don't understand why would the developpers do that.
A few examples on top of my head of these "disguised" file formats are:
PPTX, XLS, DOCX and probably all Microsoft's file formats
EPUB
JAR, WAR, though this one I can understand as it is meant to be an archive
There are many other files formats of this sort, and sometimes even company that really don't want their data files to be publicly read rely on this disguised ZIP to store data (like games saves).
What are the technological advantages of ZIP files over custom file types ?
Is there a name for the practice of building a (sometimes proprietary) new file format on top of ZIP ?

If you want your new file format to be interoperable by other applications, you'll need to define your format completely. Building on top of other standards, such as ZIP, XML, and HTML cut down a large part of documents and maintenance effort.
The format designer is usually also the first implementor. Using existing standards means they can use existing, known to be correct and working tools to create and read. This means Microsoft Office file format designer, for example, doesn't need to debug serializing and deserializing logic since they're already using the industry-proven XML.
Using a compressed archive instead of plain archiving such as TAR means your format automatically reduces the required storage when possible. ZIP is an ISO standard and patent-free (as long as it's not encrypted with a strong algorithm), so the designer and implementor don't need to pay for a license, unlike, say, RAR.
Implementing the consuming application on different hardware or platform may require rewriting a large part of code unless it's built on top of already popular standards. An EPUB reader, for example, can be patched together with the ZIP reader library (which is usually built-in with various frameworks) and HTML viewer. That's a near-zero effort from the developer side who can then focus on other features. Since the framework and CPU are likely optimized to handle ZIP compression, they usually perform much better than custom compression format. Another rarely considered factor is security and reliability. A custom archiving format may seemingly work faster or compress more efficiently, but on real-world data it might crash, or worse, return wrong reads which can result in a security breach or incorrect result.
As for companies not wanting their file to be read, plenty of solutions that can be built on top of ZIP. AES encryption is available as an open standard for ZIP under AE-x. Maybe they don't need to hide the entire structure, just values, they can encrypt the individual entries on the XML/JSON or files. EPUB DRMs can be broken easily, but that's going to happen regardless if the ebook was using non-zip based format.
I don't think there's a specific name for building a new format based on top of ZIP. When you want to store a string, you pick one of the available text encoding standards, if you want to keep the value secret, you encrypt it with yet another encryption standard, not invent a new encoding scheme. What those designers doing is simply taking the existing standards, and they're not just using ZIP, they're also using XML, Unicode, various image formats, etc.
About Microsoft formats being ZIP, well, not all of them. Pre-2007 Office files aren't, which is partly the reason behind the difficulties of implementing and improving the format (another reason is Microsoft deliberately prevent people from doing it in the first place by not documenting them). XLSB is ZIP, but instead of XMLs it uses binary serialization, which speeds up saving & opening, but afterward, it operates as fast and as memory efficient as XLSX file. ACCDB, like the precursor MDB, aren't ZIP files, database, in general, are allergic to being compressed. Visio transitioned slower, Visio 2010 uses XML based VDX (not compressed), then in 2013, it adds VSDX (XML and ZIP based), while Project and Publisher don't seem to be moving on new format soon. XPS, Nuget, and Appx are zip, but csproj, vbproj etc aren't. MSI installers are archives but they're not ZIP files.
It's interesting you stopped at JAR & WAR, because continuing on, Android APK files are ZIP files (which in itself may contain the content of the JAR it referenced), so does the overarching AAB. On iOS, IPA files are ZIP too. The LibreOffice default format, ODT, ODS, and ODP are all ZIP & XML based, designed around the same time as Microsoft Office's new format.

Related

Is there an archive file format that supports being split in multiple parts and can be unpacked natively on MS Windows?

Some archive file formats, e.g. ZIP (see Section 8 in https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT), support being split into multiple parts of a limited size. ZIP files can be opened natively on recent versions of Microsoft Windows, but it seems that Windows cannot open split ZIP files natively, only with special tools like 7-Zip. I would like to use this "split archive" functionality in a web app that I'm writing in which the created archives should be opened by a large audience of "average" computer users, so my question is: Is there an archive file format (like ZIP) that supports being split in multiple parts and can be unpacked without installing additional software on recent versions of Microsoft Windows? And ideally on other widely used operating systems as well.
Background: My final goal is to export a directory structure that is split over multiple web servers to a single local directory tree. My current idea is to have each web server produce one part of the split archive, provide all of them as some sort of Javascript multi-file download and then have one archive (in multiple parts) on the user's computer that just needs to be unpacked. An alternative idea for this final goal was to use Javascript's File System Access API (https://developer.mozilla.org/en-US/docs/Web/API/File_System_Access_API), but it is not supported on Firefox, which is a showstopper.
CAB archives meet this purpose a bit (see this library's page for example, it says that through it, archives can even be extracted directly from a HTTP(S)/FTP server). Since the library relies on .NET, it could even be used on Linux through Mono/Wine, which is a crucial part if your servers aren't running Windows... Because archive must be created on server, right?.
Your major problem is more that a split archive can't be created in parallel on multiple servers, at least because of LZx's dictionnary. Each server should create the whole set of archives and send only the ones it should send, and you don't have ANY guarantee that all these archives' sets would be identical on each server.
Best way is probably to create the whole archive on ONE server, then distribute each part (or the whole splitted archive...) on your various servers, through a replication-like interface.
Otherwise, you can also make individual archives that contains only a subset of the directory tree (you'll have to partition the files across servers), but it won't meet your requirements since it would be a collection of individual archives, and not a big splitted archive.
Some precisions may be required:
Do you absolutely need a system without any client besides the browser? Or can you use other protocols, as long as they natively exist on Windows (like FTP / SSH client that are now provided by default)?
What is the real purpose behind this request? Distribute load across all servers? Or to avoid too big single file downloads (i.e. a 30 GB archive) in case of transfer failure? Or both?
In case of a file size problem, why don't rely on resuming download?

is there a way to tag video files independent of any tool

We are in the 21st century and still there is no good way to tag photos and videos? There is always a dependency on some tool... Is there no way to make the file autonomous with respect to its tags?
Video files, for example, are not friendly to tags. some video formats do not allow tagging at all. Some tools keep the meta data in their own external representation and when you copy the original file to some new destination, the meta data of the file in the destination is lost. Also this metadata may only be seen by this proprietary tool and is not seen by other tools (e.g. tagging by Adobe products are not visible/searchable in Windows Explorer)
Is there a universal way to tag any file including video files so that
searching for files having a certain tag is possible in any tool
when a file is copied, the tags are transferred with it
when the file is edited in any tool and re-saved, the tags are not lost...?
There are no universal ways at this point, if there ever will be one.
Probaby the closest we got is file tagging provided by popular OSes based on a certain file systems' feature called 'forking'. By this means Windows and Mac provide an ability to easily add meta data (including keywords) to any file on the file system, without changing the file's content. One serious drawback of this feature is that it does not cross file-system's boundary, i.e. if you simply upload a file to the web, or copy it to a different type filesystem - the metadata will be lost. There are ways to copy such metadata but that requires consideration and use of appropriate tools.

How do I debug a kml file

I am editing kml files of maps of history and science of files that already appear on http://climateviewer.org/. I am editing them in Sublime text and/or Notepad since all I am doing is editing text, deleting extended data and switching links and references from my old web site MyReadingMapped to the new site which has far better technology. You can see images of the maps I made at http://climateviewer.org/myreadingmapped/
BTW, I am not a programmer or developer of software, but rather a retired marketing communications professional who understands just enough coding to make these changes and can do some html as well.
The problem I am having is that of the 30 or so files I have edited so far 4 have a parsing error that consistently involves closing a Placemark. Yet there appears to be nothing wrong with the code. I am testing the files by uploading them to Google Earth to get the error statements. And so far I have fixed many problems but I can't seem to solve this problem. Jim Lee, ClimateViewer's creator tells me to debug them.
How do I debug them and is it something I would be able to learn without formal training?
There are several tools available to debug a KML file, which is simply an XML file that must conform to rules of the KML specification. As an XML file, all start and end tags must match. In addition, the tags are case-sensitive.
The easiest trick is using a web browser to validate it. Simply rename the KML file to an XML file (rename .kml extension to .xml) then drag the .xml file onto the open web browser. Parsing errors will be identified with row and column number.
Next, you can upload the KML file to KML Validator to get a list of potential errors that need to be fixed or run the standalone command-line XmlValidator tool.
Additional tips to fix KML files are described here along with details about KML validation.

Preparing PDFs for use as Prawn templates

We've got a system that takes in a large variety of PDFs from unknown sources, and then uses them as templates for new PDFs generated by Prawn.
Occasionally some PDFs don't work as templates for Prawn- they either trigger a generic Prawn error ("Prawn::Errors::TemplateError => Error reading template file. If you are sure it's a valid PDF, it may be a bug.") or the resulting PDF comes out malformed.
(It's a known issue that some PDFs don't work as templates in Prawn, so I'm not trying to address that here:
[1]
[2])
If I take any of the problematic PDFs, and manually re-save them on my Mac using Preview > Save As [new PDF], I can then always use them as Prawn templates without any problem.
My question is, is there some (open source) server-side utility I can use that might be able to do the same thing- i.e. process problematic PDFs into something Prawn can use?
Yarin, it at least partially depends on why the PDFs don't work in the first place. If you can use them after re-saving with Apple's (quite bad) preview PDF code, you should be able to get the same result using a number of different tactics:
-) Use an actual PDF library to open and save the PDF files (libraries from Adobe and Global Graphics come to mind). These are typically commercial products but (I know the Adobe library the best) they do allow you to open a file and save it, performing a number of optimisations in the process. The Adobe libraries are currently licensed through a company called DataLogics (http://www.datalogics.com)
-) Use a commercial product that embeds these libraries. callas pdfToolbox comes to mind (warning, I'm affiliated with this product). This basically gives you the same possibilities as the previous point, but in a somewhat easier to use package (command-line use for example).
-) Use an open source product. I'm not very well positioned to provide useful links for that.
There is another approach that may work depending on your workflow and files. In graphic arts bad files are sometimes "made better" by a process called re-distilling; you basically convert the PDF file to PostScript and re-distill the postscript into PDF again. Because this rewrites the whole file structure, it often fixes fundamental problems. However, it also comes with risks as you're going through a different file format. Libraries such as GhostScript (watch the licensing conditions) may allow you to do this.
Given that your files seem to be fixed simply by using preview, I would think a redistilling approach would be overly dangerous and overkill. I would look into finding a good PDF library that can automatically open and save your files.

Bundling a program and external files into a single executable?

This question is kinda similar to this one, but not exactly. I have a game engine in C#, and I'm working with some people who want to use my engine. Originally I designed the engine so that all the assets are external - non programmers can create art, music, xml settings, etc. and that anyone could modify an existing game, and share them amongst each other. Basically the whole thing including the engine itself is open source.
The group I'm working with (one of only two projects using my engine currently) wants to close their assets so they can't be modified. Although it's against my principle, I don't want to turn them away, both because I've already been working with them a while and because the market is very small (both for engines like mine, and for users of those engines).
The Actual Question
Is there a way, maybe some available software, that can take an exe and a bunch of other arbitrary files, and smash them into a single exe, that isn't just an archive? I would like the final exe to behave like it runs the first exe with some command line parameters that refer to the bundled files. For example, running bundle.exe would be just like running original.exe --project_path=/project but the project files are inside the bundle, and cannot be retrieved from it.
My original exe is written in C#. I doubt that matters.
You could pack these files as embedded resources.

Resources