Reverse engineering iWork '13 formats

Reverse engineering iWork '13 formats - macos

Prior versions of Apple's iWork suite used a very simple document format:
documents were Bundles of resources (folders, zipped or not)
the bundle contained an index.apxl[z] file describing the document structure in a proprietary but fairly easy to understand schema
iWork '13 has completely redone the format. Documents are still bundles, but what was in the index XML file is now encoded in a set of binary files with type suffix .iwa packed into Index.zip.
In Keynote, for example, there are the following iwa files:
AnnotationAuthorStorage.iwa
CalculationEngine.iwa
Document.iwa
DocumentStylesheet.iwa
MasterSlide-{n}.iwa
Metadata.iwa
Slide{m}.iwa
ThemeStylesheet.iwa
ViewState.iwa
Tables/DataList.iwa
for MasterSlides 1…n and Slides 1…m
The purpose of each of these is quite clear from their naming. The files even appear uncompressed, with essentially all content text directly visible as strings among the binary blobs (albeit with some like RTF/NSAttributedString/similar-related garbage in the midst of the readable ASCII characters).
I have posted the unpacked Index of a simple example Keynote document here: https://github.com/jrk/iwork-13-format.
However, the overall file format is non-obvious to me. Apple has a long history of using simple, platform-standard formats like plists for encoding most of their documents, but there is no clear type tag at the start of the files, and it is not obvious to me what these iwa files are.
Do these files ring any bells? Is there evidence they are in some reasonably comprehensible serialization format?
Rummaging through the Keynote app runtime and class dumps with F-Script, the only evidence I've found is for some use of Protocol Buffers in the serialization classes which seem to be used for iWork, e.g.: https://github.com/nst/iOS-Runtime-Headers/blob/master/PrivateFrameworks/iWorkImport.framework/TSPArchiverBase.h.
Quickly piping a few of the files through protoc --decode_raw with the first 0…16 bytes lopped off produced nothing obviously usable.

I've done some work reverse engineering the format and published my results here. I've written up a description of the format and provided a sample project as well.
Basically, the .iwa files are Protobuf streams compressed using Snappy.
Hope this helps!

Interesting project, I like it! Here is what I have found so far.
The first 4 bytes of each of the iwa files appear to be a length, with a tweak. So it looks like there will not be any 'magic' to verify file type.
Look at Slide1.iwa:
First 4 bytes are 00 79 02 00
File size is 637 bytes
take the first 00 off, and reverse the bytes: 00 02 79
00 02 79 == 633
637 - 633 = 4 bytes that hold the size of the file.
This checks out for the 4 files I looked at: Slide1.iwa, Slide2.iwa, Document.iwa, DocumentStylesheet.iwa

Related

How can an executable be this small in file size?

I've been generating payloads on Metasploit and I've been experimenting with the different templates and one of the templates you can have your payload as is exe-small. The type of payload I've been generating is a windows/meterpreter/reverse_tcp and just using the normal exe template it has a file size around 72 KB however exe-small outputs a payload the size of 2.4kb. Why is this? And how could I apply this to my programming?

The smallest possible PE file is just 97 bytes - and it does nothing (just return).
The smallest runnable executable today is 133 bytes, because Windows requires kernel32 being loaded. Executing a PE file with no imports is not possible.
At that size it can already download payload from the Internet by specifying an UNC path in the import table.
To achieve such a small executable, you have to
implement in assembler, mainly to get rid of the C runtime
decrease the file alignment which is 1024 by default
remove the DOS stub that prints the message "This program cannot be run in DOS mode"
Merge some of the PE parts into the MZ header
Remove the data directory
The full description is available in a larger research blog post called TinyPE.

For EXE's this small, the most space typically is used for the icon. Typically the icon has various sizes and color schemes contained, which you could get rid of, if you do not care having an "old, rusty" icon, or no icon at all.
There is also some 4k of space used, when you sign the EXE.
As an example for a small EXE, see never10 by grc. There is a details page which highlights the above points:
https://www.grc.com/never10/details.htm
in the last paragraph:
A final note: I'm a bit annoyed that “Never10” is as large as it is at
85 kbyte. The digital signature increases the application's size by
4k, but the high-resolution and high-color icons Microsoft now
requires takes up 56k! So without all that annoying overhead, the app
would be a respectable 25k. And, yes, of course I wrote it in
assembly language.
Disclaimer: I am not affiliated with grc in any way.

The is little need for an executable to be big, except when it contains what I call code spam, code not actually critical to the functionality of the program/exe. This is valid for other files too. Look at a manually written HTML page compared to one written in FrontPage. That's spamcode.
I remember my good old DOS files that were all KB in size and were performing practically any needed task in the OS. One of my .exes (actually .com) was only 20 bytes in size.
Just think of it this way: just as in some situations a large majority of the files contained in a Windows OS can be removed and still the OS can function perfectly, it's the same with the .exe files: large parts of the code is either useless, or has different than relevant-to-objective purpose or are intentionally added (see below).
The peak of this aberration is the code added nowdays in the .exe files of some games that use advanced copy protection, which can make the files as large as dozens of MB. The actually code needed to run the game is practically under 10% of the full code.
A file size of 72 KB as in your example can be pretty sufficient to do practically anything to a windows OS.
To apply this to your programming, as in make very small .exes, keep things simple. Don't add unnecessary code just for the looks of it or by thinking you will use that part of the program/code at a point.

What was the "raw bytes format" that random.org used to have

The "raw bytes module" on random.org is disabled (http://www.random.org/bytes/), and I need to know the output format they used to have.
More specificly: I have a program using 1024 raw hex bytes from random.org and I need to implement this feature.
What was the format they used to have? Possible formats I can think of are:
0x1a 0x1b
#1a #1b
1a 1b
1a1b
.......
There are many possibilities... Who may help me?

You can use http://archive.org/web/ to retrieve an old web page that doesn't exists anymore.
For example: http://web.archive.org/web/*/http://www.random.org/bytes/ (sorry but there is a silly problem with the * in the url even if I use the percent encoding code, so copy and paste it, don't click on it)
Then pick a backup at a date which seems to be good.
If it's not ok, go to previous page and pick another backup.
For example I chose 2014/03/01 and voilà
If you're lucky you can even use the form of the backuped page: here's the result with the defaults params.

In-depth understanding of binary files

I am learning C++ specially about binary file structure/manipulation, and since I am totally new to the subject of binary files, bits, bites & hexadecimal numbers, I decided to take one step backward and establish a solid understanding on the subjects.
In the picture I have included below, I wrote two words (blue thief) in a .txt file.
The reason for this, is when I decode the file using a hexeditor, I wanted to understand how the information is really stored in hex format. Now, don't get me wrong, I am not trying to make a living out of reading hex formats all day, but only to have a minimum level of understanding the basics of a binary file's composition. I also, know all files have different structures, but just for the sake of understanding, I wanted to know, how exactly the words "blue thief" and a single ' ' (space) were converted into those characters.
One more thing, is that, I have heard that binary files contain three types of information:
header, ftm & and the data! is that only concerned with multimedia files like audios, videos? because, I can't seem to see anything, other than what it looks like a the data chunk in this file only.

The characters in your text file are encoded in a Windows extension of ASCII--one byte for each character that you see in Notepad. What you see is what you get.
Generally, a hard distinction is made between text and binary files on Windows systems. On Unix/Linux systems, the distinction is fuzzier... you could argue that there is no distinction, in fact.
On Windows systems, the distinction is enforced by file extensions. All files with the extension ".TXT" are assumed to be text files (i.e., to contain only hex codes that represent visible onscreen characters, where "visible" includes whitespace).
Binary files are a whole different kettle of fish. Most, as you mention, include some sort of header describing how the data that follows is encoded. These headers can vary tremendously in size depending on the type of data (again, assumed to be indicated by the extension on Windows systems as well as Unix). A simple example is the WAV format for uncompressed audio. If you open a WAV file in your hex editing program, you'll see that the first four bytes are "RIFF"--this is a marker, often called a "magic number" even though it is readable as text, indicating that the contents are an audio file. Newer versions of the WAV specification have complicated this somewhat, but originally the WAV header was just the "RIFF" tag plus a dozen or so bytes indicating the sample rate of the following data. (You can see this by comparing the raw data in a track on an audio CD to the WAV file created by ripping an uncompressed copy of that track at 44.1 KHz--the data should be the same, with just a header section added at the start of the WAV file.)
Executable files (compiled programs) are a special type of binary file, but they follow roughly the same scheme of a header followed by data in a prescribed format. In this case, though, the "data" is executable machine code, and the header indicates, among other things, what operating system the file runs on. (For example, most Linux executables begin with the characters "ELF".)

File size always remains same as 12288 bytes for small files

I am using Wp7IsolatedStorageExplorer to get some small files (say max 20 bytes) stored in IsolatedStorage.But every time I download file the Filesize remains 12288 bytes(for small files) .Is IsolatedStorageExplorer appending something at last OR Is it How small Files are stored in IsolatedStorage by default ?
Thanks
vaysage.

Initially I thought this might be an indication of the underlying FAT implementation.
However, having looked a bit deeper and having looked at your answers to comments, my guess is that this is just a UI issue in the IsolatedStorageExplorer - if you look at the source http://wp7explorer.codeplex.com/SourceControl/changeset/view/63791#1114123 - then it seems to use 12288 as a chunk size for its networking layers.

Determining end of JPEG (when merged with another file)

I'm creating a program that "hides" an encrypted file at the end of a JPEG. The problem is, when retrieving this encrypted file again, I need to be able to determine when the JPEG it was stored in ends. At first I thought this wouldn't be a problem because I can just run through the file checking for 0xFF and 0xD9, the bytes JPEG uses to end the image. However... I'm noticing that in quite a few JPEGs, this combination of bytes is not exclusive... So my program thinks the image has ended randomly half way through it.
I'm thinking there must be a set way of expressing that a JPEG has finished, otherwise me adding a load of bytes to the end of the file would obviously corrupt it... Is there a practical way to do this?

You should read the JFIF file format specifications

Well, there are always two places in the file that you can find with 100% reliability. The beginning and the end. So, when you add the hidden file, add another 4 bytes that stores the original length of the file and a special signature that's always distinct. When reading it back, first seek to the end - 8 and read that length and signature. Then just seek to that position.

You should read my answer on this question "Detect Eof for JPG images".

You're likely running into the thumbnail in the header, when moving through the file you should find that most marked segments contain a length indicator, here's a reference for which do and which don't. You can skip the bytes within those segments as the true eoi marker will not be within them.
Within the actual jpeg compressed data, any FF byte should be followed either by 00 (the zero byte is then discarded), or by FE to mark a comment (which has a length indicator, and can be skipped as described above).
Theoretically the only way you encounter a false eoi reading in the compressed data is within a comment.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio