Is there a performance rationale for mixing a binary file's endianness? - endianness

I'm writing a parser for the most common geographic data storage type, a collection of files called a "shapefile". This is my first project where I've had to think about endianness.
It turns out that the geometry storage is mixed endian; some parts of the file are big endian, but most of it is little endian. The shapefile standard is described here.
Is there a discernible performance rationale, or was it simply born out of historical context? If so, do you happen to know what that historical context is?
The integers and double-precision integers that make up the data description fields in the
file header (identified below) and record contents in the main file are in little endian (PC or Intel®) byte order. The integers and double-precision floating point numbers that make up the rest of the file and file management are in big endian (Sun® or Motorola®) byte
order.

While there doesn't seem to be a clear answer for it, what I've seen is a mixture of "confusion while trying to create a format that works on all platforms" and "a lot of poorly designed formats were designed back then". More info here: https://gis.stackexchange.com/questions/18969/oddities-in-the-shapefile-technical-specification

Related

Comparison of protobuf and arrow

Both are language-neutral and platform-neutral data exchange libraries. I wonder what are the difference of them and which library is good for which situations.
They are intended for two different problems. Protobuf is designed to create a common "on the wire" or "disk" format for data.
Arrow is designed to create a common "in memory" format for the data.
Of course, the next question, is what does this mean?
In Protobuf, if an application wants to work with the data, they first deserialize the data into some kind of "in memory" representation. This must be done because the Protobuf format is not easily compatible with CPU instructions. For example, protobuf packs unsigned integers into varints. These have a variable # of bytes and the wire-type of the field is crammed into the 3 least significant bits. You cannot take two unsigned integers and just add them without first converting them to some kind of "in memory" representation.
Now, protoc does have libraries for every language to convert to an "in memory" representation for those languages. However, this "in memory" representation is not common. You cannot take a Protobuf message, deserialize it into C# (using protoc generated code) and then process on these in-memory bytes in Java without doing some kind of C#->Java marshalling of the data.
Arrow, on the other hand, solves this problem. If you have an Arrow table in C# you can map that memory to a different language and start processing on it without doing any kind of "language-to-language" marshaling of data. This zero-copy allows for efficient hand-off between languages. Python has been employing tricks like this (e.g. the array protocol) for a while now and it works great for data analysis.
However, Arrow is not always the greatest format for over-the-wire transmission because it can be inefficient. Those varints I mentioned before help Protobuf cut down on message size. Also, Protobuf tags each field so it can save space when there are many optional fields. In fact, Arrow uses Protobuf & gRPC for over-the-wire transmission of metadata in Arrow Flight (an RPC framework).

Algorithm to compress a lot of small strings?

I am looking for an algorithm to compress small ASCII strings. They contain lots of letters but they also can contain numbers and rarely special characters. They will be small, about 50-100 bytes average, 250 max.
Examples:
Android show EditText.setError() above the EditText and not below it
ImageView CENTER_CROP dont work
Prevent an app to show on recent application list on android kitkat 4.4.2
Image can't save validable in android
Android 4.4 SMS - Not receiving sentIntents
Imported android-map-extensions version 2.0 now my R.java file is missing
GCM registering but not receiving messages on pre 4.0.4. devices
I want to compress the titles one by one, not many titles together and I don't care much about CPU and memory usage.
You can use Huffman coding with a shared Huffman tree among all texts you want to compress.
While you typically construct a Huffman tree for each string to be compressed separately, this would require a lot of overhead in storage which should be avoided here. That's also the major problem when using a standard compression scheme for your case: most of them have some overhead which kills your compression efficiency for very short strings. Some of them don't have a (big) overhead but those are typically less efficient in general.
When constructing a Huffman tree which is later used for compression and decompression, you typically use the texts which will be compressed to decide which character is encoded with which bits. Since in your case the texts to be compressed seem to be unknown in advance, you need to have some "pseudo" texts to build the tree, maybe from a dictionary of the human language or some experience of previous user data.
Then construct the Huffman tree and store it once in your application; either hardcode it into the binary or provide it in the form of a file. Then you can compress and decompress any texts using this tree. Whenever you decide to change the tree since you gain better experience on which texts are compressed, the compressed string representation also changes. It might be a good idea to introduce versioning and store the tree version together with each string you compress.
Another improvement you might think about is to use multi-character Huffman encoding. Instead of compressing the texts character by character, you could find frequent syllables or words and put them into the tree too; then they require even less bits in the compressed string. This however requires a little bit more complicated compression algorithm, but it might be well worth the effort.
To process a string of bits in the compression and decompression routine in C++(*), I recommend either boost::dynamic_bitset or std::vector<bool>. Both internally pack multiple bits into bytes.
(*)The question once had the c++ tag, so OP obviously wanted to implement it in C++. But as the general problem is not specific to a programming language, the tag was removed. But I still kept the C++-specific part of the answer.

Efficient way to encode bit-vectors?

Currently using the run length encoding for encoding bit-vectors, and the current run time is
2log(i), where is the size of the run. Is there another way of doing it to bring it down to log(i)?
Thanks.
The most efficient way of encoding a bit vector is to isolate any specific properties of the bit source. If it is totally random, there is no real noticeable gain (actually, a totally random stream of bit cannot be compressed in any way).
If you can find properties in your bit stream you could try to define a collection of vectors which will define the base of a Vector Space. In such case, the result will be very efficient.
We'll need a few more details on your bit stream.
(Edit)
Just a few more details to understand the previous statement:
"a totally random stream of bits cannot be compressed in any way"
It is not possible to compress a totally random vector of bits if by "compress" we mean the "transformed/compressed stream" plus the "vector base definition" plus the decompression program. But in most cases the decompression program (and often the vector base too) is embedded in client software. Thus, only the "compressed stream" is needed.
A good explanation (and funny story) about that is Patrick Craig 5000$ compression challenge
More scientific the theory of information, especially entropy section
And, the final one, the full story.
But whatever the solution is, if you have an unknown number of unknown streams to compress you won't be ale to do anything. You have to find a pattern.

Papers on fast validation of UTF-8

Are there any papers on state of the art UTF-8 validators/decoders. I've seen implementations "in the wild" that use clever loops that process up to 8 bytes per iteration in common cases (e.g. all 7-bit ASCII input).
I don't know about papers, it' probably a bit too specific and narrow a subject for strictly scientific analysis but rather an engineering problem. You can start by looking at how this is handled different libraries. Some solutions will use language-specific tricks while others are very general. For Java, you can start with the code of UTF8ByteBufferReader, a part of Javolution. I have found this to be much faster than the character set converters built into the language. I believe (but I'm not sure) that the latter use a common piece of code for many encodings and encoding-specific data files. Javolution in contrast has code designed specifically for UTF-8.
There are also some techniques used for specific tasks, for example if you only need to calculate how many bytes a UTF-8 character takes as you parse the text, you can use a table of 256 values which you index by the first byte of the UTF-8 encoded character and this way of skipping over characters or calculating a string's length in characters is much faster than using bit operations and conditionals.
For some situations, e.g. if you can waste some memory and if you now that most characters you encounter will be from the Basic Multilingual Plane, you could try even more aggressive lookup tables, for example first calculate the length in bytes by the method described above and if it's 1 or 2 bytes (maybe 3 makes sense too), look up the decoded char in a table. Remember, however, to benchmark this and any other algorithm you try, as it need not be faster at all (bit operations are quite fast, and with a big lookup table you loose locality of reference plus the offset calculation isn't completely free, either).
Any way, I suggest you start by looking at the Javolution code or another similar library.

How does PDF417 barcode decoding recover from damaged labels?

I recently learned about PDF417 barcodes and I was astonished that I can still read the barcode after I ripped it in half and scanned only a fragment of the original label.
How can the barcode decoding be that robust? Which (types of) algorithms are used during encoding and decoding?
EDIT: I understand the general philosophy of introducing redundancy to create robustness, but I'm interested in more details, i.e. how this is done with PDF417.
the pdf417 format allows for varying levels of duplication/redundancy in its content. the level of redundancy used will affect how much of the barcode can be obscured or removed while still leaving the contents readable
PDF417 does not use anything. It's a specification of encoding of data.
I think there is a confusion between the barcode format and the data it conveys.
The various barcode formats (PDF417, Aztec, DataMatrix) specify a way to encode data, be it numerical, alphabetic or binary... the exact content though is left unspecified.
From what I have seen, Reed-Solomon is often the algorithm used for redundancy. The exact level of redundancy is up to you with this algorithm and there are libraries at least in Java and C from what I've been dealing with.
Now, it is up to you to specify what the exact content of your barcode should be, including the algorithm used for redundancy and the parameters used by this algorithm. And of course you'll need to work hand in hand with those who are going to decode it :)
Note: QR seems slightly different, with explicit zones for redundancy data.
I don't know the PDF417. I know that QR codes use Reed Solomon correction. It is an oversampling technique. To get the concept: suppose you have a polynomial in the power of 6. Technically, you need seven points to describe this polynomial uniquely, so you can perfectly transmit the information about the whole polynomial with just seven points. However, if one of these seven is corrupted, you miss the information whole. To work around this issue, you extract a larger number of points out of the polynomial, and write them down. As long as you have at least seven out of the bunch, it will be enough to reconstruct your original information.
In other words, you trade space for robustness, by introducing more and more redundancy. Nothing new here.
I do not think the concept of trade off between space and robustness is any different here as anywhere else. Think RAID, let's say RAID 5 - you can yank a disk out of the array and the data is still available. The price? - an extra disk. Or in terms of the barcode - extra space the label occupies

Resources