Does choice of endianness have impact on file size in UTF-16 and why? - endianness

Does choice of endianness have impact on file size in UTF-16 and why? I can't figure out how this should be the case but wanted to be sure. Thanks

Related

Are there any advantages to network byte order in a new protocol?

(I know many people are going to be tempted to close this question; please don't; I'm asking for concrete technical answers, if any exist.)
"Network byte order" is big-endian for reasons that cannot be asked on stackoverflow. Lots of old protocols use that order and can't be changed but I wonder if there are any technical reasons to choose big endian for a new protocol.
I would think little endian is better, because 99.99% of processors in use are little endian (ARM can technically do both, but in reality it is always set to little endian). So I was surprised to see that CBOR, a relatively recent protocol, uses big endian. Is there an advantage that I haven't thought of?
It boils down to human factors: It is easier to read a multi-byte integer in a hex dump if it is encoded with the most significant byte(s) first. For example, the CBOR representation of 0x1234 (4,660) is the byte sequence 19 12 34. If you are looking for the value 0x1234, it is easier to spot it that way.
TLDR;
I've been in the field for over 40 years now, so there's a lot of history behind this. Even the definition of a "byte" has changed over that many years, so this may take a bit of an open mind to understand how this evolved.
Dumps of binary information weren't always in bytes, nor hexadecimal. For example, the PDP-11 (with 16-bit words, and 8-bit bytes) the use of octal notation word-wide dumps was common. This was useful because of the machine architecture, which inculuded 8 registers, and 8 addressing modes, so machine langugage dumps in octal were easier to decode than hex.

The reason behind endianness?

I was wondering, why some architectures use little-endian and others big-endian. I remember I read somewhere that it has to do with performance, however, I don't understand how can endianness influence it. Also I know that:
The little-endian system has the property that the same value can be read from memory at different lengths without using different addresses.
Which seems a nice feature, but, even so, many systems use big-endian, which probably means big-endian has some advantages too (if so, which?).
I'm sure there's more to it, most probably digging down to the hardware level. Would love to know the details.
I've looked around the net a bit for more information on this question and there is a quite a range of answers and reasonings to explain why big or little endian ordering may be preferable. I'll do my best to explain here what I found:
Little-endian
The obvious advantage to little-endianness is what you mentioned already in your question... the fact that a given number can be read as a number of a varying number of bits from the same memory address. As the Wikipedia article on the topic states:
Although this little-endian property is rarely used directly by high-level programmers, it is often employed by code optimizers as well as by assembly language programmers.
Because of this, mathematical functions involving multiple precisions are easier to write because the byte significance will always correspond to the memory address, whereas with big-endian numbers this is not the case. This seems to be the argument for little-endianness that is quoted over and over again... because of its prevalence I would have to assume that the benefits of this ordering are relatively significant.
Another interesting explanation that I found concerns addition and subtraction. When adding or subtracting multi-byte numbers, the least significant byte must be fetched first to see if there is a carryover to more significant bytes. Because the least-significant byte is read first in little-endian numbers, the system can parallelize and begin calculation on this byte while fetching the following byte(s).
Big-endian
Going back to the Wikipedia article, the stated advantage of big-endian numbers is that the size of the number can be more easily estimated because the most significant digit comes first. Related to this fact is that it is simple to tell whether a number is positive or negative by simply examining the bit at offset 0 in the lowest order byte.
What is also stated when discussing the benefits of big-endianness is that the binary digits are ordered as most people order base-10 digits. This is advantageous performance-wise when converting from binary to decimal.
While all these arguments are interesting (at least I think so), their applicability to modern processors is another matter. In particular, the addition/subtraction argument was most valid on 8 bit systems...
For my money, little-endianness seems to make the most sense and is by far the most common when looking at all the devices which use it. I think that the reason why big-endianness is still used, is more for reasons of legacy than performance. Perhaps at one time the designers of a given architecture decided that big-endianness was preferable to little-endianness, and as the architecture evolved over the years the endianness stayed the same.
The parallel I draw here is with JPEG (which is big-endian). JPEG is big-endian format, despite the fact that virtually all the machines that consume it are little-endian. While one can ask what are the benefits to JPEG being big-endian, I would venture out and say that for all intents and purposes the performance arguments mentioned above don't make a shred of difference. The fact is that JPEG was designed that way, and so long as it remains in use, that way it shall stay.
I would assume that it once were the hardware designers of the first processors who decided which endianness would best integrate with their preferred/existing/planned micro-architecture for the chips they were developing from scratch.
Once established, and for compatibility reasons, the endianness was more or less carried on to later generations of hardware; which would support the 'legacy' argument for why still both kinds exist today.

Detect encoding of a string in C/C++

Given a string in form of a pointer to a array of bytes (chars), how can I detect the encoding of the string in C/C++ (I used visual studio 2008)?? I did a search but most of samples are done in C#.
Thanks
Assuming you know the length of the input array, you can make the following guesses:
First, check to see if the first few bytes match any well know byte order marks (BOM) for Unicode. If they do, you're done!
Next, search for '\0' before the last byte. If you find one, you might be dealing with UTF-16 or UTF-32. If you find multiple consecutive '\0's, it's probably UTF-32.
If any character is from 0x80 to 0xff, it's certainly not ASCII or UTF-7. If you are restricting your input to some variant of Unicode, you can assume it's UTF-8. Otherwise, you have to do some guessing to determine which multi-byte character set it is. That will not be fun.
At this point it is either: ASCII, UTF-7, Base64, or ranges of UTF-16 or UTF-32 that just happen to not use the top bit and do not have any null characters.
It's not an easy problem to solve, and generally relies on heuristics to take a best guess at what the input encoding is, which can be tripped up by relatively innocuous inputs - for example, take a look at this Wikipedia article and The Notepad file encoding Redux for more details.
If you're looking for a Windows-only solution with minimal dependencies, you can look at using a combination of IsTextUnicode and MLang's DetectInputCodePage to attempt character set detection.
If you are looking for portability, but don't mind taking on a fairly large dependency in the form of ICU then you can make use of it's character set detection routines to achieve the same thing in a portable manner.
I have written a small C++ library for detecting text file encoding. It uses Qt, but it can be just as easily implemented using just the standard library.
It operates by measuring symbol occurrence statistics and comparing it to pre-computed reference values in different encodings and languages. As a result, it not only detects encoding but also the language of the text. The downside is that pre-computed statistics must be provided for the target language to detect this language properly.
https://github.com/VioletGiraffe/text-encoding-detector

Difference between "ASCII" and "Binary" formats in PETSc

I wanted to know what the difference is between binary format and ASCII format. The thing is I need to use PETSc to do some matrix manipulations and all my matrices are stored in text files.
PETSc has different set of rules for dealing with these formats. I don't know what these formats are, let alone what format my text file is.
Is there a way to convert one format to another?
This is an elementary question; a detailed answer will really help me in understanding this.
To answer your direct question, the difference between ASCII and binary is semantics.
ASCII is binary interpreted as text. Only a small subset of binary code can be interpreted as intelligible characters (decimal 32-126) everything else is either a special character (such as a line feed or a system bell or something else entirely.) Larger characters can be letters in other alphabets.
You can interpret general binary data as ASCII format, but if it's not ASCII text it may not mean anything to you.
As a general rule of thumb, if you open your file in a text editor (such as notepad, not such as microsoft word) and it seem to consist entirely of letters or primarily of letters, numbers, and spaces, then your file can probably be safely interpreted as ASCII. If you open your file in your text editor and it's noise it's probably needs to be interpreted as raw binary.
I am not very familiar with the program you're asking about, were I in your situation I would consult the documentation of the program to figure out what format the "binary" data stream is supposed to be in. There should be a detailed description or an included utility for generating your binary data. If you generated the data yourself it's probably in ASCII format.
If your matricies are in text files, and your program only reads from binary files, you are probably out of luck.
Binary formats are just the raw bytes of whatever data structure it uses internally (or a serialization format).
You have little hope of turning text to binary without the help of the program itself.
Look for a import format if the program has one.

Identifying Algorithms in Binaries

Does anyone of you know a technique to identify algorithms in already compiled files, e.g. by testing the disassembly for some patterns?
The rare information I have are that there is some (not exported) code in a library that decompresses the content of a Byte[], but I have no clue how that works.
I have some files which I believe to be compressed in that unknown way, and it looks as if the files come without any compression header or trailer. I assume there's no encryption, but as long as I don't know how to decompress, its worth nothing to me.
The library I have is an ARM9 binary for low capacity targets.
EDIT:
It's a lossless compression, storing binary data or plain text.
You could go a couple directions, static analysis with something like IDA Pro, or load into GDB or an emulator and follow the code that way. They may be XOR'ing the data to hide the algorithm, since there are already many good loss less compression techniques.
Decompression algorithms involve significantly looping in tight loops. You might first start looking for loops (decrement register, jump backwards if not 0).
Given that it's a small target, you have a good chance of decoding it by hand, though it looks hard now once you dive into it you'll find that you can identify various programming structures yourself.
You might also consider decompiling it to a higher level language, which would be easier than assembly, though still hard if you don't know how it was compiled.
http://www.google.com/search?q=arm%20decompiler
-Adam
The reliable way to do this is to disassemble the library and read the resulting assembly code for the decompression routine (and perhaps step through it in a debugger) to see exactly what it is doing.
However, you might be able to look at the magic number for the compressed file and so figure out what kind of compression was used. If it's compressed with DEFLATE, for example, the first two bytes will be hexadecimal 78 9c; if with bzip2, 42 5a; if with gzip, 1f 8b.
From my experience, most of the times the files are compressed using plain old Deflate. You can try using zlib to open them, starting from different offset to compensate for custom headers. Problem is, zlib itself adds its own header. In python (and I guess other implementations has that feature as well), you can pass to zlib.decompress -15 as the history buffer size (i.e. zlib.decompress(data,-15)), which cause it to decompress raw deflated data, without zlib's headers.
Reverse engineering done by viewing the assembly may have copyright issues. In particular, doing this to write a program for decompressing is almost as bad, from a copyright standpoint, as just using the assembly yourself. But the latter is much easier. So, if your motivation is just to be able to write your own decompression utility, you might be better off just porting the assembly you have.

Resources