Default byte order of QDataStream - performance

Default byte order of QDataStream is BigEndian, according to its documentation. AFAIK this means that written data on a little-endian platform is converted from little to big and read data is converted from big to little.
Most platforms supported by Qt, especially x86/x64, are little-endian. This link states Qt's CI doesn't even include any big-endian devices. This would mean that the default configuration of QDataStream requires lots of endianness conversions on Qt's major platforms - which would be pretty sub-optimal.
Did I miss something obvious? If not, why didn't Qt use LittleEndian as a default to support major platforms more efficiently?

Related

Why protobuf doesn't support alphanumeric type?

I'm wondering why protobuf doesn't implement support of commonly used alphanumeric type?
This would allow to encode several characters in only byte (more if case insensitive) very effectively as no any sort of compression involved.
Is it something that Protobuf developers are planning to implement in future?
Thanks,
In today's global world, the number of times when "alphanumeric" means the 62 characters in the range 0-9, A-Z and a-z is fairly minimal. If we just consider the basic multilingual plane, there are about 48k code units (which is to say: over 70% of the available range) that are counted as "alphanumeric" - and a fairly standard (although even this may be suboptimal in some locales) way of encoding them is UTF-8, which protobuf already uses for the string type.
I cannot see much advantage in using a dedicated wire-type for this category of data, and any additional wire-type would have the issue that it would need support adding in multiple libraries, because an unknown wire-type renders the stream unreadable to down-level parsers: you cannot even skip over unwanted data if you don't know the wire-type (the wire-type defines the skip rules).
Of course, since you also have the bytes type available, you can feel free to do anything bespoke you want inside that.

The reason behind endianness?

I was wondering, why some architectures use little-endian and others big-endian. I remember I read somewhere that it has to do with performance, however, I don't understand how can endianness influence it. Also I know that:
The little-endian system has the property that the same value can be read from memory at different lengths without using different addresses.
Which seems a nice feature, but, even so, many systems use big-endian, which probably means big-endian has some advantages too (if so, which?).
I'm sure there's more to it, most probably digging down to the hardware level. Would love to know the details.
I've looked around the net a bit for more information on this question and there is a quite a range of answers and reasonings to explain why big or little endian ordering may be preferable. I'll do my best to explain here what I found:
Little-endian
The obvious advantage to little-endianness is what you mentioned already in your question... the fact that a given number can be read as a number of a varying number of bits from the same memory address. As the Wikipedia article on the topic states:
Although this little-endian property is rarely used directly by high-level programmers, it is often employed by code optimizers as well as by assembly language programmers.
Because of this, mathematical functions involving multiple precisions are easier to write because the byte significance will always correspond to the memory address, whereas with big-endian numbers this is not the case. This seems to be the argument for little-endianness that is quoted over and over again... because of its prevalence I would have to assume that the benefits of this ordering are relatively significant.
Another interesting explanation that I found concerns addition and subtraction. When adding or subtracting multi-byte numbers, the least significant byte must be fetched first to see if there is a carryover to more significant bytes. Because the least-significant byte is read first in little-endian numbers, the system can parallelize and begin calculation on this byte while fetching the following byte(s).
Big-endian
Going back to the Wikipedia article, the stated advantage of big-endian numbers is that the size of the number can be more easily estimated because the most significant digit comes first. Related to this fact is that it is simple to tell whether a number is positive or negative by simply examining the bit at offset 0 in the lowest order byte.
What is also stated when discussing the benefits of big-endianness is that the binary digits are ordered as most people order base-10 digits. This is advantageous performance-wise when converting from binary to decimal.
While all these arguments are interesting (at least I think so), their applicability to modern processors is another matter. In particular, the addition/subtraction argument was most valid on 8 bit systems...
For my money, little-endianness seems to make the most sense and is by far the most common when looking at all the devices which use it. I think that the reason why big-endianness is still used, is more for reasons of legacy than performance. Perhaps at one time the designers of a given architecture decided that big-endianness was preferable to little-endianness, and as the architecture evolved over the years the endianness stayed the same.
The parallel I draw here is with JPEG (which is big-endian). JPEG is big-endian format, despite the fact that virtually all the machines that consume it are little-endian. While one can ask what are the benefits to JPEG being big-endian, I would venture out and say that for all intents and purposes the performance arguments mentioned above don't make a shred of difference. The fact is that JPEG was designed that way, and so long as it remains in use, that way it shall stay.
I would assume that it once were the hardware designers of the first processors who decided which endianness would best integrate with their preferred/existing/planned micro-architecture for the chips they were developing from scratch.
Once established, and for compatibility reasons, the endianness was more or less carried on to later generations of hardware; which would support the 'legacy' argument for why still both kinds exist today.

What are the key differences between Apache Thrift, Google Protocol Buffers, MessagePack, ASN.1 and Apache Avro?

All of these provide binary serialization, RPC frameworks and IDL. I'm interested in key differences between them and characteristics (performance, ease of use, programming languages support).
If you know any other similar technologies, please mention it in an answer.
ASN.1 is an ISO/ISE standard. It has a very readable source language and a variety of back-ends, both binary and human-readable. Being an international standard (and an old one at that!) the source language is a bit kitchen-sinkish (in about the same way that the Atlantic Ocean is a bit wet) but it is extremely well-specified and has decent amount of support. (You can probably find an ASN.1 library for any language you name if you dig hard enough, and if not there are good C language libraries available that you can use in FFIs.) It is, being a standardized language, obsessively documented and has a few good tutorials available as well.
Thrift is not a standard. It is originally from Facebook and was later open-sourced and is currently a top level Apache project. It is not well-documented -- especially tutorial levels -- and to my (admittedly brief) glance doesn't appear to add anything that other, previous efforts don't already do (and in some cases better). To be fair to it, it has a rather impressive number of languages it supports out of the box including a few of the higher-profile non-mainstream ones. The IDL is also vaguely C-like.
Protocol Buffers is not a standard. It is a Google product that is being released to the wider community. It is a bit limited in terms of languages supported out of the box (it only supports C++, Python and Java) but it does have a lot of third-party support for other languages (of highly variable quality). Google does pretty much all of their work using Protocol Buffers, so it is a battle-tested, battle-hardened protocol (albeit not as battle-hardened as ASN.1 is. It has much better documentation than does Thrift, but, being a Google product, it is highly likely to be unstable (in the sense of ever-changing, not in the sense of unreliable). The IDL is also C-like.
All of the above systems use a schema defined in some kind of IDL to generate code for a target language that is then used in encoding and decoding. Avro does not. Avro's typing is dynamic and its schema data is used at runtime directly both to encode and decode (which has some obvious costs in processing, but also some obvious benefits vis a vis dynamic languages and a lack of a need for tagging types, etc.). Its schema uses JSON which makes supporting Avro in a new language a bit easier to manage if there's already a JSON library. Again, as with most wheel-reinventing protocol description systems, Avro is also not standardized.
Personally, despite my love/hate relationship with it, I'd probably use ASN.1 for most RPC and message transmission purposes, although it doesn't really have an RPC stack (you'd have to make one, but IOCs make that simple enough).
We just did an internal study on serializers, here are some results (for my future reference too!)
Thrift = serialization + RPC stack
The biggest difference is that Thrift is not just a serialization protocol, it's a full blown RPC stack that's like a modern day SOAP stack. So after the serialization, the objects could (but not mandated) be sent between machines over TCP/IP. In SOAP, you started with a WSDL document that fully describes the available services (remote methods) and the expected arguments/objects. Those objects were sent via XML. In Thrift, the .thrift file fully describes the available methods, expected parameter objects and the objects are serialized via one of the available serializers (with Compact Protocol, an efficient binary protocol, being most popular in production).
ASN.1 = Grand daddy
ASN.1 was designed by telecom folks in the 80s and is awkward to use due to limited library support as compared to recent serializers which emerged from CompSci folks. There are two variants, DER (binary) encoding and PEM (ascii) encoding. Both are fast, but DER is faster and more size efficient of the two. In fact ASN.1 DER can easily keep up (and sometimes beat) serializers that were designed 30 years after itself, a testament to it's well engineered design. It's very compact, smaller than Protocol Buffers and Thrift, only beaten by Avro. The issue is having great libraries to support and right now Bouncy Castle seems to be the best one for C#/Java. ASN.1 is king in security and crypto systems and isn't going to go away, so don't be worried about 'future proofing'. Just get a good library...
MessagePack = middle of the pack
It's not bad but it's neither the fastest, nor the smallest nor the best supported. No production reason to choose it.
Common
Beyond that, they are fairly similar. Most are variants of the basic TLV: Type-Length-Value principle.
Protocol Buffers (Google originated), Avro (Apache based, used in Hadoop), Thrift (Facebook originated, now Apache project) and ASN.1 (Telecom originated) all involve some level of code generation where you first express your data in a serializer-specific format, then the serializer "compiler" will generate source code for your language via the code-gen phase. Your app source then uses these code-gen classes for IO. Note that certain implementations (eg: Microsoft's Avro library or Marc Gavel's ProtoBuf.NET) let you directly decorate your app level POCO/POJO objects and then the library directly uses those decorated classes instead of any code-gen's classes. We've seen this offer a boost performance since it eliminates a object copy stage (from application level POCO/POJO fields to code-gen fields).
Some results and a live project to play with
This project (https://github.com/sidshetye/SerializersCompare) compares important serializers in the C# world. The Java folks already have something similar.
1000 iterations per serializer, average times listed
Sorting result by size
Name Bytes Time (ms)
------------------------------------
Avro (cheating) 133 0.0142
Avro 133 0.0568
Avro MSFT 141 0.0051
Thrift (cheating) 148 0.0069
Thrift 148 0.1470
ProtoBuf 155 0.0077
MessagePack 230 0.0296
ServiceStackJSV 258 0.0159
Json.NET BSON 286 0.0381
ServiceStackJson 290 0.0164
Json.NET 290 0.0333
XmlSerializer 571 0.1025
Binary Formatter 748 0.0344
Options: (T)est, (R)esults, s(O)rt order, (S)erializer output, (D)eserializer output (in JSON form), (E)xit
Serialized via ASN.1 DER encoding to 148 bytes in 0.0674ms (hacked experiment!)
Adding to the performance perspective, Uber recently evaluated several of these libraries on their engineering blog:
https://eng.uber.com/trip-data-squeeze/
The winner for them? MessagePack + zlib for compression
Our goal was to find the combination of encoding protocol and
compression algorithm with the most compact result at the highest
speed. We tested encoding protocol and compression algorithm
combinations on 2,219 pseudorandom anonymized trips from Uber New York
City (put in a text file as JSON).
The lesson here is that your requirements drive which library is right for you. For Uber they couldn't use an IDL based protocol due to the schemaless nature of message passing they have. This eliminated a bunch of options. Also for them it's not only raw encoding/decoding time that comes into play, but the size of data at rest.
Size Results
Speed Results
The one big thing about ASN.1 is, that ist is designed for specification not implementation. Therefore it is very good at hiding/ignoring implementation detail in any "real" programing language.
Its the job of the ASN.1-Compiler to apply Encoding Rules to the asn1-file and generate from both of them executable code. The Encoding Rules might be given in EnCoding Notation (ECN) or might be one of the standardized ones such as BER/DER, PER, XER/EXER.
That is ASN.1 is the Types and Structures, the Encoding Rules define the on the wire encoding, and last but not least the Compiler transfers it to your programming language.
The free Compilers support C,C++,C#,Java, and Erlang to my knowledge. The (much to expensive and patent/licenses ridden) commercial compilers are very versatile, usually absolutely up-to-date and support sometimes even more languages, but see their sites (OSS Nokalva, Marben etc.).
It is surprisingly easy to specify an interface between parties of totally different programming cultures (eg. "embedded" people and "server farmers") using this techniques: an asn.1-file, the Encoding rule e.g. BER and an e.g. UML Interaction Diagram. No Worries how it is implemented, let everyone use "their thing"! For me it has worked very well.
Btw.: At OSS Nokalva's site you may find at least two free-to-download books about ASN.1 (one by Larmouth the other by Dubuisson).
IMHO most of the other products try only to be yet-another-RPC-stub-generators, pumping a lot of air into the serialization issue. Well, if one needs that, one might be fine. But to me, they look like reinventions of Sun-RPC (from the late 80th), but, hey, that worked fine, too.
Microsoft's Bond (https://github.com/Microsoft/bond) is very impressive with performance, functionalities and documentation. However it does not support many target platforms as of now ( 13th feb 2015 ). I can only assume it is because it is very new. currently it supports python, c# and c++ . It's being used by MS everywhere. I tried it, to me as a c# developer using bond is better than using protobuf, however I have used thrift as well, the only problem I faced was with the documentation, I had to try many things to understand how things are done.
Few resources on Bond are as follows ( https://news.ycombinator.com/item?id=8866694 , https://news.ycombinator.com/item?id=8866848 , https://microsoft.github.io/bond/why_bond.html )
For performance, one data point is jvm-serializers benchmark -- it's quite specific, small messages, but might help if you are on Java platform. I think performance in general will often not be the most important difference. Also: NEVER take authors' words as gospel; many advertised claims are bogus (msgpack site for example has some dubious claims; it may be fast, but information is very sketchy, use case not very realistic).
One big difference is whether a schema must be used (PB, Thrift at least; Avro it may be optional; ASN.1 I think also; MsgPack, not necessarily).
Also: in my opinion it is good to be able to use layered, modular design; that is, RPC layer should not dictate data format, serialization. Unfortunately most candidates do tightly bundle these.
Finally, when choosing data format, nowadays performance does not preclude use of textual formats. There are blazing fast JSON parsers (and pretty fast streaming xml parsers); and when considering interoperability from scripting languages and ease of use, binary formats and protocols may not be the best choice.

Any pointers to fix the Unix millennium bug or Y2k38 problem?

I've been reviewing the year 2038 problem (Unix Millennium Bug).
I read the article about this on Wikipedia, where I read about a solution for this problem.
Now I would like to change the time_t data type to an unsigned 32bit integer, which will allow me to be alive until 2106. I have Linux kernel 2.6.23 with RTPatch on PowerPC.
Is there any patch available that would allow me to change the time_t data type to an unsigned 32bit integer for PowerPC? Or any patch available to resolve this bug?
time_t is actually defined in your libc implementation, and not the kernel itself.
The kernel provides various mechanisms that provide the current time (in the form of system calls), many of which already support over 32-bits of precision. The problem is actually your libc implementation (glibc on most desktop Linux distributions), which, after fetching the time from the kernel, returns it back to your application in the form of a 32-bit signed integer data type.
While one could theoretically change the definition of time_t in your libc implementation, in practice it would be fairly complicated: such a change would change the ABI of libc, in turn requiring that every application using libc to also be recompiled from sources.
The easiest solution instead is to upgrade your system to a 64-bit distribution, where time_t is already defined to be a 64-bit data type, avoiding the problem altogether.
About the suggested 64-bit distribution suggested here, may I note all the issues with implementing that. There are many 32-bit NONPAE computers in the embedded industry. Replacing these with 64-bit computers is going to be a LARGE problem. Everyone is used to desktop's that get replaced/upgraded frequently. All Linux O.S. suppliers need to get serious about providing a different option. It's not like a 32-bit computer is flawed or useless or will wear out in 16 years. It doesn't take a 64 bit computer to monitor analog input , control equipment, and report alarms.

Why the creators of Windows and Linux systems chose different ways to support Unicode?

As far as I know Linux chose backward compatibility of UTF-8, whereas Windows added completely new API functions for UTF-16 (ending with "W"). Could these decisions be different? Which one proved better?
UTF-16 is pretty much a loss, the worst of both worlds. It is neither compact (for the typical case of ASCII characters), nor does it map each code unit to a character. This hasn't really bitten anyone too badly yet since characters outside of the Basic Multilingual Plane are still rarely-used, but it sure is ugly.
POSIX (Linux et al) has some w APIs too, based on the wchar_t type. On platforms other than Windows this typically corresponds to UTF-32 rather than UTF-16. Which is nice for easy string manipulation, but is incredibly bloated.
But in-memory APIs aren't really that important. What causes much more difficulty is file storage and on-the-wire protocols, where data is exchanged between applications with different charset traditions.
Here, compactness beats ease-of-indexing; UTF-8 is clearly proven the best format for this by far, and Windows's poor support of UTF-8 causes real difficulties. Windows is the last modern operating system to still have locale-specific default encodings; everyone else has moved to UTF-8 by default.
Whilst I seriously hope Microsoft will reconsider this for future versions, as it causes huge and unnecessary problems even within the Windows-only world, it's understandable how it happened.
The thinking back in the old days when WinNT was being designed was that UCS-2 was it for Unicode. There wasn't going to be anything outside the 16-bit character range. Everyone would use UCS-2 in-memory and naturally it would be easiest to save this content directly from memory. This is why Windows called that format “Unicode”, and still to this day calls UTF-16LE just “Unicode” in UI like save boxes, despite it being totally misleading.
UTF-8 wasn't even standardised until Unicode 2.0 (along with the extended character range and the surrogates that made UTF-16 what it is today). By then Microsoft were on to WinNT4, at which point it was far too late to change strategy. In short, Microsoft were unlucky to be designing a new OS from scratch around the time when Unicode was in its infancy.
Windows chose to support Unicode with UTF-16 and the attendant Ascii/Unicode functions way way way way WAAAAAAY back in the early 90's (Windows NT 3.1 came out in 1993), before Linux ever had the notion of Unicode support.
Linux has been able to learn from best practices built on Windows and other Unicode capable platforms.
Many people would agree today that UTF-8 is the better encoding for size reasons unless you know you're absolutely going to be dealing with lots of double-byte characters - exclusively - where UTF-16 is more space efficient.

Resources