Is there really such a thing as a char or short in modern programming? - performance

I've been learning to program for a Mac over the past few months (I have experience in other languages). Obviously that has meant learning the Objective C language and thus the plainer C it is predicated on. So I have stumbles on this quote, which refers to the C/C++ language in general, not just the Mac platform.
With C and C++ prefer use of int over
char and short. The main reason behind
this is that C and C++ perform
arithmetic operations and parameter
passing at integer level, If you have
an integer value that can fit in a
byte, you should still consider using
an int to hold the number. If you use
a char, the compiler will first
convert the values into integer,
perform the operations and then
convert back the result to char.
So my question, is this the case in the Mac Desktop and IPhone OS environments? I understand when talking about theses environments we're actually talking about 3-4 different architectures (PPC, i386, Arm and the A4 Arm variant) so there may not be a single answer.
Nevertheless does the general principle hold that in modern 32 bit / 64 bit systems using 1-2 byte variables that don't align with the machine's natural 4 byte words doesn't provide much of the efficiency we may expect.
For instance, a plain old C-Array of 100,000 chars is smaller than the same 100,000 ints by a factor of four, but if during an enumeration, reading out each index involves a cast/boxing/unboxing of sorts, will we see overall lower 'performance' despite the saved memory overhead?

The processor is very very fast compared to the memory speed. It will always pay to store values in memory as chars or shorts (though to avoid porting problems you should use int8_t and int16_t). Less cache will be used, and there will be fewer memory accesses.

Can't speak for PPC/Arm/A4Arm, but x86 has the ability to operate on data as if it was 8bit, 16bit, or 32bit (64bit if an x86_64 in 64bit mode), although I'm not sure if the compiler would take advantage of those instructions. Even when using 32bit load, the compiler could AND the data with a mask that'd clear the upper 16/24bits, which would be relatively fast.
Likely, the ability to fit far more data into the cache would at least cancel out the speed difference... although the only way to know for sure would be to actually profile the code.

Of course there is a need to use data structures less than the register size of the target machine. Imagine your are storing text data encoded as UTF-8, or ASCII in memory where each character is mostly like a byte in size, do you want to store the characters as 64 bit quantities?
The advice you are looking is a warning not to over optimizes.
You have to balance the savings in space versus the computation performance of you choice.
I wouldn't worry to much about it, today's modern CPUs are complicated enough that its hard to make this kind of judgement on your own. Choose the obvious datatype and let the compiler worry about the rest.

The addressing model of the x86 architecture is that the basic unit of memory is 8 bit bytes.
This is to simplify operation with character strings and decimal arithmetic.
Then, in order to have useful sizes of integers, the instruction set allows using these in units of 1, 2, 4, and (recently) 8 bytes.

A Fact to remember, is that most software development takes place writing for different processors than most of us here deal with on a day to day basis.
C and assembler are common languages for these.
About ten billion CPUs were manufactured in 2008. About 98% of new CPUs produced each year are embedded.

Related

Go: pointer bit stealing technique

In a well-known book The Art of Multiprocessor Programming by Herlihy, Shavit some of lock-free and wait-free algorithms utilize Java's template AtomicMarkableReference<T> type. It allows to perform single atomic CAS operation on the pair consisting of T reference and boolean mark.
There is no similar type in C/C++/Go stdlib, but at least in C++ it's possible to model it using bit stealing approach (see C++ example). On x86_64 arch only 48 bits of 64 bits are actually used, so one can store arbitrary data in the remaining 16 bits, and work with the whole pointer and the data atomically.
As far as I understand, there are two requirements to implement this approach:
Pointers must be aligned.
Pointer's low bits must be clear (if you want to store something like bool in this area, it must not be already occupied).
Are these requirements met in Go? Are there any working examples of bit stealing technique in Go?

Are there any advantages to network byte order in a new protocol?

(I know many people are going to be tempted to close this question; please don't; I'm asking for concrete technical answers, if any exist.)
"Network byte order" is big-endian for reasons that cannot be asked on stackoverflow. Lots of old protocols use that order and can't be changed but I wonder if there are any technical reasons to choose big endian for a new protocol.
I would think little endian is better, because 99.99% of processors in use are little endian (ARM can technically do both, but in reality it is always set to little endian). So I was surprised to see that CBOR, a relatively recent protocol, uses big endian. Is there an advantage that I haven't thought of?
It boils down to human factors: It is easier to read a multi-byte integer in a hex dump if it is encoded with the most significant byte(s) first. For example, the CBOR representation of 0x1234 (4,660) is the byte sequence 19 12 34. If you are looking for the value 0x1234, it is easier to spot it that way.
TLDR;
I've been in the field for over 40 years now, so there's a lot of history behind this. Even the definition of a "byte" has changed over that many years, so this may take a bit of an open mind to understand how this evolved.
Dumps of binary information weren't always in bytes, nor hexadecimal. For example, the PDP-11 (with 16-bit words, and 8-bit bytes) the use of octal notation word-wide dumps was common. This was useful because of the machine architecture, which inculuded 8 registers, and 8 addressing modes, so machine langugage dumps in octal were easier to decode than hex.

Is fixed point math faster than floating point?

Years ago, in the early 1990's, I built graphics packages that optimized calculations based on fixed point arithmetic and pre-computed tables for cos, sin, and scaled equations for sqrt and log approximation using Newton's approximation methods. These advanced techniques seemed to have become part of the graphics and built-in math processors. About 5 years ago, I took a numerical analysis class that touched on some of the old techniques. I've been coding for almost 30 years and rarely ever see those old fixed point optimizations in use, even after working on GPGPU applications for world class particle accelerator experiments. Are fixed point methods still useful, anywhere throughout the software industry anymore, or is the usefulness for that knowledge now gone forever?
Fixed point is marginally useful on platforms that do not support any kind of decimal type of their own; for example, I implemented a 24-bit fixed point type for the PIC16F series microcontrollers (more on why I chose fixed point later).
However, almost every modern CPU supports floating point at the microcode or hardware level, so there isn't much need for fixed point.
Fixed point numbers are limited in the range they can represent - consider a 64-bit(32.32) fixed point vs. a 64-bit floating point: the 64-bit fixed point number has a decimal resolution of 1/(232), while the floating point number has a decimal resolution of up to 1/(253); the fixed point number can represent values as high as 231, while the floating point number can represent numbers up to 2223. And if you need more, most modern CPUs support 80-bit floating point values.
Of course, the biggest downfall of floating point is limited precision in extreme cases - e.g. in fixed point, it would require fewer bits to represent 9000000000000000000000000000000.00000000000000000000000000000002. Of course, with floating point, you get better precision for average uses of decimal arithmetic, and I have yet to see an application where decimal arithmetic is as extreme as the above example yet also does not overflow the equivalent fixed-point size.
The reason I implemented a fixed-point library for the PIC16F rather than use an existing floating point library was code size, not speed: the 16F88 has 384 bytes of usable RAM and room for 4095 instructions total. To add two fixed point numbers of predefined width, I inlined integer addition with carry-out in my code (the fixed point doesn't move anyway); to multiply two fixed point numbers, I used a simple shift-and-add function with extended 32-bit fixed point, even though that isn't the fastest multiplication approach, in order to save even more code.
So, when I had need of only one or two basic arithmetic operations, I was able to add them without using up all of the program storage. For comparison, a freely available floating point library on that platform was about 60% of the total storage on the device. In contrast, software floating point libraries are mostly just wrappers around a few arithmetic operations, and in my experience, they are mostly all-or-nothing, so cutting the code size in half because you only need half of the functions doesn't work so well.
Fixed point generally doesn't provide much of an advantage in speed though, because of its limited representation range: how many bits would you need to represent 1.7E+/-308 with 15 digits of precision, the same as a 64-bit double? If my calculations are correct, you'd need somewhere around 2020 bits. I'd bet the performance of that wouldn't be so good.
Thirty years ago, when hardware floating point was relatively rare, very special-purpose fixed-point (or even scaled integer) arithmetic could provide significant gains in performance over doing software-based floating point, but only if the allowable range of values could be efficiently represented with scaled-integer arithmetic (the original Doom used this approach when no coprocessor was available, such as on my 486sx-25 in 1992 - typing this on an overclocked hyperthreaded Core i7 running at 4.0GHz with a GeForce card that has over 1000 independent floating point compute units, it just seems wrong somehow, although I'm not sure which - the 486, or the i7...).
Floating point is more general purpose due to the range of values it can represent, and with it implemented in hardware on both CPUs and GPUs, it beats fixed point in every way, unless you really need more than 80-bit floating point precision at the expense of huge fixed-point sizes and very slow code.
Well I code for 2 decades and my experience is there are 3 main reasons to use fixed point:
No FPU available
Fixed point is still valid for DSP,MCU,FPGA and chip design in general. Also no floating point unit can work without fixed point core unit so also all bigdecimal libs must use fixed point... Also graphics cards use fixed point a lot (normalized device coordinates).
insufficient FPU precision
if you go to astronomic computations you will very soon hit the extremes and the need of handling them. For example simple Newtonian/D'Alembert integration or atmosphere ray-tracing hits the precision barriers pretty fast on large scales and low granularity. I usually use array of floating point doubles to remedy that. For situations where the input/output range is known the fixed point is usually better choice. See some examples of hitting the FPU barrier:
Is it possible to make realistic n-body solar system simulation in matter of size and mass?
ray and ellipsoid intersection accuracy improvement
speed
Back in the old days FPU was really slow (especially on x86 architecture) due the interface and api it uses. An interrupt was generated for each FPU instruction not to mention the operands and results transfer process... So few bit-shift operations in CPU ALU was usually faster.
Nowadays is this not true anymore and the ALU and FPU speeds are comparable. For example here mine measurement of CPU/FPU operations (in small Win32 C++ app):
fcpu(0) = 3.194877 GHz // tested on first core of AMD-A8-5500 APU 3.2GHz Win7 x64 bit
CPU 32bit integer aritmetics:
add = 387.465 MIPS
sub = 376.333 MIPS
mul = 386.926 MIPS
div = 245.571 MIPS
mod = 243.869 MIPS
FPU 32bit float aritmetics:
add = 377.332 MFLOPS
sub = 385.444 MFLOPS
mul = 383.854 MFLOPS
div = 367.520 MFLOPS
FPU 64bit double aritmetics:
add = 385.038 MFLOPS
sub = 261.488 MFLOPS
mul = 353.601 MFLOPS
div = 309.282 MFLOPS
The values vary with time but in comparison between data types are almost identical. Just few years back the doubles where slower due to 2x times bigger data transfers. But there are other platforms where the speed difference may be still valid.

Pascal variables representation on the machine

I was given a question in my homework.
How the pascal variables represented on the machine?
for example: in C it can be different on different machines and compilers, in java there is a VM, so the programmer can assume he will get the exact same representation on different machines.
I have been googling for a while and could not find an answer about pascal. The question is about the original version of pascal, if it changer something.
Thank you!
Original (J&W) Pascal is extremely rare, and most people don't know it. It got cleaned up a bit into the ISO 7185 standard, though those changes IIRC mostly affect scoping and type equivalence, not the kind of types.
Original (non UCSD, over a decade before Borland/Turbo dialects) Pascal has nearly no machine dependent types. Just one type INTEGER and one floating point type, REAL, and non integer ordinal types like enums, boolean and char. Char was not guaranteed 8-bit, the with dependent on the machine word.
Pascal shows his mainframe roots here, where words had exotic sizes like 60-bit, didn't allow subword acess (say bytelevel access, but that is a stretch, since they might not know the concept of byte), and multiple chars were packed into machinewords. (see packed array below). C was several years later, and targeted minis, so avoided the worst of that legacy.
The integer type is the biggest type in the system, quite often the biggest type that the machine can do conveniently. Smaller integer sizes are constructed with subranges, there are no unsigned types, but these can be defined with relevant subranges (and it is up to the compiler/VM to implement those efficiently)
e..g BYTE= 0..255;
Arrays can be packed, and must be unpacked before use (with pack() and unpack()).
There is no stringtype, typically packed fixed size array of char is used, with right padding with space to signal endofstring (so trailing spaces is hard, but it is only a convention, and not much runtime support, so in exceptional cases, you simply make an exception)
Unions contain all components as separate fields (no overlap) and are always named.
It had pointers, but you couldn't take addresses of arbitrary symbols, and new pointers could only be created with NEW.
So in general original Pascal is what you would call a reasonably "safe" language, though it was not fully designed as such (and afaik not 100% safe in theory. It was also much more suitable for VMing than TP (and that happened with UCSD, albeit for a subset).
Pascal and its successors can be considered as reconnaissance of concepts that were later popularized with Java.
In C it can be different on different machines and compilers, so are they in Pascal, of course.

The reason behind endianness?

I was wondering, why some architectures use little-endian and others big-endian. I remember I read somewhere that it has to do with performance, however, I don't understand how can endianness influence it. Also I know that:
The little-endian system has the property that the same value can be read from memory at different lengths without using different addresses.
Which seems a nice feature, but, even so, many systems use big-endian, which probably means big-endian has some advantages too (if so, which?).
I'm sure there's more to it, most probably digging down to the hardware level. Would love to know the details.
I've looked around the net a bit for more information on this question and there is a quite a range of answers and reasonings to explain why big or little endian ordering may be preferable. I'll do my best to explain here what I found:
Little-endian
The obvious advantage to little-endianness is what you mentioned already in your question... the fact that a given number can be read as a number of a varying number of bits from the same memory address. As the Wikipedia article on the topic states:
Although this little-endian property is rarely used directly by high-level programmers, it is often employed by code optimizers as well as by assembly language programmers.
Because of this, mathematical functions involving multiple precisions are easier to write because the byte significance will always correspond to the memory address, whereas with big-endian numbers this is not the case. This seems to be the argument for little-endianness that is quoted over and over again... because of its prevalence I would have to assume that the benefits of this ordering are relatively significant.
Another interesting explanation that I found concerns addition and subtraction. When adding or subtracting multi-byte numbers, the least significant byte must be fetched first to see if there is a carryover to more significant bytes. Because the least-significant byte is read first in little-endian numbers, the system can parallelize and begin calculation on this byte while fetching the following byte(s).
Big-endian
Going back to the Wikipedia article, the stated advantage of big-endian numbers is that the size of the number can be more easily estimated because the most significant digit comes first. Related to this fact is that it is simple to tell whether a number is positive or negative by simply examining the bit at offset 0 in the lowest order byte.
What is also stated when discussing the benefits of big-endianness is that the binary digits are ordered as most people order base-10 digits. This is advantageous performance-wise when converting from binary to decimal.
While all these arguments are interesting (at least I think so), their applicability to modern processors is another matter. In particular, the addition/subtraction argument was most valid on 8 bit systems...
For my money, little-endianness seems to make the most sense and is by far the most common when looking at all the devices which use it. I think that the reason why big-endianness is still used, is more for reasons of legacy than performance. Perhaps at one time the designers of a given architecture decided that big-endianness was preferable to little-endianness, and as the architecture evolved over the years the endianness stayed the same.
The parallel I draw here is with JPEG (which is big-endian). JPEG is big-endian format, despite the fact that virtually all the machines that consume it are little-endian. While one can ask what are the benefits to JPEG being big-endian, I would venture out and say that for all intents and purposes the performance arguments mentioned above don't make a shred of difference. The fact is that JPEG was designed that way, and so long as it remains in use, that way it shall stay.
I would assume that it once were the hardware designers of the first processors who decided which endianness would best integrate with their preferred/existing/planned micro-architecture for the chips they were developing from scratch.
Once established, and for compatibility reasons, the endianness was more or less carried on to later generations of hardware; which would support the 'legacy' argument for why still both kinds exist today.

Resources