How fast is the procedure to convert from using big endian to little endian? - endianness

How fast is the procedure to convert from using big endian to little endian?

Very fast. It's a single machine language opcode on most architectures. Even on ancient hardware it would execute in only 2-3 clock cycles.

The speed greatly depends on the implementation and the language. Inlined machine code is extremely fast but an implementation running in an interpreted language may be orders of magnitude slower. If it's not inlined, procedure call overhead may take considerably more time than the actual byte swap.


Calculate the theoretical speed of a program?

I've written two different ways of doing the same thing. I would like to compare which one will execute faster. Of course it is always possible to benchmark, but how a program benchmarks can be different from machine to machine and can be affected by many outside factors. How could I calculate which is faster without bench marking? My thought would be that you would sum the times of all the operations done in the program. Is this a standard thing to do? It seems like when you benchmark there is alot of room for error.
My thought would be that you would sum the times of all the operations done in the program.
Yes, but you can't easily / reliably figure out those times by any method other than benchmarking.
The problem is that these times depend on the dynamic context of what happened previously in your program (or even system-wide). CPUs are complex beasts, and cache effects (data cache and instruction cache) are often a major factor. So is branch prediction. Why is it faster to process a sorted array than an unsorted array?
Static analysis of a small loop in assembly language is possible. e.g. I can accurately predict how many cycles per iteration a simple loop can run on Intel Haswell, assuming no cache misses, based on Agner Fog's microarchictecture pdf and instruction tables. Going beyond that involves more and more guesswork.
Performance in a high-level interpreted language like Ruby might be somewhat predictable for experts that spend a lot of time tuning code in it, but almost certainly not "this will take this number of microseconds", only "this is probably a bit or a lot faster than that".
Algorithmic complexity will give you a theoretical speed comparison for an algorithm.
Your question is about an arbitrary program, but a program is more than a collection of algorithms.
The execution speed of a program depends on the context it is running ( I/Os, operating system (multitasking or not), hardware ).
So there is no other method than statistics on a bunch of measurements, which is a definition for benchmark.

Endianness conversion cost on architectures

Today there are two kinds of CPU architectrues, big endian and little endian. So data needs to be converted between the two representations. Each CPU architecture, instruction set in particular, is different and each allows for different implementations for changing endianness. Some CPUs contain specific instructions while others do not.
My question is this, with today's architectures is it more efficient to have data BE and convert on LE architectures or other way around, in order to minimize data conversion latency and thus maximize throughput in a BE to LE communication.
Second question, can the cost of conversion quantified, is there any data on the matter?
How costly is this conversion in Java from a byte array? Is there data on specific JVMs on specific architectures? Is it different on Dalvik?
AFAIK the relevant architectures for this would be x64, ARM, MIPS and JVM/Dalvik. Am I missing any?
Storing data is one of the most basic things computers do, why only a few architectures relevant? All architectures must follow a specific endianness (such as x86/x86_64) or bi-endian (ARM, MIPS...) if their word size is bigger than an octet
However even if the architecture is bi-endian, the endian must be set at startup and CPUs only work in that endian mode until the mode is changed and everything is restarted. The CPUs can't deal with data in the reversed endian. Therefore data should always be in the native endian unless you only copy the value and then send it back without any processing, or you do some very simple operations that are endian-agnostic like bitwise operations
Network activities are much slower than CPU, not even comparable to RAM speed. You'll hardly ever see any difference in speed but most likely will be memory bound

Where is the bottleneck when using 64-bit variables and 32-bit systems?

A colleague and I recently were arguing about whether or not it is a good idea to use 64-bit variables in 32-bit code. I took a side saying "It might be dangerous and slow us down somewhere", he said "Nah. Nothing bad will happen." So who is right?
The bitness of parts of our ecosystem is as follows:
Windows: either 32-bit or 64-bit
Compiler: 32-bit only (Delphi...)
Processor: modern Intel ones... that's 64-bit, isn't it?
Let's say we have some variables which use some 64-bit type built into the language, with Delphi that would be Int64. Is it safe (performance-wise) to scatter the code with calculations based on these?
It sounds like you are doing some premature optimization.
Even if it would cause a very slight performance hit, the hardware of your users are more likely to have 64-bit machines as the software ages. So it would be a self correcting problem, if it even is a problem.
If the compiler supports them, go ahead and use them. If you find a slowdown in a tight loop with lots of calculations, then perhaps consider changing it.
It's bad. The data takes twice the memory, so it's like running on a processor with half the cache. It also requires 2 instructions to load, 2 instructions to add, and 2 instructions to store a pair of numbers. You can test this with simple programs like a prime number sieve. These will generally run measurably slower using the longer types on a 32bit machine, and even on a 64bit machine when you get larger than the cache. The worst part is that once you sprinkle this stuff all over your code, you'll have a systemic performance problem that will be hard to correct later. And yes, similar is even true of using 32 bit integers where shorts could be used, but only with regard to memory performance - a 32bit CPU can do 32bit math just as fast as 16bit - so don't worry about that (or your spreadsheet will only support 65K rows for example).
GCC (and i doubt your Delphi-compiler is much better) compiles/assembles its long long data-datatype so that it uses the EAX and ECX registers, loading them already takes longer than only loading one of them (if it is actually double the time depends on the cache). It then assembles multiple instructions where you would only need one for e.g. ADD/SUB/IDIV if you were only operating on a single 32-bit value. Then follow the stores for EAX and ECX which again take longer than storing only one of them.
On x64 architecture, if you only need 32 bit wide integers, then using 32 bit wide integers will result in faster code, even when running 64 bit code.
Ask yourself this:
Does this variable ever need to store a value that can be bigger than 32-bit, and then choose from the following:
If this answer is "no", use an unsigned int.
If the answer is "yes", use an unsigned 64-bit int. 64-bit ints aren't very memory expensive and are cheap to add or subtract (and reasonably cheap to multiply divide).
If the answer is "yes, but only on 64-bit systems" (for example the length of a buffer might be > 4GB on 64-bit but never on 32-bit), use a "size_t". This is defined to be 32-bit on 32-bit systems and 64-bit on 64-bit systems.
Write your program to be correct, and let the optimiser / performance tests help you get it fast afterwards. It's more expensive for you to write it fast first and fix it later than for you to write it correct first and make it fast later.

Are 64 bit programs bigger and faster than 32 bit versions?

I suppose I am focussing on x86, but I am generally interested in the move from 32 to 64 bit.
Logically, I can see that constants and pointers, in some cases, will be larger so programs are likely to be larger. And the desire to allocate memory on word boundaries for efficiency would mean more white-space between allocations.
I have also heard that 32 bit mode on the x86 has to flush its cache when context switching due to possible overlapping 4G address spaces.
So, what are the real benefits of 64 bit?
And as a supplementary question, would 128 bit be even better?
I have just written my first 32/64 bit program. It makes linked lists/trees of 16 byte (32b version) or 32 byte (64b version) objects and does a lot of printing to stderr - not a really useful program, and not something typical, but it is my first.
Size: 81128(32b) v 83672(64b) - so not much difference
Speed: 17s(32b) v 24s(64b) - running on 32 bit OS (OS-X 10.5.8)
I note that a new hybrid x32 ABI (Application Binary Interface) is being developed that is 64b but uses 32b pointers. For some tests it results in smaller code and faster execution than either 32b or 64b.
I typically see a 30% speed improvement for compute-intensive code on x86-64 compared to x86. This is most likely due to the fact that we have 16 x 64 bit general purpose registers and 16 x SSE registers instead of 8 x 32 bit general purpose registers and 8 x SSE registers. This is with the Intel ICC compiler (11.1) on an x86-64 Linux - results with other compilers (e.g. gcc), or with other operating systems (e.g. Windows), may be different of course.
Unless you need to access more memory that 32b addressing will allow you, the benefits will be small, if any.
When running on 64b CPU, you get the same memory interface no matter if you are running 32b or 64b code (you are using the same cache and same BUS).
While x64 architecture has a few more registers which allows easier optimizations, this is often counteracted by the fact pointers are now larger and using any structures with pointers results in a higher memory traffic. I would estimate the increase in the overall memory usage for a 64b application compared to a 32b one to be around 15-30 %.
Regardless of the benefits, I would suggest that you always compile your program for the system's default word size (32-bit or 64-bit), since if you compile a library as a 32-bit binary and provide it on a 64-bit system, you will force anyone who wants to link with your library to provide their library (and any other library dependencies) as a 32-bit binary, when the 64-bit version is the default available. This can be quite a nuisance for everyone. When in doubt, provide both versions of your library.
As to the practical benefits of 64-bit... the most obvious is that you get a bigger address space, so if mmap a file, you can address more of it at once (and load larger files into memory). Another benefit is that, assuming the compiler does a good job of optimizing, many of your arithmetic operations can be parallelized (for example, placing two pairs of 32-bit numbers in two registers and performing two adds in single add operation), and big number computations will run more quickly. That said, the whole 64-bit vs 32-bit thing won't help you with asymptotic complexity at all, so if you are looking to optimize your code, you should probably be looking at the algorithms rather than the constant factors like this.
Please disregard my statement about the parallelized addition. This is not performed by an ordinary add statement... I was confusing that with some of the vectorized/SSE instructions. A more accurate benefit, aside from the larger address space, is that there are more general purpose registers, which means more local variables can be maintained in the CPU register file, which is much faster to access, than if you place the variables in the program stack (which usually means going out to the L1 cache).
I'm coding a chess engine named foolsmate. The best move extraction using a minimax-based tree search to depth 9 (from a certain position) took:
on Win32 configuration: ~17.0s;
after switching to x64 configuration: ~10.3s;
This is 41% of acceleration!
In addition to having more registers, 64-bit has SSE2 by default. This means that you can indeed perform some calculations in parallel. The SSE extensions had other goodies too. But I guess the main benefit is not having to check for the presence of the extensions. If it's x64, it has SSE2 available. ...If my memory serves me correctly.
In the specific case of x68 to x68_64, the 64 bit program will be about the same size, if not slightly smaller, use a bit more memory, and run faster. Mostly this is because x86_64 doesn't just have 64 bit registers, it also has twice as many. x86 does not have enough registers to make compiled languages as efficient as they could be, so x86 code spends a lot of instructions and memory bandwidth shifting data back and forth between registers and memory. x86_64 has much less of that, and so it takes a little less space and runs faster. Floating point and bit-twiddling vector instructions are also much more efficient in x86_64.
In general, though, 64 bit code is not necessarily any faster, and is usually larger, both for code and memory usage at runtime.
Only justification for moving your application to 64 bit is need for more memory in applications like large databases or ERP applications with at least 100s of concurrent users where 2 GB limit will be exceeded fairly quickly when applications cache for better performance. This is case specially on Windows OS where integer and long is still 32 bit (they have new variable _int64. Only pointers are 64 bit. In fact WOW64 is highly optimised on Windows x64 so that 32 bit applications run with low penalty on 64 bit Windows OS. My experience on Windows x64 is 32 bit application version run 10-15% faster than 64 bit since in former case at least for proprietary memory databases you can use pointer arithmatic for maintaining b-tree (most processor intensive part of database systems). Compuatation intensive applications which require large decimals for highest accuracy not afforded by double on 32-64 bit operating system. These applications can use _int64 in natively instead of software emulation. Of course large disk based databases will also show improvement over 32 bit simply due to ability to use large memory for caching query plans and so on.
Any applications that require CPU usage such as transcoding, display performance and media rendering, whether it be audio or visual, will certainly require (at this point) and benefit from using 64 bit versus 32 bit due to the CPU's ability to deal with the sheer amount of data being thrown at it. It's not so much a question of address space as it is the way the data is being dealt with. A 64 bit processor, given 64 bit code, is going to perform better, especially with mathematically difficult things like transcoding and VoIP data - in fact, any sort of 'math' applications should benefit by the usage of 64 bit CPUs and operating systems. Prove me wrong.
On my machine, same h265 encode works almost twice as fast using virtulDub_x64 (with x64 h265 library) vs virtulDub_x32 (regular x32 h265 library). That's probably because longint (64bits) numbers operations (ie: add) can be done on a single instruction on x64, but on 32bit needs two: add lower part, and then add (with carry) the higher part. So unless integer maths are limited to 32bit integers, most of it will take more time under x32.

64-bits and Memory Bandwidth

Mason asked about the advantages of a 64-bit processor.
Well, an obvious disadvantage is that you have to move more bits around. And given that memory accesses are a serious issue these days[1], moving around twice as much memory for a fair number of operations can't be a good thing.
But how bad is the effect of this, really? And what makes up for it? Or should I be running all my small apps on 32-bit machines?
I should mention that I'm considering, in particular, the case where one has a choice of running 32- or 64-bit on the same machine, so in either mode the bandwidth to main memory is the same.
[1]: And even fifteen years ago, for that matter. I remember talk as far back as that about good cache behaviour, and also particularly that the Alpha CPUs that won all the benchmarks had a giant, for the time, 8 MB of L2 cache.
Whether your app should be 64-bit depends a lot on what kind of computation it does. If you need to process very large data sets, you obviously need 64-bit pointers. If not, you need to know whether your app spends relatively more time doing arithmetic or memory accesses. On x86-64, the general purpose registers are not only twice as wide, there are twice as many and they are more "general purpose". This means that 64-bit code can have much better integer op performance. However, if your code doesn't need the extra register space, you'll probably see better performance by using smaller pointers and data, due to increased cache effectiveness. If your app is dominated by floating point operations, there probably isn't much point in making it 32-bit, because most of the memory accesses will be for wide vectors anyways, and having the extra SSE registers will help.
Most 64-bit programming environments use the "LP64" model, meaning that only pointers and long int variables (if you're a C/C++ programmer) are 64 bits. Integers (ints) remain 32-bits unless you're in the "ILP64" model, which is fairly uncommon.
I only bring it up because most int variables aren't being used for size_t-like purposes--that is, they stay within ranges comfortably held by 32 bits. For variables of that nature, you'll never be able to tell the difference.
If you're doing numerical or data-heavy work with > 4GB of data, you'll need 64 bits anyways. If you're not, you won't notice the difference, unless you're in the habit of using longs where most would use ints.
I think you're starting off with a bad assumption here. You say:
moving around twice as much memory
for a fair number of operations can't
be a good thing
and the first question is ask is "why not"? In a true 64 bit machine, the data path is 64 bits wide, and so moving 64 bits takes exactly (to a first approximation) as many cycles as moving 32 bits on a 32 bit machine. So, if you need to move 128 bytes, it takes half as many cycles as it would take on a 32 bit machine.
