Understanding response time of memory accesses

Understanding response time of memory accesses - caching

I performed, as a part of an academic research, the following experiment:
buff = mmap(NULL, BUFFSIZE, PROT_READ | PROT_WRITE, MAP_ANONYMOUS | MAP_PRIVATE | HUGEPAGES, -1, 0);
lineAddr = buff;
for (int i = 0; i < BUFFSIZE; i++)
clflush(&(buff[i]));
for (int i = 0; i < LINES; i ++){
srand(rdtscp());
result = memaccesstime(lineAddr);
lineAddr = (void*)((uint64_t)lineAddr + (rand()%20+3)*(8*sizeof(void*)));
resultArr[i] = result;
}
MemAccessTime function returns the response time in cpu ticks.
static inline uint32_t memaccesstime(void *v) {
uint32_t rv;
asm volatile (
"mfence\n"
"lfence\n"
"rdtscp\n"
"mov %%eax, %%esi\n"
"mov (%1), %%eax\n"
"rdtscp\n"
"sub %%esi, %%eax\n"
: "=&a" (rv): "r" (v): "ecx", "edx", "esi");
return rv;
}
So the steps are:
Allocated a long range of memory (with mmap()).
clflush() all the line (with for loop)
Running over random lines (with steps between 3 to 23) and measured the response time.
The results:
Results
Please help me understand the results better.
Why after short number of samples, the response time is plunging?
Notes:
The MSR register 0x1a4 value is 0xF (but behavior is the same with 0x0)
I've chosen random steps to avoid the "stride" prefetcher.
Is there any other hardware (or software) prefetcher that could be responsible for those results?

Related

Using perf_event_open in AMD for Counting and Sampling Memory Load

I want to count and sample memory load operations that happen in a chunk of code while running it on an AMD machine. I know that the following code can be used to count memory loads happening in a code chunk on Intel machine.
int
main(int argc, char **argv)
{
struct perf_event_attr pe;
long long count;
int fd;
memset(&pe, 0, sizeof(struct perf_event_attr));
pe.type = PERF_TYPE_RAW;
pe.size = sizeof(struct perf_event_attr);
/* event number of MEM_UOPS_RETIRED.ALL_LOADS
event is 81d0 */
pe.config = 0x81d0;
pe.disabled = 1;
pe.exclude_kernel = 1;
pe.exclude_hv = 1;
fd = perf_event_open(&pe, 0, -1, -1, 0);
if (fd == -1) {
fprintf(stderr, "Error opening %llx\n", pe.config);
exit(EXIT_FAILURE);
}
ioctl(fd, PERF_EVENT_IOC_RESET, 0);
ioctl(fd, PERF_EVENT_IOC_ENABLE, 0);
// code chunk to be profiled is here
ioctl(fd, PERF_EVENT_IOC_DISABLE, 0);
read(fd, &count, sizeof(long long));
printf("Number of load operations: %lld\n", count);
close(fd);
}
However, I don't know how to get the count of the same event from AMD machine. I have read AMD's document on performance counter and IBS in https://www.amd.com/system/files/TechDocs/24593.pdf sections 13.2 and 13.3. Yet, there is no mention of any hardware event number for memory load that can be passed to pe.config in the above code.
So, how can I count or sample memory load operations in an AMD machine by using perf_event_open?
PS: The AMD processor that I am using is AMD EPYC 7551 32-Core.

Optimizing a logic AND operation for x86

I am trying to optimize an algorithm that masks an array. The initial code looks like this:
void mask(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t *m)
{
unsigned int rep=size_x*size_y;
while (rep--)
{
*(source++) &= *(m++);
}
}
I have tried to do Loop Unrolling + prefetching
void mask_LU4(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t *mask)
{ // in place
unsigned int rep;
rep= size_x* size_y;
rep/= 4 ;
while (rep--)
{
_mm_prefetch(&source[16], _MM_HINT_T0);
_mm_prefetch(&mask[16], _MM_HINT_T0);
source[0] &= mask[0];
source[1] &= mask[1];
source[2] &= mask[2];
source[3] &= mask[3];
source += 4;
mask += 4;
}
}
and use intrinsics
void fmask_SIMD(unsigned int size_x, unsigned int size_y, uint32_t *source, uint32_t *mask)
{ // in place
unsigned int rep;
__m128i *s,*m ;
s = (__m128i *) source;
m = (__m128i *) mask;
rep= size_x* size_y;
rep/= 4 ;
while (rep--)
{
*s = _mm_and_si128(*s,*m);
source+=4;mask+=4;
s = (__m128i *) source;
m = (__m128i *) mask;
}
}
However the performance is the same. I have tried to perform sw prefetch both to the SIMD and Loop Unrolling version and it I couldn't see any improvement. Any ideas on how I could optimize this algorithm?
P.S.1: I am using gcc 4.8.1 and I compile with -march=native and -Ofast.
P.S.2: I am using an Intel Core i5 3470 #3.2Ghz, Ivy bridge architecture. L1 DCache 4X32KB (8-way), L2 4x256, L3 6MB, RAM-DDR3 4Gb (Dual Channel, DRAM #798,1Mhz)

Your operation is memory bandwidth bound. However, that does not necessarily mean your operation is achieving the maximum memory bandwidth. To get closer to the maximum memory bandwidth you need to use multiple threads. Using OpenMP (add -fopenmp to GCC's options) you can do this:
#pragma omp parallel for
for(int i=0; i<rep; i++) { source[i] &= m[i]; }
If you wanted to not modify the source but use a different destination then you can use stream instructions like this:
#pragma omp parallel for
for(int i=0; i<rep/4; i++) {
__m128i m4 = _mm_load_si128((__m128i*)&m[4*i]);
__m128i s4 = _mm_load_si128((__m128i*)&source[4*i]);
s4 = _mm_and_si128(s4,m4);
_mm_stream_si128((__m128i*i)&dest[4*i], s4);
}
This would not be any faster than using the same destination and source. However, if you already planned to use a destination not equal to the source this would likely be faster (for some value of rep) than using _mm_store_si128.

Your problem might be memory bound, but that does not mean you cannot process more per cycle. Usually when you have a low payload operation (like yous here, it's only an AND after all), it makes sense to combine many loads and stores. In most CPUs most loads will be combined by the L2 cache into a single cache line load (especially if they're consecutive). I'd suggest increasing the loop unrolling to at least 4 SIMD packets here, along with the prefetching. You'd still be memory bound, but you will get fewer cache misses, giving you a slightly better performance.

Inline ASM: Use of MMX returns NaN seconds on timer

Problem
I am trying to find out whether mmx or xmm registers are faster for copying elements of an array to another array (I know about memcpy() but I need this function for a very specific purpose).
My souce code is below. The relevant function is copyarray(). I can use either mmx or xmm registers with movq or movsd respectively, and the result is correct. However, when I use mmx registers, any timer I use (either clock() or QueryPerformanceCounter) to time the operations returns NaN.
Compiled with: gcc -std=c99 -O2 -m32 -msse3 -mincoming-stack-boundary=2 -mfpmath=sse,387 -masm=intel copyasm.c -o copyasm.exe
This is a very strange bug and I cannot figure out why using mmx registers would cause a timer to return NaN seconds, while using xmm registers in exactly the same code returns a valid time value
EDIT
Results using xmm registers:
Elapsed time: 0.000000 seconds, Gigabytes copied per second: inf GB
Residual = 0.000000
0.937437 0.330424 0.883267 0.118717 0.962493 0.584826 0.344371 0.423719
0.937437 0.330424 0.883267 0.118717 0.962493 0.584826 0.344371 0.423719
Results using mmx register:
Elapsed time: nan seconds, Gigabytes copied per second: inf GB
Residual = 0.000000
0.000000 0.754173 0.615345 0.634724 0.611286 0.547655 0.729637 0.942381
0.935759 0.754173 0.615345 0.634724 0.611286 0.547655 0.729637 0.942381
Source Code
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#include <x86intrin.h>
#include <windows.h>
inline double
__attribute__ ((gnu_inline))
__attribute__ ((aligned(64))) copyarray(
double* restrict dst,
const double* restrict src,
const int n)
{
// int i = n;
// do {
// *dst++ = *src++;
// i--;
// } while(i);
__asm__ __volatile__
(
"mov ecx, %[n] \n\t"
"mov edi, %[dst] \n\t"
"mov esi, %[src] \n\t"
"xor eax, eax \n\t"
"sub ecx,1 \n\t"
"L%=: \n\t"
"movq mm0, QWORD PTR [esi+ecx*8] \n\t"
"movq QWORD PTR [edi+ecx*8], mm0 \n\t"
"sub ecx, 1 \n\t"
"jge L%= \n\t"
: // no outputs
: // inputs
[dst] "m" (dst),
[src] "m" (src),
[n] "g" (n)
: // register clobber
"eax","ecx","edi","esi",
"mm0"
);
}
void printarray(double* restrict a, int n)
{
for(int i = 0; i < n; ++i) {
printf(" %f ", *(a++));
}
printf("\n");
}
double residual(const double* restrict dst,
const double* restrict src,
const int n)
{
double residual = 0.0;
for(int i = 0; i < n; ++i)
residual += *(dst++) - *(src++);
return(residual);
}
int main()
{
double *A = NULL;
double *B = NULL;
int n = 8;
double memops;
double time3;
clock_t time1;
// LARGE_INTEGER frequency, time1, time2;
// QueryPerformanceFrequency(&frequency);
int trials = 1 << 0;
A = _mm_malloc(n*sizeof(*A), 64);
B = _mm_malloc(n*sizeof(*B), 64);
srand(time(NULL));
for(int i = 0; i < n; ++i)
*(A+i) = (double) rand()/RAND_MAX;
// QueryPerformanceCounter(&time1);
time1 = clock();
for(int i = 0; i < trials; ++i)
copyarray(B,A,n);
// QueryPerformanceCounter(&time2);
// time3 = (double)(time2.QuadPart - time1.QuadPart) / frequency.QuadPart;
time3 = (double) (clock() - time1)/CLOCKS_PER_SEC;
memops = (double) trials*n/time3*sizeof(*A)/1.0e9;
printf("Elapsed time: %f seconds, Gigabytes copied per second: %f GB\n",time3, memops);
printf("Residual = %f\n",residual(B,A,n));
printarray(A,n);
printarray(B,n);
_mm_free(A);
_mm_free(B);
}

You have to be careful when mixing MMX with floating point - use SSE instead if possible. If you must use MMX then read the section titled "MMX - State Management" on this page - note the requirement for the emms instruction after any MMX instructions before you next perform any floating point operations.

Userland interrupt timer access such as via KeQueryInterruptTime (or similar)

Is there a "Nt" or similar (i.e. non-kernelmode-driver) function equivalent for KeQueryInterruptTime or anything similar? There seems to be no such thing as NtQueryInterruptTime, at least I've not found it.
What I want is some kind of reasonably accurate and reliable, monotonic timer (thus not QPC) which is reasonably efficient and doesn't have surprises as an overflowing 32-bit counter, and no unnecessary "smartness", no time zones, or complicated structures.
So ideally, I want something like timeGetTime with a 64 bit value. It doesn't even have to be the same timer.
There exists GetTickCount64 starting with Vista, which would be acceptable as such, but I'd not like to break XP support only for such a stupid reason.
Reading the quadword at 0x7FFE0008 as indicated here ... well, works ... and it proves that indeed the actual internal counter is 64 bits under XP (it's also as fast as it could possibly get), but meh... let's not talk about what a kind of nasty hack it is to read some unknown, hardcoded memory location.
There must certainly be something in between calling an artificially stupefied (scaling a 64 bit counter down to 32 bits) high-level API function and reading a raw memory address?

Here's an example of a thread-safe wrapper for GetTickCount() extending the tick count value to 64 bits and in that being equivalent to GetTickCount64().
To avoid undesired counter roll overs, make sure to call this function a few times every 49.7 days. You can even have a dedicated thread whose only purpose would be to call this function and then sleep some 20 days in an infinite loop.
ULONGLONG MyGetTickCount64(void)
{
static volatile LONGLONG Count = 0;
LONGLONG curCount1, curCount2;
LONGLONG tmp;
curCount1 = InterlockedCompareExchange64(&Count, 0, 0);
curCount2 = curCount1 & 0xFFFFFFFF00000000;
curCount2 |= GetTickCount();
if ((ULONG)curCount2 < (ULONG)curCount1)
{
curCount2 += 0x100000000;
}
tmp = InterlockedCompareExchange64(&Count, curCount2, curCount1);
if (tmp == curCount1)
{
return curCount2;
}
else
{
return tmp;
}
}
EDIT: And here's a complete application that tests MyGetTickCount64().
// Compiled with Open Watcom C 1.9: wcl386.exe /we /wx /q gettick.c
#include <windows.h>
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
//
// The below code is an ugly implementation of InterlockedCompareExchange64()
// that is apparently missing in Open Watcom C 1.9.
// It must work with MSVC++ too, however.
//
UINT8 Cmpxchg8bData[] =
{
0x55, // push ebp
0x89, 0xE5, // mov ebp, esp
0x57, // push edi
0x51, // push ecx
0x53, // push ebx
0x8B, 0x7D, 0x10, // mov edi, [ebp + 0x10]
0x8B, 0x07, // mov eax, [edi]
0x8B, 0x57, 0x04, // mov edx, [edi + 0x4]
0x8B, 0x7D, 0x0C, // mov edi, [ebp + 0xc]
0x8B, 0x1F, // mov ebx, [edi]
0x8B, 0x4F, 0x04, // mov ecx, [edi + 0x4]
0x8B, 0x7D, 0x08, // mov edi, [ebp + 0x8]
0xF0, // lock:
0x0F, 0xC7, 0x0F, // cmpxchg8b [edi]
0x5B, // pop ebx
0x59, // pop ecx
0x5F, // pop edi
0x5D, // pop ebp
0xC3 // ret
};
LONGLONG (__cdecl *Cmpxchg8b)(LONGLONG volatile* Dest, LONGLONG* Exch, LONGLONG* Comp) =
(LONGLONG (__cdecl *)(LONGLONG volatile*, LONGLONG*, LONGLONG*))Cmpxchg8bData;
LONGLONG MyInterlockedCompareExchange64(LONGLONG volatile* Destination,
LONGLONG Exchange,
LONGLONG Comparand)
{
return Cmpxchg8b(Destination, &Exchange, &Comparand);
}
#ifdef InterlockedCompareExchange64
#undef InterlockedCompareExchange64
#endif
#define InterlockedCompareExchange64(Destination, Exchange, Comparand) \
MyInterlockedCompareExchange64(Destination, Exchange, Comparand)
//
// This stuff makes a thread-safe printf().
// We don't want characters output by one thread to be mixed
// with characters output by another. We want printf() to be
// "atomic".
// We use a critical section around vprintf() to achieve "atomicity".
//
static CRITICAL_SECTION PrintfCriticalSection;
int ts_printf(const char* Format, ...)
{
int count;
va_list ap;
EnterCriticalSection(&PrintfCriticalSection);
va_start(ap, Format);
count = vprintf(Format, ap);
va_end(ap);
LeaveCriticalSection(&PrintfCriticalSection);
return count;
}
#define TICK_COUNT_10MS_INCREMENT 0x800000
//
// This is the simulated tick counter.
// Its low 32 bits are going to be returned by
// our, simulated, GetTickCount().
//
// TICK_COUNT_10MS_INCREMENT is what the counter is
// incremented by every time. The value is so chosen
// that the counter quickly overflows in its
// low 32 bits.
//
static volatile LONGLONG SimulatedTickCount = 0;
//
// This is our simulated 32-bit GetTickCount()
// that returns a count that often overflows.
//
ULONG SimulatedGetTickCount(void)
{
return (ULONG)SimulatedTickCount;
}
//
// This thread function will increment the simulated tick counter
// whose value's low 32 bits we'll be reading in SimulatedGetTickCount().
//
DWORD WINAPI SimulatedTickThread(LPVOID lpParameter)
{
UNREFERENCED_PARAMETER(lpParameter);
for (;;)
{
LONGLONG c;
Sleep(10);
// Get the counter value, add TICK_COUNT_10MS_INCREMENT to it and
// store the result back.
c = InterlockedCompareExchange64(&SimulatedTickCount, 0, 0);
InterlockedCompareExchange64(&SimulatedTickCount, c + TICK_COUNT_10MS_INCREMENT, c) != c);
}
return 0;
}
volatile LONG CountOfObserved32bitOverflows = 0;
volatile LONG CountOfObservedUpdateRaces = 0;
//
// This prints statistics that includes the true 64-bit value of
// SimulatedTickCount that we can't get from SimulatedGetTickCount() as it
// returns only its lower 32 bits.
//
// The stats also include:
// - the number of times that MyGetTickCount64() observes an overflow of
// SimulatedGetTickCount()
// - the number of times MyGetTickCount64() fails to update its internal
// counter because of a concurrent update in another thread.
//
void PrintStats(void)
{
LONGLONG true64bitCounter = InterlockedCompareExchange64(&SimulatedTickCount, 0, 0);
ts_printf(" 0x%08X`%08X <- true 64-bit count; ovfs: ~%d; races: %d\n",
(ULONG)(true64bitCounter >> 32),
(ULONG)true64bitCounter,
CountOfObserved32bitOverflows,
CountOfObservedUpdateRaces);
}
//
// This is our poor man's implementation of GetTickCount64()
// on top of GetTickCount().
//
// It's thread safe.
//
// When used with actual GetTickCount() instead of SimulatedGetTickCount()
// it must be called at least a few times in 49.7 days to ensure that
// it doesn't miss any overflows in GetTickCount()'s return value.
//
ULONGLONG MyGetTickCount64(void)
{
static volatile LONGLONG Count = 0;
LONGLONG curCount1, curCount2;
LONGLONG tmp;
curCount1 = InterlockedCompareExchange64(&Count, 0, 0);
curCount2 = curCount1 & 0xFFFFFFFF00000000;
curCount2 |= SimulatedGetTickCount();
if ((ULONG)curCount2 < (ULONG)curCount1)
{
curCount2 += 0x100000000;
InterlockedIncrement(&CountOfObserved32bitOverflows);
}
tmp = InterlockedCompareExchange64(&Count, curCount2, curCount1);
if (tmp != curCount1)
{
curCount2 = tmp;
InterlockedIncrement(&CountOfObservedUpdateRaces);
}
return curCount2;
}
//
// This is an error counter. If a thread that uses MyGetTickCount64() notices
// any problem with what MyGetTickCount64() returns, it bumps up this error
// counter and stops. If one of threads sees a non-zero value in this
// counter due to an error in another thread, it stops as well.
//
volatile LONG Error = 0;
//
// This is a thread function that will be using MyGetTickCount64(),
// validating its return value and printing some stats once in a while.
//
// This function is meant to execute concurrently in multiple threads
// to create race conditions inside of MyGetTickCount64() and test it.
//
DWORD WINAPI TickUserThread(LPVOID lpParameter)
{
DWORD user = (DWORD)lpParameter; // thread number
ULONGLONG ticks[4];
ticks[3] = ticks[2] = ticks[1] = MyGetTickCount64();
while (!Error)
{
ticks[0] = ticks[1];
ticks[1] = MyGetTickCount64();
// Every ~100 ms sleep a little (slightly lowers CPU load, to about 90%)
if (ticks[1] > ticks[2] + TICK_COUNT_10MS_INCREMENT * 10L)
{
ticks[2] = ticks[1];
Sleep(1 + rand() % 20);
}
// Every ~1000 ms print the last value from MyGetTickCount64().
// Thread 1 also prints stats here.
if (ticks[1] > ticks[3] + TICK_COUNT_10MS_INCREMENT * 100L)
{
ticks[3] = ticks[1];
ts_printf("%u:0x%08X`%08X\n", user, (ULONG)(ticks[1] >> 32), (ULONG)ticks[1]);
if (user == 1)
{
PrintStats();
}
}
if (ticks[0] > ticks[1])
{
ts_printf("%u:Non-monotonic tick counts: 0x%016llX > 0x%016llX!\n",
user,
ticks[0],
ticks[1]);
PrintStats();
InterlockedIncrement(&Error);
return -1;
}
else if (ticks[0] + 0x100000000 <= ticks[1])
{
ts_printf("%u:Too big tick count jump: 0x%016llX -> 0x%016llX!\n",
user,
ticks[0],
ticks[1]);
PrintStats();
InterlockedIncrement(&Error);
return -1;
}
Sleep(0); // be nice, yield to other threads.
}
return 0;
}
//
// This prints stats upon Ctrl+C and terminates the program.
//
BOOL WINAPI ConsoleEventHandler(DWORD Event)
{
if (Event == CTRL_C_EVENT)
{
PrintStats();
}
return FALSE;
}
int main(void)
{
HANDLE simulatedTickThreadHandle;
HANDLE tickUserThreadHandle;
DWORD dummy;
// This is for the missing InterlockedCompareExchange64() workaround.
VirtualProtect(Cmpxchg8bData, sizeof(Cmpxchg8bData), PAGE_EXECUTE_READWRITE, &dummy);
InitializeCriticalSection(&PrintfCriticalSection);
if (!SetConsoleCtrlHandler(&ConsoleEventHandler, TRUE))
{
ts_printf("SetConsoleCtrlHandler(&ConsoleEventHandler) failed with error 0x%X\n", GetLastError());
return -1;
}
// Start the tick simulator thread.
simulatedTickThreadHandle = CreateThread(NULL, 0, &SimulatedTickThread, NULL, 0, NULL);
if (simulatedTickThreadHandle == NULL)
{
ts_printf("CreateThread(&SimulatedTickThread) failed with error 0x%X\n", GetLastError());
return -1;
}
// Start one thread that'll be using MyGetTickCount64().
tickUserThreadHandle = CreateThread(NULL, 0, &TickUserThread, (LPVOID)2, 0, NULL);
if (tickUserThreadHandle == NULL)
{
ts_printf("CreateThread(&TickUserThread) failed with error 0x%X\n", GetLastError());
return -1;
}
// The other thread using MyGetTickCount64() will be the main thread.
TickUserThread((LPVOID)1);
//
// The app terminates upon any error condition detected in TickUserThread()
// in any of the threads or by Ctrl+C.
//
return 0;
}
As a test I've been running this test app under Windows XP for 5+ hours on an otherwise idle machine that has 2 CPUs (idle, to avoid potential long starvation times and therefore avoid missing counter overflows that occur every 5 seconds) and it's still doing well.
Here's the latest output from the console:
2:0x00000E1B`C8800000
1:0x00000E1B`FA800000
0x00000E1B`FA800000 <- true 64-bit count; ovfs: ~3824; races: 110858
As you can see, MyGetTickCount64() has observed 3824 32-bit overflows and failed to update the value of Count with its second InterlockedCompareExchange64() 110858 times. So, overflows indeed occur and the last number means that the variable is, in fact, being concurrently updated by the two threads.
You can also see that the 64-bit tick counts that the two threads receive from MyGetTickCount64() in TickUserThread() don't have anything missing in the top 32 bits and are pretty close to the actual 64-bit tick count in SimulatedTickCount, whose 32 low bits are returned by SimulatedGetTickCount(). 0x00000E1BC8800000 is visually behind 0x00000E1BFA800000 due to thread scheduling and infrequent stat prints, it's behind by exactly 100*TICK_COUNT_10MS_INCREMENT, or 1 second. Internally, of course, the difference is much smaller.
Now, on availability of InterlockedCompareExchange64()... It's a bit odd that it's officially available since Windows Vista and Windows Server 2003. Server 2003 is in fact build from the same code base as Windows XP.
But the most important thing here is that this function is built on top of the Pentium CMPXCHG8B instruction that's been available since 1998 or earlier (1), (2). And I can see this instruction in my Windows XP's (SP3) binaries. It's in ntkrnlpa.exe/ntoskrnl.exe (the kernel) and ntdll.dll (the DLL that exports kernel's NtXxxx() functions that everything's built upon). Look for a byte sequence of 0xF0, 0x0F, 0xC7 and disassemble the code around that place to see that these bytes aren't there coincidentally.
You can check availability of this instruction through the CPUID instruction (EDX bit 8 of CPUID function 0x00000001 and function 0x80000001) and refuse to run instead of crashing if the instruction isn't there, but these days you're unlikely to find a machine that doesn't support this instruction. If you do, it won't be a good machine for Windows XP and probably your application as well anyways.

Thanks to Google Books which kindly offered the relevant literature for free, I came up with an easy and fast implementation of GetTickCount64 which works perfectly well on pre-Vista systems too (and it still is somewhat less nasty than reading a value from a hardcoded memory address).
It is in fact as easy as calling interrupt 0x2A, which maps to KiGetTickCount. In GCC inline assembly, this gives:
static __inline__ __attribute__((always_inline)) unsigned long long get_tick_count64()
{
unsigned long long ret;
__asm__ __volatile__ ("int $0x2a" : "=A"(ret) : : );
return ret;
}
Due to the way KiGetTickCount works, the function should probably better be called GetTickCount46, as it performs a right shift by 18, returning 46 bits, not 64. Though the same is true for the original Vista version, too.
Note that KiGetTickCount clobbers edx, this is relevant if you plan to implement your own faster implementation of the 32-bit version (must add edx to the clobber list in that case!).

Here's another approach, a variant of Alex's wrapper but using only 32-bit interlocks. It only actually returns a 60-bit number, but that's still good for about thirty-six million years. :-)
It does need to be called more often, at least once every three days. That shouldn't normally be a major drawback.
ULONGLONG MyTickCount64(void)
{
static volatile DWORD count = 0xFFFFFFFF;
DWORD previous_count, current_tick32, previous_count_zone, current_tick32_zone;
ULONGLONG current_tick64;
previous_count = InterlockedCompareExchange(&count, 0, 0);
current_tick32 = GetTickCount();
if (previous_count == 0xFFFFFFFF)
{
// count has never been written
DWORD initial_count;
initial_count = current_tick32 >> 28;
previous_count = InterlockedCompareExchange(&count, initial_count, 0xFFFFFFFF);
if (previous_count == 0xFFFFFFFF)
{ // This thread wrote the initial value for count
previous_count = initial_count;
}
else if (previous_count != initial_count)
{ // Another thread wrote the initial value for count,
// and it differs from the one we calculated
current_tick32 = GetTickCount();
}
}
previous_count_zone = previous_count & 15;
current_tick32_zone = current_tick32 >> 28;
if (current_tick32_zone == previous_count_zone)
{
// The top four bits of the 32-bit tick count haven't changed since count was last written.
current_tick64 = previous_count;
current_tick64 <<= 28;
current_tick64 += current_tick32 & 0x0FFFFFFF;
return current_tick64;
}
if (current_tick32_zone == previous_count_zone + 1 || (current_tick32_zone == 0 && previous_count_zone == 15))
{
// The top four bits of the 32-bit tick count have been incremented since count was last written.
InterlockedCompareExchange(&count, previous_count + 1, previous_count);
current_tick64 = previous_count + 1;
current_tick64 <<= 28;
current_tick64 += current_tick32 & 0x0FFFFFFF;
return current_tick64;
}
// Oops, we weren't called often enough, we're stuck
return 0xFFFFFFFF;
}

Fastest sort of fixed length 6 int array

Answering to another Stack Overflow question (this one) I stumbled upon an interesting sub-problem. What is the fastest way to sort an array of 6 integers?
As the question is very low level:
we can't assume libraries are available (and the call itself has its cost), only plain C
to avoid emptying instruction pipeline (that has a very high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in && or ||).
room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.
Really this question is a kind of Golf where the goal is not to minimize source length but execution time. I call it 'Zening' code as used in the title of the book Zen of Code optimization by Michael Abrash and its sequels.
As for why it is interesting, there is several layers:
the example is simple and easy to understand and measure, not much C skill involved
it shows effects of choice of a good algorithm for the problem, but also effects of the compiler and underlying hardware.
Here is my reference (naive, not optimized) implementation and my test set.
#include <stdio.h>
static __inline__ int sort6(int * d){
char j, i, imin;
int tmp;
for (j = 0 ; j < 5 ; j++){
imin = j;
for (i = j + 1; i < 6 ; i++){
if (d[i] < d[imin]){
imin = i;
}
}
tmp = d[j];
d[j] = d[imin];
d[imin] = tmp;
}
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
return x;
}
int main(int argc, char ** argv){
int i;
int d[6][5] = {
{1, 2, 3, 4, 5, 6},
{6, 5, 4, 3, 2, 1},
{100, 2, 300, 4, 500, 6},
{100, 2, 3, 4, 500, 6},
{1, 200, 3, 4, 5, 600},
{1, 1, 2, 1, 2, 1}
};
    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6 ; i++){
    sort6(d[i]);
    /*
         * printf("d%d : %d %d %d %d %d %d\n", i,
     *  d[i][0], d[i][6], d[i][7],
      *  d[i][8], d[i][9], d[i][10]);
        */
    }
    cycles = rdtsc() - cycles;
    printf("Time is %d\n", (unsigned)cycles);
}
Raw results
As number of variants is becoming large, I gathered them all in a test suite that can be found here. The actual tests used are a bit less naive than those showed above, thanks to Kevin Stock. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset).
I gave the answer to Daniel Stutzbach (for golfing) one year ago as he was at the source of the fastest solution at that time (sorting networks).
Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O2
Direct call to qsort library function : 689.38
Naive implementation (insertion sort) : 285.70
Insertion Sort (Daniel Stutzbach) : 142.12
Insertion Sort Unrolled : 125.47
Rank Order : 102.26
Rank Order with registers : 58.03
Sorting Networks (Daniel Stutzbach) : 111.68
Sorting Networks (Paul R) : 66.36
Sorting Networks 12 with Fast Swap : 58.86
Sorting Networks 12 reordered Swap : 53.74
Sorting Networks 12 reordered Simple Swap : 31.54
Reordered Sorting Network w/ fast swap : 31.54
Reordered Sorting Network w/ fast swap V2 : 33.63
Inlined Bubble Sort (Paolo Bonzini) : 48.85
Unrolled Insertion Sort (Paolo Bonzini) : 75.30
Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O1
Direct call to qsort library function : 705.93
Naive implementation (insertion sort) : 135.60
Insertion Sort (Daniel Stutzbach) : 142.11
Insertion Sort Unrolled : 126.75
Rank Order : 46.42
Rank Order with registers : 43.58
Sorting Networks (Daniel Stutzbach) : 115.57
Sorting Networks (Paul R) : 64.44
Sorting Networks 12 with Fast Swap : 61.98
Sorting Networks 12 reordered Swap : 54.67
Sorting Networks 12 reordered Simple Swap : 31.54
Reordered Sorting Network w/ fast swap : 31.24
Reordered Sorting Network w/ fast swap V2 : 33.07
Inlined Bubble Sort (Paolo Bonzini) : 45.79
Unrolled Insertion Sort (Paolo Bonzini) : 80.15
I included both -O1 and -O2 results because surprisingly for several programs O2 is less efficient than O1. I wonder what specific optimization has this effect ?
Comments on proposed solutions
Insertion Sort (Daniel Stutzbach)
As expected minimizing branches is indeed a good idea.
Sorting Networks (Daniel Stutzbach)
Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is here).
Sorting Networks (Paul R)
The best so far. The actual code I used to test is here. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?
Sorting Networks 12 SWAP with Fast Swap
As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is here). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap.
It is also interesting to notice that the branchless swap seems to be much (4 times) less efficient than the simple one using if on PPC architecture.
Calling Library qsort
To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).
Rank order
Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimizing this particular code (there was very little difference for other proposals).
update:
As published figures above shows this effect was still enhanced by later versions of gcc and Rank Order became consistently twice as fast as any other alternative.
Sorting Networks 12 with reordered Swap
The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do SWAP(1, 2); SWAP(0, 2); you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do SWAP(1, 2); SWAP(4, 5);the processor can execute both in parallel. I tried it and it works as expected, the sorting networks is running about 10% faster.
Sorting Networks 12 with Simple Swap
One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is here. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.
The "best" code is now as follow:
static inline void sort6_sorting_network_simple_swap(int * d){
#define min(x, y) (x<y?x:y)
#define max(x, y) (x<y?y:x)
#define SWAP(x,y) { const int a = min(d[x], d[y]); \
const int b = max(d[x], d[y]); \
d[x] = a; d[y] = b; }
SWAP(1, 2);
SWAP(4, 5);
SWAP(0, 2);
SWAP(3, 5);
SWAP(0, 1);
SWAP(3, 4);
SWAP(1, 4);
SWAP(0, 3);
SWAP(2, 5);
SWAP(1, 3);
SWAP(2, 4);
SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}
If we believe our test set (and, yes it is quite poor, it's mere benefit is being short, simple and easy to understand what we are measuring), the average number of cycles of the resulting code for one sort is below 40 cycles (6 tests are executed). That put each swap at an average of 4 cycles. I call that amazingly fast. Any other improvements possible ?

For any optimization, it's always best to test, test, test. I would try at least sorting networks and insertion sort. If I were betting, I'd put my money on insertion sort based on past experience.
Do you know anything about the input data? Some algorithms will perform better with certain kinds of data. For example, insertion sort performs better on sorted or almost-sorted dat, so it will be the better choice if there's an above-average chance of almost-sorted data.
The algorithm you posted is similar to an insertion sort, but it looks like you've minimized the number of swaps at the cost of more comparisons. Comparisons are far more expensive than swaps, though, because branches can cause the instruction pipeline to stall.
Here's an insertion sort implementation:
static __inline__ int sort6(int *d){
int i, j;
for (i = 1; i < 6; i++) {
int tmp = d[i];
for (j = i; j >= 1 && tmp < d[j-1]; j--)
d[j] = d[j-1];
d[j] = tmp;
}
}
Here's how I'd build a sorting network. First, use this site to generate a minimal set of SWAP macros for a network of the appropriate length. Wrapping that up in a function gives me:
static __inline__ int sort6(int * d){
#define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }
SWAP(1, 2);
SWAP(0, 2);
SWAP(0, 1);
SWAP(4, 5);
SWAP(3, 5);
SWAP(3, 4);
SWAP(0, 3);
SWAP(1, 4);
SWAP(2, 5);
SWAP(2, 4);
SWAP(1, 3);
SWAP(2, 3);
#undef SWAP
}

Here's an implementation using sorting networks:
inline void Sort2(int *p0, int *p1)
{
const int temp = min(*p0, *p1);
*p1 = max(*p0, *p1);
*p0 = temp;
}
inline void Sort3(int *p0, int *p1, int *p2)
{
Sort2(p0, p1);
Sort2(p1, p2);
Sort2(p0, p1);
}
inline void Sort4(int *p0, int *p1, int *p2, int *p3)
{
Sort2(p0, p1);
Sort2(p2, p3);
Sort2(p0, p2);
Sort2(p1, p3);
Sort2(p1, p2);
}
inline void Sort6(int *p0, int *p1, int *p2, int *p3, int *p4, int *p5)
{
Sort3(p0, p1, p2);
Sort3(p3, p4, p5);
Sort2(p0, p3);
Sort2(p2, p5);
Sort4(p1, p2, p3, p4);
}
You really need very efficient branchless min and max implementations for this, since that is effectively what this code boils down to - a sequence of min and max operations (13 of each, in total). I leave this as an exercise for the reader.
Note that this implementation lends itself easily to vectorization (e.g. SIMD - most SIMD ISAs have vector min/max instructions) and also to GPU implementations (e.g. CUDA - being branchless there are no problems with warp divergence etc).
See also: Fast algorithm implementation to sort very small list

Since these are integers and compares are fast, why not compute the rank order of each directly:
inline void sort6(int *d) {
int e[6];
memcpy(e,d,6*sizeof(int));
int o0 = (d[0]>d[1])+(d[0]>d[2])+(d[0]>d[3])+(d[0]>d[4])+(d[0]>d[5]);
int o1 = (d[1]>=d[0])+(d[1]>d[2])+(d[1]>d[3])+(d[1]>d[4])+(d[1]>d[5]);
int o2 = (d[2]>=d[0])+(d[2]>=d[1])+(d[2]>d[3])+(d[2]>d[4])+(d[2]>d[5]);
int o3 = (d[3]>=d[0])+(d[3]>=d[1])+(d[3]>=d[2])+(d[3]>d[4])+(d[3]>d[5]);
int o4 = (d[4]>=d[0])+(d[4]>=d[1])+(d[4]>=d[2])+(d[4]>=d[3])+(d[4]>d[5]);
int o5 = 15-(o0+o1+o2+o3+o4);
d[o0]=e[0]; d[o1]=e[1]; d[o2]=e[2]; d[o3]=e[3]; d[o4]=e[4]; d[o5]=e[5];
}

Looks like I got to the party a year late, but here we go...
Looking at the assembly generated by gcc 4.5.2 I observed that loads and stores are being done for every swap, which really isn't needed. It would be better to load the 6 values into registers, sort those, and store them back into memory. I ordered the loads at stores to be as close as possible to there the registers are first needed and last used. I also used Steinar H. Gunderson's SWAP macro. Update: I switched to Paolo Bonzini's SWAP macro which gcc converts into something similar to Gunderson's, but gcc is able to better order the instructions since they aren't given as explicit assembly.
I used the same swap order as the reordered swap network given as the best performing, although there may be a better ordering. If I find some more time I'll generate and test a bunch of permutations.
I changed the testing code to consider over 4000 arrays and show the average number of cycles needed to sort each one. On an i5-650 I'm getting ~34.1 cycles/sort (using -O3), compared to the original reordered sorting network getting ~65.3 cycles/sort (using -O1, beats -O2 and -O3).
#include <stdio.h>
static inline void sort6_fast(int * d) {
#define SWAP(x,y) { int dx = x, dy = y, tmp; tmp = x = dx < dy ? dx : dy; y ^= dx ^ tmp; }
register int x0,x1,x2,x3,x4,x5;
x1 = d[1];
x2 = d[2];
SWAP(x1, x2);
x4 = d[4];
x5 = d[5];
SWAP(x4, x5);
x0 = d[0];
SWAP(x0, x2);
x3 = d[3];
SWAP(x3, x5);
SWAP(x0, x1);
SWAP(x3, x4);
SWAP(x1, x4);
SWAP(x0, x3);
d[0] = x0;
SWAP(x2, x5);
d[5] = x5;
SWAP(x1, x3);
d[1] = x1;
SWAP(x2, x4);
d[4] = x4;
SWAP(x2, x3);
d[2] = x2;
d[3] = x3;
#undef SWAP
#undef min
#undef max
}
static __inline__ unsigned long long rdtsc(void)
{
unsigned long long int x;
__asm__ volatile ("rdtsc; shlq $32, %%rdx; orq %%rdx, %0" : "=a" (x) : : "rdx");
return x;
}
void ran_fill(int n, int *a) {
static int seed = 76521;
while (n--) *a++ = (seed = seed *1812433253 + 12345);
}
#define NTESTS 4096
int main() {
int i;
int d[6*NTESTS];
ran_fill(6*NTESTS, d);
unsigned long long cycles = rdtsc();
for (i = 0; i < 6*NTESTS ; i+=6) {
sort6_fast(d+i);
}
cycles = rdtsc() - cycles;
printf("Time is %.2lf\n", (double)cycles/(double)NTESTS);
for (i = 0; i < 6*NTESTS ; i+=6) {
if (d[i+0] > d[i+1] || d[i+1] > d[i+2] || d[i+2] > d[i+3] || d[i+3] > d[i+4] || d[i+4] > d[i+5])
printf("d%d : %d %d %d %d %d %d\n", i,
d[i+0], d[i+1], d[i+2],
d[i+3], d[i+4], d[i+5]);
}
return 0;
}
I changed modified the test suite to also report clocks per sort and run more tests (the cmp function was updated to handle integer overflow as well), here are the results on some different architectures. I attempted testing on an AMD cpu but rdtsc isn't reliable on the X6 1100T I have available.
Clarkdale (i5-650)
==================
Direct call to qsort library function 635.14 575.65 581.61 577.76 521.12
Naive implementation (insertion sort) 538.30 135.36 134.89 240.62 101.23
Insertion Sort (Daniel Stutzbach) 424.48 159.85 160.76 152.01 151.92
Insertion Sort Unrolled 339.16 125.16 125.81 129.93 123.16
Rank Order 184.34 106.58 54.74 93.24 94.09
Rank Order with registers 127.45 104.65 53.79 98.05 97.95
Sorting Networks (Daniel Stutzbach) 269.77 130.56 128.15 126.70 127.30
Sorting Networks (Paul R) 551.64 103.20 64.57 73.68 73.51
Sorting Networks 12 with Fast Swap 321.74 61.61 63.90 67.92 67.76
Sorting Networks 12 reordered Swap 318.75 60.69 65.90 70.25 70.06
Reordered Sorting Network w/ fast swap 145.91 34.17 32.66 32.22 32.18
Kentsfield (Core 2 Quad)
========================
Direct call to qsort library function 870.01 736.39 723.39 725.48 721.85
Naive implementation (insertion sort) 503.67 174.09 182.13 284.41 191.10
Insertion Sort (Daniel Stutzbach) 345.32 152.84 157.67 151.23 150.96
Insertion Sort Unrolled 316.20 133.03 129.86 118.96 105.06
Rank Order 164.37 138.32 46.29 99.87 99.81
Rank Order with registers 115.44 116.02 44.04 116.04 116.03
Sorting Networks (Daniel Stutzbach) 230.35 114.31 119.15 110.51 111.45
Sorting Networks (Paul R) 498.94 77.24 63.98 62.17 65.67
Sorting Networks 12 with Fast Swap 315.98 59.41 58.36 60.29 55.15
Sorting Networks 12 reordered Swap 307.67 55.78 51.48 51.67 50.74
Reordered Sorting Network w/ fast swap 149.68 31.46 30.91 31.54 31.58
Sandy Bridge (i7-2600k)
=======================
Direct call to qsort library function 559.97 451.88 464.84 491.35 458.11
Naive implementation (insertion sort) 341.15 160.26 160.45 154.40 106.54
Insertion Sort (Daniel Stutzbach) 284.17 136.74 132.69 123.85 121.77
Insertion Sort Unrolled 239.40 110.49 114.81 110.79 117.30
Rank Order 114.24 76.42 45.31 36.96 36.73
Rank Order with registers 105.09 32.31 48.54 32.51 33.29
Sorting Networks (Daniel Stutzbach) 210.56 115.68 116.69 107.05 124.08
Sorting Networks (Paul R) 364.03 66.02 61.64 45.70 44.19
Sorting Networks 12 with Fast Swap 246.97 41.36 59.03 41.66 38.98
Sorting Networks 12 reordered Swap 235.39 38.84 47.36 38.61 37.29
Reordered Sorting Network w/ fast swap 115.58 27.23 27.75 27.25 26.54
Nehalem (Xeon E5640)
====================
Direct call to qsort library function 911.62 890.88 681.80 876.03 872.89
Naive implementation (insertion sort) 457.69 236.87 127.68 388.74 175.28
Insertion Sort (Daniel Stutzbach) 317.89 279.74 147.78 247.97 245.09
Insertion Sort Unrolled 259.63 220.60 116.55 221.66 212.93
Rank Order 140.62 197.04 52.10 163.66 153.63
Rank Order with registers 84.83 96.78 50.93 109.96 54.73
Sorting Networks (Daniel Stutzbach) 214.59 220.94 118.68 120.60 116.09
Sorting Networks (Paul R) 459.17 163.76 56.40 61.83 58.69
Sorting Networks 12 with Fast Swap 284.58 95.01 50.66 53.19 55.47
Sorting Networks 12 reordered Swap 281.20 96.72 44.15 56.38 54.57
Reordered Sorting Network w/ fast swap 128.34 50.87 26.87 27.91 28.02

The test code is pretty bad; it overflows the initial array (don't people here read compiler warnings?), the printf is printing out the wrong elements, it uses .byte for rdtsc for no good reason, there's only one run (!), there's nothing checking that the end results are actually correct (so it's very easy to “optimize” into something subtly wrong), the included tests are very rudimentary (no negative numbers?) and there's nothing to stop the compiler from just discarding the entire function as dead code.
That being said, it's also pretty easy to improve on the bitonic network solution; simply change the min/max/SWAP stuff to
#define SWAP(x,y) { int tmp; asm("mov %0, %2 ; cmp %1, %0 ; cmovg %1, %0 ; cmovg %2, %1" : "=r" (d[x]), "=r" (d[y]), "=r" (tmp) : "0" (d[x]), "1" (d[y]) : "cc"); }
and it comes out about 65% faster for me (Debian gcc 4.4.5 with -O2, amd64, Core i7).

I stumbled onto this question from Google a few days ago because I also had a need to quickly sort a fixed length array of 6 integers. In my case however, my integers are only 8 bits (instead of 32) and I do not have a strict requirement of only using C. I thought I would share my findings anyways, in case they might be helpful to someone...
I implemented a variant of a network sort in assembly that uses SSE to vectorize the compare and swap operations, to the extent possible. It takes six "passes" to completely sort the array. I used a novel mechanism to directly convert the results of PCMPGTB (vectorized compare) to shuffle parameters for PSHUFB (vectorized swap), using only a PADDB (vectorized add) and in some cases also a PAND (bitwise AND) instruction.
This approach also had the side effect of yielding a truly branchless function. There are no jump instructions whatsoever.
It appears that this implementation is about 38% faster than the implementation which is currently marked as the fastest option in the question ("Sorting Networks 12 with Simple Swap"). I modified that implementation to use char array elements during my testing, to make the comparison fair.
I should note that this approach can be applied to any array size up to 16 elements. I expect the relative speed advantage over the alternatives to grow larger for the bigger arrays.
The code is written in MASM for x86_64 processors with SSSE3. The function uses the "new" Windows x64 calling convention. Here it is...
PUBLIC simd_sort_6
.DATA
ALIGN 16
pass1_shuffle OWORD 0F0E0D0C0B0A09080706040503010200h
pass1_add OWORD 0F0E0D0C0B0A09080706050503020200h
pass2_shuffle OWORD 0F0E0D0C0B0A09080706030405000102h
pass2_and OWORD 00000000000000000000FE00FEFE00FEh
pass2_add OWORD 0F0E0D0C0B0A09080706050405020102h
pass3_shuffle OWORD 0F0E0D0C0B0A09080706020304050001h
pass3_and OWORD 00000000000000000000FDFFFFFDFFFFh
pass3_add OWORD 0F0E0D0C0B0A09080706050404050101h
pass4_shuffle OWORD 0F0E0D0C0B0A09080706050100020403h
pass4_and OWORD 0000000000000000000000FDFD00FDFDh
pass4_add OWORD 0F0E0D0C0B0A09080706050403020403h
pass5_shuffle OWORD 0F0E0D0C0B0A09080706050201040300h
pass5_and OWORD 0000000000000000000000FEFEFEFE00h
pass5_add OWORD 0F0E0D0C0B0A09080706050403040300h
pass6_shuffle OWORD 0F0E0D0C0B0A09080706050402030100h
pass6_add OWORD 0F0E0D0C0B0A09080706050403030100h
.CODE
simd_sort_6 PROC FRAME
.endprolog
; pxor xmm4, xmm4
; pinsrd xmm4, dword ptr [rcx], 0
; pinsrb xmm4, byte ptr [rcx + 4], 4
; pinsrb xmm4, byte ptr [rcx + 5], 5
; The benchmarked 38% faster mentioned in the text was with the above slower sequence that tied up the shuffle port longer. Same on extract
; avoiding pins/extrb also means we don't need SSE 4.1, but SSSE3 CPUs without SSE4.1 (e.g. Conroe/Merom) have slow pshufb.
movd xmm4, dword ptr [rcx]
pinsrw xmm4, word ptr [rcx + 4], 2 ; word 2 = bytes 4 and 5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass1_shuffle]
pcmpgtb xmm5, xmm4
paddb xmm5, oword ptr [pass1_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass2_shuffle]
pcmpgtb xmm5, xmm4
pand xmm5, oword ptr [pass2_and]
paddb xmm5, oword ptr [pass2_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass3_shuffle]
pcmpgtb xmm5, xmm4
pand xmm5, oword ptr [pass3_and]
paddb xmm5, oword ptr [pass3_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass4_shuffle]
pcmpgtb xmm5, xmm4
pand xmm5, oword ptr [pass4_and]
paddb xmm5, oword ptr [pass4_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass5_shuffle]
pcmpgtb xmm5, xmm4
pand xmm5, oword ptr [pass5_and]
paddb xmm5, oword ptr [pass5_add]
pshufb xmm4, xmm5
movdqa xmm5, xmm4
pshufb xmm5, oword ptr [pass6_shuffle]
pcmpgtb xmm5, xmm4
paddb xmm5, oword ptr [pass6_add]
pshufb xmm4, xmm5
;pextrd dword ptr [rcx], xmm4, 0 ; benchmarked with this
;pextrb byte ptr [rcx + 4], xmm4, 4 ; slower version
;pextrb byte ptr [rcx + 5], xmm4, 5
movd dword ptr [rcx], xmm4
pextrw word ptr [rcx + 4], xmm4, 2 ; x86 is little-endian, so this is the right order
ret
simd_sort_6 ENDP
END
You can compile this to an executable object and link it into your C project. For instructions on how to do this in Visual Studio, you can read this article. You can use the following C prototype to call the function from your C code:
void simd_sort_6(char *values);

While I really like the swap macro provided:
#define min(x, y) (y ^ ((x ^ y) & -(x < y)))
#define max(x, y) (x ^ ((x ^ y) & -(x < y)))
#define SWAP(x,y) { int tmp = min(d[x], d[y]); d[y] = max(d[x], d[y]); d[x] = tmp; }
I see an improvement (which a good compiler might make):
#define SWAP(x,y) { int tmp = ((x ^ y) & -(y < x)); y ^= tmp; x ^= tmp; }
We take note of how min and max work and pull the common sub-expression explicitly. This eliminates the min and max macros completely.

Never optimize min/max without benchmarking and looking at actual compiler generated assembly. If I let GCC optimize min with conditional move instructions I get a 33% speedup:
#define SWAP(x,y) { int dx = d[x], dy = d[y], tmp; tmp = d[x] = dx < dy ? dx : dy; d[y] ^= dx ^ tmp; }
(280 vs. 420 cycles in the test code). Doing max with ?: is more or less the same, almost lost in the noise, but the above is a little bit faster. This SWAP is faster with both GCC and Clang.
Compilers are also doing an exceptional job at register allocation and alias analysis, effectively moving d[x] into local variables upfront, and only copying back to memory at the end. In fact, they do so even better than if you worked entirely with local variables (like d0 = d[0], d1 = d[1], d2 = d[2], d3 = d[3], d4 = d[4], d5 = d[5]). I'm writing this because you are assuming strong optimization and yet trying to outsmart the compiler on min/max. :)
By the way, I tried Clang and GCC. They do the same optimization, but due to scheduling differences the two have some variation in the results, can't say really which is faster or slower. GCC is faster on the sorting networks, Clang on the quadratic sorts.
Just for completeness, unrolled bubble sort and insertion sorts are possible too. Here is the bubble sort:
SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4); SWAP(4,5);
SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4);
SWAP(0,1); SWAP(1,2); SWAP(2,3);
SWAP(0,1); SWAP(1,2);
SWAP(0,1);
and here is the insertion sort:
//#define ITER(x) { if (t < d[x]) { d[x+1] = d[x]; d[x] = t; } }
//Faster on x86, probably slower on ARM or similar:
#define ITER(x) { d[x+1] ^= t < d[x] ? d[x] ^ d[x+1] : 0; d[x] = t < d[x] ? t : d[x]; }
static inline void sort6_insertion_sort_unrolled_v2(int * d){
int t;
t = d[1]; ITER(0);
t = d[2]; ITER(1); ITER(0);
t = d[3]; ITER(2); ITER(1); ITER(0);
t = d[4]; ITER(3); ITER(2); ITER(1); ITER(0);
t = d[5]; ITER(4); ITER(3); ITER(2); ITER(1); ITER(0);
This insertion sort is faster than Daniel Stutzbach's, and is especially good on a GPU or a computer with predication because ITER can be done with only 3 instructions (vs. 4 for SWAP). For example, here is the t = d[2]; ITER(1); ITER(0); line in ARM assembly:
MOV r6, r2
CMP r6, r1
MOVLT r2, r1
MOVLT r1, r6
CMP r6, r0
MOVLT r1, r0
MOVLT r0, r6
For six elements the insertion sort is competitive with the sorting network (12 swaps vs. 15 iterations balances 4 instructions/swap vs. 3 instructions/iteration); bubble sort of course is slower. But it's not going to be true when the size grows, since insertion sort is O(n^2) while sorting networks are O(n log n).

I ported the test suite to a PPC architecture machine I can not identify (didn't have to touch code, just increase the iterations of the test, use 8 test cases to avoid polluting results with mods and replace the x86 specific rdtsc):
Direct call to qsort library function : 101
Naive implementation (insertion sort) : 299
Insertion Sort (Daniel Stutzbach) : 108
Insertion Sort Unrolled : 51
Sorting Networks (Daniel Stutzbach) : 26
Sorting Networks (Paul R) : 85
Sorting Networks 12 with Fast Swap : 117
Sorting Networks 12 reordered Swap : 116
Rank Order : 56

An XOR swap may be useful in your swapping functions.
void xorSwap (int *x, int *y) {
if (*x != *y) {
*x ^= *y;
*y ^= *x;
*x ^= *y;
}
}
The if may cause too much divergence in your code, but if you have a guarantee that all your ints are unique this could be handy.

Looking forward to trying my hand at this and learning from these examples, but first some timings from my 1.5 GHz PPC Powerbook G4 w/ 1 GB DDR RAM. (I borrowed a similar rdtsc-like timer for PPC from http://www.mcs.anl.gov/~kazutomo/rdtsc.html for the timings.) I ran the program a few times and the absolute results varied but the consistently fastest test was "Insertion Sort (Daniel Stutzbach)", with "Insertion Sort Unrolled" a close second.
Here's the last set of times:
**Direct call to qsort library function** : 164
**Naive implementation (insertion sort)** : 138
**Insertion Sort (Daniel Stutzbach)** : 85
**Insertion Sort Unrolled** : 97
**Sorting Networks (Daniel Stutzbach)** : 457
**Sorting Networks (Paul R)** : 179
**Sorting Networks 12 with Fast Swap** : 238
**Sorting Networks 12 reordered Swap** : 236
**Rank Order** : 116

Here is my contribution to this thread: an optimized 1, 4 gap shellsort for a 6-member int vector (valp) containing unique values.
void shellsort (int *valp)
{
int c,a,*cp,*ip=valp,*ep=valp+5;
c=*valp; a=*(valp+4);if (c>a) {*valp= a;*(valp+4)=c;}
c=*(valp+1);a=*(valp+5);if (c>a) {*(valp+1)=a;*(valp+5)=c;}
cp=ip;
do
{
c=*cp;
a=*(cp+1);
do
{
if (c<a) break;
*cp=a;
*(cp+1)=c;
cp-=1;
c=*cp;
} while (cp>=valp);
ip+=1;
cp=ip;
} while (ip<ep);
}
On my HP dv7-3010so laptop with a dual-core Athlon M300 # 2 Ghz (DDR2 memory) it executes in 165 clock cycles. This is an average calculated from timing every unique sequence (6!/720 in all). Compiled to Win32 using OpenWatcom 1.8. The loop is essentially an insertion sort and is 16 instructions/37 bytes long.
I do not have a 64-bit environment to compile on.

If insertion sort is reasonably competitive here, I would recommend trying a shellsort. I'm afraid 6 elements is probably just too little for it to be among the best, but it might be worth a try.
Example code, untested, undebugged, etc. You want to tune the inc = 4 and inc -= 3 sequence to find the optimum (try inc = 2, inc -= 1 for example).
static __inline__ int sort6(int * d) {
char j, i;
int tmp;
for (inc = 4; inc > 0; inc -= 3) {
for (i = inc; i < 5; i++) {
tmp = a[i];
j = i;
while (j >= inc && a[j - inc] > tmp) {
a[j] = a[j - inc];
j -= inc;
}
a[j] = tmp;
}
}
}
I don't think this will win, but if someone posts a question about sorting 10 elements, who knows...
According to Wikipedia this can even be combined with sorting networks:
Pratt, V (1979). Shellsort and sorting networks (Outstanding dissertations in the computer sciences). Garland. ISBN 0-824-04406-1

I know I'm super-late, but I was interested in experimenting with some different solutions. First, I cleaned up that paste, made it compile, and put it into a repository. I kept some undesirable solutions as dead-ends so that others wouldn't try it. Among this was my first solution, which attempted to ensure that x1>x2 was calculated once. After optimization, it is no faster than the other, simple versions.
I added a looping version of rank order sort, since my own application of this study is for sorting 2-8 items, so since there are a variable number of arguments, a loop is necessary. This is also why I ignored the sorting network solutions.
The test code didn't test that duplicates were handled correctly, so while the existing solutions were all correct, I added a special case to the test code to ensure that duplicates were handled correctly.
Then, I wrote an insertion sort that is entirely in AVX registers. On my machine it is 25% faster than the other insertion sorts, but 100% slower than rank order. I did this purely for experiment and had no expectation of this being better due to the branching in insertion sort.
static inline void sort6_insertion_sort_avx(int* d) {
__m256i src = _mm256_setr_epi32(d[0], d[1], d[2], d[3], d[4], d[5], 0, 0);
__m256i index = _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7);
__m256i shlpermute = _mm256_setr_epi32(7, 0, 1, 2, 3, 4, 5, 6);
__m256i sorted = _mm256_setr_epi32(d[0], INT_MAX, INT_MAX, INT_MAX,
INT_MAX, INT_MAX, INT_MAX, INT_MAX);
__m256i val, gt, permute;
unsigned j;
// 8 / 32 = 2^-2
#define ITER(I) \
val = _mm256_permutevar8x32_epi32(src, _mm256_set1_epi32(I));\
gt = _mm256_cmpgt_epi32(sorted, val);\
permute = _mm256_blendv_epi8(index, shlpermute, gt);\
j = ffs( _mm256_movemask_epi8(gt)) >> 2;\
sorted = _mm256_blendv_epi8(_mm256_permutevar8x32_epi32(sorted, permute),\
val, _mm256_cmpeq_epi32(index, _mm256_set1_epi32(j)))
ITER(1);
ITER(2);
ITER(3);
ITER(4);
ITER(5);
int x[8];
_mm256_storeu_si256((__m256i*)x, sorted);
d[0] = x[0]; d[1] = x[1]; d[2] = x[2]; d[3] = x[3]; d[4] = x[4]; d[5] = x[5];
#undef ITER
}
Then, I wrote a rank order sort using AVX. This matches the speed of the other rank-order solutions, but is no faster. The issue here is that I can only calculate the indices with AVX, and then I have to make a table of indices. This is because the calculation is destination-based rather than source-based. See Converting from Source-based Indices to Destination-based Indices
static inline void sort6_rank_order_avx(int* d) {
__m256i ror = _mm256_setr_epi32(5, 0, 1, 2, 3, 4, 6, 7);
__m256i one = _mm256_set1_epi32(1);
__m256i src = _mm256_setr_epi32(d[0], d[1], d[2], d[3], d[4], d[5], INT_MAX, INT_MAX);
__m256i rot = src;
__m256i index = _mm256_setzero_si256();
__m256i gt, permute;
__m256i shl = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 6, 6);
__m256i dstIx = _mm256_setr_epi32(0,1,2,3,4,5,6,7);
__m256i srcIx = dstIx;
__m256i eq = one;
__m256i rotIx = _mm256_setzero_si256();
#define INC(I)\
rot = _mm256_permutevar8x32_epi32(rot, ror);\
gt = _mm256_cmpgt_epi32(src, rot);\
index = _mm256_add_epi32(index, _mm256_and_si256(gt, one));\
index = _mm256_add_epi32(index, _mm256_and_si256(eq,\
_mm256_cmpeq_epi32(src, rot)));\
eq = _mm256_insert_epi32(eq, 0, I)
INC(0);
INC(1);
INC(2);
INC(3);
INC(4);
int e[6];
e[0] = d[0]; e[1] = d[1]; e[2] = d[2]; e[3] = d[3]; e[4] = d[4]; e[5] = d[5];
int i[8];
_mm256_storeu_si256((__m256i*)i, index);
d[i[0]] = e[0]; d[i[1]] = e[1]; d[i[2]] = e[2]; d[i[3]] = e[3]; d[i[4]] = e[4]; d[i[5]] = e[5];
}
The repo can be found here: https://github.com/eyepatchParrot/sort6/

This question is becoming quite old, but I actually had to solve the same problem these days: fast agorithms to sort small arrays. I thought it would be a good idea to share my knowledge. While I first started by using sorting networks, I finally managed to find other algorithms for which the total number of comparisons performed to sort every permutation of 6 values was smaller than with sorting networks, and smaller than with insertion sort. I didn't count the number of swaps; I would expect it to be roughly equivalent (maybe a bit higher sometimes).
The algorithm sort6 uses the algorithm sort4 which uses the algorithm sort3. Here is the implementation in some light C++ form (the original is template-heavy so that it can work with any random-access iterator and any suitable comparison function).
Sorting 3 values
The following algorithm is an unrolled insertion sort. When two swaps (6 assignments) have to be performed, it uses 4 assignments instead:
void sort3(int* array)
{
if (array[1] < array[0]) {
if (array[2] < array[0]) {
if (array[2] < array[1]) {
std::swap(array[0], array[2]);
} else {
int tmp = array[0];
array[0] = array[1];
array[1] = array[2];
array[2] = tmp;
}
} else {
std::swap(array[0], array[1]);
}
} else {
if (array[2] < array[1]) {
if (array[2] < array[0]) {
int tmp = array[2];
array[2] = array[1];
array[1] = array[0];
array[0] = tmp;
} else {
std::swap(array[1], array[2]);
}
}
}
}
It looks a bit complex because the sort has more or less one branch for every possible permutation of the array, using 2~3 comparisons and at most 4 assignments to sort the three values.
Sorting 4 values
This one calls sort3 then performs an unrolled insertion sort with the last element of the array:
void sort4(int* array)
{
// Sort the first 3 elements
sort3(array);
// Insert the 4th element with insertion sort
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
if (array[1] < array[0]) {
std::swap(array[0], array[1]);
}
}
}
}
This algorithm performs 3 to 6 comparisons and at most 5 swaps. It is easy to unroll an insertion sort, but we will be using another algorithm for the last sort...
Sorting 6 values
This one uses an unrolled version of what I called a double insertion sort. The name isn't that great, but it's quite descriptive, here is how it works:
Sort everything but the first and the last elements of the array.
Swap the first and the elements of the array if the first is greater than the last.
Insert the first element into the sorted sequence from the front then the last element from the back.
After the swap, the first element is always smaller than the last, which means that, when inserting them into the sorted sequence, there won't be more than N comparisons to insert the two elements in the worst case: for example, if the first element has been insert in the 3rd position, then the last one can't be inserted lower than the 4th position.
void sort6(int* array)
{
// Sort everything but first and last elements
sort4(array+1);
// Switch first and last elements if needed
if (array[5] < array[0]) {
std::swap(array[0], array[5]);
}
// Insert first element from the front
if (array[1] < array[0]) {
std::swap(array[0], array[1]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[4] < array[3]) {
std::swap(array[3], array[4]);
}
}
}
}
// Insert last element from the back
if (array[5] < array[4]) {
std::swap(array[4], array[5]);
if (array[4] < array[3]) {
std::swap(array[3], array[4]);
if (array[3] < array[2]) {
std::swap(array[2], array[3]);
if (array[2] < array[1]) {
std::swap(array[1], array[2]);
}
}
}
}
}
My tests on every permutation of 6 values ever show that this algorithms always performs between 6 and 13 comparisons. I didn't compute the number of swaps performed, but I don't expect it to be higher than 11 in the worst case.
I hope that this helps, even if this question may not represent an actual problem anymore :)
EDIT: after putting it in the provided benchmark, it is cleary slower than most of the interesting alternatives. It tends to perform a bit better than the unrolled insertion sort, but that's pretty much it. Basically, it isn't the best sort for integers but could be interesting for types with an expensive comparison operation.

I found that at least on my system, the functions sort6_iterator() and sort6_iterator_local() defined below both ran at least as fast, and frequently noticeably faster, than the above current record holder:
#define MIN(x, y) (x<y?x:y)
#define MAX(x, y) (x<y?y:x)
template<class IterType>
inline void sort6_iterator(IterType it)
{
#define SWAP(x,y) { const auto a = MIN(*(it + x), *(it + y)); \
const auto b = MAX(*(it + x), *(it + y)); \
*(it + x) = a; *(it + y) = b; }
SWAP(1, 2) SWAP(4, 5)
SWAP(0, 2) SWAP(3, 5)
SWAP(0, 1) SWAP(3, 4)
SWAP(1, 4) SWAP(0, 3)
SWAP(2, 5) SWAP(1, 3)
SWAP(2, 4)
SWAP(2, 3)
#undef SWAP
}
I passed this function a std::vector's iterator in my timing code.
I suspect (from comments like this and elsewhere) that using iterators gives g++ certain assurances about what can and can't happen to the memory that the iterator refers to, which it otherwise wouldn't have and it is these assurances that allow g++ to better optimize the sorting code (e.g. with pointers, the compiler can't be sure that all pointers are pointing to different memory locations). If I remember correctly, this is also part of the reason why so many STL algorithms, such as std::sort(), generally have such obscenely good performance.
Moreover, sort6_iterator() is sometimes (again, depending on the context in which the function is called) consistently outperformed by the following sorting function, which copies the data into local variables before sorting them.1 Note that since there are only 6 local variables defined, if these local variables are primitives then they are likely never actually stored in RAM and are instead only ever stored in the CPU's registers until the end of the function call, which helps make this sorting function fast. (It also helps that the compiler knows that distinct local variables have distinct locations in memory).
template<class IterType>
inline void sort6_iterator_local(IterType it)
{
#define SWAP(x,y) { const auto a = MIN(data##x, data##y); \
const auto b = MAX(data##x, data##y); \
data##x = a; data##y = b; }
//DD = Define Data
#define DD1(a) auto data##a = *(it + a);
#define DD2(a,b) auto data##a = *(it + a), data##b = *(it + b);
//CB = Copy Back
#define CB(a) *(it + a) = data##a;
DD2(1,2) SWAP(1, 2)
DD2(4,5) SWAP(4, 5)
DD1(0) SWAP(0, 2)
DD1(3) SWAP(3, 5)
SWAP(0, 1) SWAP(3, 4)
SWAP(1, 4) SWAP(0, 3) CB(0)
SWAP(2, 5) CB(5)
SWAP(1, 3) CB(1)
SWAP(2, 4) CB(4)
SWAP(2, 3) CB(2) CB(3)
#undef CB
#undef DD2
#undef DD1
#undef SWAP
}
Note that defining SWAP() as follows sometimes results in slightly better performance although most of the time it results in slightly worse performance or a negligible difference in performance.
#define SWAP(x,y) { const auto a = MIN(data##x, data##y); \
data##y = MAX(data##x, data##y); \
data##x = a; }
If you just want a sorting algorithm that on primitive data types, gcc -O3 is consistently good at optimizing no matter what context the call to the sorting function appears in1 then, depending on how you pass the input, try one of the following two algorithms:
template<class T> inline void sort6(T it) {
#define SORT2(x,y) {if(data##x>data##y){auto a=std::move(data##y);data##y=std::move(data##x);data##x=std::move(a);}}
#define DD1(a) register auto data##a=*(it+a);
#define DD2(a,b) register auto data##a=*(it+a);register auto data##b=*(it+b);
#define CB1(a) *(it+a)=data##a;
#define CB2(a,b) *(it+a)=data##a;*(it+b)=data##b;
DD2(1,2) SORT2(1,2)
DD2(4,5) SORT2(4,5)
DD1(0) SORT2(0,2)
DD1(3) SORT2(3,5)
SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5)
SORT2(1,4) SORT2(0,3) CB1(0)
SORT2(2,4) CB1(4)
SORT2(1,3) CB1(1)
SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}
Or if you want to pass the variables by reference then use this (the below function differs from the above in its first 5 lines):
template<class T> inline void sort6(T& e0, T& e1, T& e2, T& e3, T& e4, T& e5) {
#define SORT2(x,y) {if(data##x>data##y)std::swap(data##x,data##y);}
#define DD1(a) register auto data##a=e##a;
#define DD2(a,b) register auto data##a=e##a;register auto data##b=e##b;
#define CB1(a) e##a=data##a;
#define CB2(a,b) e##a=data##a;e##b=data##b;
DD2(1,2) SORT2(1,2)
DD2(4,5) SORT2(4,5)
DD1(0) SORT2(0,2)
DD1(3) SORT2(3,5)
SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5)
SORT2(1,4) SORT2(0,3) CB1(0)
SORT2(2,4) CB1(4)
SORT2(1,3) CB1(1)
SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}
The reason for using the register keyword is because this is one of the few times that you know that you want these values in registers. Without register, the compiler will figure this out most of the time but sometimes it doesn't. Using the register keyword helps solve this issue. Normally, however, don't use the register keyword since it's more likely to slow your code than speed it up.
Also, note the use of templates. This is done on purpose since, even with the inline keyword, template functions are generally much more aggressively optimized by gcc than vanilla C functions (this has to do with gcc needing to deal with function pointers for vanilla C functions but not with template functions).
While timing various sorting functions I noticed that the context (i.e. surrounding code) in which the call to the sorting function was made had a significant impact on performance, which is likely due to the function being inlined and then optimized. For instance, if the program was sufficiently simple then there usually wasn't much of a difference in performance between passing the sorting function a pointer versus passing it an iterator; otherwise using iterators usually resulted in noticeably better performance and never (in my experience so far at least) any noticeably worse performance. I suspect that this may be because g++ can globally optimize sufficiently simple code.

I believe there are two parts to your question.
The first is to determine the optimal algorithm. This is done - at least in this case - by looping through every possible ordering (there aren't that many) which allows you to compute exact min, max, average and standard deviation of compares and swaps. Have a runner-up or two handy as well.
The second is to optimize the algorithm. A lot can be done to convert textbook code examples to mean and lean real-life algorithms. If you realize that an algorithm can't be optimized to the extent required, try a runner-up.
I wouldn't worry too much about emptying pipelines (assuming current x86): branch prediction has come a long way. What I would worry about is making sure that the code and data fit in one cache line each (maybe two for the code). Once there fetch latencies are refreshingly low which will compensate for any stall. It also means that your inner loop will be maybe ten instructions or so which is right where it should be (there are two different inner loops in my sorting algorithm, they are 10 instructions/22 bytes and 9/22 long respectively). Assuming the code doesn't contain any divs you can be sure it will be blindingly fast.

I know this is an old question.
But I just wrote a different kind of solution I want to share.
Using nothing but nested MIN MAX,
It's not fast as it uses 114 of each,
could reduce it to 75 pretty simply like so -> pastebin
But then it's not purely min max anymore.
What might work is doing min/max on multiple integers at once with AVX
PMINSW reference
#include <stdio.h>
static __inline__ int MIN(int a, int b){
int result =a;
__asm__ ("pminsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ int MAX(int a, int b){
int result = a;
__asm__ ("pmaxsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ unsigned long long rdtsc(void){
unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" :
"=A" (x));
return x;
}
#define MIN3(a, b, c) (MIN(MIN(a,b),c))
#define MIN4(a, b, c, d) (MIN(MIN(a,b),MIN(c,d)))
static __inline__ void sort6(int * in) {
const int A=in[0], B=in[1], C=in[2], D=in[3], E=in[4], F=in[5];
in[0] = MIN( MIN4(A,B,C,D),MIN(E,F) );
const int
AB = MAX(A, B),
AC = MAX(A, C),
AD = MAX(A, D),
AE = MAX(A, E),
AF = MAX(A, F),
BC = MAX(B, C),
BD = MAX(B, D),
BE = MAX(B, E),
BF = MAX(B, F),
CD = MAX(C, D),
CE = MAX(C, E),
CF = MAX(C, F),
DE = MAX(D, E),
DF = MAX(D, F),
EF = MAX(E, F);
in[1] = MIN4 (
MIN4( AB, AC, AD, AE ),
MIN4( AF, BC, BD, BE ),
MIN4( BF, CD, CE, CF ),
MIN3( DE, DF, EF)
);
const int
ABC = MAX(AB,C),
ABD = MAX(AB,D),
ABE = MAX(AB,E),
ABF = MAX(AB,F),
ACD = MAX(AC,D),
ACE = MAX(AC,E),
ACF = MAX(AC,F),
ADE = MAX(AD,E),
ADF = MAX(AD,F),
AEF = MAX(AE,F),
BCD = MAX(BC,D),
BCE = MAX(BC,E),
BCF = MAX(BC,F),
BDE = MAX(BD,E),
BDF = MAX(BD,F),
BEF = MAX(BE,F),
CDE = MAX(CD,E),
CDF = MAX(CD,F),
CEF = MAX(CE,F),
DEF = MAX(DE,F);
in[2] = MIN( MIN4 (
MIN4( ABC, ABD, ABE, ABF ),
MIN4( ACD, ACE, ACF, ADE ),
MIN4( ADF, AEF, BCD, BCE ),
MIN4( BCF, BDE, BDF, BEF )),
MIN4( CDE, CDF, CEF, DEF )
);
const int
ABCD = MAX(ABC,D),
ABCE = MAX(ABC,E),
ABCF = MAX(ABC,F),
ABDE = MAX(ABD,E),
ABDF = MAX(ABD,F),
ABEF = MAX(ABE,F),
ACDE = MAX(ACD,E),
ACDF = MAX(ACD,F),
ACEF = MAX(ACE,F),
ADEF = MAX(ADE,F),
BCDE = MAX(BCD,E),
BCDF = MAX(BCD,F),
BCEF = MAX(BCE,F),
BDEF = MAX(BDE,F),
CDEF = MAX(CDE,F);
in[3] = MIN4 (
MIN4( ABCD, ABCE, ABCF, ABDE ),
MIN4( ABDF, ABEF, ACDE, ACDF ),
MIN4( ACEF, ADEF, BCDE, BCDF ),
MIN3( BCEF, BDEF, CDEF )
);
const int
ABCDE= MAX(ABCD,E),
ABCDF= MAX(ABCD,F),
ABCEF= MAX(ABCE,F),
ABDEF= MAX(ABDE,F),
ACDEF= MAX(ACDE,F),
BCDEF= MAX(BCDE,F);
in[4]= MIN (
MIN4( ABCDE, ABCDF, ABCEF, ABDEF ),
MIN ( ACDEF, BCDEF )
);
in[5] = MAX(ABCDE,F);
}
int main(int argc, char ** argv) {
int d[6][6] = {
{1, 2, 3, 4, 5, 6},
{6, 5, 4, 3, 2, 1},
{100, 2, 300, 4, 500, 6},
{100, 2, 3, 4, 500, 6},
{1, 200, 3, 4, 5, 600},
{1, 1, 2, 1, 2, 1}
};
unsigned long long cycles = rdtsc();
for (int i = 0; i < 6; i++) {
sort6(d[i]);
}
cycles = rdtsc() - cycles;
printf("Time is %d\n", (unsigned)cycles);
for (int i = 0; i < 6; i++) {
printf("d%d : %d %d %d %d %d %d\n", i,
d[i][0], d[i][1], d[i][2],
d[i][3], d[i][4], d[i][5]);
}
}
EDIT:
Rank order solution inspired by Rex Kerr's,
Much faster than the mess above
static void sort6(int *o) {
const int
A=o[0],B=o[1],C=o[2],D=o[3],E=o[4],F=o[5];
const unsigned char
AB = A>B, AC = A>C, AD = A>D, AE = A>E,
BC = B>C, BD = B>D, BE = B>E,
CD = C>D, CE = C>E,
DE = D>E,
a = AB + AC + AD + AE + (A>F),
b = 1 - AB + BC + BD + BE + (B>F),
c = 2 - AC - BC + CD + CE + (C>F),
d = 3 - AD - BD - CD + DE + (D>F),
e = 4 - AE - BE - CE - DE + (E>F);
o[a]=A; o[b]=B; o[c]=C; o[d]=D; o[e]=E;
o[15-a-b-c-d-e]=F;
}

I thought I'd try an unrolled Ford-Johnson merge-insertion sort, which achieves the minimum possible number of comparisons (ceil(log2(6!)) = 10) and no swaps.
It doesn't compete, though (I got a slightly better timing than the worst sorting networks solution sort6_sorting_network_v1).
It loads the values into six registers, then performs 8 to 10 comparisons
to decide which of the 720=6!
cases it's in, then writes the registers back in the appropriate one
of those 720 orders (separate code for each case).
There are no swaps or reordering of anything until the final write-back. I haven't looked at the generated assembly code.
static inline void sort6_ford_johnson_unrolled(int *D) {
register int a = D[0], b = D[1], c = D[2], d = D[3], e = D[4], f = D[5];
#define abcdef(a,b,c,d,e,f) (D[0]=a, D[1]=b, D[2]=c, D[3]=d, D[4]=e, D[5]=f)
#define abdef_cd(a,b,c,d,e,f) (c<a ? abcdef(c,a,b,d,e,f) \
: c<b ? abcdef(a,c,b,d,e,f) \
: abcdef(a,b,c,d,e,f))
#define abedf_cd(a,b,c,d,e,f) (c<b ? c<a ? abcdef(c,a,b,e,d,f) \
: abcdef(a,c,b,e,d,f) \
: c<e ? abcdef(a,b,c,e,d,f) \
: abcdef(a,b,e,c,d,f))
#define abdf_cd_ef(a,b,c,d,e,f) (e<b ? e<a ? abedf_cd(e,a,c,d,b,f) \
: abedf_cd(a,e,c,d,b,f) \
: e<d ? abedf_cd(a,b,c,d,e,f) \
: abdef_cd(a,b,c,d,e,f))
#define abd_cd_ef(a,b,c,d,e,f) (d<f ? abdf_cd_ef(a,b,c,d,e,f) \
: b<f ? abdf_cd_ef(a,b,e,f,c,d) \
: abdf_cd_ef(e,f,a,b,c,d))
#define ab_cd_ef(a,b,c,d,e,f) (b<d ? abd_cd_ef(a,b,c,d,e,f) \
: abd_cd_ef(c,d,a,b,e,f))
#define ab_cd(a,b,c,d,e,f) (e<f ? ab_cd_ef(a,b,c,d,e,f) \
: ab_cd_ef(a,b,c,d,f,e))
#define ab(a,b,c,d,e,f) (c<d ? ab_cd(a,b,c,d,e,f) \
: ab_cd(a,b,d,c,e,f))
a<b ? ab(a,b,c,d,e,f)
: ab(b,a,c,d,e,f);
#undef ab
#undef ab_cd
#undef ab_cd_ef
#undef abd_cd_ef
#undef abdf_cd_ef
#undef abedf_cd
#undef abdef_cd
#undef abcdef
}
TEST(ford_johnson_unrolled, "Unrolled Ford-Johnson Merge-Insertion sort");

Try 'merging sorted list' sort. :) Use two array. Fastest for small and big array.
If you concating, you only check where insert. Other bigger values you not need compare (cmp = a-b>0).
For 4 numbers, you can use system 4-5 cmp (~4.6) or 3-6 cmp (~4.9). Bubble sort use 6 cmp (6). Lots of cmp for big numbers slower code.
This code use 5 cmp (not MSL sort):
if (cmp(arr[n][i+0],arr[n][i+1])>0) {swap(n,i+0,i+1);}
if (cmp(arr[n][i+2],arr[n][i+3])>0) {swap(n,i+2,i+3);}
if (cmp(arr[n][i+0],arr[n][i+2])>0) {swap(n,i+0,i+2);}
if (cmp(arr[n][i+1],arr[n][i+3])>0) {swap(n,i+1,i+3);}
if (cmp(arr[n][i+1],arr[n][i+2])>0) {swap(n,i+1,i+2);}
Principial MSL
9 8 7 6 5 4 3 2 1 0
89 67 45 23 01 ... concat two sorted lists, list length = 1
6789 2345 01 ... concat two sorted lists, list length = 2
23456789 01 ... concat two sorted lists, list length = 4
0123456789 ... concat two sorted lists, list length = 8
js code
function sortListMerge_2a(cmp)
{
var step, stepmax, tmp, a,b,c, i,j,k, m,n, cycles;
var start = 0;
var end = arr_count;
//var str = '';
cycles = 0;
if (end>3)
{
stepmax = ((end - start + 1) >> 1) << 1;
m = 1;
n = 2;
for (step=1;step<stepmax;step<<=1) //bounds 1-1, 2-2, 4-4, 8-8...
{
a = start;
while (a<end)
{
b = a + step;
c = a + step + step;
b = b<end ? b : end;
c = c<end ? c : end;
i = a;
j = b;
k = i;
while (i<b && j<c)
{
if (cmp(arr[m][i],arr[m][j])>0)
{arr[n][k] = arr[m][j]; j++; k++;}
else {arr[n][k] = arr[m][i]; i++; k++;}
}
while (i<b)
{arr[n][k] = arr[m][i]; i++; k++;
}
while (j<c)
{arr[n][k] = arr[m][j]; j++; k++;
}
a = c;
}
tmp = m; m = n; n = tmp;
}
return m;
}
else
{
// sort 3 items
sort10(cmp);
return m;
}
}

Maybe I am late to the party, but at least my contribution is a new approach.
The code really should be inlined
even if inlined, there are too many branches
the analysing part is basically O(N(N-1)) which seems OK for N=6
the code could be more effective if the cost of swap would be higher (irt the cost of compare)
I trust on static functions being inlined.
The method is related to rank-sort
instead of ranks, the relative ranks (offsets) are used.
the sum of the ranks is zero for every cycle in any permutation group.
instead of SWAP()ing two elements, the cycles are chased, needing only one temp, and one (register->register) swap (new <- old).
Update: changed the code a bit, some people use C++ compilers to compile C code ...
#include <stdio.h>
#if WANT_CHAR
typedef signed char Dif;
#else
typedef signed int Dif;
#endif
static int walksort (int *arr, int cnt);
static void countdifs (int *arr, Dif *dif, int cnt);
static void calcranks(int *arr, Dif *dif);
int wsort6(int *arr);
void do_print_a(char *msg, int *arr, unsigned cnt)
{
fprintf(stderr,"%s:", msg);
for (; cnt--; arr++) {
fprintf(stderr, " %3d", *arr);
}
fprintf(stderr,"\n");
}
void do_print_d(char *msg, Dif *arr, unsigned cnt)
{
fprintf(stderr,"%s:", msg);
for (; cnt--; arr++) {
fprintf(stderr, " %3d", (int) *arr);
}
fprintf(stderr,"\n");
}
static void inline countdifs (int *arr, Dif *dif, int cnt)
{
int top, bot;
for (top = 0; top < cnt; top++ ) {
for (bot = 0; bot < top; bot++ ) {
if (arr[top] < arr[bot]) { dif[top]--; dif[bot]++; }
}
}
return ;
}
/* Copied from RexKerr ... */
static void inline calcranks(int *arr, Dif *dif){
dif[0] = (arr[0]>arr[1])+(arr[0]>arr[2])+(arr[0]>arr[3])+(arr[0]>arr[4])+(arr[0]>arr[5]);
dif[1] = -1+ (arr[1]>=arr[0])+(arr[1]>arr[2])+(arr[1]>arr[3])+(arr[1]>arr[4])+(arr[1]>arr[5]);
dif[2] = -2+ (arr[2]>=arr[0])+(arr[2]>=arr[1])+(arr[2]>arr[3])+(arr[2]>arr[4])+(arr[2]>arr[5]);
dif[3] = -3+ (arr[3]>=arr[0])+(arr[3]>=arr[1])+(arr[3]>=arr[2])+(arr[3]>arr[4])+(arr[3]>arr[5]);
dif[4] = -4+ (arr[4]>=arr[0])+(arr[4]>=arr[1])+(arr[4]>=arr[2])+(arr[4]>=arr[3])+(arr[4]>arr[5]);
dif[5] = -(dif[0]+dif[1]+dif[2]+dif[3]+dif[4]);
}
static int walksort (int *arr, int cnt)
{
int idx, src,dst, nswap;
Dif difs[cnt];
#if WANT_REXK
calcranks(arr, difs);
#else
for (idx=0; idx < cnt; idx++) difs[idx] =0;
countdifs(arr, difs, cnt);
#endif
calcranks(arr, difs);
#define DUMP_IT 0
#if DUMP_IT
do_print_d("ISteps ", difs, cnt);
#endif
nswap = 0;
for (idx=0; idx < cnt; idx++) {
int newval;
int step,cyc;
if ( !difs[idx] ) continue;
newval = arr[idx];
cyc = 0;
src = idx;
do {
int oldval;
step = difs[src];
difs[src] =0;
dst = src + step;
cyc += step ;
if(dst == idx+1)idx=dst;
oldval = arr[dst];
#if (DUMP_IT&1)
fprintf(stderr, "[Nswap=%d] Cyc=%d Step=%2d Idx=%d Old=%2d New=%2d #### Src=%d Dst=%d[%2d]->%2d <-- %d\n##\n"
, nswap, cyc, step, idx, oldval, newval
, src, dst, difs[dst], arr[dst]
, newval );
do_print_a("Array ", arr, cnt);
do_print_d("Steps ", difs, cnt);
#endif
arr[dst] = newval;
newval = oldval;
nswap++;
src = dst;
} while( cyc);
}
return nswap;
}
/*************/
int wsort6(int *arr)
{
return walksort(arr, 6);
}

//Bruteforce compute unrolled count dumbsort(min to 0-index)
void bcudc_sort6(int* a)
{
int t[6] = {0};
int r1,r2;
r1=0;
r1 += (a[0] > a[1]);
r1 += (a[0] > a[2]);
r1 += (a[0] > a[3]);
r1 += (a[0] > a[4]);
r1 += (a[0] > a[5]);
while(t[r1]){r1++;}
t[r1] = a[0];
r2=0;
r2 += (a[1] > a[0]);
r2 += (a[1] > a[2]);
r2 += (a[1] > a[3]);
r2 += (a[1] > a[4]);
r2 += (a[1] > a[5]);
while(t[r2]){r2++;}
t[r2] = a[1];
r1=0;
r1 += (a[2] > a[0]);
r1 += (a[2] > a[1]);
r1 += (a[2] > a[3]);
r1 += (a[2] > a[4]);
r1 += (a[2] > a[5]);
while(t[r1]){r1++;}
t[r1] = a[2];
r2=0;
r2 += (a[3] > a[0]);
r2 += (a[3] > a[1]);
r2 += (a[3] > a[2]);
r2 += (a[3] > a[4]);
r2 += (a[3] > a[5]);
while(t[r2]){r2++;}
t[r2] = a[3];
r1=0;
r1 += (a[4] > a[0]);
r1 += (a[4] > a[1]);
r1 += (a[4] > a[2]);
r1 += (a[4] > a[3]);
r1 += (a[4] > a[5]);
while(t[r1]){r1++;}
t[r1] = a[4];
r2=0;
r2 += (a[5] > a[0]);
r2 += (a[5] > a[1]);
r2 += (a[5] > a[2]);
r2 += (a[5] > a[3]);
r2 += (a[5] > a[4]);
while(t[r2]){r2++;}
t[r2] = a[5];
a[0]=t[0];
a[1]=t[1];
a[2]=t[2];
a[3]=t[3];
a[4]=t[4];
a[5]=t[5];
}
static __inline__ void sort6(int* a)
{
#define wire(x,y); t = a[x] ^ a[y] ^ ( (a[x] ^ a[y]) & -(a[x] < a[y]) ); a[x] = a[x] ^ t; a[y] = a[y] ^ t;
register int t;
wire( 0, 1); wire( 2, 3); wire( 4, 5);
wire( 3, 5); wire( 0, 2); wire( 1, 4);
wire( 4, 5); wire( 2, 3); wire( 0, 1);
wire( 3, 4); wire( 1, 2);
wire( 2, 3);
#undef wire
}

Well, if it's only 6 elements and you can leverage parallelism, want to minimize conditional branching, etc. Why you don't generate all the combinations and test for order? I would venture that in some architectures, it can be pretty fast (as long as you have the memory preallocated)

Sort 4 items with usage cmp==0.
Numbers of cmp is ~4.34 (FF native have ~4.52), but take 3x time than merging lists. But better less cmp operations, if you have big numbers or big text.
Edit: repaired bug
Online test http://mlich.zam.slu.cz/js-sort/x-sort-x2.htm
function sort4DG(cmp,start,end,n) // sort 4
{
var n = typeof(n) !=='undefined' ? n : 1;
var cmp = typeof(cmp) !=='undefined' ? cmp : sortCompare2;
var start = typeof(start)!=='undefined' ? start : 0;
var end = typeof(end) !=='undefined' ? end : arr[n].length;
var count = end - start;
var pos = -1;
var i = start;
var cc = [];
// stabilni?
cc[01] = cmp(arr[n][i+0],arr[n][i+1]);
cc[23] = cmp(arr[n][i+2],arr[n][i+3]);
if (cc[01]>0) {swap(n,i+0,i+1);}
if (cc[23]>0) {swap(n,i+2,i+3);}
cc[12] = cmp(arr[n][i+1],arr[n][i+2]);
if (!(cc[12]>0)) {return n;}
cc[02] = cc[01]==0 ? cc[12] : cmp(arr[n][i+0],arr[n][i+2]);
if (cc[02]>0)
{
swap(n,i+1,i+2); swap(n,i+0,i+1); // bubble last to top
cc[13] = cc[23]==0 ? cc[12] : cmp(arr[n][i+1],arr[n][i+3]);
if (cc[13]>0)
{
swap(n,i+2,i+3); swap(n,i+1,i+2); // bubble
return n;
}
else {
cc[23] = cc[23]==0 ? cc[12] : (cc[01]==0 ? cc[30] : cmp(arr[n][i+2],arr[n][i+3])); // new cc23 | c03 //repaired
if (cc[23]>0)
{
swap(n,i+2,i+3);
return n;
}
return n;
}
}
else {
if (cc[12]>0)
{
swap(n,i+1,i+2);
cc[23] = cc[23]==0 ? cc[12] : cmp(arr[n][i+2],arr[n][i+3]); // new cc23
if (cc[23]>0)
{
swap(n,i+2,i+3);
return n;
}
return n;
}
else {
return n;
}
}
return n;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Understanding response time of memory accesses - caching

Related

Using perf_event_open in AMD for Counting and Sampling Memory Load

Optimizing a logic AND operation for x86

Inline ASM: Use of MMX returns NaN seconds on timer

Userland interrupt timer access such as via KeQueryInterruptTime (or similar)

Fastest sort of fixed length 6 int array

Categories

Resources