How do x86 jump instructions check their respective flags? - performance

As I understand, conditional jumps check the status of a flag set after the CMP instruction. For example:
CMP AX,DX ; Set compare flags
JGE DONE ; Go to DONE label if AX >= DX
...
DONE:
...
How are the flags actually checked in this situation? If I'm not mistaken, the flags are individual bits in a special register; so are the bits checked one at a time, or all at once? In C pseudocode:
unsigned flags = 0; /* reset all flags */
/* define the flags */
const unsigned GREATER = 1<<1;
const unsigned EQUAL = 1<<2;
const unsigned LESS = 1<<3;
unsigned AX = 4; /* initialize AX */
unsigned DX = 3; /* initialize DX */
/* CMP AX,DX */
int result = AX - DX;
if(result > 0){
flags |= GREATER;
}else if(result == 0){
flags |= EQUAL;
}else if(result < 0){
flags |= LESS;
}
/* -------------------------------- */
/* JGE Method 1 */
if(flags & GREATER){
goto DONE;
}
if(flags & EQUAL){
goto DONE;
}
/* or JGE Method 2 */
if(flags & (GREATER|EQUAL)){
goto DONE;
}
Don't look to deeply into the flag-setting code -- I know actual x86 processors flags are naturally filled rather than explicitly set -- my concern is how those flags are actually checked: bit by bit, or an encompassing bitmask.

Well, it depends, on how the machine is built.
Let me point you to a PDF of Harry Porter's Relay Computer, a particularly simple example of a computer.
Look on slide 115, where he's showing how the instructions are processed.
Basically there's a big blob of combinational logic implementing a Finite State Machine that controls how each instruction is executed.
Most of the stepping in the FSM is concerned with moving data and addresses between registers, using the address and data busses.
If you're wondering what combinational logic is, it is a blob having a bunch of boolean inputs, and a bunch of boolean outputs, and each output is a boolean function of some of the inputs. The blob has no memory. The way it gets memory is by feeding some of the outputs back to the inputs.
So to answer your question, in the context of that computer, it probably tests all the condition bits at the same time, in a boolean expression.

Related

Insertion Sort of uint16_t halfwords?

This is the first code for sorting whole signed words (int32_t):
I tried to change it to sort uint16_t unsigned halfwords. I'm almost done the code but something is missing. The problem is the sort, the architecture is ARM (in ARM mode, not Thumb); this is what I have done until now:
sort: STMFD SP!,{R4-R6,LR}
MOV R2,#1 //for (unsigned i = 1;
L1: CMP R2,R1
BHS L4 // i < toSort.size();
MOV R3,R2 // for (unsigned j = i
L2: SUBS R3,R3,#1// - 1; --j)
BLO L3 // j != -1;
ADD R6,R0,R3,LSL #2
LDRH R4,[R6]
LDRH R5,[R6,#2]
CMP R5,R4 // if (toSort[j+1] < toSort[j])
STRHT R4,[R6,#2]
STRHT R5,[R6] // swap(toSort[j], toSort[j+1]);
BLT L2
L3: ADD R2,R2,#1// else break; ++i)
B L1
L4: LDMFD SP!,{R4-R6,PC}
void insertionSort(vector<int>& toSort)
{
for (int i = 1; i < toSort.size(); ++i)
for (int j = i - 1; j >= 0; --j)
if (toSort[j+1] < toSort[j])
swap(toSort[j], toSort[j+1]);
else
break;
} */ This is the code that should be in assembly
Based on the original, your changes look reasonable, except that you need to shift #1 instead of #2 in the ADD R6,R0,R3,LSL #2.  This is because the #2 is for 4 byte integers, #1 would be for 2 byte integers.  This is a shift amount and the shift count reflects the power of 2 by which you want to scale the index.  For 4 byte integers we want to multiply/scale by 4, which is 22, whereas for 2 byte integers we want to multiply/scale by 2, which is 21.  When shifting in binary, the number of bits to shift multiplies by that power of 2.
However, I find it very difficult to believe that the original insertion sort actually works.  (Did you test it independently of your modifications?)
In particular, as I have said, "the swap is happening before the condition test", and, further your instruction told you that the strh should be conditional, another way of describing the same problem.
So, that code is not following the insertion sort as described in comment.
The intent of the if-statement within the inner loop is to detect when insertion point has been reached, and immediately stop the inner loop.  Or else if the insertion point hasn't been reached, then swap elements and continue the inner loop.
That code is instead doing the following:
swap ( toSort[j], toSort[j+1 );
if ( toSort[j+1] >= toSort[j] )
break;
With that being at the end of the inner loop.
Or more specifically:
conditionCodes = toSort[j+1] < toSort[j];
swap ( toSort[j], toSort[j+1 );
if ( conditionCodes is less than )
continue;
break;
The effect of this is that when the proper stopping point is reached, it will stop, but will also have unwantedly swapped two elements.
You can stop that swap from happening in the last iteration, either by changing the flow of control so that those store instructions are not executed, or as #fuz says, on ARM you can make those store instructions themselves each conditional, which is probably what the original author intended, given how far apart the CMP and BLT are.  However, the conditional operation of those stores was lost somewhere along the way.

C++11 on modern Intel: am I crazy or are non-atomic aligned 64-bit load/store actually atomic?

Can I base a mission-critical application on the results of this test, that 100 threads reading a pointer set a billion times by a main thread never see a tear?
Any other potential problems doing this besides tearing?
Here's a stand-alone demo that compiles with g++ -g tear.cxx -o tear -pthread.
#include <atomic>
#include <thread>
#include <vector>
using namespace std;
void* pvTearTest;
atomic<int> iTears( 0 );
void TearTest( void ) {
while (1) {
void* pv = (void*) pvTearTest;
intptr_t i = (intptr_t) pv;
if ( ( i >> 32 ) != ( i & 0xFFFFFFFF ) ) {
printf( "tear: pv = %p\n", pv );
iTears++;
}
if ( ( i >> 32 ) == 999999999 )
break;
}
}
int main( int argc, char** argv ) {
printf( "\n\nTEAR TEST: are normal pointer read/writes atomic?\n" );
vector<thread> athr;
// Create lots of threads and have them do the test simultaneously.
for ( int i = 0; i < 100; i++ )
athr.emplace_back( TearTest );
for ( int i = 0; i < 1000000000; i++ )
pvTearTest = (void*) (intptr_t)
( ( i % (1L<<32) ) * 0x100000001 );
for ( auto& thr: athr )
thr.join();
if ( iTears )
printf( "%d tears\n", iTears.load() );
else
printf( "\n\nTEAR TEST: SUCCESS, no tears\n" );
}
The actual application is a malloc()'ed and sometimes realloc()'d array (size is power of two; realloc doubles storage) that many child threads will absolutely be hammering in a mission-critical but also high-performance-critical way.
From time to time a thread will need to add a new entry to the array, and will do so by setting the next array entry to point to something, then increment an atomic<int> iCount. Finally it will add data to some data structures that would cause other threads to attempt to dereference that cell.
It all seems fine (except I'm not positive if the increment of count is assured of happening before following non-atomic updates)... except for one thing: realloc() will typically change the address of the array, and further frees the old one, the pointer to which is still visible to other threads.
OK, so instead of realloc(), I malloc() a new array, manually copy the contents, set the pointer to the array. I would free the old array but I realize other threads may still be accessing it: they read the array base; I free the base; a third thread allocates it writes something else there; the first thread then adds the indexed offset to the base and expects a valid pointer. I'm happy to leak those though. (Given the doubling growth, all old arrays combined are about the same size as the current array so overhead is simply an extra 16 bytes per item, and it's memory that soon is never referenced again.)
So, here's the crux of the question: once I allocate the bigger array, can I write it's base address with a non-atomic write, in utter safety? Or despite my billion-access test, do I actually have to make it atomic<> and thus slow all worker threads to read that atomic?
(As this is surely environment dependent, we're talking 2012-or-later Intel, g++ 4 to 9, and Red Hat of 2012 or later.)
EDIT: here is a modified test program that matches my planned scenario much more closely, with only a small number of writes. I've also added a count of the reads. I see when switching from void* to atomic I go from 2240 reads/sec to 660 reads/sec (with optimization disabled). The machine language for the read is shown after the source.
#include <atomic>
#include <chrono>
#include <thread>
#include <vector>
using namespace std;
chrono::time_point<chrono::high_resolution_clock> tp1, tp2;
// void*: 1169.093u 0.027s 2:26.75 796.6% 0+0k 0+0io 0pf+0w
// atomic<void*>: 6656.864u 0.348s 13:56.18 796.1% 0+0k 0+0io 0pf+0w
// Different definitions of the target variable.
atomic<void*> pvTearTest;
//void* pvTearTest;
// Children sum the tears they find, and at end, total checks performed.
atomic<int> iTears( 0 );
atomic<uint64_t> iReads( 0 );
bool bEnd = false; // main thr sets true; children all finish.
void TearTest( void ) {
uint64_t i;
for ( i = 0; ! bEnd; i++ ) {
intptr_t iTearTest = (intptr_t) (void*) pvTearTest;
// Make sure top 4 and bottom 4 bytes are the same. If not it's a tear.
if ( ( iTearTest >> 32 ) != ( iTearTest & 0xFFFFFFFF ) ) {
printf( "tear: pv = %ux\n", iTearTest );
iTears++;
}
// Output periodically to prove we're seeing changing values.
if ( ( (i+1) % 50000000 ) == 0 )
printf( "got: pv = %lx\n", iTearTest );
}
iReads += i;
}
int main( int argc, char** argv ) {
printf( "\n\nTEAR TEST: are normal pointer read/writes atomic?\n" );
vector<thread> athr;
// Create lots of threads and have them do the test simultaneously.
for ( int i = 0; i < 100; i++ )
athr.emplace_back( TearTest );
tp1 = chrono::high_resolution_clock::now();
#if 0
// Change target as fast as possible for fixed number of updates.
for ( int i = 0; i < 1000000000; i++ )
pvTearTest = (void*) (intptr_t)
( ( i % (1L<<32) ) * 0x100000001 );
#else
// More like our actual app: change target only periodically, for fixed time.
for ( int i = 0; i < 100; i++ ) {
pvTearTest.store( (void*) (intptr_t) ( ( i % (1L<<32) ) * 0x100000001 ),
std::memory_order_release );
this_thread::sleep_for(10ms);
}
#endif
bEnd = true;
for ( auto& thr: athr )
thr.join();
tp2 = chrono::high_resolution_clock::now();
chrono::duration<double> dur = tp2 - tp1;
printf( "%ld reads in %.4f secs: %.2f reads/usec\n",
iReads.load(), dur.count(), iReads.load() / dur.count() / 1000000 );
if ( iTears )
printf( "%d tears\n", iTears.load() );
else
printf( "\n\nTEAR TEST: SUCCESS, no tears\n" );
}
Dump of assembler code for function TearTest():
0x0000000000401256 <+0>: push %rbp
0x0000000000401257 <+1>: mov %rsp,%rbp
0x000000000040125a <+4>: sub $0x10,%rsp
0x000000000040125e <+8>: movq $0x0,-0x8(%rbp)
0x0000000000401266 <+16>: movzbl 0x6e83(%rip),%eax # 0x4080f0 <bEnd>
0x000000000040126d <+23>: test %al,%al
0x000000000040126f <+25>: jne 0x40130c <TearTest()+182>
=> 0x0000000000401275 <+31>: mov $0x4080d8,%edi
0x000000000040127a <+36>: callq 0x40193a <std::atomic<void*>::operator void*() const>
0x000000000040127f <+41>: mov %rax,-0x10(%rbp)
0x0000000000401283 <+45>: mov -0x10(%rbp),%rax
0x0000000000401287 <+49>: sar $0x20,%rax
0x000000000040128b <+53>: mov -0x10(%rbp),%rdx
0x000000000040128f <+57>: mov %edx,%edx
0x0000000000401291 <+59>: cmp %rdx,%rax
0x0000000000401294 <+62>: je 0x4012bb <TearTest()+101>
0x0000000000401296 <+64>: mov -0x10(%rbp),%rax
0x000000000040129a <+68>: mov %rax,%rsi
0x000000000040129d <+71>: mov $0x40401a,%edi
0x00000000004012a2 <+76>: mov $0x0,%eax
0x00000000004012a7 <+81>: callq 0x401040 <printf#plt>
0x00000000004012ac <+86>: mov $0x0,%esi
0x00000000004012b1 <+91>: mov $0x4080e0,%edi
0x00000000004012b6 <+96>: callq 0x401954 <std::__atomic_base<int>::operator++(int)>
0x00000000004012bb <+101>: mov -0x8(%rbp),%rax
0x00000000004012bf <+105>: lea 0x1(%rax),%rcx
0x00000000004012c3 <+109>: movabs $0xabcc77118461cefd,%rdx
0x00000000004012cd <+119>: mov %rcx,%rax
0x00000000004012d0 <+122>: mul %rdx
0x00000000004012d3 <+125>: mov %rdx,%rax
0x00000000004012d6 <+128>: shr $0x19,%rax
0x00000000004012da <+132>: imul $0x2faf080,%rax,%rax
0x00000000004012e1 <+139>: sub %rax,%rcx
0x00000000004012e4 <+142>: mov %rcx,%rax
0x00000000004012e7 <+145>: test %rax,%rax
0x00000000004012ea <+148>: jne 0x401302 <TearTest()+172>
0x00000000004012ec <+150>: mov -0x10(%rbp),%rax
0x00000000004012f0 <+154>: mov %rax,%rsi
0x00000000004012f3 <+157>: mov $0x40402a,%edi
0x00000000004012f8 <+162>: mov $0x0,%eax
0x00000000004012fd <+167>: callq 0x401040 <printf#plt>
0x0000000000401302 <+172>: addq $0x1,-0x8(%rbp)
0x0000000000401307 <+177>: jmpq 0x401266 <TearTest()+16>
0x000000000040130c <+182>: mov -0x8(%rbp),%rax
0x0000000000401310 <+186>: mov %rax,%rsi
0x0000000000401313 <+189>: mov $0x4080e8,%edi
0x0000000000401318 <+194>: callq 0x401984 <std::__atomic_base<unsigned long>::operator+=(unsigned long)>
0x000000000040131d <+199>: nop
0x000000000040131e <+200>: leaveq
0x000000000040131f <+201>: retq
Yes, on x86 aligned loads are atomic, BUT this is an architectural detail that you should NOT rely on!
Since you are writing C++ code, you have to abide by the rules of the C++ standard, i.e., you have to use atomics instead of volatile. The fact
that volatile has been part of that language long before the introduction
of threads in C++11 should be a strong enough indication that volatile was
never designed or intended to be used for multi-threading. It is important to
note that in C++ volatile is something fundamentally different from volatile
in languages like Java or C# (in these languages volatile is in
fact related to the memory model and therefore much more like an atomic in C++).
In C++, volatile is used for what is often referred to as "unusual memory".
This is typically memory that can be read or modified outside the current process,
for example when using memory mapped I/O. volatile forces the compiler to
execute all operations in the exact order as specified. This prevents
some optimizations that would be perfectly legal for atomics, while also allowing
some optimizations that are actually illegal for atomics. For example:
volatile int x;
int y;
volatile int z;
x = 1;
y = 2;
z = 3;
z = 4;
...
int a = x;
int b = x;
int c = y;
int d = z;
In this example, there are two assignments to z, and two read operations on x.
If x and z were atomics instead of volatile, the compiler would be free to treat
the first store as irrelevant and simply remove it. Likewise it could just reuse the
value returned by the first load of x, effectively generating code like int b = a.
But since x and z are volatile, these optimizations are not possible. Instead,
the compiler has to ensure that all volatile operations are executed in the
exact order as specified, i.e., the volatile operations cannot be reordered with
respect to each other. However, this does not prevent the compiler from reordering
non-volatile operations. For example, the operations on y could freely be moved
up or down - something that would not be possible if x and z were atomics. So
if you were to try implementing a lock based on a volatile variable, the compiler
could simply (and legally) move some code outside your critical section.
Last but not least it should be noted that marking a variable as volatile does
not prevent it from participating in a data race. In those rare cases where you
have some "unusual memory" (and therefore really require volatile) that is
also accessed by multiple threads, you have to use volatile atomics.
Since aligned loads are actually atomic on x86, the compiler will translate an atomic.load() call to a simple mov instruction, so an atomic load is not slower than reading a volatile variable. An atomic.store() is actually slower than writing a volatile variable, but for good reasons, since in contrast to the volatile write it is by default sequentially consistent. You can relax the memory orders, but you really have to know what you are doing!!
If you want to learn more about the C++ memory model, I can recommend this paper: Memory Models for C/C++ Programmers

Convert bit vector to one bit

Is there an efficient way to get 0x00000001 or 0xFFFFFFFF for a non-zero unsigned integer values, and 0 for zero without branching?
I want to test several masks and create another mask based on that. Basically, I want to optimize the following code:
unsigned getMask(unsigned x, unsigned masks[4])
{
return (x & masks[0] ? 1 : 0) | (x & masks[1] ? 2 : 0) |
(x & masks[2] ? 4 : 0) | (x & masks[3] ? 8 : 0);
}
I know that some optimizing compilers can handle this, but even if that's the case, how exactly do they do it? I looked through the Bit twiddling hacks page, but found only a description of conditional setting/clearing of a mask using a boolean condition, so the conversion from int to bool should be done outside the method.
If there is no generic way to solve this, how can I do that efficiently using x86 assembler code?
x86 SSE2 can do this in a few instructions, the most important being movmskps which extracts the top bit of each 4-byte element of a SIMD vector into an integer bitmap.
Intel's intrinsics guide is pretty good, see also the SSE tag wiki
#include <immintrin.h>
static inline
unsigned getMask(unsigned x, unsigned masks[4])
{
__m128i vx = _mm_set1_epi32(x);
__m128i vm = _mm_load_si128(masks); // or loadu if this can inline where masks[] isn't aligned
__m128i and = _mm_and_si128(vx, vm);
__m128i eqzero = _mm_cmpeq_epi32(and, _mm_setzero_si128()); // vector of 0 or -1 elems
unsigned zeromask = _mm_movemask_ps(_mm_castsi128_ps(eqzero));
return zeromask ^ 0xf; // flip the low 4 bits
}
Until AVX512, there's no SIMD cmpneq, so the best option is scalar XOR after extracting a mask. (We want to just flip the low 4 bits, not all of them with a NOT.)
The usual way to do this in x86 is:
test eax, eax
setne al
You can use !! to coerce a value to 0 or 1 and rewrite the expression like this
return !!(x & masks[0]) | (!!(x & masks[1]) << 1) |
(!!(x & masks[2]) << 2) | (!!(x & masks[3]) << 3);

Userland interrupt timer access such as via KeQueryInterruptTime (or similar)

Is there a "Nt" or similar (i.e. non-kernelmode-driver) function equivalent for KeQueryInterruptTime or anything similar? There seems to be no such thing as NtQueryInterruptTime, at least I've not found it.
What I want is some kind of reasonably accurate and reliable, monotonic timer (thus not QPC) which is reasonably efficient and doesn't have surprises as an overflowing 32-bit counter, and no unnecessary "smartness", no time zones, or complicated structures.
So ideally, I want something like timeGetTime with a 64 bit value. It doesn't even have to be the same timer.
There exists GetTickCount64 starting with Vista, which would be acceptable as such, but I'd not like to break XP support only for such a stupid reason.
Reading the quadword at 0x7FFE0008 as indicated here ... well, works ... and it proves that indeed the actual internal counter is 64 bits under XP (it's also as fast as it could possibly get), but meh... let's not talk about what a kind of nasty hack it is to read some unknown, hardcoded memory location.
There must certainly be something in between calling an artificially stupefied (scaling a 64 bit counter down to 32 bits) high-level API function and reading a raw memory address?
Here's an example of a thread-safe wrapper for GetTickCount() extending the tick count value to 64 bits and in that being equivalent to GetTickCount64().
To avoid undesired counter roll overs, make sure to call this function a few times every 49.7 days. You can even have a dedicated thread whose only purpose would be to call this function and then sleep some 20 days in an infinite loop.
ULONGLONG MyGetTickCount64(void)
{
static volatile LONGLONG Count = 0;
LONGLONG curCount1, curCount2;
LONGLONG tmp;
curCount1 = InterlockedCompareExchange64(&Count, 0, 0);
curCount2 = curCount1 & 0xFFFFFFFF00000000;
curCount2 |= GetTickCount();
if ((ULONG)curCount2 < (ULONG)curCount1)
{
curCount2 += 0x100000000;
}
tmp = InterlockedCompareExchange64(&Count, curCount2, curCount1);
if (tmp == curCount1)
{
return curCount2;
}
else
{
return tmp;
}
}
EDIT: And here's a complete application that tests MyGetTickCount64().
// Compiled with Open Watcom C 1.9: wcl386.exe /we /wx /q gettick.c
#include <windows.h>
#include <stdio.h>
#include <stdarg.h>
#include <stdlib.h>
//
// The below code is an ugly implementation of InterlockedCompareExchange64()
// that is apparently missing in Open Watcom C 1.9.
// It must work with MSVC++ too, however.
//
UINT8 Cmpxchg8bData[] =
{
0x55, // push ebp
0x89, 0xE5, // mov ebp, esp
0x57, // push edi
0x51, // push ecx
0x53, // push ebx
0x8B, 0x7D, 0x10, // mov edi, [ebp + 0x10]
0x8B, 0x07, // mov eax, [edi]
0x8B, 0x57, 0x04, // mov edx, [edi + 0x4]
0x8B, 0x7D, 0x0C, // mov edi, [ebp + 0xc]
0x8B, 0x1F, // mov ebx, [edi]
0x8B, 0x4F, 0x04, // mov ecx, [edi + 0x4]
0x8B, 0x7D, 0x08, // mov edi, [ebp + 0x8]
0xF0, // lock:
0x0F, 0xC7, 0x0F, // cmpxchg8b [edi]
0x5B, // pop ebx
0x59, // pop ecx
0x5F, // pop edi
0x5D, // pop ebp
0xC3 // ret
};
LONGLONG (__cdecl *Cmpxchg8b)(LONGLONG volatile* Dest, LONGLONG* Exch, LONGLONG* Comp) =
(LONGLONG (__cdecl *)(LONGLONG volatile*, LONGLONG*, LONGLONG*))Cmpxchg8bData;
LONGLONG MyInterlockedCompareExchange64(LONGLONG volatile* Destination,
LONGLONG Exchange,
LONGLONG Comparand)
{
return Cmpxchg8b(Destination, &Exchange, &Comparand);
}
#ifdef InterlockedCompareExchange64
#undef InterlockedCompareExchange64
#endif
#define InterlockedCompareExchange64(Destination, Exchange, Comparand) \
MyInterlockedCompareExchange64(Destination, Exchange, Comparand)
//
// This stuff makes a thread-safe printf().
// We don't want characters output by one thread to be mixed
// with characters output by another. We want printf() to be
// "atomic".
// We use a critical section around vprintf() to achieve "atomicity".
//
static CRITICAL_SECTION PrintfCriticalSection;
int ts_printf(const char* Format, ...)
{
int count;
va_list ap;
EnterCriticalSection(&PrintfCriticalSection);
va_start(ap, Format);
count = vprintf(Format, ap);
va_end(ap);
LeaveCriticalSection(&PrintfCriticalSection);
return count;
}
#define TICK_COUNT_10MS_INCREMENT 0x800000
//
// This is the simulated tick counter.
// Its low 32 bits are going to be returned by
// our, simulated, GetTickCount().
//
// TICK_COUNT_10MS_INCREMENT is what the counter is
// incremented by every time. The value is so chosen
// that the counter quickly overflows in its
// low 32 bits.
//
static volatile LONGLONG SimulatedTickCount = 0;
//
// This is our simulated 32-bit GetTickCount()
// that returns a count that often overflows.
//
ULONG SimulatedGetTickCount(void)
{
return (ULONG)SimulatedTickCount;
}
//
// This thread function will increment the simulated tick counter
// whose value's low 32 bits we'll be reading in SimulatedGetTickCount().
//
DWORD WINAPI SimulatedTickThread(LPVOID lpParameter)
{
UNREFERENCED_PARAMETER(lpParameter);
for (;;)
{
LONGLONG c;
Sleep(10);
// Get the counter value, add TICK_COUNT_10MS_INCREMENT to it and
// store the result back.
c = InterlockedCompareExchange64(&SimulatedTickCount, 0, 0);
InterlockedCompareExchange64(&SimulatedTickCount, c + TICK_COUNT_10MS_INCREMENT, c) != c);
}
return 0;
}
volatile LONG CountOfObserved32bitOverflows = 0;
volatile LONG CountOfObservedUpdateRaces = 0;
//
// This prints statistics that includes the true 64-bit value of
// SimulatedTickCount that we can't get from SimulatedGetTickCount() as it
// returns only its lower 32 bits.
//
// The stats also include:
// - the number of times that MyGetTickCount64() observes an overflow of
// SimulatedGetTickCount()
// - the number of times MyGetTickCount64() fails to update its internal
// counter because of a concurrent update in another thread.
//
void PrintStats(void)
{
LONGLONG true64bitCounter = InterlockedCompareExchange64(&SimulatedTickCount, 0, 0);
ts_printf(" 0x%08X`%08X <- true 64-bit count; ovfs: ~%d; races: %d\n",
(ULONG)(true64bitCounter >> 32),
(ULONG)true64bitCounter,
CountOfObserved32bitOverflows,
CountOfObservedUpdateRaces);
}
//
// This is our poor man's implementation of GetTickCount64()
// on top of GetTickCount().
//
// It's thread safe.
//
// When used with actual GetTickCount() instead of SimulatedGetTickCount()
// it must be called at least a few times in 49.7 days to ensure that
// it doesn't miss any overflows in GetTickCount()'s return value.
//
ULONGLONG MyGetTickCount64(void)
{
static volatile LONGLONG Count = 0;
LONGLONG curCount1, curCount2;
LONGLONG tmp;
curCount1 = InterlockedCompareExchange64(&Count, 0, 0);
curCount2 = curCount1 & 0xFFFFFFFF00000000;
curCount2 |= SimulatedGetTickCount();
if ((ULONG)curCount2 < (ULONG)curCount1)
{
curCount2 += 0x100000000;
InterlockedIncrement(&CountOfObserved32bitOverflows);
}
tmp = InterlockedCompareExchange64(&Count, curCount2, curCount1);
if (tmp != curCount1)
{
curCount2 = tmp;
InterlockedIncrement(&CountOfObservedUpdateRaces);
}
return curCount2;
}
//
// This is an error counter. If a thread that uses MyGetTickCount64() notices
// any problem with what MyGetTickCount64() returns, it bumps up this error
// counter and stops. If one of threads sees a non-zero value in this
// counter due to an error in another thread, it stops as well.
//
volatile LONG Error = 0;
//
// This is a thread function that will be using MyGetTickCount64(),
// validating its return value and printing some stats once in a while.
//
// This function is meant to execute concurrently in multiple threads
// to create race conditions inside of MyGetTickCount64() and test it.
//
DWORD WINAPI TickUserThread(LPVOID lpParameter)
{
DWORD user = (DWORD)lpParameter; // thread number
ULONGLONG ticks[4];
ticks[3] = ticks[2] = ticks[1] = MyGetTickCount64();
while (!Error)
{
ticks[0] = ticks[1];
ticks[1] = MyGetTickCount64();
// Every ~100 ms sleep a little (slightly lowers CPU load, to about 90%)
if (ticks[1] > ticks[2] + TICK_COUNT_10MS_INCREMENT * 10L)
{
ticks[2] = ticks[1];
Sleep(1 + rand() % 20);
}
// Every ~1000 ms print the last value from MyGetTickCount64().
// Thread 1 also prints stats here.
if (ticks[1] > ticks[3] + TICK_COUNT_10MS_INCREMENT * 100L)
{
ticks[3] = ticks[1];
ts_printf("%u:0x%08X`%08X\n", user, (ULONG)(ticks[1] >> 32), (ULONG)ticks[1]);
if (user == 1)
{
PrintStats();
}
}
if (ticks[0] > ticks[1])
{
ts_printf("%u:Non-monotonic tick counts: 0x%016llX > 0x%016llX!\n",
user,
ticks[0],
ticks[1]);
PrintStats();
InterlockedIncrement(&Error);
return -1;
}
else if (ticks[0] + 0x100000000 <= ticks[1])
{
ts_printf("%u:Too big tick count jump: 0x%016llX -> 0x%016llX!\n",
user,
ticks[0],
ticks[1]);
PrintStats();
InterlockedIncrement(&Error);
return -1;
}
Sleep(0); // be nice, yield to other threads.
}
return 0;
}
//
// This prints stats upon Ctrl+C and terminates the program.
//
BOOL WINAPI ConsoleEventHandler(DWORD Event)
{
if (Event == CTRL_C_EVENT)
{
PrintStats();
}
return FALSE;
}
int main(void)
{
HANDLE simulatedTickThreadHandle;
HANDLE tickUserThreadHandle;
DWORD dummy;
// This is for the missing InterlockedCompareExchange64() workaround.
VirtualProtect(Cmpxchg8bData, sizeof(Cmpxchg8bData), PAGE_EXECUTE_READWRITE, &dummy);
InitializeCriticalSection(&PrintfCriticalSection);
if (!SetConsoleCtrlHandler(&ConsoleEventHandler, TRUE))
{
ts_printf("SetConsoleCtrlHandler(&ConsoleEventHandler) failed with error 0x%X\n", GetLastError());
return -1;
}
// Start the tick simulator thread.
simulatedTickThreadHandle = CreateThread(NULL, 0, &SimulatedTickThread, NULL, 0, NULL);
if (simulatedTickThreadHandle == NULL)
{
ts_printf("CreateThread(&SimulatedTickThread) failed with error 0x%X\n", GetLastError());
return -1;
}
// Start one thread that'll be using MyGetTickCount64().
tickUserThreadHandle = CreateThread(NULL, 0, &TickUserThread, (LPVOID)2, 0, NULL);
if (tickUserThreadHandle == NULL)
{
ts_printf("CreateThread(&TickUserThread) failed with error 0x%X\n", GetLastError());
return -1;
}
// The other thread using MyGetTickCount64() will be the main thread.
TickUserThread((LPVOID)1);
//
// The app terminates upon any error condition detected in TickUserThread()
// in any of the threads or by Ctrl+C.
//
return 0;
}
As a test I've been running this test app under Windows XP for 5+ hours on an otherwise idle machine that has 2 CPUs (idle, to avoid potential long starvation times and therefore avoid missing counter overflows that occur every 5 seconds) and it's still doing well.
Here's the latest output from the console:
2:0x00000E1B`C8800000
1:0x00000E1B`FA800000
0x00000E1B`FA800000 <- true 64-bit count; ovfs: ~3824; races: 110858
As you can see, MyGetTickCount64() has observed 3824 32-bit overflows and failed to update the value of Count with its second InterlockedCompareExchange64() 110858 times. So, overflows indeed occur and the last number means that the variable is, in fact, being concurrently updated by the two threads.
You can also see that the 64-bit tick counts that the two threads receive from MyGetTickCount64() in TickUserThread() don't have anything missing in the top 32 bits and are pretty close to the actual 64-bit tick count in SimulatedTickCount, whose 32 low bits are returned by SimulatedGetTickCount(). 0x00000E1BC8800000 is visually behind 0x00000E1BFA800000 due to thread scheduling and infrequent stat prints, it's behind by exactly 100*TICK_COUNT_10MS_INCREMENT, or 1 second. Internally, of course, the difference is much smaller.
Now, on availability of InterlockedCompareExchange64()... It's a bit odd that it's officially available since Windows Vista and Windows Server 2003. Server 2003 is in fact build from the same code base as Windows XP.
But the most important thing here is that this function is built on top of the Pentium CMPXCHG8B instruction that's been available since 1998 or earlier (1), (2). And I can see this instruction in my Windows XP's (SP3) binaries. It's in ntkrnlpa.exe/ntoskrnl.exe (the kernel) and ntdll.dll (the DLL that exports kernel's NtXxxx() functions that everything's built upon). Look for a byte sequence of 0xF0, 0x0F, 0xC7 and disassemble the code around that place to see that these bytes aren't there coincidentally.
You can check availability of this instruction through the CPUID instruction (EDX bit 8 of CPUID function 0x00000001 and function 0x80000001) and refuse to run instead of crashing if the instruction isn't there, but these days you're unlikely to find a machine that doesn't support this instruction. If you do, it won't be a good machine for Windows XP and probably your application as well anyways.
Thanks to Google Books which kindly offered the relevant literature for free, I came up with an easy and fast implementation of GetTickCount64 which works perfectly well on pre-Vista systems too (and it still is somewhat less nasty than reading a value from a hardcoded memory address).
It is in fact as easy as calling interrupt 0x2A, which maps to KiGetTickCount. In GCC inline assembly, this gives:
static __inline__ __attribute__((always_inline)) unsigned long long get_tick_count64()
{
unsigned long long ret;
__asm__ __volatile__ ("int $0x2a" : "=A"(ret) : : );
return ret;
}
Due to the way KiGetTickCount works, the function should probably better be called GetTickCount46, as it performs a right shift by 18, returning 46 bits, not 64. Though the same is true for the original Vista version, too.
Note that KiGetTickCount clobbers edx, this is relevant if you plan to implement your own faster implementation of the 32-bit version (must add edx to the clobber list in that case!).
Here's another approach, a variant of Alex's wrapper but using only 32-bit interlocks. It only actually returns a 60-bit number, but that's still good for about thirty-six million years. :-)
It does need to be called more often, at least once every three days. That shouldn't normally be a major drawback.
ULONGLONG MyTickCount64(void)
{
static volatile DWORD count = 0xFFFFFFFF;
DWORD previous_count, current_tick32, previous_count_zone, current_tick32_zone;
ULONGLONG current_tick64;
previous_count = InterlockedCompareExchange(&count, 0, 0);
current_tick32 = GetTickCount();
if (previous_count == 0xFFFFFFFF)
{
// count has never been written
DWORD initial_count;
initial_count = current_tick32 >> 28;
previous_count = InterlockedCompareExchange(&count, initial_count, 0xFFFFFFFF);
if (previous_count == 0xFFFFFFFF)
{ // This thread wrote the initial value for count
previous_count = initial_count;
}
else if (previous_count != initial_count)
{ // Another thread wrote the initial value for count,
// and it differs from the one we calculated
current_tick32 = GetTickCount();
}
}
previous_count_zone = previous_count & 15;
current_tick32_zone = current_tick32 >> 28;
if (current_tick32_zone == previous_count_zone)
{
// The top four bits of the 32-bit tick count haven't changed since count was last written.
current_tick64 = previous_count;
current_tick64 <<= 28;
current_tick64 += current_tick32 & 0x0FFFFFFF;
return current_tick64;
}
if (current_tick32_zone == previous_count_zone + 1 || (current_tick32_zone == 0 && previous_count_zone == 15))
{
// The top four bits of the 32-bit tick count have been incremented since count was last written.
InterlockedCompareExchange(&count, previous_count + 1, previous_count);
current_tick64 = previous_count + 1;
current_tick64 <<= 28;
current_tick64 += current_tick32 & 0x0FFFFFFF;
return current_tick64;
}
// Oops, we weren't called often enough, we're stuck
return 0xFFFFFFFF;
}

OS X video memory from x86_64 assembly

I am working on a C++ app on Intel Mac OS X 10.6.x. I have a variable which contains pixel data which was obtained using OpenGL call glReadPixels. I want to do some operations directly on the pixel data using x86_64 assembly instructions. The assembly routine works fine in test programs but when I try to use it on the pixel data, it only gets zeroes in the memory location pointed by the pixel data variable. I am guessing this is since I am trying to access video memory directly from x86_64 assembly. Is there a way to access x86_64 video memory directly from assembly? Otherwise how can I resolve this situation?
Appreciate any pointers. Thanks in advance.
See below for code sample to flip last n and 1st n bytes. Same code works well in test program.
void Flip(void *b, unsigned int w, unsigned int h)
{
__asm {
mov r8, rdi //rdi b
mov r9, rsi //W
mov r10,rdx //H
mov r11, 0 // h <- 0
mov r12, 0 // w<- 0
outloop:
------------
.............
.............
}
This isn't really an answer but the comments bit is too short to post this.
Your inline assembly is problematic, in multiple ways:
it assumes by the time the compiler gets to the inline block, the function arguments are still in the arg registers. This isn't guaranteed.
it uses MS-VC++ style inline assembly; I'm unsure about OSX Clang, but gcc proper refuses to even compile this.
It'd also be good to have a complete (compilable) source fragment. Something like (32bit code):
int mogrifyFramebufferContents(unsigned char *fb, int width, int height)
{
int i, sum;
glReadPixels(1, 1, width, height, GL_RGBA, GL_UNSIGNED_BYTE, fb);
for (i = 0, sum = 0; i < 4 * width * height; i++)
sum += fb[i];
printf("sum over fb via C: %d\n", sum);
asm("xorl %0, %0\n"
"xorl %1, %1\n"
"0:\n"
"movsbl (%2, %1), %ebx\n"
"addl %ebx, %0\n"
"incl %1\n"
"cmpl %1, %3\n"
"jl 0b"
: "=r"(sum)
: "r"(i), "r"(fb), "r"(4 * width * height)
: "cc", "%ebx");
printf("sum over fb via inline asm: %d\n", sum);
return (sum);
}
If I haven't made a one-off error, this code should result in the same output for C and assembly. Try something similar, right at the place where you access the read data within C, and compare assembly results - or even singlestep through the generated assembly with a debugger.
A stackoverflow search for "gcc inline assembly syntax" will give you a starting point for what the %0...%2 placeholders mean, and/or how the register assignment constraints like "=r"(sum) above work.

Resources