In C++, I am using ReadProcessMemory to obtain the values of some numbers. Each of the numbers is in tandem, and there are five different numbers. Is it faster to do five, single RPM calls to read each number individually, or is it faster do a single RPM call that reads an entire structure built to contain these five numbers? For instance,
Mem.Read<int>(DWORD64 L1);
Mem.Read<int>(DWORD64 L1);
Mem.Read<int>(DWORD64 L1);
Mem.Read<int>(DWORD64 L1);
Mem.Read<int>(DWORD64 L1);
or
struct numbers{
int a;
int b;
int c;
int d;
int e;
};
Mem.Read<numbers>(DWORD64 L1)
Thanks!
Using ReadProcessMemory to read memory is "slow" so one large operation is going to be faster than many small reads.
If we ignore the fact that ReadProcessMemory is probably a system call that has to perform a context switch to kernel mode and access the pages of a different process (that might be paged out etc.) you can still make assumptions about the speed of one large vs many small reads.
Imagine you are reading from your own process. Many calls to memcpy with a small buffer is slower than one large copy operation. Just calling a function has some overhead; a call instruction, setting up the stack, doing the actual work and then restoring the stack and returning. A typical memcpy implementation is often optimized for larger reads and will do a little extra work upfront to make the source address aligned so it can read the majority of the memory in 4 or 8 byte chunks (maybe even larger on newer CPUs).
I would normally say that you need to measure with a timer if you are not sure about performance but in this case it is a no-brainer, just do the single read.
Related
Using this question as basis I implemented a pseudo-random number generator with a global state:
__global uint global_random_state;
void set_random_seed(uint seed){
global_random_state = seed;
}
uint get_random_number(uint range){
uint seed = global_random_state + get_global_id(0);
uint t = seed ^ (seed << 11);
uint result = seed ^ (seed >> 19) ^ (t ^ (t >> 8));
global_random_state = result; /* race condition? */
return result % range;
}
Since these functions will be used from multiple threads, there will be a race condition present when writing to global_random_state.
This might actually help the system to be more unpredictable, so it seems like a good thing, but I'd like to know if there are any consequences to this that might not surface immediately. Are there any side-effects inside the GPU which might cause problems later on when the kernel is run?
In theory you want atom_cmpxchg for correctness here (or find the equivalent GPGPU). However, a grave note of warning, having the entire machine serializing through a single cacheline is going to strangle your performance fundamentally. Atomics on the same address must form a queue and wait. Atomics on different locations can parallelize (more details at the end).
Generally, algorithms that leverage random variables on GPGPU will keep their own copy of the random variable generators. This enables each work item to cache and potentially reuse their own random with out glutting the bus with memory traffic on every new random. Search for "OpenCL Monte Carlo" "Simulation" or "Example" for samples. CUDA has some nice examples too.
Another option is to use a random generator that allows one to skip ahead and have different work items move forward in the sequence different amounts. This can be more compute intensive though, but the tradeoff is that you don't strain the memory hierarchy as much.
More gory details on atomics: (1) GPU cache atomics are designed to expect contiguous arrays and atomic ALUs are per bank, (2) each dword in a cacheline will be processed by the same atomic ALU each time, and (3) neighboring cachelines will hash to different banks. So, if every clock you are doing atomics on contiguous cachelines of data then the work should be perfectly spread out (or statistically so). Conversely, if one makes every work item atomically modify the same 32b, then the cache system cannot apply all the same atomic ALU slot to 16/32/64 (whatever your system uses). It must break the operation up in 16/32/64 separate atomic operations apply it iteratively (by #2 above). In a system where you have 512 ALUs to process atomics you would be using 1 of those ALUs each clock (the same one). Spread the work out and you can use all 512/c.
Let's imagine we have software developer that's goal is achieve absolute maximum of CPU's performance.
In today's CPUs we have many cores, we can load data in cache for faster processing and we also have SIMD instructions (AVX for example) that allow us to sum\multiply\do other ops with array of items (multiply 8 integers per one CPU clock). The disadvantage of this instruction is the cost of sending data & instructions to SIMD module + overhead of converting vector type to primitive types (sorry I familiar only with C#'s Vector) (We not looling on code complexety for now).
As far as I understand, while we using SIMD, main registers of CPU used only for sending and recieving data to this registers and main ALU blocks used for general purpose calculations are idle at this time.
And here is my question - will using of SIMD instructions load main CPU blocks? For example if we have huge amount of different calculations (let's imagine 40% of them are best to run on SIMD and 60% of them are better to run as a usual), will SIMD allow us to gain performance boost in this way: 100% of all cores performace + n% of SIMD's boost performance?
I'm asking this question because of for example with GPGPU we can use GPU for parallel calculations and CPU used in this case only for sending and recieving data, so it's idle all the time and we can utilize it's performance for sensitive for latency tasks.
Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.
So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).
Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)
Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)
#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps
void computeThing(float8 a[], float8 b[], float8 c[], int count)
{
for(int i = 0; i < count; ++i)
{
a[i] = mul8f(a[i], b[i]);
b[i] = mul8f(b[i], c[i]);
}
}
Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).
The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?
It just so happens that we are also making use of the SIMD units to do a couple of multiplications...
Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.
i.e. https://godbolt.org/z/xTcLrf
It's my understanding that if two threads are reading from the same piece of memory, and no thread is writing to that memory, then the operation is safe. However, I'm not sure what happens if one thread is reading and the other is writing. What would happen? Is the result undefined? Or would the read just be stale? If a stale read is not a concern is it ok to have unsynchronized read-write to a variable? Or is it possible the data would be corrupted, and neither the read nor the write would be correct and one should always synchronize in this case?
I want to say that I've learned it is the later case, that a race on memory access leaves the state undefined... but I don't remember where I may have learned that and I'm having a hard time finding the answer on google. My intuition is that a variable is operated on in registers, and that true (as in hardware) concurrency is impossible (or is it), so that the worst that could happen is stale data, i.e. the following:
WriteThread: copy value from memory to register
WriteThread: update value in register
ReadThread: copy value of memory to register
WriteThread: write new value to memory
At which point the read thread has stale data.
Usually memory is read or written in atomic units determined by the CPU architecture (32 bit and 64 bits item aligned on 32 bit and 64 bit boundaries is common these days).
In this case, what happens depends on the amount of data being written.
Let's consider the case of 32 bit atomic read/write cells.
If two threads write 32 bits into such an aligned cell, then it is absolutely well defined what happens: one of the two written values is retained. Unfortunately for you (well, the program), you don't know which value. By extremely clever programming, you can actually use this atomicity of reads and writes to build synchronization algorithms (e.g., Dekker's algorithm), but it is faster typically to use architecturally defined locks instead.
If two threads write more than an atomic unit (e.g., they both write a 128 bit value), then in fact the atomic unit sized pieces of the values written will be stored in a absolutely well defined way, but you won't know which pieces of which value get written in what order. So what may end up in storage is the value from the first thread, the second thread, or mixes of the bits in atomic unit sizes from both threads.
Similar ideas hold for one thread reading, and one thread writing in atomic units, and larger.
Basically, you don't want to do unsynchronized reads and writes to memory locations, because you won't know the outcome, even though it may be very well defined by the architecture.
The result is undefined. Corrupted data is entirely possible. For an obvious example, consider a 64-bit value being manipulated by a 32-bit processor. Let's assume the value is a simple counter, and we increment it when the lower 32-bits contain 0xffffffff. The increment produces 0x00000000. When we detect that, we increment the upper word. If, however, some other thread read the value between the time the lower word was incremented and the upper word was incremented, they get a value with an un-incremented upper word, but the lower word set to 0 -- a value completely different from what it would have been either before or after the increment is complete.
As I hinted in Ira Baxter's answer, CPU cache also plays a part on multicore systems. Consider the following test code:
DANGER WILL ROBISON!
The following code boosts priority to realtime to achieve somewhat more consistent results - while doing so requires admin privileges, be careful if running the code on dual- or single-core systems, since your machine will lock up for the duration of the test run.
#include <windows.h>
#include <stdio.h>
const int RUNFOR = 5000;
volatile bool terminating = false;
volatile int value;
static DWORD WINAPI CountErrors(LPVOID parm)
{
int errors = 0;
while(!terminating)
{
value = (int) parm;
if(value != (int) parm)
errors++;
}
printf("\tThread %08X: %d errors\n", parm, errors);
return 0;
}
static void RunTest(int affinity1, int affinity2)
{
terminating = false;
DWORD dummy;
HANDLE t1 = CreateThread(0, 0, CountErrors, (void*)0x1000, CREATE_SUSPENDED, &dummy);
HANDLE t2 = CreateThread(0, 0, CountErrors, (void*)0x2000, CREATE_SUSPENDED, &dummy);
SetThreadAffinityMask(t1, affinity1);
SetThreadAffinityMask(t2, affinity2);
ResumeThread(t1);
ResumeThread(t2);
printf("Running test for %d milliseconds with affinity %d and %d\n", RUNFOR, affinity1, affinity2);
Sleep(RUNFOR);
terminating = true;
Sleep(100); // let threads have a chance of picking up the "terminating" flag.
}
int main()
{
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
RunTest(1, 2); // core 1 & 2
RunTest(1, 4); // core 1 & 3
RunTest(4, 8); // core 3 & 4
RunTest(1, 8); // core 1 & 4
}
On my Quad-core intel Q6600 system (which iirc has two sets of cores where each set share L2 cache - would explain the results anyway ;)), I get the following results:
Running test for 5000 milliseconds with affinity 1 and 2
Thread 00002000: 351883 errors
Thread 00001000: 343523 errors
Running test for 5000 milliseconds with affinity 1 and 4
Thread 00001000: 48073 errors
Thread 00002000: 59813 errors
Running test for 5000 milliseconds with affinity 4 and 8
Thread 00002000: 337199 errors
Thread 00001000: 335467 errors
Running test for 5000 milliseconds with affinity 1 and 8
Thread 00001000: 55736 errors
Thread 00002000: 72441 errors
When we have a program that requires lots of operations over a large data sets and the operations on each of the data elements are independent, OpenCL can be one of the good choice to make it faster. I have a program like the following:
while( function(b,c)!=TRUE)
{
[X,Y] = function1(BigData);
M = functionA(X);
b = function2(M);
N = functionB(Y);
c = function3(N);
}
Here the function1 is applied on each of the elements on the BigData and produce another two big data sets (X,Y). function2 and function3 are then applied operation individually on each of the elements on these X,Y data, respectively.
Since the operations of all the functions are applied on each of the elements of the data sets independently, using GPU might make it faster. So I come up with the following:
while( function(b,c)!=TRUE)
{
//[X,Y] = function1(BigData);
1. load kernel1 and BigData on the GPU. each of the thread will work on one of the data
element and save the result on X and Y on GPU.
//M = functionA(X);
2a. load kernel2 on GPU. Each of the threads will work on one of the
data elements of X and save the result on M on GPU.
(workItems=n1, workgroup size=y1)
//b = function2(M);
2b. load kernel2 (Same kernel) on GPU. Each of the threads will work on
one of the data elements of M and save the result on B on GPU
(workItems=n2, workgroup size=y2)
3. read the data B on host variable b
//N = functionB(Y);
4a. load kernel3 on GPU. Each of the threads will work on one of the
data element of Y and save the result on N on GPU.
(workItems=n1, workgroup size=y1)
//c = function2(M);
4b. load kernel3 (Same kernel) on GPU. Each of the threads will work
on one of the data element of M and save the result on C on GPU
(workItems=n2, workgroup size=y2)
5. read the data C on host variable c
}
However, the overhead involved in this code seems significant to me (I have implemented a test program and run on a GPU). And if the kernels have some sort of synchronizations it might be ended up with more slowdown.
I also believe the workflow is kind of common. So what is the best practice to using OpenCL for speedup for a program like this.
I don't think there's a general problem with the way you've split up the problem into kernels, although it's hard to say as you haven't been very specific. How often do you expect your while loop to run?
If your kernels do negligible work but the outer loop is doing a lot of iterations, you may wish to combine the kernels into one, and do some number of iterations within the kernel itself, if that works for your problem.
Otherwise:
If you're getting unexpectedly bad performance, you most likely need to be looking at the efficiency of each of your kernels, and possibly their data access patterns. Unless neighbouring work items are reading/writing neighbouring data (ideally: 16 work items read 4 bytes each from a 64-byte cache line at a time) you're probably wasting memory bandwidth. If your kernels contain lots of conditionals or non-constant loop iterations, that will cost you, etc.
You don't specify what kind of runtimes you're getting, on what kind Of job size, (Tens? Thousands? Millions of arithmetic ops? How big are your data sets?) or what hardware. (Compute card? Laptop IGPU?) "Significant overhead" can mean a lot of different things. 5ms? 1 second?
Intel, nVidia and AMD all publish optimisation guides - have you read these?
I tested the speed of memcpy() noticing the speed drops dramatically at i*4KB. The result is as follow: the Y-axis is the speed(MB/second) and the X-axis is the size of buffer for memcpy(), increasing from 1KB to 2MB. Subfigure 2 and Subfigure 3 detail the part of 1KB-150KB and 1KB-32KB.
Environment:
CPU : Intel(R) Xeon(R) CPU E5620 # 2.40GHz
OS : 2.6.35-22-generic #33-Ubuntu
GCC compiler flags : -O3 -msse4 -DINTEL_SSE4 -Wall -std=c99
I guess it must be related to caches, but I can't find a reason from the following cache-unfriendly cases:
Why is my program slow when looping over exactly 8192 elements?
Why is transposing a matrix of 512x512 much slower than transposing a matrix of 513x513?
Since the performance degradation of these two cases are caused by unfriendly loops which read scattered bytes into the cache, wasting the rest of the space of a cache line.
Here is my code:
void memcpy_speed(unsigned long buf_size, unsigned long iters){
struct timeval start, end;
unsigned char * pbuff_1;
unsigned char * pbuff_2;
pbuff_1 = malloc(buf_size);
pbuff_2 = malloc(buf_size);
gettimeofday(&start, NULL);
for(int i = 0; i < iters; ++i){
memcpy(pbuff_2, pbuff_1, buf_size);
}
gettimeofday(&end, NULL);
printf("%5.3f\n", ((buf_size*iters)/(1.024*1.024))/((end.tv_sec - \
start.tv_sec)*1000*1000+(end.tv_usec - start.tv_usec)));
free(pbuff_1);
free(pbuff_2);
}
UPDATE
Considering suggestions from #usr, #ChrisW and #Leeor, I redid the test more precisely and the graph below shows the results. The buffer size is from 26KB to 38KB, and I tested it every other 64B(26KB, 26KB+64B, 26KB+128B, ......, 38KB). Each test loops 100,000 times in about 0.15 second. The interesting thing is the drop not only occurs exactly in 4KB boundary, but also comes out in 4*i+2 KB, with a much less falling amplitude.
PS
#Leeor offered a way to fill the drop, adding a 2KB dummy buffer between pbuff_1 and pbuff_2. It works, but I am not sure about Leeor's explanation.
Memory is usually organized in 4k pages (although there's also support for larger sizes). The virtual address space your program sees may be contiguous, but it's not necessarily the case in physical memory. The OS, which maintains a mapping of virtual to physical addresses (in the page map) would usually try to keep the physical pages together as well but that's not always possible and they may be fractured (especially on long usage where they may be swapped occasionally).
When your memory stream crosses a 4k page boundary, the CPU needs to stop and go fetch a new translation - if it already saw the page, it may be cached in the TLB, and the access is optimized to be the fastest, but if this is the first access (or if you have too many pages for the TLBs to hold on to), the CPU will have to stall the memory access and start a page walk over the page map entries - that's relatively long as each level is in fact a memory read by itself (on virtual machines it's even longer as each level may need a full pagewalk on the host).
Your memcpy function may have another issue - when first allocating memory, the OS would just build the pages to the pagemap, but mark them as unaccessed and unmodified due to internal optimizations. The first access may not only invoke a page walk, but possibly also an assist telling the OS that the page is going to be used (and stores into, for the target buffer pages), which would take an expensive transition to some OS handler.
In order to eliminate this noise, allocate the buffers once, perform several repetitions of the copy, and calculate the amortized time. That, on the other hand, would give you "warm" performance (i.e. after having the caches warmed up) so you'll see the cache sizes reflect on your graphs. If you want to get a "cold" effect while not suffering from paging latencies, you might want to flush the caches between iteration (just make sure you don't time that)
EDIT
Reread the question, and you seem to be doing a correct measurement. The problem with my explanation is that it should show a gradual increase after 4k*i, since on every such drop you pay the penalty again, but then should enjoy the free ride until the next 4k. It doesn't explain why there are such "spikes" and after them the speed returns to normal.
I think you are facing a similar issue to the critical stride issue linked in your question - when your buffer size is a nice round 4k, both buffers will align to the same sets in the cache and thrash each other. Your L1 is 32k, so it doesn't seem like an issue at first, but assuming the data L1 has 8 ways it's in fact a 4k wrap-around to the same sets, and you have 2*4k blocks with the exact same alignment (assuming the allocation was done contiguously) so they overlap on the same sets. It's enough that the LRU doesn't work exactly as you expect and you'll keep having conflicts.
To check this, i'd try to malloc a dummy buffer between pbuff_1 and pbuff_2, make it 2k large and hope that it breaks the alignment.
EDIT2:
Ok, since this works, it's time to elaborate a little. Say you assign two 4k arrays at ranges 0x1000-0x1fff and 0x2000-0x2fff. set 0 in your L1 will contain the lines at 0x1000 and 0x2000, set 1 will contain 0x1040 and 0x2040, and so on. At these sizes you don't have any issue with thrashing yet, they can all coexist without overflowing the associativity of the cache. However, everytime you perform an iteration you have a load and a store accessing the same set - i'm guessing this may cause a conflict in the HW. Worse - you'll need multiple iteration to copy a single line, meaning that you have a congestion of 8 loads + 8 stores (less if you vectorize, but still a lot), all directed at the same poor set, I'm pretty sure there's are a bunch of collisions hiding there.
I also see that Intel optimization guide has something to say specifically about that (see 3.6.8.2):
4-KByte memory aliasing occurs when the code accesses two different
memory locations with a 4-KByte offset between them. The 4-KByte
aliasing situation can manifest in a memory copy routine where the
addresses of the source buffer and destination buffer maintain a
constant offset and the constant offset happens to be a multiple of
the byte increment from one iteration to the next.
...
loads have to wait until stores have been retired before they can
continue. For example at offset 16, the load of the next iteration is
4-KByte aliased current iteration store, therefore the loop must wait
until the store operation completes, making the entire loop
serialized. The amount of time needed to wait decreases with larger
offset until offset of 96 resolves the issue (as there is no pending
stores by the time of the load with same address).
I expect it's because:
When the block size is a 4KB multiple, then malloc allocates new pages from the O/S.
When the block size is not a 4KB multiple, then malloc allocates a range from its (already allocated) heap.
When the pages are allocated from the O/S then they are 'cold': touching them for the first time is very expensive.
My guess is that, if you do a single memcpy before the first gettimeofday then that will 'warm' the allocated memory and you won't see this problem. Instead of doing an initial memcpy, even writing one byte into each allocated 4KB page might be enough to pre-warm the page.
Usually when I want a performance test like yours I code it as:
// Run in once to pre-warm the cache
runTest();
// Repeat
startTimer();
for (int i = count; i; --i)
runTest();
stopTimer();
// use a larger count if the duration is less than a few seconds
// repeat test 3 times to ensure that results are consistent
Since you are looping many times, I think arguments about pages not being mapped are irrelevant. In my opinion what you are seeing is the effect of hardware prefetcher not willing to cross page boundary in order not to cause (potentially unnecessary) page faults.