My pc has a 10th gen Core i7 vPRO with virtualization enabled. 8 cores + 8 virtual cores. (i7-10875H, Comet Lake)
Each physical core is split into pairs, so Core 1 hosts virtual cores 0 & 1, core 2 hosts virtual cores 2 & 3. I've noticed that in task manager, the first item of each core pair seems to be the preferred core, judging by the higher usage. I do set some affinities manually for certain heavy programs but I always set these in groups of 4, either from 0-3, 4-7, 8-11, 12-15, and never mismatch different logical processors.
I'm wondering why this behaviour happens - do the even numbered cores equate to physical cores, which could be slightly faster? If so, would I get slightly better clock speeds without virtualisation if I'm running programs that don't have a high thread count?
In general (for "scheduler theory"):
if you care about performance, spread the tasks across physical cores where possible. This prevents a "2 tasks run slower because they're sharing a physical core, while a whole physical core is idle" situation.
if you care about power consumption and not performance, make tasks use logical processors in the same physical core where possible. This may allow you to put entire core/s into a very power efficient "do nothing" state.
if you care about security (and not performance or power consumption), don't let unrelated tasks use logical processors in the same physical core at all (because information, like what kinds of instructions are currently being used, can be "leaked" from one logical processor to another logical process in the same physical core). Note that it would be fine for related tasks to use logical processes in the same physical core (e.g. 2 threads that belong to the same process that do trust each other, but not threads that belong to different processes that don't trust each other).
Of course a good OS would know the preference for each task (if each task cares about performance or power consumption or security), and would make intelligent decisions to handle a mixture of tasks with difference preferences. Sadly there are no good operating systems - most operating systems and APIs were designed in the 1990s or earlier (back when SMP was just starting and all CPUs were identical anyway) and lack the information about tasks that would be necessary to make intelligent decisions; so they assume performance is the only thing that matters for all tasks, leading to the "tasks spread across physical cores where possible, even when it's not ideal" behavior you're seeing.
My guess is that's due to hyperthreading.
Hyperthreading doesn't double CPU capacity (according to Intel, it adds ~30% on average), so it makes sense to spread the work among physical cores first, and use hyperthreading as a last resort when the overall CPU demand starts exceeding 50%.
Fun fact: a reported 50% overall CPU load on a hyperthreaded system is in fact a load of around ~70%, and the remaining 50% equate to the remaining ~30%.
If we query the OS to see how logical processors are assigned to cores1, we will see a situation like this:
Core 0: mask 0x3
Core 1: mask 0xc
Core 2: mask 0x30
Core 3: mask 0xc0
. . .
That means logical processors 0 and 1 are on core 0, 2 and 3 on core 1, etc.
You can disable hyperthreading in the BIOS. But since it adds performance, it's is a nice to have feature. Just need to be careful not to pin work such that it is competing for the same core.
1 To check core assignment I use a small C program below. The information might also be available via WMIC.
#include <stdio.h>
#include <stdlib.h>
#undef _WIN32_WINNT
#define _WIN32_WINNT 0x601
#include <Windows.h>
int main() {
DWORD len = 65536;
char *buf = (char*)malloc(len);
if (!GetLogicalProcessorInformationEx(RelationProcessorCore, (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buf, &len)) {
return GetLastError();
}
union {
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX info;
PBYTE infob;
};
info = (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buf;
for (size_t i = 0, n = 0; n < len; i++, n += info->Size, infob += info->Size) {
switch (info->Relationship) {
case RelationProcessorCore:
printf("Core %zd:", i);
for (int j = 0; j < info->Processor.GroupCount; j++)
printf(" mask 0x%llx", info->Processor.GroupMask[j].Mask);
printf("\n");
break;
}
}
return 0;
}
Related
Let's imagine we have software developer that's goal is achieve absolute maximum of CPU's performance.
In today's CPUs we have many cores, we can load data in cache for faster processing and we also have SIMD instructions (AVX for example) that allow us to sum\multiply\do other ops with array of items (multiply 8 integers per one CPU clock). The disadvantage of this instruction is the cost of sending data & instructions to SIMD module + overhead of converting vector type to primitive types (sorry I familiar only with C#'s Vector) (We not looling on code complexety for now).
As far as I understand, while we using SIMD, main registers of CPU used only for sending and recieving data to this registers and main ALU blocks used for general purpose calculations are idle at this time.
And here is my question - will using of SIMD instructions load main CPU blocks? For example if we have huge amount of different calculations (let's imagine 40% of them are best to run on SIMD and 60% of them are better to run as a usual), will SIMD allow us to gain performance boost in this way: 100% of all cores performace + n% of SIMD's boost performance?
I'm asking this question because of for example with GPGPU we can use GPU for parallel calculations and CPU used in this case only for sending and recieving data, so it's idle all the time and we can utilize it's performance for sensitive for latency tasks.
Looks like this is a question about Out-Of-Order-Execution? Modern x64 have a number of execution ports on the CPU, and each can dispatch a new instruction per clock cycle (so about 8 CPU ops can run in parallel on an Intel SkyLake). Some of those ports handle memory loads/stores, some handle integer arithmetic, and some handle the SIMD instructions.
So for example, you may be able to displatch 2 AVX float mults, an AVX bitwise op, 2 AVX loads, a single AVX store, and a couple of bits of pointer arithmetic on the general purpose registers in a single cycle [you will have to wait for the operation to complete - the latency]. So in theory, as long as there aren't horrific dependency chains in the code, with some care you should able to keep each of those ports busy (or at least, that's the basic aim!).
Simple Rule 1: The busier you can keep the execution ports, the faster your code goes. This should be self evident. If you can keep 8 ports busy, you're doing 8 times more than if you can only keep 1 busy. In general though, it's mostly not worth worrying about (yes, there are always exceptions to the rule)
Simple Rule 2: When the SIMD execution ports are in use, the ALU doesn't suddenly become idle [A slight terminology error on your part here: The ALU is simply the bit of the CPU that does arithmetic. The computation for general purpose ops is done on an ALU, but it's also correct to call a SIMD unit an ALU. What you meant to ask is: do the general purpose parts of the CPU power down when SIMD units are in use? To which the answer is no... ]. Consider this AVX2 optimised method (which does nothing interesting!)
#include <immintrin.h>
typedef __m256 float8;
#define mul8f _mm256_mul_ps
void computeThing(float8 a[], float8 b[], float8 c[], int count)
{
for(int i = 0; i < count; ++i)
{
a[i] = mul8f(a[i], b[i]);
b[i] = mul8f(b[i], c[i]);
}
}
Since there are no dependencies between a, b, and c (which I should really be explicit about by specifying __restrict), then the two SIMD multiply instructions can both be dispatched in a single clock cycle (since there are two execution ports that can handle floating point multiply).
The General Purpose ALU doesn't suddenly power down here - The general purpose registers & instructions are still being used!
1. to compute memory addresses (for: a[i], b[i], c[i], d[i])
2. to load/store into those memory locations
3. to increment the loop counter
4. to test if the count has been reached?
It just so happens that we are also making use of the SIMD units to do a couple of multiplications...
Simple Rule 3: For floating point operations, using 'float' or '__m256' makes next to no difference. The same CPU hardware used to compute either float or float8 types is exactly the same. There are simply a couple of bits in the machine code encoding that specifies the choice between float/__m128/__m256.
i.e. https://godbolt.org/z/xTcLrf
It's my understanding that if two threads are reading from the same piece of memory, and no thread is writing to that memory, then the operation is safe. However, I'm not sure what happens if one thread is reading and the other is writing. What would happen? Is the result undefined? Or would the read just be stale? If a stale read is not a concern is it ok to have unsynchronized read-write to a variable? Or is it possible the data would be corrupted, and neither the read nor the write would be correct and one should always synchronize in this case?
I want to say that I've learned it is the later case, that a race on memory access leaves the state undefined... but I don't remember where I may have learned that and I'm having a hard time finding the answer on google. My intuition is that a variable is operated on in registers, and that true (as in hardware) concurrency is impossible (or is it), so that the worst that could happen is stale data, i.e. the following:
WriteThread: copy value from memory to register
WriteThread: update value in register
ReadThread: copy value of memory to register
WriteThread: write new value to memory
At which point the read thread has stale data.
Usually memory is read or written in atomic units determined by the CPU architecture (32 bit and 64 bits item aligned on 32 bit and 64 bit boundaries is common these days).
In this case, what happens depends on the amount of data being written.
Let's consider the case of 32 bit atomic read/write cells.
If two threads write 32 bits into such an aligned cell, then it is absolutely well defined what happens: one of the two written values is retained. Unfortunately for you (well, the program), you don't know which value. By extremely clever programming, you can actually use this atomicity of reads and writes to build synchronization algorithms (e.g., Dekker's algorithm), but it is faster typically to use architecturally defined locks instead.
If two threads write more than an atomic unit (e.g., they both write a 128 bit value), then in fact the atomic unit sized pieces of the values written will be stored in a absolutely well defined way, but you won't know which pieces of which value get written in what order. So what may end up in storage is the value from the first thread, the second thread, or mixes of the bits in atomic unit sizes from both threads.
Similar ideas hold for one thread reading, and one thread writing in atomic units, and larger.
Basically, you don't want to do unsynchronized reads and writes to memory locations, because you won't know the outcome, even though it may be very well defined by the architecture.
The result is undefined. Corrupted data is entirely possible. For an obvious example, consider a 64-bit value being manipulated by a 32-bit processor. Let's assume the value is a simple counter, and we increment it when the lower 32-bits contain 0xffffffff. The increment produces 0x00000000. When we detect that, we increment the upper word. If, however, some other thread read the value between the time the lower word was incremented and the upper word was incremented, they get a value with an un-incremented upper word, but the lower word set to 0 -- a value completely different from what it would have been either before or after the increment is complete.
As I hinted in Ira Baxter's answer, CPU cache also plays a part on multicore systems. Consider the following test code:
DANGER WILL ROBISON!
The following code boosts priority to realtime to achieve somewhat more consistent results - while doing so requires admin privileges, be careful if running the code on dual- or single-core systems, since your machine will lock up for the duration of the test run.
#include <windows.h>
#include <stdio.h>
const int RUNFOR = 5000;
volatile bool terminating = false;
volatile int value;
static DWORD WINAPI CountErrors(LPVOID parm)
{
int errors = 0;
while(!terminating)
{
value = (int) parm;
if(value != (int) parm)
errors++;
}
printf("\tThread %08X: %d errors\n", parm, errors);
return 0;
}
static void RunTest(int affinity1, int affinity2)
{
terminating = false;
DWORD dummy;
HANDLE t1 = CreateThread(0, 0, CountErrors, (void*)0x1000, CREATE_SUSPENDED, &dummy);
HANDLE t2 = CreateThread(0, 0, CountErrors, (void*)0x2000, CREATE_SUSPENDED, &dummy);
SetThreadAffinityMask(t1, affinity1);
SetThreadAffinityMask(t2, affinity2);
ResumeThread(t1);
ResumeThread(t2);
printf("Running test for %d milliseconds with affinity %d and %d\n", RUNFOR, affinity1, affinity2);
Sleep(RUNFOR);
terminating = true;
Sleep(100); // let threads have a chance of picking up the "terminating" flag.
}
int main()
{
SetPriorityClass(GetCurrentProcess(), REALTIME_PRIORITY_CLASS);
RunTest(1, 2); // core 1 & 2
RunTest(1, 4); // core 1 & 3
RunTest(4, 8); // core 3 & 4
RunTest(1, 8); // core 1 & 4
}
On my Quad-core intel Q6600 system (which iirc has two sets of cores where each set share L2 cache - would explain the results anyway ;)), I get the following results:
Running test for 5000 milliseconds with affinity 1 and 2
Thread 00002000: 351883 errors
Thread 00001000: 343523 errors
Running test for 5000 milliseconds with affinity 1 and 4
Thread 00001000: 48073 errors
Thread 00002000: 59813 errors
Running test for 5000 milliseconds with affinity 4 and 8
Thread 00002000: 337199 errors
Thread 00001000: 335467 errors
Running test for 5000 milliseconds with affinity 1 and 8
Thread 00001000: 55736 errors
Thread 00002000: 72441 errors
We've got a simple memory throughput benchmark. All it does is memcpy repeatedly for a large block of memory.
Looking at the results (compiled for 64-bit) on a few different machines, Skylake machines do significantly better than Broadwell-E, keeping OS (Win10-64), processor speed, and RAM speed (DDR4-2133) the same. We're not talking a few percentage points, but rather a factor of about 2. Skylake is configured dual-channel, and the results for Broadwell-E don't vary for dual/triple/quad-channel.
Any ideas why this might be happening? The code that follows is compiled in Release in VS2015, and reports average time to complete each memcpy at:
64-bit: 2.2ms for Skylake vs 4.5ms for Broadwell-E
32-bit: 2.2ms for Skylake vs 3.5ms for Broadwell-E.
We can get greater memory throughput on a quad-channel Broadwell-E build by utilizing multiple threads, and that's nice, but to see such a drastic difference for single-threaded memory access is frustrating. Any thoughts on why the difference is so pronounced?
We've also used various benchmarking software, and they validate what this simple example shows - single-threaded memory throughput is way better on Skylake.
#include <memory>
#include <Windows.h>
#include <iostream>
//Prevent the memcpy from being optimized out of the for loop
_declspec(noinline) void MemoryCopy(void *destinationMemoryBlock, void *sourceMemoryBlock, size_t size)
{
memcpy(destinationMemoryBlock, sourceMemoryBlock, size);
}
int main()
{
const int SIZE_OF_BLOCKS = 25000000;
const int NUMBER_ITERATIONS = 100;
void* sourceMemoryBlock = malloc(SIZE_OF_BLOCKS);
void* destinationMemoryBlock = malloc(SIZE_OF_BLOCKS);
LARGE_INTEGER Frequency;
QueryPerformanceFrequency(&Frequency);
while (true)
{
LONGLONG total = 0;
LONGLONG max = 0;
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
for (int i = 0; i < NUMBER_ITERATIONS; ++i)
{
QueryPerformanceCounter(&StartingTime);
MemoryCopy(destinationMemoryBlock, sourceMemoryBlock, SIZE_OF_BLOCKS);
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
total += ElapsedMicroseconds.QuadPart;
max = max(ElapsedMicroseconds.QuadPart, max);
}
std::cout << "Average is " << total*1.0 / NUMBER_ITERATIONS / 1000.0 << "ms" << std::endl;
std::cout << "Max is " << max / 1000.0 << "ms" << std::endl;
}
getchar();
}
Single-threaded memory bandwidth on modern CPUs is limited by max_concurrency / latency of the transfers from L1D to the rest of the system, not by DRAM-controller bottlenecks. Each core has 10 Line-Fill Buffers (LFBs) which track outstanding requests to/from L1D. (And 16 "superqueue" entries which track lines to/from L2).
(Update: experiments show that Skylake probably has 12 LFBs, up from 10 in Broadwell. e.g. Fig7 in the ZombieLoad paper, and other performance experiments including #BeeOnRope's testing of multiple store streams)
Intel's many-core chips have higher latency to L3 / memory than quad-core or dual-core desktop / laptop chips, so single-threaded memory bandwidth is actually much worse on a big Xeon, even though the max aggregate bandwidth with many threads is much better. They have many more hops on the ring bus that connects cores, memory controllers, and the System Agent (PCIe and so on).
SKX (Skylake-server / AVX512, including the i9 "high-end desktop" chips) is really bad for this: L3 / memory latency is significantly higher than for Broadwell-E / Broadwell-EP, so single-threaded bandwidth is even worse than on a Broadwell with a similar core count. (SKX uses a mesh instead of a ring bus because that scales better, see this for details on both. But apparently the constant factors are bad in the new design; maybe future generations will have better L3 bandwidth/latency for small / medium core counts. The private per-core L2 is bumped up to 1MiB though, so maybe L3 is intentionally slow to save power.)
(Skylake-client (SKL) like in the question, and later quad/hex-core desktop/laptop chips like Kaby Lake and Coffee Lake, still use the simpler ring-bus layout. Only the server chips changed. We don't yet know for sure what Ice Lake client will do.)
A quad or dual core chip only needs a couple threads (especially if the cores + uncore (L3) are clocked high) to saturate its memory bandwidth, and a Skylake with fast DDR4 dual channel has quite a lot of bandwidth.
For more about this, see the Latency-bound Platforms section of this answer about x86 memory bandwidth. (And read the other parts for memcpy/memset with SIMD loops vs. rep movs/rep stos, and NT stores vs. regular RFO stores, and more.)
Also related: What Every Programmer Should Know About Memory? (2017 update on what's still true and what's changed in that excellent article from 2007).
I finally got VTune (evalutation) up and running. It gives a DRAM bound score of .602 (between 0 and 1) on Broadwell-E and .324 on Skylake, with a huge part of the Broadwell-E delay coming from Memory Latency. Given that the memory sticks are the same speed (except dual-channel configured in Skylake and quad-channel in Broadwell-E), my best guess is that something about the memory controller in Skylake is just tremendously better.
It makes buying into the Broadwell-E architecture a much tougher call, and requires that you really need the extra cores to even consider it.
I also got L3/TLB miss counts. On Broadwell-E, TLB miss count was about 20% higher, and L3 miss count about 36% higher.
I don't think this is really an answer for "why" so I won't mark it as such, but is as close as I think I'll get to one for the time being. Thanks for all the helpful comments along the way.
Whenever a cache miss occurs, is it possible to know the address of that missed cache line? Are there any hardware performance counters in modern processors that can provide such information?
Yes, on modern Intel hardware there are precise memory sampling events that track not only the address of the instruction, but the data address as well. These events also includes a great deal of other information, such as what level of the cache hierarchy the memory access was satisfied it, the total latency and so on.
You can use perf mem to sample this information and produces a report.
For example, the following program:
#include <stddef.h>
#define SIZE (100 * 1024 * 1024)
int p[SIZE] = {1};
void do_writes(volatile int *p) {
for (size_t i = 0; i < SIZE; i += 5) {
p[i] = 42;
}
}
void do_reads(volatile int *p) {
volatile int sink;
for (size_t i = 0; i < SIZE; i += 5) {
sink = p[i];
}
}
int main(int argc, char **argv) {
do_writes(p);
do_reads(p);
}
compiled with:
g++ -g -O1 -march=native perf-mem-test.cpp -o perf-mem-test
and run with:
sudo perf mem record -U ./perf-mem-test && sudo perf mem report
Produces a report of memory accesses sorted by latency like this:
The Data Symbol column shows where address the load was targeting - most here show up as something like p+0xa0658b4 which means at an offset of 0xa0658b4 from the start of p which makes sense as the code is reading and writing p. The list is sorted by "local weight" which is the access latency in reference cycles1.
Note that the information recorded is only a sample of memory accesses: recording every miss would usually be way too much information. Furthermore, it only records loads with a latency of 30 cycles or more by default, but you can apparently tweak this with command line arguments.
If you're only interested in accesses that miss in all levels of cache, you're looking for the "Local RAM hit" lines2. Perhaps you can restrict your sampling to only cache misses - I'm pretty sure the Intel memory sampling stuff supports that, and I think you can tell perf mem to look at only misses.
Finally, note that here I'm using the -U argument after record which instructs perf mem to only record userspace events. By default it will include kernel events, which may or may not be useful for your. For the example program, there are many kernel events associated with copying the p array from the binary into writable process memory.
Keep in mind that I specifically arranged my program such that the global array p ended up in the initialized .data section (the binary is ~400 MB!), so that it shows up with the right symbol in the listing. The vast majority of the time your process is going to be accessing dynamically allocated or stack memory, which will just give you a raw address. Whether you can map this back to a meaningful object depends on if you track enough information to make that possible.
1 I think it's in reference cycles, but I could be wrong and the kernel may have already converted it to nanoseconds?
2 The "Local" and "hit" part here refer to the fact that we hit the RAM attached to the current core, i.e., we didn't have go to the RAM associated with another socket in a multi-socket NUMA configuration.
If you want to know the exact virtual or physical address of every cache miss on a particular processor, that would be very hard and sometimes impossible. But you are more likely to be interested in expensive memory access patterns; those patterns that incur large latencies because they miss in one or more levels of the cache subsystem. Note that it is important to keep in mind that a cache miss on one processor might be a cache hit on another depending on design details of each processor and depending also on the operating system.
There are several ways to find such patterns, two are commonly used. One is to use a simulator such as gem5 or Sniper. Another is to use hardware performance events. Events that represent cache misses are available but they do not provide any details on why or where a miss occurred. However, using a profiler, you can approximately associate cache misses as reported by the corresponding hardware performance events with the instructions that caused them which in turn can be mapped back to locations in the source code using debug information. Examples of such profilers include Intel VTune Amplifier and AMD CodeXL. The results produced by simulators and profilers may not be accurate and so you have to be careful when interpreting them.
I'm looking into OpenCL, and I'm a little confused why this kernel is running so slowly, compared to how I would expect it to run. Here's the kernel:
__kernel void copy(
const __global char* pSrc,
__global __write_only char* pDst,
int length)
{
const int tid = get_global_id(0);
if(tid < length) {
pDst[tid] = pSrc[tid];
}
}
I've created the buffers in the following way:
char* out = new char[2048*2048];
cl::Buffer(
context,
CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
length,
out);
Ditto for the input buffer, except that I've initialized the in pointer to random values. Finally, I run the kernel this way:
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(1),
NULL,
&event);
event.wait();
On average, the time is around 75 milliseconds, as calculated by:
cl_ulong startTime = event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong endTime = event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
std::cout << (endTime - startTime) * SECONDS_PER_NANO / SECONDS_PER_MILLI << "\n";
I'm running Windows 7, with an Intel i5-3450 chip (Sandy Bridge architecture). For comparison, the "direct" way of doing the copy takes less than 5 milliseconds. I don't think the event.getProfilingInfo includes the communication time between the host and device. Thoughts?
EDIT:
At the suggestion of ananthonline, I changed the kernel to use float4s instead of chars, and that dropped the average run time to about 50 millis. Still not as fast as I would have hoped, but an improvement. Thanks ananthonline!
I think your main problem is the 2048*2048 work groups you are using. The opencl drivers on your system have to manage a lot more overhead if you have this many single-item work groups. This would be especially bad if you were to execute this program using a gpu, because you would get a very low level of saturation of the hardware.
Optimization: call your kernel with larger work groups. You don't even have to change your existing kernel. see question: What should this size be? I have used 64 below as an example. 64 happens to be a decent number on most hardware.
cl::size_t myOptimalGroupSize = 64;
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(myOptimalGroupSize),
NULL,
&event);
event.wait();
You should also get your kernel to do more than copy a single value. I have given an answer to a similar question about global memory over here.
CPUs are very different from GPUs. Running this on an x86 CPU, the best way to achieve decent performance would be to use double16 (the largest data type) instead of char or float4 (as suggested by someone else).
In my little experience with OpenCL on CPU, I have never reached performance levels that I could get with an OpenMP parallelization.
The best way to do a copy in parallel with a CPU would be to divide the block to copy into a small number of large sub-block, and let each thread copy a sub-block.
The GPU approach is orthogonal: each thread participates in the copy of the same block.
This is because on GPUs, different thread can access contiguous memory regions efficicently (coalescing).
To do an efficient copy on CPU with OpenCL, use a loop inside your kernel to copy contiguous data. And then use a workgroup size not larger than the number of available cores.
I believe it is the cl::NDRange(1) which is telling the runtime to use single item work groups. This is not efficient. In the C API you can pass NULL for this to leave the work group size up to the runtime; there should be a way to do that in the C++ API as well (perhaps also just NULL). This should be faster on the CPU; it certainly will be on a GPU.