While I was practicing lock-based vs. lock-free concurrency, I realised that my lock-based function takes less time than a function with no synchronisation:
// A simple money transfer function
void transfer(unsigned int from, unsigned int to, unsigned int amount)
{
lock_guard<mutex> lock(m); // removing this line makes it run slower!
accounts[from] -= amount;
accounts[to] += amount;
}
Here is the coliru link of the full code. What might be the reason?
Note: I am aware of the data-race issue!
UPDATE: My observation was just based on Coliru. So I run the same code on a MacBook Pro (Retina, Mid 2012) 2,6 GHz Intel Core i7 16 GB 1600 MHz DDR3 and realised that no-locking one runs twice faster than the lock-based one. But the lock-based one runs a little bit faster on Coliru consistently.
Related
We've got a simple memory throughput benchmark. All it does is memcpy repeatedly for a large block of memory.
Looking at the results (compiled for 64-bit) on a few different machines, Skylake machines do significantly better than Broadwell-E, keeping OS (Win10-64), processor speed, and RAM speed (DDR4-2133) the same. We're not talking a few percentage points, but rather a factor of about 2. Skylake is configured dual-channel, and the results for Broadwell-E don't vary for dual/triple/quad-channel.
Any ideas why this might be happening? The code that follows is compiled in Release in VS2015, and reports average time to complete each memcpy at:
64-bit: 2.2ms for Skylake vs 4.5ms for Broadwell-E
32-bit: 2.2ms for Skylake vs 3.5ms for Broadwell-E.
We can get greater memory throughput on a quad-channel Broadwell-E build by utilizing multiple threads, and that's nice, but to see such a drastic difference for single-threaded memory access is frustrating. Any thoughts on why the difference is so pronounced?
We've also used various benchmarking software, and they validate what this simple example shows - single-threaded memory throughput is way better on Skylake.
#include <memory>
#include <Windows.h>
#include <iostream>
//Prevent the memcpy from being optimized out of the for loop
_declspec(noinline) void MemoryCopy(void *destinationMemoryBlock, void *sourceMemoryBlock, size_t size)
{
memcpy(destinationMemoryBlock, sourceMemoryBlock, size);
}
int main()
{
const int SIZE_OF_BLOCKS = 25000000;
const int NUMBER_ITERATIONS = 100;
void* sourceMemoryBlock = malloc(SIZE_OF_BLOCKS);
void* destinationMemoryBlock = malloc(SIZE_OF_BLOCKS);
LARGE_INTEGER Frequency;
QueryPerformanceFrequency(&Frequency);
while (true)
{
LONGLONG total = 0;
LONGLONG max = 0;
LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds;
for (int i = 0; i < NUMBER_ITERATIONS; ++i)
{
QueryPerformanceCounter(&StartingTime);
MemoryCopy(destinationMemoryBlock, sourceMemoryBlock, SIZE_OF_BLOCKS);
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
total += ElapsedMicroseconds.QuadPart;
max = max(ElapsedMicroseconds.QuadPart, max);
}
std::cout << "Average is " << total*1.0 / NUMBER_ITERATIONS / 1000.0 << "ms" << std::endl;
std::cout << "Max is " << max / 1000.0 << "ms" << std::endl;
}
getchar();
}
Single-threaded memory bandwidth on modern CPUs is limited by max_concurrency / latency of the transfers from L1D to the rest of the system, not by DRAM-controller bottlenecks. Each core has 10 Line-Fill Buffers (LFBs) which track outstanding requests to/from L1D. (And 16 "superqueue" entries which track lines to/from L2).
(Update: experiments show that Skylake probably has 12 LFBs, up from 10 in Broadwell. e.g. Fig7 in the ZombieLoad paper, and other performance experiments including #BeeOnRope's testing of multiple store streams)
Intel's many-core chips have higher latency to L3 / memory than quad-core or dual-core desktop / laptop chips, so single-threaded memory bandwidth is actually much worse on a big Xeon, even though the max aggregate bandwidth with many threads is much better. They have many more hops on the ring bus that connects cores, memory controllers, and the System Agent (PCIe and so on).
SKX (Skylake-server / AVX512, including the i9 "high-end desktop" chips) is really bad for this: L3 / memory latency is significantly higher than for Broadwell-E / Broadwell-EP, so single-threaded bandwidth is even worse than on a Broadwell with a similar core count. (SKX uses a mesh instead of a ring bus because that scales better, see this for details on both. But apparently the constant factors are bad in the new design; maybe future generations will have better L3 bandwidth/latency for small / medium core counts. The private per-core L2 is bumped up to 1MiB though, so maybe L3 is intentionally slow to save power.)
(Skylake-client (SKL) like in the question, and later quad/hex-core desktop/laptop chips like Kaby Lake and Coffee Lake, still use the simpler ring-bus layout. Only the server chips changed. We don't yet know for sure what Ice Lake client will do.)
A quad or dual core chip only needs a couple threads (especially if the cores + uncore (L3) are clocked high) to saturate its memory bandwidth, and a Skylake with fast DDR4 dual channel has quite a lot of bandwidth.
For more about this, see the Latency-bound Platforms section of this answer about x86 memory bandwidth. (And read the other parts for memcpy/memset with SIMD loops vs. rep movs/rep stos, and NT stores vs. regular RFO stores, and more.)
Also related: What Every Programmer Should Know About Memory? (2017 update on what's still true and what's changed in that excellent article from 2007).
I finally got VTune (evalutation) up and running. It gives a DRAM bound score of .602 (between 0 and 1) on Broadwell-E and .324 on Skylake, with a huge part of the Broadwell-E delay coming from Memory Latency. Given that the memory sticks are the same speed (except dual-channel configured in Skylake and quad-channel in Broadwell-E), my best guess is that something about the memory controller in Skylake is just tremendously better.
It makes buying into the Broadwell-E architecture a much tougher call, and requires that you really need the extra cores to even consider it.
I also got L3/TLB miss counts. On Broadwell-E, TLB miss count was about 20% higher, and L3 miss count about 36% higher.
I don't think this is really an answer for "why" so I won't mark it as such, but is as close as I think I'll get to one for the time being. Thanks for all the helpful comments along the way.
I am a newbie to CUDA. I simply tried to sort an array using Thrust.
clock_t start_time = clock();
thrust::host_vector<int> h_vec(10);
thrust::generate(h_vec.begin(), h_vec.end(), rand);
thrust::device_vector<int> d_vec = h_vec;
thrust::sort(d_vec.begin(), d_vec.end());
//thrust::sort(h_vec.begin(), h_vec.end());
clock_t stop_time = clock();
printf("%f\n", (double)(stop_time - start_time) / CLOCKS_PER_SEC);
Time took to sort d_vec is 7.4s, and time took to sort h_vec is 0.4s
I am assuming its parallel computation on device memory, so shouldn't it be faster ?
Probably the main problem is context creation time: the first CUDA call will initialize the CUDA context which takes some time, see here. Therefore you should start measuring time only after the first CUDA call.
In general you can only expect speed-up with GPU code compared to CPU code if the degree of parallelism is high enough. The vector size of 10 as in the example code is definitely too small to achieve speed-up. With a vector size >> 10000 you can expect to fully utilize a modern GPU.
You should also think about measuring only the time for sorting without the copy d_vec = h_vec, since often you will work with the device vector in the next step. Then you can consider the copy operation as a one time setup cost. (However if sorting is the only operation on device it is of course reasonable to include the memcopy in the measurement.)
As a test, I am trying to crunch as much GFLOPS from the GPU as possible, just to see how far we can go with compute via RenderScript.
For this I use a GPU-cache-friendly kernel that will (hopefully) not be bounded on memory access for testing purposes:
#pragma rs_fp_relaxed
rs_allocation input;
float __attribute__((kernel)) compute(float in, int x)
{
float sum = 0;
if (x < 64) return 0;
for (int i = 0; i < 64; i++) {
sum += rsGetElementAt_float(input, x - i);
}
return sum;
}
On the Java side I just call the kernel a couple of times:
for (int i = 0; i < 1024; i++) {
m_script.forEach_compute(m_inAllocation, m_outAllocation);
}
With allocation sizes of 1M floats this maxes around 1-2 GFLOPS on a GPU that should max around 100 GFLOPS (Snapdragon 600, APQ8064AB), that is 50x - 100x less compute performance !.
I have tried unrolling the loop (10% difference), using larger or smaller sums (<5% diff), different allocation sizes (<5% diff), 1D or 2D allocations (no diff), but come nowhere near the amount of GFLOPS that should be possible on the device. I even am thinking that the entire kernel is only running on the CPUs.
In similar sense, looking at the results of an RenderScript benchmark application (https://compubench.com/result.jsp?benchmark=compu20, the top of the line devices only achieve around 60M pixels/s on a Gaussian blur. A 5x5 blur in naive (non-seperable) implementation takes around 50 FLOPS/pixel, resulting in 3 GFLOPS as opposed to the 300 GFLOPS these GPUs have.
Any thoughts?
(see e.g. http://kyokojap.myweb.hinet.net/gpu_gflops/ for an overview of device capabilities)
EDIT:
Using the OpenCL libs that are available on the device (Samsung S4, 4.4.2) I have rewritten the RenderScript test program to OpenCL and run it via the NDK. With basically the same setup (1M float buffers and running the kernel 1024 times) I can now get around 25 GFLOPS, that is 10x the RenderScript performance, and 4x from the theoretical device maximum.
For RenderScript there is no way of knowing if a kernel is running on the GPU. So:
if the RenderScript kernel does run on the GPU, why is it so slow?
if the kernel is not running on the GPU, which devices do run RenderScript on the GPU (aside from most probably the Nexus line)?
Thanks.
What device are you using? Not all devices are shipping with GPU drivers yet.
Also, that kernel will be memory bound, since you've got a 1:1 arithmetic to load ratio.
I'm looking into OpenCL, and I'm a little confused why this kernel is running so slowly, compared to how I would expect it to run. Here's the kernel:
__kernel void copy(
const __global char* pSrc,
__global __write_only char* pDst,
int length)
{
const int tid = get_global_id(0);
if(tid < length) {
pDst[tid] = pSrc[tid];
}
}
I've created the buffers in the following way:
char* out = new char[2048*2048];
cl::Buffer(
context,
CL_MEM_USE_HOST_PTR | CL_MEM_WRITE_ONLY,
length,
out);
Ditto for the input buffer, except that I've initialized the in pointer to random values. Finally, I run the kernel this way:
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(1),
NULL,
&event);
event.wait();
On average, the time is around 75 milliseconds, as calculated by:
cl_ulong startTime = event.getProfilingInfo<CL_PROFILING_COMMAND_START>();
cl_ulong endTime = event.getProfilingInfo<CL_PROFILING_COMMAND_END>();
std::cout << (endTime - startTime) * SECONDS_PER_NANO / SECONDS_PER_MILLI << "\n";
I'm running Windows 7, with an Intel i5-3450 chip (Sandy Bridge architecture). For comparison, the "direct" way of doing the copy takes less than 5 milliseconds. I don't think the event.getProfilingInfo includes the communication time between the host and device. Thoughts?
EDIT:
At the suggestion of ananthonline, I changed the kernel to use float4s instead of chars, and that dropped the average run time to about 50 millis. Still not as fast as I would have hoped, but an improvement. Thanks ananthonline!
I think your main problem is the 2048*2048 work groups you are using. The opencl drivers on your system have to manage a lot more overhead if you have this many single-item work groups. This would be especially bad if you were to execute this program using a gpu, because you would get a very low level of saturation of the hardware.
Optimization: call your kernel with larger work groups. You don't even have to change your existing kernel. see question: What should this size be? I have used 64 below as an example. 64 happens to be a decent number on most hardware.
cl::size_t myOptimalGroupSize = 64;
cl::Event event;
queue.enqueueNDRangeKernel(
kernel,
cl::NullRange,
cl::NDRange(length),
cl::NDRange(myOptimalGroupSize),
NULL,
&event);
event.wait();
You should also get your kernel to do more than copy a single value. I have given an answer to a similar question about global memory over here.
CPUs are very different from GPUs. Running this on an x86 CPU, the best way to achieve decent performance would be to use double16 (the largest data type) instead of char or float4 (as suggested by someone else).
In my little experience with OpenCL on CPU, I have never reached performance levels that I could get with an OpenMP parallelization.
The best way to do a copy in parallel with a CPU would be to divide the block to copy into a small number of large sub-block, and let each thread copy a sub-block.
The GPU approach is orthogonal: each thread participates in the copy of the same block.
This is because on GPUs, different thread can access contiguous memory regions efficicently (coalescing).
To do an efficient copy on CPU with OpenCL, use a loop inside your kernel to copy contiguous data. And then use a workgroup size not larger than the number of available cores.
I believe it is the cl::NDRange(1) which is telling the runtime to use single item work groups. This is not efficient. In the C API you can pass NULL for this to leave the work group size up to the runtime; there should be a way to do that in the C++ API as well (perhaps also just NULL). This should be faster on the CPU; it certainly will be on a GPU.
I use MathLink to send and receive independent mma expressions from a C++ application as strings.
std::string expression[N];
// ...
for(int i = 0; i < N; ++i) {
MLPutFunction(l, "EnterTextPacket", 1);
MLPutString(l, expression[i].c_str());
MLEndPacket(l);
// Check Packet ...
const char* result;
MLGetString(l, &result);
// process result ...
MLDisownString(l, result);
}
I would expect that MLDisownString frees the used memory except that it doesn't.
Any ideas?
Ok. Posting this as an answer, because I believe the odds you are using version 5 or below are pretty low:
`As of Version 6.0, MLDisownString() has been superseded by MLReleaseString()`
Check it here
First of all, I should point out such parameter as $HistoryLength. Setting it to zero often allows to reduce memory requirements considerably:
$HistoryLength = 0
At the same time, it is known problem with the MathKernel process that it accumulates system memory in long computations and does not release it.
The only way to ultimately solve the problem it to restart the kernel when it takes too much memory or when the amount of available free physical memory becomes too small. This task can be automatized.
If you have not tried Mathematica 8 yet, it may be worth a try, since, according to Oliver Ruebenkoenig:
For version 8 the memory allocator has
been rewritten and improved.
(What a small sentence for such a huge
endeavor and such a fine execution)
But I have not tried the version 8 yet and cannot say anything on it.