Some questions related to cache performance (computer architecture)

Some questions related to cache performance (computer architecture) - performance

Details about the X5650 processor at https://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20X5650%20-%20AT80614004320AD%20(BX80614X5650).html
important notes:
L3 cache size : 12288KB
cache line size : 64
Consider the following two functions, which each increment the values in an array by 100.
void incrementVector1(INT4* v, int n) {
for (int k = 0; k < 100; ++k) {
for (int i = 0; i < n; ++i) {
v[i] = v[i] + 1;
} }
}
void incrementVector2(INT4* v, int n) {
for (int i = 0; i < n; ++i) {
for (int k = 0; k < 100; ++k) {
v[i] = v[i] + 1;
} }
}
The following data collected using the perf utility captures runtime information for executing
each of these functions on the Intel Xeon X5650 processor for various data sizes. In this data: • the program vector1.bin executes the function incrementVector1;
• the program vector2.bin executes the function incrementVector2;
• the programs take a command line argument which sets the value of n;
• both programs begin by allocating an array of size n and initializing all elements to 0. • LLC-loads means “last level cache loads”, the number of accesses to L3;
• LLC-load-misses means “last level cache misses”, the number of L3 cache misses.
Runtime performance of vector1.bin.
Performance counter stats for ’./vector1.bin 1000000’:
230,070 LLC-loads
3,280 LLC-load-misses # 1.43% of all LL-cache references
0.383542737 seconds time elapsed
Performance counter stats for ’./vector1.bin 3000000’:
669,884 LLC-loads
242,876 LLC-load-misses # 36.26% of all LL-cache references
1.156663301 seconds time elapsed
Performance counter stats for ’./vector1.bin 5000000’:
1,234,031 LLC-loads
898,577 LLC-load-misses # 72.82% of all LL-cache references
1.941832434 seconds time elapsed
Performance counter stats for ’./vector1.bin 7000000’:
1,620,026 LLC-loads
1,142,275 LLC-load-misses # 70.51% of all LL-cache references
2.621428714 seconds time elapsed
Performance counter stats for ’./vector1.bin 9000000’:
2,068,028 LLC-loads
1,422,269 LLC-load-misses # 68.77% of all LL-cache references
3.308037628 seconds time elapsed
8
Runtime performance of vector2.bin.
Performance counter stats for ’./vector2.bin 1000000’:
16,464 LLC-loads
1,168 LLC-load-misses # 7.049% of all LL-cache references
0.319311959 seconds time elapsed
Performance counter stats for ’./vector2.bin 3000000’:
42,052 LLC-loads
17,027 LLC-load-misses # 40.49% of all LL-cache references
0.954854798 seconds time elapsed
Performance counter stats for ’./vector2.bin 5000000’:
63,991 LLC-loads
38,459 LLC-load-misses # 60.10% of all LL-cache references
1.593526338 seconds time elapsed
Performance counter stats for ’./vector2.bin 7000000’:
99,773 LLC-loads
56,481 LLC-load-misses # 56.61% of all LL-cache references
2.198810471 seconds time elapsed
Performance counter stats for ’./vector2.bin 9000000’:
120,456 LLC-loads
76,951 LLC-load-misses # 63.88% of all LL-cache references
2.772653964 seconds time elapsed
I have two questions:
Consider the cache miss rates for vector1.bin. Between the vector sizes 1000000 and 5000000, the cache miss rate drastically increases. What is the cause of this increase in cache miss rate?
Consider the cache miss rates for both programs. Notice that the miss rate between the two programs is roughly equal for any particular array size. Why is that?

Related

Unique matrix transpose problem: contradictory reports from cachegrind and perf

In the following question, we're talking about an algorithm which transposes a matrix of complex values struct complex {double real = 0.0; double imag = 0.0;};. Owing to a special data-layout, there is a stride-n*n access between the rows, which means that the loading of a subsequent row causes the eviction of the previously loaded row. All runs have been done using 1 thread only.
I'm trying to understand why my 'optimized' transpose function, which makes use of 2D blocking, is performing badly (coming from: 2D blocking with unique matrix transpose problem) and so I'm trying to use performance counters/cache simulators to get a reading on what's going wrong.
According to my analysis, if n=500 is the size of the matrix, b=4 is my block-size and c=4 is my cache-line size, we have for the naive algorithm,
for (auto i1 = std::size_t{}; i1 < n1; ++i1)
{
for (auto i3 = std::size_t{}; i3 < n3; ++i3)
{
mat_out(i3, i1) = mat_in(i1, i3);
}
}
Number of cache-references: (read) n*n + (write) n*n
Number of cache-misses: (read) n*n / c + (write) n*n
Rate of misses: 62.5 %.
Sure enough, I'm getting the same output as per cachegrind:
==21470== Cachegrind, a cache and branch-prediction profiler
==21470== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==21470== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==21470== Command: ./benchmark/benchmarking_transpose_vslices_dir2_naive 500
==21470==
--21470-- warning: L3 cache found, using its data for the LL simulation.
--21470-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--21470-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
==21470==
==21470== I refs: 30,130,879,636
==21470== I1 misses: 7,666
==21470== LLi misses: 6,286
==21470== I1 miss rate: 0.00%
==21470== LLi miss rate: 0.00%
==21470==
==21470== D refs: 13,285,386,487 (6,705,198,115 rd + 6,580,188,372 wr)
==21470== D1 misses: 8,177,337,186 (1,626,402,679 rd + 6,550,934,507 wr)
==21470== LLd misses: 3,301,064,720 (1,625,156,375 rd + 1,675,908,345 wr)
==21470== D1 miss rate: 61.6% ( 24.3% + 99.6% )
==21470== LLd miss rate: 24.8% ( 24.2% + 25.5% )
==21470==
==21470== LL refs: 8,177,344,852 (1,626,410,345 rd + 6,550,934,507 wr)
==21470== LL misses: 3,301,071,006 (1,625,162,661 rd + 1,675,908,345 wr)
==21470== LL miss rate: 7.6% ( 4.4% + 25.5% )
Now for the implementation with blocking, I expect,
Hint: The following code is without remainder loops. The container intermediate_result, sized b x b, as per suggestion by #JérômeRichard, is used in order to prevent cache-thrashing.
for (auto bi1 = std::size_t{}; bi1 < n1; bi1 += block_size)
{
for (auto bi3 = std::size_t{}; bi3 < n3; bi3 += block_size)
{
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
intermediate_result(i3, i1) = mat_in(bi1 + i1, bi3 + i3);
}
}
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
#pragma omp simd safelen(8)
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
mat_out(bi3 + i1, bi1 + i3) = intermediate_result(i1, i3);
}
}
}
}
Number of cache-references: (read) b*b + (write) b*b
Number of cache-misses: (read) b*b / c + (write) b*b / c
Rate of misses: 25 %.
Once again, cachegrind gives me the following report:
==21473== Cachegrind, a cache and branch-prediction profiler
==21473== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==21473== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==21473== Command: ./benchmark/benchmarking_transpose_vslices_dir2_best 500 4
==21473==
--21473-- warning: L3 cache found, using its data for the LL simulation.
--21473-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--21473-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
==21473==
==21473== I refs: 157,135,137,350
==21473== I1 misses: 11,057
==21473== LLi misses: 9,604
==21473== I1 miss rate: 0.00%
==21473== LLi miss rate: 0.00%
==21473==
==21473== D refs: 43,995,141,079 (29,709,076,051 rd + 14,286,065,028 wr)
==21473== D1 misses: 3,307,834,114 ( 1,631,898,173 rd + 1,675,935,941 wr)
==21473== LLd misses: 3,301,066,570 ( 1,625,157,620 rd + 1,675,908,950 wr)
==21473== D1 miss rate: 7.5% ( 5.5% + 11.7% )
==21473== LLd miss rate: 7.5% ( 5.5% + 11.7% )
==21473==
==21473== LL refs: 3,307,845,171 ( 1,631,909,230 rd + 1,675,935,941 wr)
==21473== LL misses: 3,301,076,174 ( 1,625,167,224 rd + 1,675,908,950 wr)
==21473== LL miss rate: 1.6% ( 0.9% + 11.7% )
I cannot explain this discrepancy at this point, except to speculate that this might be because of prefetching.
Now, when I watch the same naive implementation using perf (with option "-d"), I get:
Performance counter stats for './benchmark/benchmarking_transpose_vslices_dir2_naive 500':
91.122,33 msec task-clock # 0,933 CPUs utilized
870.939 context-switches # 0,010 M/sec
17 cpu-migrations # 0,000 K/sec
50.807.083 page-faults # 0,558 M/sec
354.169.268.894 cycles # 3,887 GHz
217.031.159.494 instructions # 0,61 insn per cycle
34.980.334.095 branches # 383,883 M/sec
148.578.378 branch-misses # 0,42% of all branches
58.473.530.591 L1-dcache-loads # 641,704 M/sec
12.636.479.302 L1-dcache-load-misses # 21,61% of all L1-dcache hits
440.543.654 LLC-loads # 4,835 M/sec
276.733.102 LLC-load-misses # 62,82% of all LL-cache hits
97,705649040 seconds time elapsed
45,526653000 seconds user
47,295247000 seconds sys
When I do the same for the implementation with 2D-blocking, I get:
Performance counter stats for './benchmark/benchmarking_transpose_vslices_dir2_best 500 4':
79.865,16 msec task-clock # 0,932 CPUs utilized
766.200 context-switches # 0,010 M/sec
12 cpu-migrations # 0,000 K/sec
50.807.088 page-faults # 0,636 M/sec
310.452.015.452 cycles # 3,887 GHz
343.399.743.845 instructions # 1,11 insn per cycle
51.889.725.247 branches # 649,717 M/sec
133.541.902 branch-misses # 0,26% of all branches
81.279.037.114 L1-dcache-loads # 1017,703 M/sec
7.722.318.725 L1-dcache-load-misses # 9,50% of all L1-dcache hits
399.149.174 LLC-loads # 4,998 M/sec
123.134.807 LLC-load-misses # 30,85% of all LL-cache hits
85,660207381 seconds time elapsed
34,524170000 seconds user
46,884443000 seconds sys
Questions:
Why is there a strong difference in the output here for L1D and LLC?
Why are we seeing such bad L3 cache-miss rate (according to perf) in case of the blocking algorithm? This is obviously exacerbated when I start using 6 cores.
Any tips on how to detect cache-thrashing will also be appreciated.
Thanks in advance for your time and help, I'm glad to provide additional information upon request.
Additional Info:
The processor used for testing here is the (Coffee Lake) Intel(R) Core(TM) i5-8400 CPU # 2.80GHz.
CPU with 6 cores operating at 2.80 GHz - 4.00 GHz
L1 6x 32 KiB 8-way set associative (64 sets)
L2 6x 256 KiB 4-way set associative (1024 sets)
shared L3 9 MiB 12-way set associative (12288 sets)

Inconsistent `perf annotate` memory load/store time reporting

I'm having a hard time interpreting Intel performance events reporting.
Consider the following simple program that mainly reads/writes memory:
#include <stdint.h>
#include <stdio.h>
volatile uint32_t a;
volatile uint32_t b;
int main() {
printf("&a=%p\n&b=%p\n", &a, &b);
for(size_t i = 0; i < 1000000000LL; i++) {
a ^= (uint32_t) i;
b += (uint32_t) i;
b ^= a;
}
return 0;
}
I compile it with gcc -O2 and run under perf:
# gcc -g -O2 a.c
# perf stat -a ./a.out
&a=0x55a4bcf5f038
&b=0x55a4bcf5f034
Performance counter stats for 'system wide':
32,646.97 msec cpu-clock # 15.974 CPUs utilized
374 context-switches # 0.011 K/sec
1 cpu-migrations # 0.000 K/sec
1 page-faults # 0.000 K/sec
10,176,974,023 cycles # 0.312 GHz
13,010,322,410 instructions # 1.28 insn per cycle
1,002,214,919 branches # 30.699 M/sec
123,960 branch-misses # 0.01% of all branches
2.043727462 seconds time elapsed
# perf record -a ./a.out
&a=0x5589cc1fd038
&b=0x5589cc1fd034
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.997 MB perf.data (9269 samples) ]
# perf annotate
The result of perf annotate (annotated for memory loads/stores by me):
Percent│ for(size_t i = 0; i < 1000000000LL; i ++) {
│ xor %eax,%eax
│ nop
│ a ^= (uint32_t) i;
│28: mov a,%edx // 32-bit load
│ xor %eax,%edx
9.74 │ mov %edx,a // 32-bit store
│ b += (uint32_t) i;
12.12 │ mov b,%edx // 32-bit load
8.79 │ add %eax,%edx
│ for(size_t i = 0; i < 1000000000LL; i ++) {
│ add $0x1,%rax
│ b += (uint32_t) i;
18.69 │ mov %edx,b // 32-bit store
│ b ^= a;
0.04 │ mov a,%ecx // 32-bit load
22.39 │ mov b,%edx // 32-bit load
8.92 │ xor %ecx,%edx
19.31 │ mov %edx,b // 32-bit store
│ for(size_t i = 0; i < 1000000000LL; i ++) {
│ cmp $0x3b9aca00,%rax
│ ↑ jne 28
│ }
│ return 0;
│ }
│ xor %eax,%eax
│ add $0x8,%rsp
│ ← retq
My observations:
From the 1.28 insn per cycle I conclude that the program is mainly memory-bound.
a and b appear to be located in the same cache line, adjacent to each other.
My Question:
Shouldn't CPU time be more consistent for the various memory loads and stores?
Why is CPU time of the 1st memory load (mov a,%edx) zero?
Why is the time of the 3rd load mov a,%ecx 0.04%, while the one just next to it mov b,%edx 22.39%?
Why do some instructions take 0 time? The loop consists of 14 instructions, so each instruction must contribute some observable time.
Notes:
OS: Linux 4.19.0-amd64, CPU: Intel Core i9-9900K, 100% idle system (also tested on i7-7700, same result).

Not exactly "memory" bound, but bound on latency of store-forwarding. i9-9900K and i7-7700 have exactly the same microarchitecture for each core so that's not surprising :P https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Key_changes_from_Kaby_Lake. (Except possibly for improvement in hardware mitigation of Meltdown, and possibly fixing the loop buffer (LSD).)
Remember that when a perf event counter overflows and triggers a sample, the out-of-order superscalar CPU has to choose exactly one of the in-flight instructions to "blame" for this cycles event. Often this is the oldest un-retired instruction in the ROB, or the one after. Be very suspicious of cycles event samples over very small scales.
Perf never blames a load that was slow to produce a result, usually the instruction that was waiting for it. (In this case an xor or add). Here, sometimes the store consuming the result of that xor. These aren't cache-miss loads; store-forwarding latency is only about 3 to 5 cycles on Skylake (variable, and shorter if you don't try too soon: Loop with function call faster than an empty loop) so you do have loads completing at about 2 per 3 to 5 cycles.
You have two dependency chains through memory
The longest one involving two RMWs of b. This is twice as long and will be the overall bottleneck for the loop.
The other involving one RMW of a (with an extra read each iteration which can happen in parallel with the read that's part of the next a ^= i;).
The dep chain for i only involves registers and can run far ahead of the others; it's no surprise that add $0x1,%rax has no counts. Its execution cost is totally hidden in the shadow of waiting for loads.
I'm a bit surprised there are significant counts for mov %edx,a. Perhaps it sometimes has to wait for the older store uops involving b to run on the CPUs single store-data port. (Uops are dispatched to ports according to oldest-ready first. How are x86 uops scheduled, exactly?)
Uops can't retire until all previous uops have executed, so it could just be getting some skew from the store at the bottom of the loop. Uops retire in groups of 4, so if the mov %edx,b does retire, the already-executed cmp/jcc, the mov load of a, and the xor %eax,%edx can retire with it. Those are not part of the dep chain that waits for b, so they're always going to be sitting in the ROB waiting to retire whenever the b store is ready to retire. (This is guesswork about how mov %edx,a could be getting counts, despite not being part of a real bottleneck.)
The store-address uops should all run far ahead of the loop because they don't have to wait for previous iterations: RIP-relative addressing1 is ready right away. And they can run on port 7, or compete with loads for ports 2 or 3. Same for the loads: they can execute right away and detect what store they're waiting for, with the load buffer monitoring it and ready to report when the data becomes ready after the store-data uop does eventually run.
Presumably the front-end will eventually bottleneck on allocating load buffer entries, and that's what will limit how many uops can be in the back-end, not ROB or RS size.
Footnote 1: Your annotated output only shows a not a(%rip) so that's odd; doesn't matter if somehow you did get it to use 32-bit absolute, or if it's just a disassembly quirk failing to show RIP-relative.

Why does the same OpenCL code have different outputs from Intel Xeon CPU and NVIDIA GTX 1080 Ti GPU?

I am trying to parallelize Monte Carlo simulation by using OpenCL. I use the MWC64X as a uniform random number generator. The code runs well on different Intel CPUs, since the output of parallel computation is very close to the sequential one.
Using OpenCL device: Intel(R) Xeon(R) CPU E5-2630L v3 # 1.80GHz
Literal influence running time: 0.029048 seconds r1 seqInfl= 0.4771
Literal influence running time: 0.029762 seconds r2 seqInfl= 0.4771
Literal influence running time: 0.029742 seconds r3 seqInfl= 0.4771
Literal influence running time: 0.02971 seconds ra seqInfl= 0.4771
Literal influence running time: 0.029225 seconds trust1-57 seqInfl= 0.6001
Literal influence running time: 0.04992 seconds trust110-1 seqInfl= 0
Literal influence running time: 0.034636 seconds trust4-57 seqInfl= 0
Literal influence running time: 0.049079 seconds trust57-110 seqInfl= 0
Literal influence running time: 0.024442 seconds trust57-4 seqInfl= 0.8026
Literal influence running time: 0.04946 seconds trust33-1 seqInfl= 0
Literal influence running time: 0.049071 seconds trust57-33 seqInfl= 0
Literal influence running time: 0.053117 seconds trust4-1 seqInfl= 0.1208
Literal influence running time: 0.051642 seconds trust57-1 seqInfl= 0
Literal influence running time: 0.052052 seconds trust57-64 seqInfl= 0
Literal influence running time: 0.052118 seconds trust64-1 seqInfl= 0
Literal influence running time: 0.051998 seconds trust57-7 seqInfl= 0
Literal influence running time: 0.052069 seconds trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.71728 seconds
Sequential maxInfluence Literal: trust57-4 0.8026
index1= 17 size= 51 dim1_size= 6
sum0:4781 influence0:0.478100 sum2:4781 influence2:0.478100 sum6:0 influence6:0.000000 sum10:0 sum12:0 influence12:0.000000 sum7:0 influence7:0.000000 influence10:0.000000 sum4:5962 influence4:0.596200 sum8:7971 influence8:0.797100 sum1:4781 influence1:0.478100 sum3:4781 influence3:0.478100 sum13:0 influence13:0.000000 sum11:1261 influence11:0.126100 sum9:0 influence9:0.000000 sum14:0 influence14:0.000000 sum5:0 influence5:0.000000 sum15:0 influence15:0.000000 sum16:0 influence16:0.000000
Parallel influence running time: 0.054391 seconds
Parallel maxInfluence Literal: trust57-4 Infl=0.7971
However, when I run the code on GeForce GTX 1080 Ti, with NVIDIA-SMI 430.40 and CUDA 10.1 and OpenCL 1.2 CUDA installed, the output is as below:
Using OpenCL device: GeForce GTX 1080 Ti
Influence:
Literal influence running time: 0.011119 seconds r1 seqInfl= 0.4771
Literal influence running time: 0.011238 seconds r2 seqInfl= 0.4771
Literal influence running time: 0.011408 seconds r3 seqInfl= 0.4771
Literal influence running time: 0.01109 seconds ra seqInfl= 0.4771
Literal influence running time: 0.011132 seconds trust1-57 seqInfl= 0.6001
Literal influence running time: 0.018978 seconds trust110-1 seqInfl= 0
Literal influence running time: 0.013093 seconds trust4-57 seqInfl= 0
Literal influence running time: 0.018968 seconds trust57-110 seqInfl= 0
Literal influence running time: 0.009105 seconds trust57-4 seqInfl= 0.8026
Literal influence running time: 0.018753 seconds trust33-1 seqInfl= 0
Literal influence running time: 0.018583 seconds trust57-33 seqInfl= 0
Literal influence running time: 0.02005 seconds trust4-1 seqInfl= 0.1208
Literal influence running time: 0.01957 seconds trust57-1 seqInfl= 0
Literal influence running time: 0.019686 seconds trust57-64 seqInfl= 0
Literal influence running time: 0.019632 seconds trust64-1 seqInfl= 0
Literal influence running time: 0.019687 seconds trust57-7 seqInfl= 0
Literal influence running time: 0.019859 seconds trust7-1 seqInfl= 0
Total number of literals: 17
Sequential influence running time: 0.272032 seconds
Sequential maxInfluence Literal: trust57-4 0.8026
index1= 17 size= 51 dim1_size= 6
sum0:10000 sum1:10000 sum2:10000 sum3:10000 sum4:10000 sum5:0 sum6:0 sum7:0 sum8:10000 sum9:0 sum10:0 sum11:0 sum12:0 sum13:0 sum14:0 sum15:0 sum16:0
Parallel influence running time: 0.193581 seconds
The "Influence" value equals sum*1.0/10000, thus the parallel influence only composes of 1 and 0, which is incorrect (in GPU runs) and doesn't happen when parallelizing on a Intel CPU.
When I check the output of the random number generator if(flag==0) printf("randint=%u",randint);, it seems the outputs are all zero on GPU. Below is the clinfo and the .cl code:
Device Name GeForce GTX 1080 Ti
Device Vendor NVIDIA Corporation
Device Vendor ID 0x10de
Device Version OpenCL 1.2 CUDA
Driver Version 430.40
Device OpenCL C Version OpenCL C 1.2
Device Type GPU
Device Topology (NV) PCI-E, 68:00.0
Device Profile FULL_PROFILE
Device Available Yes
Compiler Available Yes
Linker Available Yes
Max compute units 28
Max clock frequency 1721MHz
Compute Capability (NV) 6.1
Device Partition (core)
Max number of sub-devices 1
Supported partition types None
Max work item dimensions 3
Max work item sizes 1024x1024x64
Max work group size 1024
Preferred work group size multiple 32
Warp size (NV) 32
Preferred / native vector sizes
char 1 / 1
short 1 / 1
int 1 / 1
long 1 / 1
half 0 / 0 (n/a)
float 1 / 1
double 1 / 1 (cl_khr_fp64)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Correctly-rounded divide and sqrt operations Yes
Double-precision Floating-point support (cl_khr_fp64)
Denormals Yes
Infinity and NANs Yes
Round to nearest Yes
Round to zero Yes
Round to infinity Yes
IEEE754-2008 fused multiply-add Yes
Support is emulated in software No
Address bits 64, Little-Endian
Global memory size 11720130560 (10.92GiB)
Error Correction support No
Max memory allocation 2930032640 (2.729GiB)
Unified memory for Host and Device No
Integrated memory (NV) No
Minimum alignment for any data type 128 bytes
Alignment of base address 4096 bits (512 bytes)
Global Memory cache type Read/Write
Global Memory cache size 458752 (448KiB)
Global Memory cache line size 128 bytes
Image support Yes
Max number of samplers per kernel 32
Max size for 1D images from buffer 134217728 pixels
Max 1D or 2D image array size 2048 images
Max 2D image size 16384x32768 pixels
Max 3D image size 16384x16384x16384 pixels
Max number of read image args 256
Max number of write image args 16
Local memory type Local
Local memory size 49152 (48KiB)
Registers per block (NV) 65536
Max number of constant args 9
Max constant buffer size 65536 (64KiB)
Max size of kernel argument 4352 (4.25KiB)
Queue properties
Out-of-order execution Yes
Profiling Yes
Prefer user sync for interop No
Profiling timer resolution 1000ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels No
Kernel execution timeout (NV) Yes
Concurrent copy and kernel execution (NV) Yes
Number of async copy engines 2
printf() buffer size 1048576 (1024KiB)
#define N 70 // N > index, which is the total number of literals
#define BASE 4294967296UL
//! Represents the state of a particular generator
typedef struct{ uint x; uint c; } mwc64x_state_t;
enum{ MWC64X_A = 4294883355U };
enum{ MWC64X_M = 18446383549859758079UL };
void MWC64X_Step(mwc64x_state_t *s)
{
uint X=s->x, C=s->c;
uint Xn=MWC64X_A*X+C;
uint carry=(uint)(Xn<C); // The (Xn<C) will be zero or one for scalar
uint Cn=mad_hi(MWC64X_A,X,carry);
s->x=Xn;
s->c=Cn;
}
//! Return a 32-bit integer in the range [0..2^32)
uint MWC64X_NextUint(mwc64x_state_t *s)
{
uint res=s->x ^ s->c;
MWC64X_Step(s);
return res;
}
__kernel void setInfluence(const int literals, const int size, const int dim1_size, __global int* lambdas, __global float* lambdap, __global int* dim2_size, __global float* influence){
int flag=get_global_id(0);
int sum=0;
int count=10000;
int assignment[N];
//or try to get newlambda like original version does
if(flag < literals){
mwc64x_state_t rng;
for(int i=0; i<count; i++){
for(int j=0; j<size; j++){
uint randint=MWC64X_NextUint(&rng);
float rand=randint*1.0/BASE;
//if(flag==0)
// printf("randint=%u",randint);
if(lambdap[j]<rand)
assignment[lambdas[j]]=0;
else
assignment[lambdas[j]]=1;
}
//the true case
assignment[flag]=1;
int valuet=0;
int index=0;
for(int m=0; m<dim1_size; m++){
int valueMono=1;
for(int n=0; n<dim2_size[m]; n++){
if(assignment[lambdas[index+n]]==0){
valueMono=0;
index+=dim2_size[m];
break;
}
}
if(valueMono==1){
valuet=1;
break;
}
}
//the false case
assignment[flag]=0;
int valuef=0;
index=0;
for(int m=0; m<dim1_size; m++){
int valueMono=1;
for(int n=0; n<dim2_size[m]; n++){
if(assignment[lambdas[index+n]]==0){
valueMono=0;
index+=dim2_size[m];
break;
}
}
if(valueMono==1){
valuef=1;
break;
}
}
sum += valuet-valuef;
}
influence[flag] = 1.0*sum/count;
printf("sum%d:%d\t", flag, sum);
}
}
What might be the problem when running the code on GPU? Is it MWC64X? According to its author, it can perform well on NVIDIA GPUs. If so, how can I fix it; if not, what might be the problem?

(This started out as a comment, it turns out this was the source of the problem so I'm turning it into an answer.)
You're not initialising your mwc64x_state_t rng; variable before reading from it, so any results will be undefined:
mwc64x_state_t rng;
for(int i=0; i<count; i++){
for(int j=0; j<size; j++){
uint randint=MWC64X_NextUint(&rng);
Where MWC64X_NextUint() immediately reads from the rng state before updating it:
uint MWC64X_NextUint(mwc64x_state_t *s)
{
uint res=s->x ^ s->c;
Note that you will probably want to seed your RNG differently for each work-item, otherwise you will get nasty correlation artifacts in your results.

All use-cases of a pseudo-random number are a next-level challenge in true-[PARALLEL] computing platforms (not languages, platforms).
Either, there is some source-of-randomness, which gets us into a trouble once massively-parallel requests are to get fair-handled in a truly [PARALLEL] fashion (here, hardware resources may help, yet at a cost of not being able to reproduce the same behaviour "outside" of this very same platform ( and moment-in-time, if such a source is not software-operated with some seed-injection feature, that may setup the "just"-pseudo-random algorithm that creates a pure-[SERIAL] sequence-of-produced "just"-pseudo-random numbers ) )
Or,there is some "shared"-generator of pseudo-random numbers, which enjoys of a higher level of system-wide level-of-entropy (which is good for the resulting "quality" of pseudo-randomness) but at a cost of pure-serial dependence (no parallel execution possible,serial sequence gets served one after another in a sequential manner) and having close to zero chance for repeatable runs (a must for reproducible science) providing repeatably same sequences, needed for testing and for method-validation cases.
RESUME :
The code may employ a work-item-"private" pseudo-random generating function(s) ( privacy is a must for the sake of both the parallel code-execution and the mutual independence (non-intervening processes) of generating these pseudo-random numbers ) , yet each of instances must be a) independently initialised, so as to provide the expected level of randomness achievable in parallelised code-runs and b) any such initialisation ought be performed in a repeatably reproducible manner, for the sake of running the test on different times, often using different OpenCL target computing-platforms.
For __kernel-s, that do not rely on hardware-specific sources-of-randomness, meeting the conditions a && b will suffice for receiving repeatably reproducible (same) results for testing "in vitro" and thus providing a reasonably random method for generating results during generic production-level use-case code-runs "in vivo".
The comparison of net-run-times (benchmarked above) seems to show that Amdahl's law add-on overhead costs plus a tail-end effect of the atomicity-of-work have finally decided the net-run-time was ~ 3.6x faster on XEON compared to GPU:
index1 = 17
size = 51
dim1_size = 6
sum0: 4781 influence0: 0.478100
sum2: 4781 influence2: 0.478100
sum6: 0 influence6: 0.000000
sum10: 0 influence10: 0.000000
sum12: 0 influence12: 0.000000
sum7: 0 influence7: 0.000000
sum4: 5962 influence4: 0.596200
sum8: 7971 influence8: 0.797100
sum1: 4781 influence1: 0.478100
sum3: 4781 influence3: 0.478100
sum13: 0 influence13: 0.000000
sum11: 1261 influence11: 0.126100
sum9: 0 influence9: 0.000000
sum14: 0 influence14: 0.000000
sum5: 0 influence5: 0.000000
sum15: 0 influence15: 0.000000
sum16: 0 influence16: 0.000000
Parallel influence running time: 0.054391 seconds on XEON E5-2630L v3 # 1.80GHz using OpenCL
|....
index1 = 17 |....
size = 51 |....
dim1_size = 6 |....
sum0: 10000 |....
sum1: 10000 |....
sum2: 10000 |....
sum3: 10000 |....
sum4: 10000 |....
sum5: 0 |....
sum6: 0 |....
sum7: 0 |....
sum8: 10000 |....
sum9: 0 |....
sum10: 0 |....
sum11: 0 |....
sum12: 0 |....
sum13: 0 |....
sum14: 0 |....
sum15: 0 |....
sum16: 0 |....
Parallel influence running time: 0.193581 seconds on GeForce GTX 1080 Ti using OpenCL

Formulas in perf stat

I am wondering about the formulas used in perf stat to calculate figures from the raw data.
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./myapp
1080267.226401 task-clock (msec) # 19.062 CPUs utilized
1,592,123,216,789 cycles # 1.474 GHz (50.00%)
871,190,006,655 instructions # 0.55 insn per cycle (75.00%)
3,697,548,810 cache-references # 3.423 M/sec (75.00%)
459,457,321 cache-misses # 12.426 % of all cache refs (75.00%)
In this context, how do you calculate M/sec from cache-references?

Formulas are seems not to be implemented in the builtin-stat.c (where default event sets for perf stat are defined), but they are probably calculated (and averaged with stddev) in perf_stat__print_shadow_stats() (and some stats are collected into arrays in perf_stat__update_shadow_stats()):
http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L626
When HW_INSTRUCTIONS is counted:
"Instructions per clock" = HW_INSTRUCTIONS / HW_CPU_CYCLES; "stalled cycles per instruction" = HW_STALLED_CYCLES_FRONTEND / HW_INSTRUCTIONS
if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
total = avg_stats(&runtime_cycles_stats[ctx][cpu]);
if (total) {
ratio = avg / total;
print_metric(ctxp, NULL, "%7.2f ",
"insn per cycle", ratio);
} else {
print_metric(ctxp, NULL, NULL, "insn per cycle", 0);
}
Branch misses are from print_branch_misses as HW_BRANCH_MISSES / HW_BRANCH_INSTRUCTIONS
There are several cache miss ratio calculations in perf_stat__print_shadow_stats() too like HW_CACHE_MISSES / HW_CACHE_REFERENCES and some more detailed (perf stat -d mode).
Stalled percents are computed as HW_STALLED_CYCLES_FRONTEND / HW_CPU_CYCLES and HW_STALLED_CYCLES_BACKEND / HW_CPU_CYCLES
GHz is computed as HW_CPU_CYCLES / runtime_nsecs_stats, where runtime_nsecs_stats was updated from any of software events task-clock or cpu-clock (SW_TASK_CLOCK & SW_CPU_CLOCK, We still know no exact difference between them two since 2010 in LKML and 2014 at SO)
if (perf_evsel__match(counter, SOFTWARE, SW_TASK_CLOCK) ||
perf_evsel__match(counter, SOFTWARE, SW_CPU_CLOCK))
update_stats(&runtime_nsecs_stats[cpu], count[0]);
There are also several formulas for transactions (perf stat -T mode).
"CPU utilized" is from task-clock or cpu-clock / walltime_nsecs_stats, where walltime is calculated by the perf stat itself (in userspace using clock from the wall (astronomic time, ):
static inline unsigned long long rdclock(void)
{
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}
...
static int __run_perf_stat(int argc, const char **argv)
{
...
/*
* Enable counters and exec the command:
*/
t0 = rdclock();
clock_gettime(CLOCK_MONOTONIC, &ref_time);
if (forks) {
....
}
t1 = rdclock();
update_stats(&walltime_nsecs_stats, t1 - t0);
There are also some estimations from the Top-Down methodology (Tuning Applications Using a Top-down Microarchitecture Analysis Method, Software Optimizations Become Simple with Top-Down Analysis .. Name Skylake, IDF2015, #22 in Gregg's Methodology List. Described in 2016 by Andi Kleen https://lwn.net/Articles/688335/ "Add top down metrics to perf stat" (perf stat --topdown -I 1000 cmd mode).
And finally, if there was no exact formula for the currently printing event, there is universal "%c/sec" (K/sec or M/sec) metric: http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L845 Anything divided by runtime nsec (task-clock or cpu-clock events, if they were present in perf stat event set)
} else if (runtime_nsecs_stats[cpu].n != 0) {
char unit = 'M';
char unit_buf[10];
total = avg_stats(&runtime_nsecs_stats[cpu]);
if (total)
ratio = 1000.0 * avg / total;
if (ratio < 0.001) {
ratio *= 1000;
unit = 'K';
}
snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit);
print_metric(ctxp, NULL, "%8.3f", unit_buf, ratio);
}

Program execution taking almost same usertime on CPU as well as GPU?

The program for finding prime numbers using OpenCL 1.1 gave the following benchmarks :
Device : CPU
Realtime : approx. 3 sec
Usertime : approx. 32 sec
Device : GPU
Realtime - approx. 37 sec
Usertime - approx. 32 sec
Why is the usertime of execution by GPU not less than that of CPU? Is data/task parallelization not occuring?
System specifications :64-bit CentOS 5.3 system with two ATI Radeon 5970 graphics card + Intel Core i7 processor(12 cores)

Your kernel is rather inefficient, I have an adjusted one below for you to consider. As to why it runs better on a cpu device...
Using your algorithm, the work items take varying amounts of time to execute. They will take longer as the numbers tested grow larger. A work group on a gpu will not finish until all of its items are finished some of the hardware will be left idle until the last item is done. On a cpu, it behaves more like a loop iterating over the kernel items, so the difference in cycles needed to compute each item won't drastically affect the performance.
'A' is not used by the kernel. It should not be copied unless it is used. It looks like you wanted to test the A[i] rather then 'i' itself though.
I think the gpu would be much better at FFT-based prime calculations, or even a sieve algorithm.
{
int t;
int i = get_global_id(0);
int end = sqrt(i);
if(i%2){
B[i] = 0;
}else{
B[i] = 1; //assuming only that it should be non-zero
}
for ( t = 3; (t<=end)&&(B[i] > 0) ; t+=2 ) {
if ( i % t == 0 ) {
B[ i ] = 0;
}
}
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio