Determine CPUID as listed in the Intel Intrinsics Guide - intrinsics

In the Intel Intrinsics Guide there are 'Latency and Throughput Information' at the bottom of several Intrinsics, listing the performance for several CPUID(s).
For example, the table in the Intrinsics Guide looks as follows for the Intrinsic _mm_hadd_pd:
CPUID(s) Parameters Latency Throughput
0F_03 13 4
06_2A xmm1, xmm2 5 2
06_25/2C/1A/1E/1F/2E xmm1, xmm2 5 2
06_17/1D xmm1, xmm2 6 1
06_0F xmm1, xmm2 5 2
Now: How do I determine, what ID my CPU has?
I'm using Kubuntu 12.04 and tried with sudo dmidecode -t 4 and also with the little program cpuid from the Ubuntu packages, but their output isn't really useful.
I cannot find any of the strings listed in the Intrinsics Guide anywhere in the output of the commands above.

you can get that information using CPUID instruction, where
The extended family, bit positions 20 through
27 are used in conjunction with the family
code, specified in bit positions 8 through 11, to indicate whether the processor belongs to
the Intel386, Intel486, Pentium, Pentium Pro or Pentium 4 family of processors. P6
family processors include all processors based on the Pentium Pro processor architecture
and have an extended family equal to 00h
and a family code equal to 06h. Pentium 4
family processors include all processors based on the Intel NetBurst® microarchitecture
and have an extended family equal to 00h and a family code equal to 0Fh.
The extended model specified in bit positi
ons 16 through 19, in conjunction with the
model number specified in bits 4 though 7 are
used to identify the model of the processor
within the processor’s family.
see page 22 in Intel Processor Identification and the CPUID Instruction for futher details.
Actual CPUID is then "family_model".
The following code should do the job:
#include "stdio.h"
int main () {
int ebx = 0, ecx = 0, edx = 0, eax = 1;
__asm__ ("cpuid": "=b" (ebx), "=c" (ecx), "=d" (edx), "=a" (eax):"a" (eax));
int model = (eax & 0x0FF) >> 4;
int extended_model = (eax & 0xF0000) >> 12;
int family_code = (eax & 0xF00) >> 8;
int extended_family_code = (eax & 0xFF00000) >> 16;
printf ("%x %x %x %x \n", eax, ebx, ecx, edx);
printf ("CPUID: %02x %x\n", extended_family_code | family_code, extended_model | model);
return 0;
}
For my computer I get:
CPUID: 06_25
hope it helps.

Related

Better understanding of timing and pipelining [duplicate]

This question already has answers here:
What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
(1 answer)
How to get the CPU cycle count in x86_64 from C++?
(5 answers)
How many CPU cycles are needed for each assembly instruction?
(5 answers)
Assembly - How to score a CPU instruction by latency and throughput
(1 answer)
Closed 1 year ago.
In this code, I'm just looping through the set of instructions a bunch of times. Without regard to how many times (100, 1000, 1000000), the timing using RDTSC shows (outputs) 6 clock cycles for the loop. I'm on a Coffee Lake I9-9900K
There are 13 instructions in the loop- I would have thought the minimum RDTSC delta would have been 13.
Would someone be able to educate me as to how this is seeming to run twice as fast as I expected it to? I'm clearly misunderstanding something basic, or I've made a ridiculous mistake.
Thank you!
rng.SetFloatScale(2.0f / 8.0f);
00C010AE vmovups ymm4,ymmword ptr [__ymm#3e0000003e0000003e0000003e0000003e0000003e0000003e0000003e000000 (0C02160h)]
Vec8f sum = 0;
const size_t loopLen = 1000;
auto start = __rdtsc();
00C010BB rdtsc
00C010BD mov esi,eax
sum += rng.NextScaledFloats();
00C010F0 vpslld ymm0,ymm2,xmm5
00C010F4 vpxor ymm1,ymm0,ymm2
00C010F8 vpsrld ymm0,ymm1,xmm6
00C010FC vpxor ymm1,ymm0,ymm1
00C01100 vpslld ymm0,ymm1,xmm7
00C01104 vpxor ymm2,ymm0,ymm1
00C01108 vpand ymm0,ymm2,ymmword ptr [__ymm#007fffff007fffff007fffff007fffff007fffff007fffff007fffff007fffff (0C02140h)]
00C01110 vpor ymm0,ymm0,ymmword ptr [__ymm#4000000040000000400000004000000040000000400000004000000040000000 (0C021A0h)]
00C01118 vmovups ymm1,ymm4
00C0111C vfmsub213ps ymm1,ymm0,ymmword ptr [__ymm#3e8000003e8000003e8000003e8000003e8000003e8000003e8000003e800000 (0C02180h)]
00C01125 vaddps ymm3,ymm1,ymm3
for (size_t i = 0; i < loopLen; i++)
00C01129 sub eax,1
00C0112C jne main+80h (0C010F0h)
auto end = __rdtsc();
00C0112E rdtsc
00C01130 mov edi,eax
00C01132 mov ecx,edx
printf("\n\nAverage: %f\nAverage RDTSC: %ld\n", fsum, (end - start) / loopLen);

Why is my SSE assembly slower in release builds?

I've been playing around with some x64 assembly and the XMM registers to do some float math, and I'm seeing some performance that is puzzling me.
As a self-learning exercise, I wrote some SSE assembly to approximate the 'sin' function (using the Taylor series), and called this from some basic C++ in a loop to compare to the standard library version. Code is below, and I've pasted the output for some typical runs after that. (I'm not looking for a critique of the code or approach here, just trying to understand the perf numbers).
What I don't get is why with a "Release" build, where the actual running assembly is identical (I've stepped though the debugger to double check), is consistently about 40 - 50 cycles slower. (Uncommenting the LFENCE instructions adds about 100 cycles to both Debug and Release, so the delta remains the same). As a bonus question, why is the very first iteration typically in the thousands!!
I get this stuff is very complex and subtly impacted by numerous factors, but everything that pops in my head as a potential cause here just doesn't make sense.
I've checked the MSCSR flags in both runs, and this is identical across builds also (with the default value of 1f80h which has all exceptions masked).
Any idea what would cause this? What further analysis could I do to figure this out an an even deeper level?
Assembly
_RDATA segment
pi real4 3.141592654
rf3 real4 0.1666666667
rf5 real4 0.008333333333
rf7 real4 0.0001984126984
_RDATA ends
_TEXT segment
; float CalcSin(float rads, int* cycles)
CalcSin PROC
; "leaf" function - doesn't use the stack or any non-volatile registers
mov r8, rdx ; Save the 'cycles' pointer into R8
rdtsc ; Get current CPU cyles in EDX:EAX
; lfence ; Ensure timer is taken before executing the below
mov ecx, eax ; Save the low 32 bits of the timer into ECX
movss xmm2, xmm0
mulss xmm2, xmm2 ; X^2
movss xmm3, xmm0
mulss xmm3, xmm2 ; x^3
movss xmm4, rf3 ; 1/3!
mulss xmm4, xmm3 ; x^3 / 3!
subss xmm0, xmm4 ; x - x^3 / 3!
mulss xmm3, xmm2 ; x^5
movss xmm4, rf5 ; 1/5!
mulss xmm4, xmm3 ; x^5 / 5!
addss xmm0, xmm4 ; x - x^3 / 3! + x^5 / 5!
mulss xmm3, xmm2 ; x^7
movss xmm4, rf7 ; 1/7!
mulss xmm4, xmm3 ; x^7 / 7!
subss xmm0, xmm4 ; x - x^3 / 3! + x^5 / 5! - x^7 / 7!
; lfence ; Ensure above completes before taking the timer again
rdtsc ; Get the timer now
sub eax, ecx ; Get the difference in cycles
mov dword ptr [r8], eax
ret
CalcSin ENDP
_TEXT ends
END
C++
#include <stdio.h>
#include <math.h>
#include <vector>
const float PI = 3.141592654f;
extern "C" float CalcSin(float rads, int* cycles);
void DoCalcs(float rads) {
int cycles;
float result = CalcSin(rads, &cycles);
printf("Sin(%.8f) = %.8f. Took %d cycles\n", rads, result, cycles);
printf("C library = %.8f\n", sin(rads));
}
int main(int argc, char* argv[]) {
std::vector<float> inputs{PI / 1000, PI / 2 - PI / 1000, PI / 4, 0.0001f, PI / 2};
for (auto val : inputs) {
DoCalcs(val);
}
return 0;
}
With a "Debug" build (I'm using Visual Studio 2019), I typically see the below timing reported:
Sin(0.00314159) = 0.00314159. Took 3816 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 18 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 18 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 18 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 18 cycles
C library = 1.00000000
The exact same code with a "Release" build, I typically see the below:
Sin(0.00314159) = 0.00314159. Took 4426 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 70 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 62 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 64 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 62 cycles
C library = 1.00000000
====UPDATE 1====
I changed the code to load the constants as immediates, instead of referencing the .rdata segment as Peter mentioned, and this got rid of the slow first iteration, i.e. replaced the commented out line with the 2 lines following:
; movss xmm4, rf5 ; 1/5!
mov eax, 3C088889h ; 1/5! float representation
movd xmm4, eax
Warming up the CPU didn't help, but I did notice the first iteration in Release was now just as fast as debug, and the rest were still slow. As the printf isn't called until after the first calculation, I wondered if this had an impact. I change the code to just store the results as it ran, and print them once complete, and now Release is just as fast. i.e.
Updated C++ code
extern "C" float CalcSin(float rads, int* cycles);
std::vector<float> values;
std::vector<int> rdtsc;
void DoCalcs(float rads) {
int cycles;
float result = CalcSin(rads, &cycles);
values.push_back(result);
rdtsc.push_back(cycles);
// printf("Sin(%.8f) = %.8f. Took %d cycles\n", rads, result, cycles);
// printf("C library = %.8f\n", sin(rads));
}
int main(int argc, char* argv[]) {
std::vector<float> inputs{PI / 1000, PI / 2 - PI / 1000, PI / 4, 0.0001f, PI / 2};
for (auto val : inputs) {
DoCalcs(val);
}
auto cycle_iter = rdtsc.begin();
auto value_iter = values.begin();
for (auto& input : inputs) {
printf("Sin(%.8f) = %.8f. Took %d cycles\n", input, *value_iter++, *cycle_iter++);
printf("C library = %.8f\n", sin(input));
}
return 0;
}
And now Release is pretty much identical to debug, i.e. around 18 - 24 cycles consistently on each call.
I'm not sure what the printf call is doing in Release builds, or maybe the way it was linked/optimized with Release settings, but strange it negatively impacted the identical and distinct assembly calls as it did.
Sin(0.00314159) = 0.00314159. Took 18 cycles
C library = 0.00314159
Sin(1.56765473) = 0.99984086. Took 18 cycles
C library = 0.99999507
Sin(0.78539819) = 0.70710647. Took 24 cycles
C library = 0.70710680
Sin(0.00010000) = 0.00010000. Took 20 cycles
C library = 0.00010000
Sin(1.57079637) = 0.99984306. Took 24 cycles
C library = 1.00000000
====UPDATE 2====
To rule out the CPU ramp-up down, I went in and tweaked a few bios settings (disabled Turbo, set a consistent core voltage, etc.), and can now see via the "AI Suite" ASUS app for the motherboard the CPU is a consistent 3600MHz. (I'm running an Intel Core i9-9900k # 3.6GHz on Windows 10 x64).
After setting that... still no change.
Next thing that occurred to me is that with the 'printf' I have a call out to the C-runtime library between each loop, which is a different DLL between Debug and Release builds. To remove any other variation I starting building from the command-line instead of VS. Compiling with maximum speed optimizations and the release CRT DLLs (/O2 and /MD respectively), I still see the same slow-down. Switching to the debug CRT DLLs, I see some improvement. If I switch static linking in the CRT, then it doesn't matter if I use the debug or release versions, or if I compile with optimizations or not, I regularly see the 24 cycles per call, i.e.
ml64 /c ..\x64simd.asm
cl.exe /Od /MT /Feapp.exe ..\main.cpp x64simd.obj
>app.exe
Sin(0.00314159) = 0.00314159. Took 24 cycles
Sin(1.56765473) = 0.99984086. Took 24 cycles
Sin(0.78539819) = 0.70710647. Took 24 cycles
Sin(0.00010000) = 0.00010000. Took 24 cycles
Sin(1.57079637) = 0.99984306. Took 24 cycles
So it's definitely something in calling out to the CRT Release DLLs causing the slow-down. I'm still puzzled as to why, especially as the Debug build in VS is also using CRT via DLLs.
You're timing in reference cycles with rdtsc, not core clock cycles. It's probably the same speed both times, in core clock cycles, but with the CPU running at different frequencies.
Probably a debug build gives the CPU time to ramp up to max turbo (more core cycles per reference cycle) before your function gets called. Because the calling code compiles to slower asm. And especially with MSVC, a debug build adds extra stuff like poisoning the stack frame to catch use of uninitialized vars. And also overhead for incremental linking.
None of this slows down your hand-written function itself, it's just "warm up" that you neglected to do manually in your microbenchmark.
See How to get the CPU cycle count in x86_64 from C++? for lots more details about RDTSC.
A factor of ~3 between idle CPU clock and max-turbo (or some higher clock) is very plausible for modern x86 CPUs. My i7-6700k idles at 0.8GHz with rated frequency of 4.0GHz, max single-core turbo of 4.2. But many laptop CPUs much lower non-turbo max (and might only ramp to non-turbo initially, not max turbo right away, depending on energy_performance_preference HW governor, or especially software governor on older CPUs.)
As a bonus question, why is the very first iteration typically in the thousands!!
Probably dTLB miss and cache miss for loading rf3 from data memory. You could try loading those from C (by declaring extern volatile float rf3) to prime the TLB + cache for that block of constants, assuming they're all in the same cache line.
Possibly also an I-cache miss after the rdtsc, but the first load is probably before the end of an I-cache line so those could happen in parallel. (Putting the rdtsc inside your asm function means we probably aren't waiting for an iTLB miss or i-cache miss inside the timed region to even fetch the first byte of the function).
Code review:
Don't use movss between XMM registers unless you want to blend the low 4 bytes into the old value of the destination. Use movaps xmm2, xmm0 to copy the whole register; it's much more efficient.
movaps can be handled by register renaming without needing any back-end execution unit, vs. movss only running on one execution unit in Intel CPUs, port 5. https://agner.org/optimize/. Also, movaps avoids a false dependency on the old value of the register because it overwrites the full reg, allowing out-of-order exec to work properly.
movss xmm, [mem] is fine, though: as a load it zero-extends into the full register.

Why does re-initializing a register inside an unrolled ADD loop make it run faster even with more instructions inside the loop?

I have the following code:
#include <iostream>
#include <chrono>
#define ITERATIONS "10000"
int main()
{
/*
======================================
The first case: the MOV is outside the loop.
======================================
*/
auto t1 = std::chrono::high_resolution_clock::now();
asm("mov $100, %eax\n"
"mov $200, %ebx\n"
"mov $" ITERATIONS ", %ecx\n"
"lp_test_time1:\n"
" add %eax, %ebx\n" // 1
" add %eax, %ebx\n" // 2
" add %eax, %ebx\n" // 3
" add %eax, %ebx\n" // 4
" add %eax, %ebx\n" // 5
"loop lp_test_time1\n");
auto t2 = std::chrono::high_resolution_clock::now();
auto time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
std::cout << time;
/*
======================================
The second case: the MOV is inside the loop (faster).
======================================
*/
t1 = std::chrono::high_resolution_clock::now();
asm("mov $100, %eax\n"
"mov $" ITERATIONS ", %ecx\n"
"lp_test_time2:\n"
" mov $200, %ebx\n"
" add %eax, %ebx\n" // 1
" add %eax, %ebx\n" // 2
" add %eax, %ebx\n" // 3
" add %eax, %ebx\n" // 4
" add %eax, %ebx\n" // 5
"loop lp_test_time2\n");
t2 = std::chrono::high_resolution_clock::now();
time = std::chrono::duration_cast<std::chrono::nanoseconds>(t2 - t1).count();
std::cout << '\n' << time << '\n';
}
The first case
I compiled it with
gcc version 9.2.0 (GCC)
Target: x86_64-pc-linux-gnu
gcc -Wall -Wextra -pedantic -O0 -o proc proc.cpp
and its output is
14474
5837
I also compiled it with Clang with the same result.
So, why the second case is faster (almost 3x speedup)? Does it actually related with some microarchitectural details? If it matters, I have an AMD's CPU: “AMD A9-9410 RADEON R5, 5 COMPUTE CORES 2C+3G”.
mov $200, %ebx inside the loop breaks the loop-carried dependency chain through ebx, allowing out-of-order execution to overlap the chain of 5 add instructions across multiple iterations.
Without it, the chain of add instructions bottlenecks the loop on the latency of the add (1 cycle) critical path, instead of the throughput (4/cycle on Excavator, improved from
2/cycle on Steamroller). Your CPU is an Excavator core.
AMD since Bulldozer has an efficient loop instruction (only 1 uop), unlike Intel CPUs where loop would bottleneck either loop at 1 iteration per 7 cycles. (https://agner.org/optimize/ for instruction tables, microarch guide, and more details on everything in this answer.)
With loop and mov taking slots in the front-end (and back-end execution units) away from add, a 3x instead of 4x speedup looks about right.
See this answer for an intro to how CPUs find and exploit Instruction Level Parallelism (ILP).
See Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths for some in-depth details about overlapping independent dep chains.
BTW, 10k iterations is not many. Your CPU might not even ramp up out of idle speed in that time. Or might jump to max speed for most of the 2nd loop but none of the first. So be careful with microbenchmarks like this.
Also, your inline asm is unsafe because you forgot to declare clobbers on EAX, EBX, and ECX. You step on the compiler's registers without telling it. Normally you should always compile with optimization enabled, but your code would probably break if you did that.

Use experimental devices support in PlaidML?

I want to use PlaidML to speed up deep learning training on my Mac Pro computer. After installing PlaidML, I run "plaidml-setup", and received the following message:
PlaidML Setup (0.3.5)
Thanks for using PlaidML!
Some Notes:
* Bugs and other issues: https://github.com/plaidml/plaidml
* Questions: https://stackoverflow.com/questions/tagged/plaidml
* Say hello: https://groups.google.com/forum/#!forum/plaidml-dev
* PlaidML is licensed under the GNU AGPLv3
Default Config Devices:
No devices.
Experimental Config Devices:
llvm_cpu.0 : CPU (LLVM)
opencl_amd_amd_radeon_pro_555_compute_engine.0 : AMD AMD Radeon Pro 555 Compute Engine (OpenCL)
metal_amd_radeon_pro_460.0 : AMD Radeon Pro 460 (Metal)
opencl_intel_intel(r)_hd_graphics_630.0 : Intel Inc. Intel(R) HD Graphics 630 (OpenCL)
opencl_cpu.0 : Intel CPU (OpenCL)
metal_intel(r)_hd_graphics_unknown.0 : Intel(R) HD Graphics Unknown (Metal)
Using experimental devices can cause poor performance, crashes, and other nastiness.
Enable experimental device support? (y,n)[n]:
Why does it say this is 'experimental devices'? Is this normal to configure PlaidML on Mac Pro?
Should I click "yes" to proceed the setup?
EDIT:
After I click 'yes', I was presented with another set of options:
Multiple devices detected (You can override by setting PLAIDML_DEVICE_IDS).
Please choose a default device:
1 : llvm_cpu.0
2 : opencl_amd_amd_radeon_pro_555_compute_engine.0
3 : metal_amd_radeon_pro_460.0
4 : opencl_intel_intel(r)_hd_graphics_630.0
5 : opencl_cpu.0
6 : metal_intel(r)_hd_graphics_unknown.0
Default device? (1,2,3,4,5,6)[1]:
Which one should I choose? Or it doesn't matter?
What version of macOS are you running? What year is the machine? I suspect for older machines or macOS < 10.14, you don't see a default, because PlaidML has heeded to Apple's deprecation of OpenGL/CL in 10.14 in favor of Metal.
FWIW, on my machine I see similar options, except the metal devices are listed under "Default Config Devices."
As for each of these options briefly (okay, maybe I got carried away) explained:
You can train/run ML models on CPUs or GPUs. CPUs aren't as well suited for the pipelines of matrix math that are common in ML applications. Moderm CPUs have Streaming SIMD Extensions (SIMD means Single Instruction Multiple Data) or SSE. These allow you to do a more limited set of matrix-like operations. For example, when adding two vectors instead of considering each pair of elements and adding them one-by-one, SIMD allows you to add many numbers at once. For example, compiling the following code with clang -O3 -march=native:
#include <array>
auto add(std::array<float, 64> a, std::array<float, 64> b) {
std::array<float, 64> output;
for (size_t i = 0; i < 64; i++) {
output[i] = a[i] + b[i];
}
return output;
}
We can see two different compilations depending on whether we pass -mno-sse (which as you might guess, produces a binary that works on CPUs without SSE). With SSE:
add(std::array<float, 64ul>, std::array<float, 64ul>):
mov rax, rdi
vmovups zmm0, zmmword ptr [rsp + 8]
vaddps zmm0, zmm0, zmmword ptr [rsp + 264]
vmovups zmmword ptr [rdi], zmm0
vmovups zmm0, zmmword ptr [rsp + 72]
vaddps zmm0, zmm0, zmmword ptr [rsp + 328]
vmovups zmmword ptr [rdi + 64], zmm0
vmovups zmm0, zmmword ptr [rsp + 136]
vaddps zmm0, zmm0, zmmword ptr [rsp + 392]
vmovups zmmword ptr [rdi + 128], zmm0
vmovups zmm0, zmmword ptr [rsp + 200]
vaddps zmm0, zmm0, zmmword ptr [rsp + 456]
vmovups zmmword ptr [rdi + 192], zmm0
vzeroupper
ret
Without SSE:
add(std::array<float, 64ul>, std::array<float, 64ul>):
mov rax, rdi
lea rcx, [rsp + 264]
lea rdx, [rsp + 8]
xor esi, esi
.LBB0_1:
fld dword ptr [rdx + 4*rsi]
fadd dword ptr [rcx + 4*rsi]
fstp dword ptr [rax + 4*rsi]
fld dword ptr [rdx + 4*rsi + 4]
fadd dword ptr [rcx + 4*rsi + 4]
fstp dword ptr [rax + 4*rsi + 4]
fld dword ptr [rdx + 4*rsi + 8]
fadd dword ptr [rcx + 4*rsi + 8]
fstp dword ptr [rax + 4*rsi + 8]
fld dword ptr [rdx + 4*rsi + 12]
fadd dword ptr [rcx + 4*rsi + 12]
fstp dword ptr [rax + 4*rsi + 12]
fld dword ptr [rdx + 4*rsi + 16]
fadd dword ptr [rcx + 4*rsi + 16]
fstp dword ptr [rax + 4*rsi + 16]
fld dword ptr [rdx + 4*rsi + 20]
fadd dword ptr [rcx + 4*rsi + 20]
fstp dword ptr [rax + 4*rsi + 20]
fld dword ptr [rdx + 4*rsi + 24]
fadd dword ptr [rcx + 4*rsi + 24]
fstp dword ptr [rax + 4*rsi + 24]
fld dword ptr [rdx + 4*rsi + 28]
fadd dword ptr [rcx + 4*rsi + 28]
fstp dword ptr [rax + 4*rsi + 28]
add rsi, 8
cmp rsi, 64
jne .LBB0_1
ret
You don't need to deeply understand what's going on here, but notice that the instructions that begin with v in the SSE binary. Those are AVX instructions. And the zmm0 is an AVX register that can hold 16 floats (AVX-512 provides 512 bit registers, floats are 32 bits). LLVM takes advantage of this and instead of adding the numbers element by element (like we wrote in our original code) it does them 16 at a time. You see 4 variations of the following assembly one after the other (pay attention to the math inside the parenthesis):
vmovups zmm0, zmmword ptr [rsp + (8 + 64*N)]
vaddps zmm0, zmm0, zmmword ptr [rsp + (8 + 4*64 + 64*N)]
vmovups zmmword ptr [rdi + (64*N)], zmm0
The math here requires a bit of knowledge about the System V call ABI. Simply put, ignore the 8 +. [rsp + 64*N] gets you a[16*N] to a[16*(N+1)], exclusive. [rsp + (4*64 + 64*N)] skips all of a (a is 64 floats each of size 4 bytes) and gets you b[16*N] to b[16*(N+1)], exclusive. And [rdi + (64*N)] is output[16*N] to output[16*(N+1)], exclusive. So this effectively translates to the following pseudocode:
std::array<float, 16> temp = {a[16*N], a[16*N+1], ..., a[16*N+16]};
temp += {b[16*N], b[16*N+1], ..., b[16*N+16]};
{output[16*n], output[16*N+1], ..., output[16*N+16]} = temp;
So indeed, we see that AVX-512 (an extension to SIMD) allows us to do the addition in chunks of 16 numbers at a time. Compare this quickly to the -mno-sse version. It should be clear that it's doing a lot more work. Again we have a pattern of instructions (although this time it's in a loop):
fld dword ptr [rdx + 4*rsi + 4*N]
fadd dword ptr [rcx + 4*rsi + 4*N]
fstp dword ptr [rax + 4*rsi + 4*N]
There are eight of these (with N ranging from 0 to 8, exclusive). This is wrapped in a loop which repeats 8 times (8 * 8 = 64, the array length). You should be able to guess what's going on here. It's very similar to above, except we work on one number at a time instead of 16. fld is similar to vmovups, fadd is similar to vaddps. The pseudocode for this would look more like the code we actually wrote:
float temp = a[loop_num*8 + N];
temp += b[loop_num*8 + N];
output[loop_num*8] = temp;
Hopefully, it is intuitive that it will be much more efficient to do things 16 at a time than 1 at a time.
There are also fancy linear algebra frameworks like blas, which can squeeze just about all the performance you can get out of a CPU when it comes to math.
GPUs work a bit differently. A gross simplification would be to think of a GPU as a device with huge SIMD instructions (particularly suited for floating point operations). So instead of working 16 at a time, imagine just handing it an entire image and in one operation it can apply a pixel-filter to it (like changing the brightness or saturation).
So what does that tangent have to do with anything?
AVX instructions make it somewhat reasonable to run some code on the CPU. All the options you see with _cpu in them will only run on the CPU. llvm_cpu will likely use similar techniques to above that clang used (clang uses llvm behind the scenes) to compile all of the math necessary to run/train your ML models. Given that modern CPUs are multicore this can be as much as a 16 * number_of_cores speedup.
OpenCL is an open standard for writing math computations and easily running them on various hardware (including GPUs). OpenCL also can be emulated by CPUs (admittedly at a much slower rate--remember CPUs can only do 16x, GPUs can do much more).
Metal is Apple's replacement for OpenGL/CL. It accomplishes similar things, but is macOS specific (and closed source).
The only difference left to comment on is "Intel(R) HD Graphics 630" vs "AMD Radeon 460." Your computer has two GPUs. The first one is an integrated graphics card. The integrated here means that your Intel CPU has a little GPU embedded inside of it. It isn't as performant as a discrete GPU (one that's separate from the CPU, often found in card form factors for desktops), but it gets the job done for certain less intensive graphics tasks (and typically is more power efficient). Your AMD Radeon 460 is a discrete GPU. It will likely be the most powerful piece of hardware you have for this task.
So with that in mind, I predict the devices will be, fastest to slowest:
metal_amd_radeon_pro_460.0 - Discrete GPUs are fast, Apple has optimized Metal to work very well on new Macs
opencl_amd_amd_radeon_pro_555_compute_engine.0 - This still uses the discrete GPU, but OpenCL has been neglected a bit and is now deprecated on macOS, so it likely won't be as fast
metal_intel(r)_hd_graphics_unknown.0 - Integrated GPUs are better than CPUs, Apple has optimized Metal
opencl_intel_intel(r)_hd_graphics_630.0 - ditto regarding the other OpenCL (except this is an integrated not discrete GPU)
llvm_cpu.0 - This uses the CPU, but LLVM is pretty good at writing efficient SIMD code.
opencl_cpu.0 - This emulates (2) and (4) except using your CPU, which will be much slower. Additionally, it likely doesn't have all the fancy algorithms LLVM uses to output efficient SIMD code.
But all this is speculation, you can test it by pip install plaidbench plaidml-keras keras. For each device, run plainml-setup (selecting that device) and then run plainbench keras mobilenet (or any of the other benchmarks). Here are the results I see on my machine:
| device | exeuction (s) | fps | correctness |
|------------------------------|---------------|--------|-------------|
| Metal AMD Radeon Pro 560 | 9.009 | 112.53 | PASS |
| OpenCL AMD Radeon Pro 560 | 18.339 | 93.29 | PASS |
| OpenCL Intel HD Graphics 630 | 23.204 | 60.18 | FAIL |
| Metal Intel HD Graphics 630 | 24.809 | 41.27 | PASS |
| LLVM CPU | 66.072 | 16.82 | PASS |
| OpenCL CPU Emulation | 155.639 | 6.71 | FAIL |
I've renamed the devices to have prettier names, but their mapping to the identifiers should be obvious.
Execution time is time it took to run the model (lower is better) and FPS is the FPS that the execution achieved (higher is better).
We note that the order is generally what we expected. Discrete GPU is faster than Integrated GPU, which is faster than CPU. An important thing to call out is that OpenCL on the integrated GPU and CPU emulation failed the correctness check. The CPU emulation was only off by a factor of about 7%, but the integrated GPU was off by about 77%. You probably only want to choose a device that passes the correctness check on your machine (it's possible--but not guaranteed--that the backend or device itself is buggy if it fails that check).
tl;dr Use metal + discrete GPU (AMD Radeon). It is the fastest device you have available. Using anything CPU-based will only spin up your fans and consume a ton of power (and take forever to finish/train).
Yes you absolutely need experimental support to use PlaidML, period. After that, you want to choose
3: metal_amd_radeon_pro_460.0
or anything that says "metal" and "radeon" (or NVIVIA, if you have it and prefer that). There is little point to using Intel UHD graphics (even if you can by choosing 6 : metal_intel(r)_hd_graphics_unknown.0) since it's inferior to a discrete GPU.
Apple has deprecated OpenCL in favor of Apple's Metal framework, and recently the OpenCL plaid-setups are getting Fail errors on plaidbench. For example, if you used the opencl driver, you will be guaranteed a Fail error when you run
plaidbench keras mobilenet
You will most likely get a Success with a metal driver.

Checking if TWO SSE registers are not both zero without destroying them

I want to test if two SSE registers are not both zero without destroying them.
This is the code I currently have:
uint8_t *src; // Assume it is initialized and 16-byte aligned
__m128i xmm0, xmm1, xmm2;
xmm0 = _mm_load_si128((__m128i const*)&src[i]); // Need to preserve xmm0 & xmm1
xmm1 = _mm_load_si128((__m128i const*)&src[i+16]);
xmm2 = _mm_or_si128(xmm0, xmm1);
if (!_mm_testz_si128(xmm2, xmm2)) { // Test both are not zero
}
Is this the best way (using up to SSE 4.2)?
I learned something useful from this question. Let's first look at some scalar code
extern foo2(int x, int y);
void foo(int x, int y) {
if((x || y)!=0) foo2(x,y);
}
Compile this like this gcc -O3 -S -masm=intel test.c and the important assembly is
mov eax, edi ; edi = x, esi = y -> copy x into eax
or eax, esi ; eax = x | y and set zero flag in FLAGS if zero
jne .L4 ; jump not zero
Now let's look at testing SIMD registers for zero. Unlike scalar code there is no SIMD FLAGS register. However, with SSE4.1 there are SIMD test instructions which can set the zero flag (and carry flag) in the scalar FLAGS register.
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
__m128i z = _mm_or_si128(x,y);
if (!_mm_testz_si128(z,z)) foo2(x,y);
}
Compile with c99 -msse4.1 -O3 -masm=intel -S test_SSE.c and the the important assembly is
movdqa xmm2, xmm0 ; xmm0 = x, xmm1 = y, copy x into xmm2
por xmm2, xmm1 ; xmm2 = x | y
ptest xmm2, xmm2 ; set zero flag if zero
jne .L4 ; jump not zero
Notice that this takes one more instruction because the packed bit-wise OR does not set the zero flag. Notice also that both the scalar version and the SIMD version need to use an additional register (eax in the scalar case and xmm2 in the SIMD case). So to answer your question your current solution is the best you can do.
However, if you did not have a processor with SSE4.1 or better you would have to use _mm_movemask_epi8. Another alternative which only needs SSE2 is to use _mm_movemask_epi8
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
if (_mm_movemask_epi8(_mm_or_si128(x,y))) foo2(x,y);
}
The important assembly is
movdqa xmm2, xmm0
por xmm2, xmm1
pmovmskb eax, xmm2
test eax, eax
jne .L4
Notice that this needs one more instruction then with the SSE4.1 ptest instruction.
Until now I have been using the pmovmaskb instruction because the latency is better on pre Sandy Bridge processors than with ptest. However, I realized this before Haswell. On Haswell the latency of pmovmaskb is worse than the latency of ptest. They both have the same throughput. But in this case this is not really important. What's important (which I did not realize before) is that pmovmaskb does not set the FLAGS register and so it requires another instruction. So now I'll be using ptest in my critical loop. Thank you for your question.
Edit: as suggested by the OP there is a way this can be done without using another SSE register.
extern foo2(__m128i x, __m128i y);
void foo(__m128i x, __m128i y) {
if (_mm_movemask_epi8(x) | _mm_movemask_epi8(y)) foo2(x,y);
}
The relevant assembly from GCC is:
pmovmskb eax, xmm0
pmovmskb edx, xmm1
or edx, eax
jne .L4
Instead of using another xmm register this uses two scalar registers.
Note that fewer instructions does not necessarily mean better performance. Which of these solutions is best? You have to test each of them to find out.
If you use C / C ++, you can not control the individual CPU registers. If you want full control, you must use assembler.

Resources