I'm trying to measure TLB misses in my laptop with the following configuration:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 61
model name : Intel(R) Core(TM) i5-5200U CPU # 2.20GHz
stepping : 4
microcode : 0x1d
cpu MHz : 1593.625
cache size : 3072 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 2
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 20
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap xsaveopt
bugs :
bogomips : 4389.43
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
The above is for processor 0 , with similar results for processor 1,2 and 3.
Here's my result for trying to measure TLB misses:
perf stat -B -e dTLB-load-misses sleep 2
Performance counter stats for 'sleep 2':
0 dTLB-load-misses
2.001923304 seconds time elapsed
Not sure how to interpret this. Any insights? I read somewhere that perf doesn't work well on Sandy Bridge laptops...
Related
Problem:
I need to install Qt Creator for my university course on my mac and it doesn't open. I've tried reinstalling different versions a lot of times but it gives me a Segmentation Fault everytime.
Everything else is updated to the latest Version and i don't know what to try anymore.
Any tips and solutions would be appreciated.
This is one of the error messages:
DebuggerItem \"/Applications/Xcode.app/Contents/Developer/usr/bin/lldb\" ({8999bbd1-bcb9-4fd1-843a-47fec36eb8b6}) read from \"/Users/Samy/.config/QtProject/qtcreator/debuggers.xml\" dropped since the command is not executable."
zsh: segmentation fault /Applications/Qt\ Creator.app/Contents/MacOS/Qt\ Creator
Here is the Report if I try opening the App:
Process: Qt Creator [54924]
Path: /Applications/Qt Creator.app/Contents/MacOS/Qt Creator
Identifier: org.qt-project.qtcreator
Version: 4.11.0 (4.11.0)
Code Type: X86-64 (Native)
Parent Process: ??? [1]
Responsible: Qt Creator [54924]
User ID: 501
Date/Time: 2020-01-08 22:12:42.699 +0100
OS Version: Mac OS X 10.15 (19A602)
Report Version: 12
Anonymous UUID: 103F9158-D5E0-F1B8-BA89-222AB7C6F587
Sleep/Wake UUID: 3FB53B09-90D6-4FB8-AB3A-2D663E75701E
Time Awake Since Boot: 170000 seconds
Time Since Wake: 15000 seconds
System Integrity Protection: enabled
Crashed Thread: 0 Dispatch queue: com.apple.main-thread
Exception Type: EXC_BAD_ACCESS (SIGSEGV)
Exception Codes: KERN_INVALID_ADDRESS at 0x0000000000000000
Exception Note: EXC_CORPSE_NOTIFY
Termination Signal: Segmentation fault: 11
Termination Reason: Namespace SIGNAL, Code 0xb
Terminating Process: exc handler [54924]
VM Regions Near 0:
-->
__TEXT 000000010d082000-000000010d096000 [ 80K] r-x/rwx SM=COW /Applications/Qt Creator.app/Contents/MacOS/Qt Creator
Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0 libUtils.4.11.0.dylib 0x000000010d1ab7b0 QtPrivate::QFunctorSlotObject<Utils::FancyMainWindowPrivate::FancyMainWindowPrivate(Utils::FancyMainWindow*)::$_2, 1, QtPrivate::List<bool>, void>::impl(int, QtPrivate::QSlotObjectBase*, QObject*, void**, bool*) + 48
1 org.qt-project.QtCore 0x000000010e3369b5 0x10e10f000 + 2259381
2 org.qt-project.QtWidgets 0x000000010d3f040e QAction::setChecked(bool) + 174
3 libDebugger.dylib 0x0000000114a1a2ef Utils::DebuggerMainWindow::restorePersistentSettings() + 1151
4 libDebugger.dylib 0x0000000114a19c5d Utils::DebuggerMainWindow::DebuggerMainWindow() + 925
5 libDebugger.dylib 0x0000000114a218fb Utils::Perspective::Perspective(QString const&, QString const&, QString const&, QString const&) + 267
6 libDebugger.dylib 0x0000000114a24d25 Debugger::Internal::DebuggerPluginPrivate::DebuggerPluginPrivate(QStringList const&) + 1781
7 libDebugger.dylib 0x0000000114a36747 Debugger::Internal::DebuggerPlugin::initialize(QStringList const&, QString*) + 39
8 libExtensionSystem.4.11.0.dylib 0x000000010d0cbecb ExtensionSystem::Internal::PluginSpecPrivate::initializePlugin() + 107
9 libExtensionSystem.4.11.0.dylib 0x000000010d0be6a0 ExtensionSystem::Internal::PluginManagerPrivate::loadPlugin(ExtensionSystem::PluginSpec*, ExtensionSystem::PluginSpec::State) + 656
10 libExtensionSystem.4.11.0.dylib 0x000000010d0b6f90 ExtensionSystem::Internal::PluginManagerPrivate::loadPlugins() + 592
11 org.qt-project.qtcreator 0x000000010d08de28 main + 14520
12 libdyld.dylib 0x00007fff6e91b405 start + 1
Thread 1:
0 libsystem_pthread.dylib 0x00007fff6eb245b4 start_wqthread + 0
Thread 2:
0 libsystem_pthread.dylib 0x00007fff6eb245b4 start_wqthread + 0
Thread 3:
0 libsystem_pthread.dylib 0x00007fff6eb245b4 start_wqthread + 0
Thread 4:
0 libsystem_pthread.dylib 0x00007fff6eb245b4 start_wqthread + 0
Thread 5:
0 libsystem_pthread.dylib 0x00007fff6eb245b4 start_wqthread + 0
Thread 6:: com.apple.CFSocket.private
0 libsystem_kernel.dylib 0x00007fff6ea6b7c6 __select + 10
1 com.apple.CoreFoundation 0x00007fff3764a92a __CFSocketManager + 632
2 libsystem_pthread.dylib 0x00007fff6eb27d76 _pthread_start + 125
3 libsystem_pthread.dylib 0x00007fff6eb245d7 thread_start + 15
Thread 0 crashed with X86 Thread State (64-bit):
rax: 0x0000000000000000 rbx: 0x0000000000000000 rcx: 0x00007ffee2b7cf50 rdx: 0x0000600000d5a3c0
rdi: 0x0000000000000000 rsi: 0x000060000184bba0 rbp: 0x00007ffee2b7ce70 rsp: 0x00007ffee2b7ce60
r8: 0x0000000000000000 r9: 0x0000000000000000 r10: 0x000000011c7d7be0 r11: 0x000000010d5461d0
r12: 0x0000600003c9bea0 r13: 0x000000011c805d40 r14: 0x0000000000000001 r15: 0x0000600000d5a3c0
rip: 0x000000010d1ab7b0 rfl: 0x0000000000010246 cr2: 0x0000000000000000
Logical CPU: 2
Error Code: 0x00000004 (no mapping for user data write)
Trap Number: 14
[Number jibberish]`enter code here`
External Modification Summary:
Calls made by other processes targeting this process:
task_for_pid: 0
thread_create: 0
thread_set_state: 0
Calls made by this process:
task_for_pid: 0
thread_create: 0
thread_set_state: 0
Calls made by all processes on this machine:
task_for_pid: 114998
thread_create: 0
thread_set_state: 0
VM Region Summary:
ReadOnly portion of Libraries: Total=780.0M resident=0K(0%) swapped_out_or_unallocated=780.0M(100%)
Writable regions: Total=573.9M written=0K(0%) resident=0K(0%) swapped_out=0K(0%) unallocated=573.9M(100%)
VIRTUAL REGION
REGION TYPE SIZE COUNT (non-coalesced)
=========== ======= =======
Accelerate framework 128K 1
Activity Tracing 256K 1
CG backing stores 248K 2
CoreImage 8K 2
CoreUI image data 72K 1
JS VM register file 4360K 4
JS VM register file (reserved) 2040K 1 reserved VM address space (unallocated)
Kernel Alloc Once 8K 1
MALLOC 167.2M 43
MALLOC guard page 16K 4
MALLOC_NANO (reserved) 384.0M 1 reserved VM address space (unallocated)
STACK GUARD 56.0M 7
Stack 11.0M 7
VM_ALLOCATE 132K 9
WebAssembly memory 4096K 1
__DATA 39.4M 428
__DATA_CONST 41K 3
__FONT_DATA 4K 1
__GLSLBUILTINS 5176K 1
__LINKEDIT 382.9M 106
__OBJC_RO 31.8M 1
__OBJC_RW 1764K 2
__TEXT 397.2M 410
__UNICODE 564K 1
mapped file 332.3M 43
shared memory 640K 15
=========== ======= =======
TOTAL 1.8G 1096
TOTAL, minus reserved VM space 1.4G 1096
Model: MacBookAir7,2, BootROM 190.0.0.0.0, 2 processors, Dual-Core Intel Core i5, 1,6 GHz, 8 GB, SMC 2.27f2
Graphics: kHW_IntelHDGraphics6000Item, Intel HD Graphics 6000, spdisplays_builtin
Memory Module: BANK 0/DIMM0, 4 GB, DDR3, 1600 MHz, 0x02FE, -
Memory Module: BANK 1/DIMM0, 4 GB, DDR3, 1600 MHz, 0x02FE, -
AirPort: spairport_wireless_card_type_airport_extreme (0x14E4, 0x117), Broadcom BCM43xx 1.0 (7.77.105.1 AirPortDriverBrcmNIC-1429)
Bluetooth: Version 7.0.0f8, 3 services, 18 devices, 1 incoming serial ports
Network Service: Wi-Fi, AirPort, en0
Serial ATA Device: APPLE SSD SM0128G, 121,33 GB
USB Device: USB 3.0 Bus
USB Device: BRCM20702 Hub
USB Device: Bluetooth USB Host Controller
Thunderbolt Bus: MacBook Air, Apple Inc., 27.2
I have written a benchmark to compute memory bandwidth:
#include <benchmark/benchmark.h>
double sum_array(double* v, long n)
{
double s = 0;
for (long i =0 ; i < n; ++i) {
s += v[i];
}
return s;
}
void BM_MemoryBandwidth(benchmark::State& state) {
long n = state.range(0);
double* v = (double*) malloc(state.range(0)*sizeof(double));
for (auto _ : state) {
benchmark::DoNotOptimize(sum_array(v, n));
}
free(v);
state.SetComplexityN(state.range(0));
state.SetBytesProcessed(int64_t(state.range(0))*int64_t(state.iterations())*sizeof(double));
}
BENCHMARK(BM_MemoryBandwidth)->RangeMultiplier(2)->Range(1<<5, 1<<23)->Complexity(benchmark::oN);
BENCHMARK_MAIN();
I compile with
g++-9 -masm=intel -fverbose-asm -S -g -O3 -ffast-math -march=native --std=c++17 -I/usr/local/include memory_bandwidth.cpp
This produces a bunch of moves from RAM, and then some addpd instructions which perf says are hot, so I go into the generated asm and remove them, then assemble and link via
$ g++-9 -c memory_bandwidth.s -o memory_bandwidth.o
$ g++-9 memory_bandwidth.o -o memory_bandwidth.x -L/usr/local/lib -lbenchmark -lbenchmark_main -pthread -fPIC
At this point, get a perf output that I expect: Movement of data into xmm registers, increments of the pointer, and a jmp at the end of the loop:
All fine and well up to here. Now here's where things get weird:
I inquire of my hardware what the memory bandwidth is:
$ sudo lshw -class memory
*-memory
description: System Memory
physical id: 3c
slot: System board or motherboard
size: 16GiB
*-bank:1
description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
vendor: AMI
physical id: 1
slot: ChannelA-DIMM1
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
So I should be getting at most 8 bytes * 2.4 GHz = 19.2 gigabytes/second.
But instead I get 48 gigabytes/second:
-------------------------------------------------------------------------------------
Benchmark Time CPU Iterations UserCounters...
-------------------------------------------------------------------------------------
BM_MemoryBandwidth/32 6.43 ns 6.43 ns 108045392 bytes_per_second=37.0706G/s
BM_MemoryBandwidth/64 11.6 ns 11.6 ns 60101462 bytes_per_second=40.9842G/s
BM_MemoryBandwidth/128 21.4 ns 21.4 ns 32667394 bytes_per_second=44.5464G/s
BM_MemoryBandwidth/256 47.6 ns 47.6 ns 14712204 bytes_per_second=40.0884G/s
BM_MemoryBandwidth/512 86.9 ns 86.9 ns 8057225 bytes_per_second=43.9169G/s
BM_MemoryBandwidth/1024 165 ns 165 ns 4233063 bytes_per_second=46.1437G/s
BM_MemoryBandwidth/2048 322 ns 322 ns 2173012 bytes_per_second=47.356G/s
BM_MemoryBandwidth/4096 636 ns 636 ns 1099074 bytes_per_second=47.9781G/s
BM_MemoryBandwidth/8192 1264 ns 1264 ns 553898 bytes_per_second=48.3047G/s
BM_MemoryBandwidth/16384 2524 ns 2524 ns 277224 bytes_per_second=48.3688G/s
BM_MemoryBandwidth/32768 5035 ns 5035 ns 138843 bytes_per_second=48.4882G/s
BM_MemoryBandwidth/65536 10058 ns 10058 ns 69578 bytes_per_second=48.5455G/s
BM_MemoryBandwidth/131072 20103 ns 20102 ns 34832 bytes_per_second=48.5802G/s
BM_MemoryBandwidth/262144 40185 ns 40185 ns 17420 bytes_per_second=48.6035G/s
BM_MemoryBandwidth/524288 80351 ns 80347 ns 8708 bytes_per_second=48.6171G/s
BM_MemoryBandwidth/1048576 160855 ns 160851 ns 4353 bytes_per_second=48.5699G/s
BM_MemoryBandwidth/2097152 321657 ns 321643 ns 2177 bytes_per_second=48.5787G/s
BM_MemoryBandwidth/4194304 648490 ns 648454 ns 1005 bytes_per_second=48.1915G/s
BM_MemoryBandwidth/8388608 1307549 ns 1307485 ns 502 bytes_per_second=47.8017G/s
BM_MemoryBandwidth_BigO 0.16 N 0.16 N
BM_MemoryBandwidth_RMS 1 % 1 %
What am I misunderstanding about memory bandwidth that has made my calculations come out wrong by more than a factor of 2?
(Also, this is kinda an insane workflow to empirically determine how much memory bandwidth I have. Is there a better way?)
Full asm for sum_array after removing add instructions:
_Z9sum_arrayPdl:
.LVL0:
.LFB3624:
.file 1 "example_code/memory_bandwidth.cpp"
.loc 1 5 1 view -0
.cfi_startproc
.loc 1 6 5 view .LVU1
.loc 1 7 5 view .LVU2
.LBB1545:
# example_code/memory_bandwidth.cpp:7: for (long i =0 ; i < n; ++i) {
.loc 1 7 24 is_stmt 0 view .LVU3
test rsi, rsi # n
jle .L7 #,
lea rax, -1[rsi] # tmp105,
cmp rax, 1 # tmp105,
jbe .L8 #,
mov rdx, rsi # bnd.299, n
shr rdx # bnd.299
sal rdx, 4 # tmp107,
mov rax, rdi # ivtmp.311, v
add rdx, rdi # _44, v
pxor xmm0, xmm0 # vect_s_10.306
.LVL1:
.p2align 4,,10
.p2align 3
.L5:
.loc 1 8 9 is_stmt 1 discriminator 2 view .LVU4
# example_code/memory_bandwidth.cpp:8: s += v[i];
.loc 1 8 11 is_stmt 0 discriminator 2 view .LVU5
movupd xmm2, XMMWORD PTR [rax] # tmp115, MEM[base: _24, offset: 0B]
add rax, 16 # ivtmp.311,
.loc 1 8 11 discriminator 2 view .LVU6
cmp rax, rdx # ivtmp.311, _44
jne .L5 #,
movapd xmm1, xmm0 # tmp110, vect_s_10.306
unpckhpd xmm1, xmm0 # tmp110, vect_s_10.306
mov rax, rsi # tmp.301, n
and rax, -2 # tmp.301,
test sil, 1 # n,
je .L10 #,
.L3:
.LVL2:
.loc 1 8 9 is_stmt 1 view .LVU7
# example_code/memory_bandwidth.cpp:8: s += v[i];
.loc 1 8 11 is_stmt 0 view .LVU8
addsd xmm0, QWORD PTR [rdi+rax*8] # <retval>, *_3
.LVL3:
# example_code/memory_bandwidth.cpp:7: for (long i =0 ; i < n; ++i) {
.loc 1 7 5 view .LVU9
inc rax # i
.LVL4:
# example_code/memory_bandwidth.cpp:7: for (long i =0 ; i < n; ++i) {
.loc 1 7 24 view .LVU10
cmp rsi, rax # n, i
jle .L1 #,
.loc 1 8 9 is_stmt 1 view .LVU11
# example_code/memory_bandwidth.cpp:8: s += v[i];
.loc 1 8 11 is_stmt 0 view .LVU12
addsd xmm0, QWORD PTR [rdi+rax*8] # <retval>, *_6
.LVL5:
.loc 1 8 11 view .LVU13
ret
.LVL6:
.p2align 4,,10
.p2align 3
.L7:
.loc 1 8 11 view .LVU14
.LBE1545:
# example_code/memory_bandwidth.cpp:6: double s = 0;
.loc 1 6 12 view .LVU15
pxor xmm0, xmm0 # <retval>
.loc 1 10 5 is_stmt 1 view .LVU16
.LVL7:
.L1:
# example_code/memory_bandwidth.cpp:11: }
.loc 1 11 1 is_stmt 0 view .LVU17
ret
.p2align 4,,10
.p2align 3
.L10:
.loc 1 11 1 view .LVU18
ret
.LVL8:
.L8:
.LBB1546:
# example_code/memory_bandwidth.cpp:7: for (long i =0 ; i < n; ++i) {
.loc 1 7 15 view .LVU19
xor eax, eax # tmp.301
.LBE1546:
# example_code/memory_bandwidth.cpp:6: double s = 0;
.loc 1 6 12 view .LVU20
pxor xmm0, xmm0 # <retval>
jmp .L3 #
.cfi_endproc
.LFE3624:
.size _Z9sum_arrayPdl, .-_Z9sum_arrayPdl
.section .text.startup,"ax",#progbits
.p2align 4
.globl main
.type main, #function
Full output of lshw -class memory:
*-firmware
description: BIOS
vendor: American Megatrends Inc.
physical id: 0
version: 1.90
date: 10/21/2016
size: 64KiB
capacity: 15MiB
capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification uefi
*-memory
description: System Memory
physical id: 3c
slot: System board or motherboard
size: 16GiB
*-bank:0
description: [empty]
physical id: 0
slot: ChannelA-DIMM0
*-bank:1
description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
product: CMU16GX4M2A2400C16
vendor: AMI
physical id: 1
serial: 00000000
slot: ChannelA-DIMM1
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
*-bank:2
description: [empty]
physical id: 2
slot: ChannelB-DIMM0
*-bank:3
description: DIMM DDR4 Synchronous 2400 MHz (0.4 ns)
product: CMU16GX4M2A2400C16
vendor: AMI
physical id: 3
serial: 00000000
slot: ChannelB-DIMM1
size: 8GiB
width: 64 bits
clock: 2400MHz (0.4ns)
Is the CPU relevant here? Well here's the specs:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 2
On-line CPU(s) list: 0,1
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 94
Model name: Intel(R) Pentium(R) CPU G4400 # 3.30GHz
Stepping: 3
CPU MHz: 3168.660
CPU max MHz: 3300.0000
CPU min MHz: 800.0000
BogoMIPS: 6624.00
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0,1
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust erms invpcid rdseed smap clflushopt intel_pt xsaveopt xsavec xgetbv1 xsaves dtherm arat pln pts hwp hwp_notify hwp_act_window hwp_epp flush_l1d
The data produced by the clang compile is much more intelligible. The performance monotonically decreases until it hits 19.8Gb/s as the vector gets much larger than cache:
Here's the benchmark output:
It looks like from your hardware description that you have two DIMM slots that are placed into two channels. This interleaves memory between the two DIMM chips, so that memory accesses will be reading from both chips. (One possibility is that bytes 0-7 are in DIMM1 and bytes 8-15 are in DIMM2, but this depends on the hardware implementation.) This doubles the memory bandwidth because you're accessing two hardware chips instead of one.
Some systems support three or four channels, further increasing the maximum bandwidth.
I have implemented scalar matrix addition kernel.
#include <stdio.h>
#include <time.h>
//#include <x86intrin.h>
//loops and iterations:
#define N 128
#define M N
#define NUM_LOOP 1000000
float __attribute__(( aligned(32))) A[N][M],
__attribute__(( aligned(32))) B[N][M],
__attribute__(( aligned(32))) C[N][M];
int main()
{
int w=0, i, j;
struct timespec tStart, tEnd;//used to record the processiing time
double tTotal , tBest=10000;//minimum of toltal time will asign to the best time
do{
clock_gettime(CLOCK_MONOTONIC,&tStart);
for( i=0;i<N;i++){
for(j=0;j<M;j++){
C[i][j]= A[i][j] + B[i][j];
}
}
clock_gettime(CLOCK_MONOTONIC,&tEnd);
tTotal = (tEnd.tv_sec - tStart.tv_sec);
tTotal += (tEnd.tv_nsec - tStart.tv_nsec) / 1000000000.0;
if(tTotal<tBest)
tBest=tTotal;
} while(w++ < NUM_LOOP);
printf(" The best time: %lf sec in %d repetition for %dX%d matrix\n",tBest,w, N, M);
return 0;
}
In this case, I've compiled the program with different compiler flag and the assembly output of the inner loop is as follows:
gcc -O2 msse4.2: The best time: 0.000024 sec in 406490 repetition for 128X128 matrix
movss xmm1, DWORD PTR A[rcx+rax]
addss xmm1, DWORD PTR B[rcx+rax]
movss DWORD PTR C[rcx+rax], xmm1
gcc -O2 -mavx: The best time: 0.000009 sec in 1000001 repetition for 128X128 matrix
vmovss xmm1, DWORD PTR A[rcx+rax]
vaddss xmm1, xmm1, DWORD PTR B[rcx+rax]
vmovss DWORD PTR C[rcx+rax], xmm1
AVX version gcc -O2 -mavx:
__m256 vec256;
for(i=0;i<N;i++){
for(j=0;j<M;j+=8){
vec256 = _mm256_add_ps( _mm256_load_ps(&A[i+1][j]) , _mm256_load_ps(&B[i+1][j]));
_mm256_store_ps(&C[i+1][j], vec256);
}
}
SSE version gcc -O2 -sse4.2::
__m128 vec128;
for(i=0;i<N;i++){
for(j=0;j<M;j+=4){
vec128= _mm_add_ps( _mm_load_ps(&A[i][j]) , _mm_load_ps(&B[i][j]));
_mm_store_ps(&C[i][j], vec128);
}
}
In scalar program the speedup of -mavx over msse4.2 is 2.7x. I know the avx improved the ISA efficiently and it might be because of these improvements. But when I implemented the program in intrinsics for both AVX and SSE the speedup is a factor of 3x. The question is: AVX scalar is 2.7x faster than SSE when I vectorized it the speed up is 3x (matrix size is 128x128 for this question). Does it make any sense While using AVX and SSE in scalar mode yield, a 2.7x speedup. but vectorized method must be better because I process eight elements in AVX compared to four elements in SSE. All programs have less than 4.5% of cache misses as perf stat reported.
using gcc -O2 , linux mint, skylake
UPDATE: Briefly, Scalar-AVX is 2.7x faster than Scalar-SSE but AVX-256 is only 3x faster than SSE-128 while it's vectorized. I think it might be because of pipelining. in scalar I have 3 vec-ALU that might not be useable in vectorized mode. I might compare apples to oranges instead of apples to apples and this might be the point that I can not understand the reason.
The problem you are observing is explained here. On Skylake systems if the upper half of an AVX register is dirty then there is false dependency for non-vex encoded SSE operations on the upper half of the AVX register. In your case it seems there is a bug in your version of glibc 2.23. On my Skylake system with Ubuntu 16.10 and glibc 2.24 I don't have the problem. You can use
__asm__ __volatile__ ( "vzeroupper" : : : );
to clean the upper half of the AVX register. I don't think you can use an intrinsic such as _mm256_zeroupper to fix this because GCC will say it's SSE code and not recognize the intrinsic. The options -mvzeroupper won't work either because GCC one again thinks it's SSE code and will not emit the vzeroupper instruction.
BTW, it's Microsoft's fault that the hardware has this problem.
Update:
Other people are apparently encountering this problem on Skylake. It has been observed after printf, memset, and clock_gettime.
If your goal is to compare 128-bit operations with 256-bit operations could consider using -mprefer-avx128 -mavx (which is particularly useful on AMD). But then you would be comparing AVX256 vs AVX128 and not AVX256 vs SSE. AVX128 and SSE both use 128-bit operations but their implementations are different. If you benchmark you should mention which one you used.
I run 25000 clients that just upload log to server every 1 second. The Server crashes in the process.From the log file, we found that the cause of the crash was the JVM crash.The Error log show :
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f032964f085, pid=2043, tid=0x00007f02955cd700
#
# JRE version: Java(TM) SE Runtime Environment (8.0_111-b14) (build 1.8.0_111-b14)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.111-b14 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x5c4085] G1ParScanThreadState::copy_to_survivor_space(InCSetState, oopDesc*, markOopDesc*)+0x45
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid2043.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
My JVM Arguments and System and more infos :
VM Arguments:
jvm_args: -Xms256M -Xmx16G -XX:+UseG1GC -Dfile.encoding=UTF8 -Dserver_log_dir=/var/log/kaa -Dserver_log_sufix= -Dserver_home_dir=/usr/lib/kaa-node -Dcom.sun.management.jmxremote=true -Dcom.sun.management.jmxremote.port=7091 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false
java_command: org.kaaproject.kaa.server.node.KaaNodeApplication
java_class_path (initial): /usr/lib/kaa-node/conf:/usr/lib/kaa-node/lib/spring-context-4.2.5.RELEASE.jar:/usr/lib/kaa-node/lib/curator-client-2.9.0.jar:/usr/lib/kaa-node/lib/spring-jdbc-4.2.5.RELEASE.jar:/usr/lib/kaa-node/lib/javax.annotation-api-1.2.jar:/usr/lib/kaa-node/lib/log4j-over-slf4j-1.7.7.jar:/usr/lib/kaa-node/lib/jna-4.0.0.jar:/usr/lib/kaa-node/lib/jetty-server-9.2.2.v20140723.jar:/usr/lib/kaa-node/lib/fastutil-6.5.7.jar:/usr/lib/kaa-node/lib/application-action-0.0.64.jar:/usr/lib/kaa-node/lib/commons-collections-3.2.1.jar:/usr/lib/kaa-node/lib/joda-time-2.2.jar:/usr/lib/kaa-node/lib/httpcore-4.3.2.jar:/usr/lib/kaa-node/lib/velocity-1.7.jar:/usr/lib/kaa-node/lib/spring-tx-4.2.5.RELEASE.jar:/usr/lib/kaa-node/lib/hibernate-entitymanager-4.3.11.Final.jar:/usr/lib/kaa-node/lib/jetty-http-9.2.2.v20140723.jar:/usr/lib/kaa-node/lib/commons-cli-1.2.jar:/usr/lib/kaa-node/lib/gwt-client-0.2.1.jar:/usr/lib/kaa-node/lib/swagger-annotations-1.5.9.jar:/usr/lib/kaa-node/lib/jandex-1.1.0.Final.jar:/usr/lib/kaa-node/lib/core-0.10.0.jar:/usr/lib/kaa-node/lib/jetty-security-9.2.2.v20140723.jar:/usr/lib/kaa-node/lib/commons-compress-1.8.jar:/usr/lib/kaa-node/lib/jackson-core-asl-1.9.13.jar:/usr/lib/kaa-node/lib/netty-codec-4.0.34.Final.jar:/usr/lib/kaa-node/lib/dao-0.10.0.jar:/usr/lib/kaa-node/lib/file-appender-0.10.0.jar:/usr/lib/kaa-node/lib/spring-security-web-3.2.9.RELEASE.jar:/usr/lib/kaa-node/lib/gwtquery-1.4.2.jar:/usr/lib/kaa-node/lib/facebook-verifier-0.10.0.jar:/usr/lib/kaa-node/lib/transport-0.10.0-tcp.jar:/usr/lib/kaa-node/lib/spring-data-mongodb-1.9.4.RELEASE.jar:/usr/lib/kaa-node/lib/cassandra-driver-extras-3.0.0.jar:/usr/lib/kaa-node/lib/cassandra-all-3.4.jar:/usr/lib/kaa-node/lib/jcl-over-slf4j-1.7.21.jar:/usr/lib/kaa-node/lib/ant-launcher-1.9.4.jar:/usr/lib/kaa-node/lib/hamcrest-core-1.3.jar:/usr/lib/kaa-node/lib/aspectjrt-1.7.4.jar:/usr/lib/kaa-node/lib/guava-18.0.jar:/usr/lib/kaa-node/lib/spring-beans-4.2.5.RELEASE.jar:/usr/lib/kaa-node/lib/kaa-node-0.10
Launcher Type: SUN_STANDARD
Environment Variables:
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games
SHELL=/bin/bash
SYSTEM:
OS:DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=14.04
DISTRIB_CODENAME=trusty
DISTRIB_DESCRIPTION="Ubuntu 14.04.5 LTS"
uname:Linux 4.4.0-31-generic #50~14.04.1-Ubuntu SMP Wed Jul 13 01:07:32 UTC 2016 x86_64
libc:glibc 2.19 NPTL 2.19
rlimit: STACK 8192k, CORE 0k, NPROC 32768, NOFILE 65536, AS infinity
load average:23.81 21.57 24.39
/proc/meminfo:
MemTotal: 32629180 kB
MemFree: 11245384 kB
MemAvailable: 16204112 kB
Buffers: 116504 kB
Cached: 5084432 kB
SwapCached: 0 kB
Active: 9205152 kB
Inactive: 3234744 kB
Active(anon): 7260192 kB
Inactive(anon): 1048 kB
Active(file): 1944960 kB
Inactive(file): 3233696 kB
Unevictable: 8523068 kB
Mlocked: 8523068 kB
SwapTotal: 0 kB
SwapFree: 0 kB
Dirty: 3096 kB
Writeback: 0 kB
AnonPages: 15762160 kB
Mapped: 168560 kB
Shmem: 1384 kB
Slab: 280612 kB
SReclaimable: 181816 kB
SUnreclaim: 98796 kB
KernelStack: 17856 kB
PageTables: 36468 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 16314588 kB
Committed_AS: 18242600 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 15398912 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 70872 kB
DirectMap2M: 2754560 kB
DirectMap1G: 30408704 kB
CPU:total 8 (4 cores per cpu, 2 threads per core) family 6 model 60 stepping 3, cmov, cx8, fxsr, mmx, sse, sse2, sse3, ssse3, sse4.1, sse4.2, popcnt, avx, avx2, aes, clmul, erms, lzcnt, ht, tsc, tscinvbit, bmi1, bmi2
/proc/cpuinfo:
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU # 3.60GHz
stepping : 3
microcode : 0x1d
cpu MHz : 3800.109
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts
bugs :
bogomips : 7183.28
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU # 3.60GHz
stepping : 3
microcode : 0x1d
cpu MHz : 3860.859
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts
bugs :
bogomips : 7183.28
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU # 3.60GHz
stepping : 3
microcode : 0x1d
cpu MHz : 3799.968
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 4
initial apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts
bugs :
bogomips : 7183.28
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU # 3.60GHz
stepping : 3
microcode : 0x1d
cpu MHz : 3799.968
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 6
initial apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts
bugs :
bogomips : 7183.28
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU # 3.60GHz
stepping : 3
microcode : 0x1d
cpu MHz : 3893.906
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts
bugs :
bogomips : 7183.28
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU # 3.60GHz
stepping : 3
microcode : 0x1d
cpu MHz : 3800.109
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts
bugs :
bogomips : 7183.28
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU # 3.60GHz
stepping : 3
microcode : 0x1d
cpu MHz : 3799.968
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 5
initial apicid : 5
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts
bugs :
bogomips : 7183.28
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 60
model name : Intel(R) Core(TM) i7-4790 CPU # 3.60GHz
stepping : 3
microcode : 0x1d
cpu MHz : 3799.968
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt dtherm ida arat pln pts
bugs :
bogomips : 7183.28
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
Memory: 4k page, physical 32629180k(11245384k free), swap 0k(0k free)
vm_info: Java HotSpot(TM) 64-Bit Server VM (25.111-b14) for linux-amd64 JRE (1.8.0_111-b14), built on Sep 22 2016 16:14:03 by "java_re" with gcc 4.3.0 20080428 (Red Hat 4.3.0-8)
time: Fri Nov 25 05:01:22 2016
elapsed time: 21112 seconds (0d 5h 51m 52s)
In my limited experience with JVM. So I am searching for a long time on net and find related errors at Oracle site.But I didn't find a solution from it. From my error log:
Memory: 4k page, physical 32629180k(11245384k free), swap 0k(0k free)
show the physical memory occupied too much.This can be caused by any bug that corrupts heap memory. It could be an issue with GC, with the compiler, with bad native code.
If you don't use any native libraries that might have corrupted the heap, this a a bug in the JVM. You should check whether Oracle already knows about the bug and (if not) file a bug report.
The name of the problematic frame (G1ParScanThreadState::copy_to_survivor_space) strongly suggest that the garbage collector (GC) crashes. So for a workaround until the bug is fixed you can try any of the following:
Monitor the garbage collector and make sure the memory usage doesn't increase over time and the garbage collector doesn't use too much CPU time
Change the garbage collector parameters (see Java's command line
parameters)
Switch to a different garbage collector (see Java's command line parameters)
As you're trying to work around a bug, it's trial and error.
In the Intel Intrinsics Guide there are 'Latency and Throughput Information' at the bottom of several Intrinsics, listing the performance for several CPUID(s).
For example, the table in the Intrinsics Guide looks as follows for the Intrinsic _mm_hadd_pd:
CPUID(s) Parameters Latency Throughput
0F_03 13 4
06_2A xmm1, xmm2 5 2
06_25/2C/1A/1E/1F/2E xmm1, xmm2 5 2
06_17/1D xmm1, xmm2 6 1
06_0F xmm1, xmm2 5 2
Now: How do I determine, what ID my CPU has?
I'm using Kubuntu 12.04 and tried with sudo dmidecode -t 4 and also with the little program cpuid from the Ubuntu packages, but their output isn't really useful.
I cannot find any of the strings listed in the Intrinsics Guide anywhere in the output of the commands above.
you can get that information using CPUID instruction, where
The extended family, bit positions 20 through
27 are used in conjunction with the family
code, specified in bit positions 8 through 11, to indicate whether the processor belongs to
the Intel386, Intel486, Pentium, Pentium Pro or Pentium 4 family of processors. P6
family processors include all processors based on the Pentium Pro processor architecture
and have an extended family equal to 00h
and a family code equal to 06h. Pentium 4
family processors include all processors based on the Intel NetBurst® microarchitecture
and have an extended family equal to 00h and a family code equal to 0Fh.
The extended model specified in bit positi
ons 16 through 19, in conjunction with the
model number specified in bits 4 though 7 are
used to identify the model of the processor
within the processor’s family.
see page 22 in Intel Processor Identification and the CPUID Instruction for futher details.
Actual CPUID is then "family_model".
The following code should do the job:
#include "stdio.h"
int main () {
int ebx = 0, ecx = 0, edx = 0, eax = 1;
__asm__ ("cpuid": "=b" (ebx), "=c" (ecx), "=d" (edx), "=a" (eax):"a" (eax));
int model = (eax & 0x0FF) >> 4;
int extended_model = (eax & 0xF0000) >> 12;
int family_code = (eax & 0xF00) >> 8;
int extended_family_code = (eax & 0xFF00000) >> 16;
printf ("%x %x %x %x \n", eax, ebx, ecx, edx);
printf ("CPUID: %02x %x\n", extended_family_code | family_code, extended_model | model);
return 0;
}
For my computer I get:
CPUID: 06_25
hope it helps.