Max performance with cores/threads - performance

I have a single computer with Debian installed that gives me the following output from lscpu:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 32
On-line CPU(s) list: 0-31
Thread(s) per core: 2
Core(s) per socket: 8
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Intel(R) Xeon(R) CPU E5-2650 0 # 2.00GHz
Stepping: 7
CPU MHz: 2711.791
CPU max MHz: 2800.0000
CPU min MHz: 1200.0000
BogoMIPS: 4000.03
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 20480K
NUMA node0 CPU(s): 0-7,16-23
NUMA node1 CPU(s): 8-15,24-31
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm epb kaiser tpr_shadow vnmi flexpriority ept vpid xsaveopt dtherm ida arat pln pts
I'm looking to maximize the performance of execution of a code, compiled with the Intel compiler, MKL and BLAS, by adjusting the parameters of mpirun and OpenMP. How can I exploit this computer to get the best performance?
What I tried:
mpirun -np 16 code ( doesn't use all resources according to htop )
mpirun -np 32 code ( worst case )
export OMP_NUM_THREADS=2 ; mpirun -np 16 ( doesn't use all resources according to htop )

Related

CLI command to determine if the Mac machine is running on Apple Silicon on Intel? [duplicate]

This question already has answers here:
Detect Apple Silicon from command line
(3 answers)
Closed 1 year ago.
On a M1 Mac Mini,
If I do uname -i, I get:
uname: illegal option -- -
usage: uname [-amnprsv]
If I do uname -m, I get:
x86_64
If I do uname -a, I get:
Darwin bevrymemacmini.local 20.5.0 Darwin Kernel Version 20.5.0: Sat May 8 05:10:31 PDT 2021; root:xnu-7195.121.3~9/RELEASE_ARM64_T8101 x86_64
Only in uname -a do I get the information that it is running on Apple Silicon via the ARM64 part.
Is there a better way of determining which architecture I am running on Mac machines?
use this command
“sysctl -a | grep machdep.cpu” to get get a detailed dump on the cpu.
The command “sysctl -n machdep.cpu.brand_string” will just return the Intel model number and speed.
ouput from my mac using intel
machdep.cpu.xsave.extended_state: xxxxxxxxxxxxxxxx
machdep.cpu.xsave.extended_state1: xxxxxxxxxxxxx
machdep.cpu.thermal.ACNT_MCNT: 1
machdep.cpu.thermal.core_power_limits: 1
machdep.cpu.thermal.dynamic_acceleration: 1
machdep.cpu.thermal.energy_policy: 1
machdep.cpu.thermal.fine_grain_clock_mod: 1
machdep.cpu.thermal.hardware_feedback: 0
machdep.cpu.thermal.invariant_APIC_timer: 1
machdep.cpu.thermal.package_thermal_intr: 1
machdep.cpu.thermal.sensor: 1
machdep.cpu.thermal.thresholds: 2
machdep.cpu.mwait.extensions: 3
machdep.cpu.mwait.linesize_max: 64
machdep.cpu.mwait.linesize_min: 64
machdep.cpu.mwait.sub_Cstates: 286396448
machdep.cpu.cache.L2_associativity: 8
machdep.cpu.cache.linesize: 64
machdep.cpu.cache.size: 256
machdep.cpu.arch_perf.events: 0
machdep.cpu.arch_perf.events_number: 8
machdep.cpu.arch_perf.fixed_number: 4
machdep.cpu.arch_perf.fixed_width: 48
machdep.cpu.arch_perf.number: 8
machdep.cpu.arch_perf.version: 5
machdep.cpu.arch_perf.width: 48
machdep.cpu.address_bits.physical: 39
machdep.cpu.address_bits.virtual: 48
machdep.cpu.tsc_ccc.denominator: 2
machdep.cpu.tsc_ccc.numerator: 104
machdep.cpu.brand: 0
machdep.cpu.brand_string: Intel(R) Core(TM) i5-1038NG7 CPU # 2.00GHz
machdep.cpu.core_count: 4
machdep.cpu.cores_per_package: 8
machdep.cpu.extfamily: 0
machdep.cpu.extfeature_bits: xxxxxxxxxxxxxxxxxx
machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI
machdep.cpu.extmodel: 7
machdep.cpu.family: 6
machdep.cpu.feature_bits: xxxxxxxxxxxxxxxxxxxxxx
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.leaf7_feature_bits: xxxxxxxxxxxxxxxxxxxxxx
machdep.cpu.leaf7_feature_bits_edx: xxxxxxxxxxxxxxx
machdep.cpu.leaf7_features: RDWRFSGS TSC_THREAD_OFFSET SGX BMI1 AVX2 FDPEO SMEP BMI2 ERMS INVPCID FPU_CSDS AVX512F AVX512DQ RDSEED ADX SMAP AVX512IFMA CLFSOPT IPT AVX512CD SHA AVX512BW AVX512VL AVX512VBMI UMIP PKU GFNI VAES VPCLMULQDQ AVX512VNNI AVX512BITALG AVX512VPOPCNTDQ RDPID SGXLC FSREPMOV MDCLEAR IBRS STIBP L1DF ACAPMSR SSBD
machdep.cpu.logical_per_package: 16
machdep.cpu.max_basic: 27
machdep.cpu.max_ext: xxxxxxxxxxxx
machdep.cpu.microcode_version: 166
machdep.cpu.model: 126
machdep.cpu.processor_flag: 7
machdep.cpu.signature: xxxxxxxxx
machdep.cpu.stepping: 5
machdep.cpu.thread_count: 8
machdep.cpu.vendor: GenuineIntel

How to enable KVM on a Mac for Qemu?

I'm virtualizing a machine for the first time on my Mac with Qemu (for an university assignment, so it's not possible to change the tool).
We have to compare some measurements between a VM running on KVM and one without KVM.
I tried to start the KVM machine by calling qemu-system-x86_64 my.qcow2 -enable-kvm but I'm getting this error:
qemu-system-x86_64: -machine accel=kvm: No accelerator found
I checked sysctl -a | grep machdep.cpu.features and that`s my output:
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR
PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE
SSE3 PCLMULQDQ DTES64 MON DSCPL VMX EST TM2 SSSE3 FMA CX16 TPR PDCM
SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64
TSCTMR AVX1.0 RDRAND F16C
As there is VMX listed I assume my Macbook supports KVM and by default it should be enabled as far as I understood.
So why am I getting this error and does anybody have a solution to that?
Btw. my Macbook Pro is a Retina, 13' Mid 2014 version running 10.14.1 (18B75).
kvm is the linux hypervisor implementation, that isn't going to work.
Recent qemu version have support for the macos hypervisor framework, use accel=hvf for that.
For example:
qemu-system-x86_64 -m 2G -hda ubuntu.20.qcow2 -accel hvf
Make sure your command doesn't include -enable-kvm or kvm=on in -cpu
This worked for me:
$ qemu-system-x86_64 -m 2048 -vga virtio -display cocoa,show-cursor=on -usb -device usb-tablet -cdrom ~/VMs/isos/ubuntu-18.10-live-server-amd64.iso -drive file=~/VMs/qemu/ubuntu-server-18.04.qcow2,if=virtio -accel hvf -cpu Penryn,vendor=GenuineIntel

tf.matmul() at 35% slower then np.dot() when calculating on the CPU

I compiled tensorflow 1.3 from source files and was unpleasantly surprised by the performance of the product. Considered the comments of the community have managed to reduce the superiority of numpy over tensorflow from 45% to 35% when calculating on the CPU. But still, the difference is huge. Benchmarks code given below:
#! /usr/bin/env python3
import sys
import time
import numpy as np
import tensorflow as tf
print('Python', sys.version)
print('TensorFlow', tf.__version__)
gDType = np.float64
size = 8192
# Numpy calculation
rand_array = np.random.uniform(0, 1, (size, size))
timer0 = time.time()
res = np.dot(np.dot(rand_array, rand_array), rand_array)
print("numpy multiply: %f" % (time.time() - timer0))
# TensorFlow calculation
x = tf.Variable( tf.random_uniform(shape=(size, size), minval=0, maxval=1, dtype=gDType), dtype=gDType, name='x')
x3 = tf.matmul(tf.matmul(x, x), x)
# Avoid optimizing away redundant nodes
config = tf.ConfigProto(graph_options=tf.GraphOptions(optimizer_options=tf.OptimizerOptions(opt_level=tf.OptimizerOptions.L0)))
sess = tf.Session(config=config)
# sess = tf.Session()
sess.run(tf.global_variables_initializer())
# Exclude delays caused by initialization of the graph
timer0 = time.time()
sess.run(x3.op)
print("tensorflow multiply 1 pass: %f" % (time.time() - timer0))
timer0 = time.time()
sess.run(x3.op)
print("tensorflow multiply 2 pass: %f" % (time.time() - timer0))
Here is the output of the script:
$ ./matmul_benchmark.py
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609]
TensorFlow 1.3.0
numpy multiply: 37.464786
tensorflow multiply 1 pass: 61.245776
tensorflow multiply 2 pass: 49.944690
The script in the process, consumes 4 GB of RAM and you might want to reduce the size variable to 4096.
The comparison shows the superiority of numpy by 35% (50 sec. / 37 sec.).
Tell me, please, was there any mistake in this test?
PS. My CPU Sandy-bridge flags:
$ lscpu | grep Flags
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr
pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss ht syscall nx
rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable
nonstop_tsc aperfmperf pni pclmulqdq ssse3 cx16 **sse4_1 sse4_2** popcnt aes
xsave **avx** hypervisor lahf_lm epb xsaveopt dtherm ida arat pln pts
Intel has added optimizations to TensorFlow for Xeon and Xeon Phi through the use of Math Kernel Library for Deep Neural Networks. When compiling the tensorflow 1.3+, you should consider adding --config=mkl option during the compile. It only supports Linux OS though. Not sure how much speedup it will give you for benchmarking you are doing.
Some numpy distribution already includes MKL support. For example, Anaconda versions 2.5 and later, MKL is available by default.
First session.run takes longer because it has initialization calls
Are you using optimized numpy ( np.__config__.get_info()) ? Is your TensorFlow compiled with all optimizations? (build -c opt --config=opt)
Numpy and tensorflow manage memory separately, the default behavior of session.run is to copy result into numpy runtime. You could keep all data in TF runtime for lower overhead.
Here's a version that avoids common pitfalls (cuts overhead of needlessly copying the result back into numpy) -- Testing GPU with tensorflow matrix multiplication
For best case scenario I get 11 T ops/sec on GPU, and 1.1 T ops/sec on Xeon V3 (vs something like 0.5 T ops/sec with conda numpy)

How to set gcc option -march?

I get the help of gcc -march by typing gcc --target-help command:
-march=CPU[,+EXTENSION...]
generate code for CPU and EXTENSION, CPU is one of: i8086,
i186, i286, i386, i486, pentium, pentiumpro, pentiumii,
pentiumiii, pentium4, prescott, nocona, core, core2,
corei7, l1om, k6, k6_2, athlon, k8, amdfam10, generic32,
generic64 EXTENSION is combination of: 8087, 287, 387,
no87, mmx, nommx, sse, sse2, sse3, ssse3, sse4.1, sse4.2,
sse4, nosse, avx, noavx, vmx, smx, xsave, movbe, ept, aes,
pclmul, fma, clflush, syscall, rdtscp, 3dnow, 3dnowa,
sse4a, svme, abm, padlock, fma4, xop, lwp
I tried to set -march=i686+nommx and -march=i686,+nommx, but it's not correct! gcc reported error: error: bad value (i686,+nommx) for -march= switch
I want to build my program to i686 without mmx target, how to set the -march option?

gcc doesn't want to use AVX on mac

So I have this brand new mac book pro with intel core I7 processor and sysctl machdep.cpu.features giving
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 CX16 TPR PDCM SSE4.1 SSE4.2 xAPIC POPCNT AES PCID XSAVE OSXSAVE TSCTMR AVX1.0 RDRAND F16C
yet when I run gcc (4.7.2 macports), it doesn't #define __AVX__. What's wrong? (Mac OS X 10.8.2)
I depends on the compiler flags you are using wether __AVX__ and __SSEx__ will be defined.
So if you are using g++ -march=corei7avx the macro will be defined. -march=native should also suffice, if gcc is able to detect you cpu correctly (it usually is).
On my i7 MBP 13" (mid 2010) running 10.6.8, the current MacPorts gcc 4.7.3 and 4.8.2 do define AVX when -mavx is specified. They however crash compiling code using boost::simd (available via www.metascale.org).
Macports clang-3.3 has no such issues, but takes way longer to compile (with or without -mavx, compared to gcc >= 4.7 WITHOUT -mavx).

Resources