impact of Meltdown/Spectre patches? - macos

Has anyone timed the impact of the Spectre/Meltdown patches for OS X or macOS?
I just updated my El Capitan based Mac Pro (#1), and noticed a 6% slowdown (#2) on my major benchmark (compiling 500 C files (about 1/2 million lines) from the Terminal).
The update is the "Security Update 2018-001", which addresses Meltdown and Spectre according to https://support.apple.com/en-us/HT208465
Stan
early 2009, 16 cores, 20 GB RAM
Tests repeated 3 times, files brought into memory first, best of the three tests compared showing 6% slowdown. If average of times used instead, still shows 6%. No other "apps" running.

Related

How to get better performace in ProxmoxVE + CEPH cluster

We have been running ProxmoxVE since 5.0 (now in 6.4-15) and we noticed a decay in performance whenever there is some heavy reading/writing.
We have 9 nodes, 7 with CEPH and 56 OSDs (8 on each node). OSDs are hard drives (HDD) WD Gold or better (4~12 Tb). Nodes with 64/128 Gbytes RAM, dual Xeon CPU mainboards (various models).
We already tried simple tests like "ceph tell osd.* bench" getting stable 110 Mb/sec data transfer to each of them with +- 10 Mb/sec spread during normal operations. Apply/Commit Latency is normally below 55 ms with a couple of OSDs reaching 100 ms and one-third below 20 ms.
The front network and back network are both 1 Gbps (separated in VLANs), we are trying to move to 10 Gbps but we found some trouble we are still trying to figure out how to solve (unstable OSDs disconnections).
The Pool is defined as "replicated" with 3 copies (2 needed to keep running). Now the total amount of disk space is 305 Tb (72% used), reweight is in use as some OSDs were getting much more data than others.
Virtual machines run on the same 9 nodes, most are not CPU intensive:
Avg. VM CPU Usage < 6%
Avg. Node CPU Usage < 4.5%
Peak VM CPU Usage 40%
Peak Node CPU Usage 30%
But I/O Wait is a different story:
Avg. Node IO Delay 11
Max. Node IO delay 38
Disk writing load is around 4 Mbytes/sec average, with peaks up to 20 Mbytes/sec.
Anyone with experience in getting better Proxmox+CEPH performance?
Thank you all in advance for taking the time to read,
Ruben.
Got some Ceph pointers that you could follow...
get some good NVMEs (one or two per server but if you have 8HDDs per server 1 should be enough) and put those as DB/WALL (make sure they have power protection)
the ceph tell osd.* bench is not that relevant for real world, I suggest to try some FIO tests see here
set OSD osd_memory_target to at 8G or RAM minimum.
in order to save some write on your HDD (data is not replicated X times) create your RBD pool as EC (erasure coded pool) but please do some research on that because there are some tradeoffs. Recovery takes some extra CPU calculations
All and all, hype-converged clusters are good for training, small projects and medium projects with not such a big workload on them... Keep in mind that planning is gold
Just my 2 cents,
B.

Different speedups on different machines for same program

I have a dell laptop, i7 5th gen processor(1000MHz) with 4 logical cores and 2 physical cores with 16 GB RAM. I have taken a course on High performance computing, for which I have to draw graphs of speed-up.
Compared to a Desktop machine(800 MHz) (i5 5th gen and 8GB RAM), having 4 physical and logical cores, for same program, my laptop takes ~3 seconds while on the desktop machine it takes around 12 seconds. Ideally, since the laptop is 1.25 times faster, the time on my laptop should have been around 9 to 10 seconds
This might not have been the problem if I had got almost similar speedup. But in my laptop, using 4 threads, the speedup is nearly equal to 1.3 and for the same number of cores, on my desktop the speedup is nearly equal to 3.5. If my laptop was fast, then it would have also reflected that property for parallel program, but the parallel program was only ~1.3 times faster. What could have been the reason?

Why can't my ultraportable laptop CPU maintain peak performance in HPC

I have developed a high performance Cholesky factorization routine, which should have peak performance at around 10.5 GFLOPs on a single CPU (without hyperthreading). But there is some phenomenon which I don't understand when I test its performance. In my experiment, I measured the performance with increasing matrix dimension N, from 250 up to 10000.
In my algorithm I have applied caching (with tuned blocking factor), and data are always accessed with unit stride during computation, so cache performance is optimal; TLB and paging problem are eliminated;
I have 8GB available RAM, and the maximum memory footprint during experiment is under 800MB, so no swapping comes across;
During experiment, no resource demanding process like web browser is running at the same time. Only some really cheap background process is running to record CPU frequency as well as CPU temperature data every 2s.
I would expect the performance (in GFLOPs) should maintain at around 10.5 for whatever N I am testing. But a significant performance drop is observed in the middle of the experiment as shown in the first figure.
CPU frequency and CPU temperature are seen in the 2nd and 3rd figure. The experiment finishes in 400s. Temperature was at 51 degree when experiment started, and quickly rose up to 72 degree when CPU got busy. After that it grew slowly to the highest at 78 degree. CPU frequency is basically stable, and it did not drop when temperature got high.
So, my question is:
since CPU frequency did not drop, why performance suffers?
how exactly does temperature affect CPU performance? Does the increment from 72 degree to 78 degree really make things worse?
CPU info
System: Ubuntu 14.04 LTS
Laptop model: Lenovo-YOGA-3-Pro-1370
Processor: Intel Core M-5Y71 CPU # 1.20 GHz * 2
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0,1
Off-line CPU(s) list: 2,3
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 61
Stepping: 4
CPU MHz: 1474.484
BogoMIPS: 2799.91
Virtualisation: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 4096K
NUMA node0 CPU(s): 0,1
CPU 0, 1
driver: intel_pstate
CPUs which run at the same hardware frequency: 0, 1
CPUs which need to have their frequency coordinated by software: 0, 1
maximum transition latency: 0.97 ms.
hardware limits: 500 MHz - 2.90 GHz
available cpufreq governors: performance, powersave
current policy: frequency should be within 500 MHz and 2.90 GHz.
The governor "performance" may decide which speed to use
within this range.
current CPU frequency is 1.40 GHz.
boost state support:
Supported: yes
Active: yes
update 1 (control experiment)
In my original experiment, CPU is kept busy working from N = 250 to N = 10000. Many people (primarily those whose saw this post before re-editing) suspected that the overheating of CPU is the major reason for performance hit. Then I went back and installed lm-sensors linux package to track such information, and indeed, CPU temperature rose up.
But to complete the picture, I did another control experiment. This time, I give CPU a cooling time between each N. This is achieved by asking the program to pause for a number of seconds at the start of iteration of the loop through N.
for N between 250 and 2500, the cooling time is 5s;
for N between 2750 and 5000, the cooling time is 20s;
for N between 5250 and 7500, the cooling time is 40s;
finally for N between 7750 and 10000, the cooling time is 60s.
Note that the cooling time is much larger than the time spent for computation. For N = 10000, only 30s are needed for Cholesky factorization at peak performance, but I ask for a 60s cooling time.
This is certainly a very uninteresting setting in high performance computing: we want our machine to work all the time at peak performance, until a very large task is completed. So this kind of halt makes no sense. But it helps to better know the effect of temperature on performance.
This time, we see that peak performance is achieved for all N, just as theory supports! The periodic feature of CPU frequency and temperature is the result of cooling and boost. Temperature still has an increasing trend, simply because as N increases, the work load is getting bigger. This also justifies more cooling time for a sufficient cooling down, as I have done.
The achievement of peak performance seems to rule out all effects other than temperature. But this is really annoying. Basically it says that computer will get tired in HPC, so we can't get expected performance gain. Then what is the point of developing HPC algorithm?
OK, here are the new set of plots:
I don't know why I could not upload the 6th figure. SO simply does not allow me to submit the edit when adding the 6th figure. So I am sorry I can't attach the figure for CPU frequency.
update 2 (how I measure CPU frequency and temperature)
Thanks to Zboson for adding the x86 tag. The following bash commands are what I used for measurement:
while true
do
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq >> cpu0_freq.txt ## parameter "freq0"
cat sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq >> cpu1_freq.txt ## parameter "freq1"
sensors | grep "Core 0" >> cpu0_temp.txt ## parameter "temp0"
sensors | grep "Core 1" >> cpu1_temp.txt ## parameter "temp1"
sleep 2
done
Since I did not pin the computation to 1 core, the operating system will alternately use two different cores. It makes more sense to take
freq[i] <- max (freq0[i], freq1[i])
temp[i] <- max (temp0[i], temp1[i])
as the overall measurement.
TL:DR: Your conclusion is correct. Your CPU's sustained performance is nowhere near its peak. This is normal: the peak perf is only available as a short term "bonus" for bursty interactive workloads, above its rated sustained performance, given the light-weight heat-sink, fans, and power-delivery.
You can develop / test on this machine, but benchmarking will be hard. You'll want to run on a cluster, server, or desktop, or at least a gaming / workstation laptop.
From the CPU info you posted, you have a dual-core-with-hyperthreading Intel Core M with a rated sustainable frequency of 1.20 GHz, Broadwell generation. Its max turbo is 2.9GHz, and it's TDP-up sustainable frequency is 1.4GHz (at 6W).
For short bursts, it can run much faster and make much more heat than it requires its cooling system to handle. This is what Intel's "turbo" feature is all about. It lets low-power ultraportable laptops like yours have snappy UI performance in stuff like web browsers, because the CPU load from interactive is almost always bursty.
Desktop/server CPUs (Xeon and i5/i7, but not i3) do still have turbo, but the sustained frequency is much closer to the max turbo. e.g. a Haswell i7-4790k has a sustained "rated" frequency of 4.0GHz. At that frequency and below, it won't use (and convert to heat) more than its rated TDP of 88W. Thus, it needs a cooling system that can handle 88W. When power/current/temperature allow, it can clock up to 4.4GHz and use more than 88W of power. (The sliding window for calculating the power history to keep the sustained power with 88W is sometimes configurable in the BIOS, e.g. 20sec or 5sec. Depending on what code is running, 4.4GHz might not increase the electrical current demand to anywhere near peak. e.g. code with lots of branch mispredicts that's still limited by CPU frequency, but that doesn't come anywhere near saturating the 256b AVX FP units like Prime95 would.)
Your laptop's max turbo is a factor of 2.4x higher than rated frequency. That high-end Haswell desktop CPU can only upclock by 1.1x. The max sustained frequency is already pretty close to the max peak limits, because it's rated to need a good cooling system that can keep up with that kind of heat production. And a solid power supply that can supply that much current.
The purpose of Core M is to have a CPU that can limit itself to ultra low power levels (rated TDP of 4.5 W at 1.2GHz, 6W at 1.4GHz). So the laptop manufacturer can safely design a cooling and power delivery system that's small and light, and only handles that much power. The "Scenario Design Power" is only 3.5W, and that's supposed to represent the thermal requirements for real-world code, not max-power stuff like Prime95.
Even a "normal" ULV laptop CPU is rated for 15W sustained, and high power gaming/workstation laptop CPUs at 45W. And of course laptop vendors put those CPUs into machines with beefier heat-sinks and fans. See a table on wikipedia, and compare desktop / server CPUs (also on the same page).
The achievement of peak performance seems to rule out all effects
other than temperature. But this is really annoying. Basically it says
that computer will get tired in HPC, so we can't get expected
performance gain. Then what is the point of developing HPC algorithm?
The point is to run them on hardware that's not so badly thermally limited! An ultra-low-power CPU like a Core M makes a decent dev platform, but not a good HPC compute platform.
Even a laptop with an xxxxM CPU, rather than a xxxxU CPU, will do ok. (e.g. a "gaming" or "workstation" laptop that's designed to run CPU-intensive stuff for sustained periods). Or in Skylake-family, "xxxxH" or "HK" are the 45W mobile CPUs, at least quad-core.
Further reading:
Modern Microprocessors
A 90-Minute Guide!
[Power Delivery in a Modern Processor] - general background, including the "power wall" that Pentium 4 ran into.
(https://www.realworldtech.com/power-delivery/) - really deep technical dive into CPU / motherboard design and the challenges of delivering stable low-voltage to very bursty demands, and reacting quickly to the CPU requesting more / less voltage as it changes frequency.

what prevent windows cpu to work more than 100%?

today i took my 1st UNIX lesson, so please bear with me if here comes some stupid questions.
In the class the tutor just run
~$: yes "hello, world"
twice, then the CPU goes above 100%, it goes to 1.36 actually, before he killed the 2 yes process.
he said in Solaris, CPU could go to 400%, and still working. slow, but never crash.
what is this cpu percentage, if it's a percentage how come it goes beyond 100%?
and I never observe any CPU percentage more than 100% in windows, if ever it's 80% it's as slow as a worm. is there any windows OS limitation so that it won't go beyond 100%?
Neither Unix nor Windows can utilize a CPU more than 100% ... for multi-core / hyperthreading etc. the percentage can be calculated either as the sum as Solaris seems to do it (thus going above 100%) or the average as Windows does it (thus never going above 100%)...
The 1.36 is NOT the same as CPU utilization but it is the "load" which is calculated differently - for a nice explanation see http://en.wikipedia.org/wiki/Load_%28computing%29
Its a question of % calculation. You either sum each core up and show a total or you show an average over all cores.
If Solaris goes to 400% its for 4 cores at 100%. If 1 core is at 100% it shows 100%.
In Windows is at 100% this equals to 4 cores at 100%. If 1 core is at 100% it shows 25%.
The definition of CPU percent is simply different for multi core systems. Windows calculates the average, solaris the sum. So if all cores in a quad-core system are busy, windows will display 100%, and solaris will say it's 400%. That doesn't mean that those 400% percent are somehow faster than the 100% on windows, it's just a display convention.

Why does some Ruby code run twice as fast on a 2.53GHz than on a 2.2GHz Core 2 Duo processor?

(This question attempts to find out why the running of a program can be different on different processors, so it is related to the performance aspect of programming.)
The following program will take 3.6 seconds to run on a Macbook that has 2.2GHz Core 2 Duo, and 1.8 seconds to run on a Macbook Pro that has 2.53GHz Core 2 Duo. Why is that?
That's a bit weird... why doubling the speed when the CPU is only 15% faster in clock speed? I double checked the CPU meter to make sure none of the 2 cores are in 100% usage (so as to see the CPU is not busy running something else). Could it be because one is Mac OS X Leopard and one is Mac OS X Snow Leopard (64 bit)? Both are running Ruby 1.9.2.
p RUBY_VERSION
p RUBY_DESCRIPTION if defined? RUBY_DESCRIPTION
n = 9_999_999
p n
t = 0; 1.upto(n) {|i| t += i if i%3==0 || i%5==0}; p t
The following are just output of the program:
On 2.2GHz Core 2 Duo: (Update: Macbook identifier: MacBook3,1, therefore probably is Intel Core 2 Duo (T7300/T7500))
$ time ruby 1.rb
"1.9.2"
"ruby 1.9.2p0 (2010-08-18 revision 29036) [i386-darwin9.8.0]"
9999999
23333331666668
real 0m3.784s
user 0m3.751s
sys 0m0.021s
2.53GHz Intel Core 2 Duo: (Update: Macbook identifier: MacBookPro5,4, therefore probably is Intel Core 2 Duo Penryn with 3 MB on-chip L2 cache)
$ time ruby 1.rb
"1.9.2"
"ruby 1.9.2p0 (2010-08-18 revision 29036) [x86_64-darwin10.4.0]"
9999999
23333331666668
real 0m1.893s
user 0m1.809s
sys 0m0.012s
Test run on Windows 7:
time_start = Time.now
p RUBY_VERSION
p RUBY_DESCRIPTION if defined? RUBY_DESCRIPTION
n = 9_999_999
p n
t = 0; 1.upto(n) {|i| t += i if i%3==0 || i%5==0}; p t
print "Took #{Time.now - time_start} seconds to run\n"
Intel Q6600 Quad Core 2.4GHz running Windows 7, 64-bit:
C:\> ruby try.rb
"1.9.2"
"ruby 1.9.2p0 (2010-08-18) [i386-mingw32]"
9999999
23333331666668
Took 3.248186 seconds to run
Intel 920 i7 2.67GHz running Windows 7, 64-bit:
C:\> ruby try.rb
"1.9.2"
"ruby 1.9.2p0 (2010-08-18) [i386-mingw32]"
9999999
23333331666668
Took 2.044117 seconds to run
It is also strange why an i7 with 2.67GHz is slower than a 2.53GHz Core 2 Duo.
I suspect that ruby is switching to a arbitrary-precision integer implementation
later on the 64 bit os.
Quoting the Fixnum ruby doc:
A Fixnum holds Integer values that can
be represented in a native machine
word (minus 1 bit). If any operation
on a Fixnum exceeds this range, the
value is automatically converted to a
Bignum.
Here, a native machine word is technically 64 bit, but the interpreter is compiled to run on 32 bit processors.
why doubling the speed when the CPU is only 15% faster in clock speed?
Quite simply because the performance of computers is determined not solely by CPU clock speed.
Other things to consider are:
CPU architectures, including e.g. the number of cores on a CPU, or the general ability to run multiple instructions in parallel
other clock speeds in the system (memory, FSB)
CPU cache sizes
installed memory chips (some are faster than others)
additionally installed hardware (might slow down the system through hardware interruptions)
different operating systems
32-bit vs. 64-bit systems
I'm sure there's a lot more things to add to the above list. I won't elaborate further on the point, but if anyone feels like it, please, feel free to add to the above list.
In out CI environment we have a lot of "pizza box" computers that are supposed to be identical. They have the same hardware, were installed at the same time, and should be generally identical. They're even placed in "thermally equivalent" locations. They're not identical, and the variation can be quite stunning.
The only conclusion I have come up with is different binnings of CPU will have different thresholds for thermal stepping; some of the "best" chips hold up better. I also suspect other "minor" hardware faults/variations to be playing a role here. Maybe the slow boxes have slightly different components that play less well together ?
There are tools out there that will show you if your CPU is throttling for thermal reasons.
I don't know much Ruby, but your code doesn't look to be multithreaded, if that's the case it's not going to take advantage of multiple cores. There can also be large differences between two CPU models. You have smaller process sizes, larger caches, better SIMD instructions sets, faster memory access, etc... Compiler & OS differences can cause large swings in performance between Windows & Linux, this can also be said for x86 vs x64. Plus Core i7s support HyperThreading which in some cases makes a single threaded app slower.
Just as an example, if that 2.2Ghz CPU is an Intel Core2 E4500 it has the following specs:
Clock: 2.2Ghz
L2 Cache: 2MB
FSB: 800MT/sec
Process Size: 65nm
vs a T9400 which is likely in your MacBook Pro
Clock: 2.53Ghz
L2 Cache: 6MB
FSB: 1066MT/sec
Process Size: 45nm
Plus you're running it on an x64 build of Darwin. All those things could definitely add up to inflating a trivial little script into executing much faster.
didn't read the code, But.. It is really hard to test on 2 different computers.
You need exactly the same os, same processes, same amount of memory.
If you change the processor family (i7, core2due, P4, P4-D) - the processor frequency says nothing on each processor abilities against another family. you can only compare in the same family (a newer processor might invest cycles in core management rather in computation for example)

Resources