NUMA: Win10 CPU utilization - windows

I develop a multithreaded cpu-intensive application. Until now this application has been tested on multicore (but single-cpu) systems like an i7-6800K and worked well under Linux and Windows. A newly observed phenomenon is that it does not run well on certain sever hardware: 2 x Xeon E5 2660 v3:
When 40 threads are active then cpu utilization drops to 5-10 %. This server has two physical CPUs and supports NUMA. The application has not been written with the NUMA-model in mind and thus we have certainly lots of memory accesses to non-local memory and that should be improved. But the question is: "Can low displayed cpu-utilization be caused by slow memory access?"
I believe this is the case but a colleque said that the cpu utilization would nevertheless stay at 100 %. This is important because if he is right then the trouble does not come from memory-misplacement. I don't know how Windows10 counts cpu utilization so I hope that somebody knows from practical experience with server hardware if the displayed cpu utilization drops in case of congested memory controllers.

Related

Fully Utilise CPU and Memory

I have a system - processor 2.8 ghz, 20 physical cores, 40 logical cores, 128 gb ram and 4tb hard drive.
Scenario:
I am running 3 (independent) python base processes/scripts (running independently) that read data from file and write it to database. They are taking time while not using CPU and Memory 100% not even 40%.
Why is it so? (I think it depends upon OS)
How can I configure it to utilise CPU and Memory more?
I am using Windows 8.1.
take a look at processoraffinity and processpriority
https://msdn.microsoft.com/en-us/library/system.diagnostics.processthread.processoraffinity(v=vs.110).aspx
https://msdn.microsoft.com/en-us/library/system.diagnostics.process.priorityclass(v=vs.110).aspx
A process (including a python script) isn't going to use any more cores than it has running threads. So if your python script is single-threaded, it's only going to use a single core.
Further, disk and database operations will stall the process while blocked on I/O and network. (Effective CPU usage == 0).
In other words, your program may not be "cpu bound" if it's doing a lot of I/O.
I'm not sure what your programs do, but if the problem at hand can be parallelized (split up into multiple independent tasks), then it might lend itself to having more threads or processes to take advantage of the extra hardware you have. But it's tricky and very hard to get this right and get the performance gain.

DRAM and its effects on real world performance

After learning a little on how computer programs run I had some thoughts concerning the cpu and RAM. After watching a few youtube videos (linus tech tips and others) they all seem to show that increasing a RAM speed (frequency) does not really have much of a performance improvement in real world applications and games on a general desktop computer. My first question is why is this? Is it because of the high hit rates (95% and above) of the cpu's cache on most modern cpus? Which in turn would lead to less and less need for the cpu to reach out to ram? Also, in which situations would faster RAM frequency be beneficial?
NOTE: this is a very broad question, and the answers can vary very differently depending on the architecture/OS running the system. I am answering from a best-judgement standpoint on how these things generally work
Why is there not a larger performance difference between different RAM clock speeds?
I would imagine that the clock speed of the RAM of the computer matters less than the clock speed of the CPU cache. Because:
the CPU gets its instructions from the cache, not straight from RAM
with the larger cache sizes of modern CPU's, it is less necessary to need to go out to RAM as often.
When the cache needs to go out to RAM, it uses an asynchronous processor (DMA) to grab more information, allowing the CPU to switch to a different process entirely.
Besides that, the clock speed of the motherboard's various pipelines (DMA) could be creating a chokepoint where it is slowing the transfer rate of the information overall.
which situations would faster RAM frequency be beneficial?
I would say that overall, any one of the core pieces of hardware involved with memory and its use and transfer (CPU, CPU Cache, The Various memory pipelines, the various memory transfer devices (DMA, etc.), the RAM itself) can cause a chokepoint where faster RAM might or might not affect the overall performance. It is really a by-case issue.

How to reduce time taken for large calculations in MATLAB

When using the desktop PC's in my university (Which have 4Gb of ram), calculations in Matlab are fairly speedy, but on my laptop (Which also has 4Gb of ram), the exact same calculations take ages. My laptop is much more modern so I assume it also has a similar clock speed to the desktops.
For example, I have written a program that calculates the solid angle subtended by 50 disks at 500 points. On the desktop PC's this calculation takes about 15 seconds, on my laptop it takes about 5 minutes.
Is there a way to reduce the time taken to perform these calculations? e.g, can I allocate more ram to MATLAB, or can I boot up my PC in a way that optimises it for using MATLAB? I'm thinking that if the processor on my laptop is also doing calculations to run other programs this will slow down the MATLAB calculations. I've closed all other applications, but I know theres probably a lot of stuff going on I can't see. Can I boot my laptop up in a way that will have less of these things going on in the background?
I can't modify the code to make it more efficient.
Thanks!
You might run some of my benchmarks which, along with example results, can be found via:
http://www.roylongbottom.org.uk/
The CPU core used at a particular point in time, is the same on Pentiums, Celerons, Core 2s, Xeons and others. Only differences are L2/L3 cache sizes and external memory bus speeds. So you can compare most results with similar vintage 2 GHz CPUs. Things to try, besides simple number crunching tests.
1 - Try memory test, such as my BusSpeed, to show that caches are being used and RAM not dead slow.
2 - Assuming Windows, check that the offending program is the one using most CPU time in Task Manager, also that with the program not running, that CPU utilisation is around zero.
3 - Check that CPU temperature is not too high, like with SpeedFan (free D/L).
4 - If disk light is flashing, too much RAM might be being used, with some being swapped in and out. Task Manager Performance would show this. Increasing RAM demands can be checked my some of my reliability tests.
There are many things that go into computing power besides RAM. You mention processor speed, but there is also number of cores, GPU capability and more. Programs like MATLAB are designed to take advantage of features like parallelism.
Summary: You can't compare only RAM between two machines and expect to know how they will perform with respect to one another.
Side note: 4 GB is not very much RAM for a modern laptop.
Firstly you should perform a CPU performance benchmark on both computers.
Modern operating systems usually apply the most aggressive power management schemes when it is run on laptop. This usually means turning off one or more cores, or setting them to a very low frequency. For example, a Quad-core CPU that normally runs at 2.0 GHz could be throttled down to 700 MHz on one CPU while the other three are basically put to sleep, while it is on battery. (Remark. Numbers are not taken from a real example.)
The OS manages the CPU frequency in a dynamic way, tweaking it on the order of seconds. You will need a software monitoring tool that actually asks for the CPU frequency every second (without doing busy work itself) in order to know if this is the case.
Plugging in the laptop will make the OS use a less aggressive power management scheme.
(If this is found to be unrelated to MATLAB, please "flag" this post and ask moderator to move this question to the SuperUser site.)

Memory-intense jobs scaling poorly on multi-core cloud instances (ec2, gce, rackspace)?

Has anyone else noticed terrible performance when scaling up to use all the cores on a cloud instance with somewhat memory intense jobs (2.5GB in my case)?
When I run jobs locally on my quad xeon chip, the difference between using 1 core and all 4 cores is about a 25% slowdown with all cores. This is to be expected from what I understand; a drop in clock rate as the cores get used up is part of the multi-core chip design.
But when I run the jobs on a multicore virtual instance, I am seeing a slowdown of like 2x - 4x in processing time between using 1 core and all cores. I've seen this on GCE, EC2, and Rackspace instances. And I have tested many difference instance types, mostly the fastest offered.
So has this behavior been seen by others with jobs about the same size in memory usage?
The jobs I am running are written in fortran. I did not write them, and I'm not really a fortran guy so my knowledge of them is limited. I know they have low I/O needs. They appear to be CPU-bound when I watch top as they run. They run without the need to communicate with each other, ie., embarrasingly parallel. They each take about 2.5GB in memory.
So my best guess so far is that jobs that use up this much memory take a big hit by the virtualization layer's memory management. It could also be that my jobs are competing for an I/O resource, but this seems highly unlikely according to an expert.
My workaround for now is to use GCE because they have single-core instance that actually runs the jobs as fast as my laptop's chip, and are priced almost proportionally by core.
You might be running into memory bandwidth constraints, depending on your data access pattern.
The linux perf tool might give some insight into this, though I'll admit that I don't entirely understand your description of the problem. If I understand correctly:
Running one copy of the single-threaded program on your laptop takes X minutes to complete.
Running 4 copies of the single-threaded program on your laptop, each copy takes X * 1.25 minutes to complete.
Running one copy of the single-threaded program on various cloud instances takes X minutes to complete.
Running N copies of the single-threaded program on an N-core virtual cloud instances, each copy takes X * 2-4 minutes to complete.
If so, it sounds like you're either running into a kernel contention or contention for e.g. memory I/O. It would be interesting to see whether various fortran compiler options might help optimize memory access patterns; for example, enabling SSE2 load/store intrinsics or other optimizations. You might also compare results with gcc and intel's fortran compilers.

Huge CPU usage difference for the same .NET 4 program on different machines

I have two servers, one running core i7 920 (8 logic CPUs at 2.8Ghz), the other running Xeon X3430 (4 logic CPUs at 2.4Ghz). For the same .NET 4 application, CPU usage on the first machine is 6%; on the second machine it is 50%! I wonder what makes this huge difference. And how can I diagnose the cause of the issue?
Its not just CPU that matters, are you saturating IO? Is the faster machine so much faster that it is writing much more data that the CPU cannot keep up, whereas the slower machine is rattling along and so the CPU fully utilised.
Locking might also have a part to play, I know a simple test app I wrote a long time ago showed large performance differences between a single core and quad core systems. (the single core was a lot faster, I think .NET optimised away locks for it whereas the quad core suffered).
In short, unless there's a fair bit more information on the problem, no-one can give you anything other than guesses as to the cause.

Resources