mpirun performance analysis - performance

I'm running mpirun (OpenMPI) with 86 processes on 12 CPUs and 2 GPUs on Ubuntu 18.04. The application that is being run is training neural networks.
After a day or so of training the iterations slow down dramatically. The code works fine on a single thread, network traffic (file reads) are well within spec and the CPUs and GPUs show no excessive load.
So I think that problem is with the mpirun.
Are there non-intrusive tools available to show the performance of the MPI runs? I've been looking at Performance Co-Pilot but I don't see any MPI profiling in the software itself.

Callgrind and kcachegrind might be useful. A brief look here [1] may help you as well.
[1] https://www.open-mpi.org/faq/?category=debugging#parallel-debuggers

Related

Why isn't some resource maxed out to 100% during Visual Studio compile?

I'm compiling a large solution of about 100 C++ projects with Visual Studio. During compilation, neither Memory, CPU, Disk, nor Ethernet are utilizied to anywhere near 100% (according to the Task Manager Performance tab). CPU is often as low as 25% and Memory Disk utilization seems to be as low as 5-10%.
So if no resource is utilized at 100%, what's the bottleneck? What's limiting my compile speed? I was honestly expecting it to be CPU. But it seems that it's not.
Am I perhaps measuring incorrectly? What should I expect to be the limiting resource when compiling? How can I speed things up? If there's something else that is the limitation (like RAM but as I/O via a cache or something) then what's the right tool/method to measure the bottleneck?
Additional Info: I'm certainly using maximum number of parallel projects to build = 8. Also multi-processor compilation is enabled for all the Visual C++ projects. My machine has 8 logical processors. So I really think that I'm not just maxing out one core. That would present itself as 12.5% usage on my machine, (and I see that often with single-threaded applications.)
well memory wise maybe your application don't use as much memory.
and as for the CPU usage, your program might be working on one thread, or to be more specific, on one single core of your CPU;
so if you have a quad core CPU, your application won't use anything above 25%.
as for the internet usage, i think that the task manager shows you computer's Ethernet capability, so maybe you have an internet speed of 10 Mb/s, but your Ethernet is capable of 50 Mb/s.
this is a link that i just looked up :https://askleo.com/why_wont_my_program_use_more_than_25_of_the_cpu/
great question.
Just setting the compilation to run all projects in parallel you just get the same result as #VasiliyGalkin, too much work for your setup.
But due to the way VS compiles each project you need a certain overlap so limit the number of parallel projects to 2-3 depending on the actual PC you run it on. If your PC is a monster with 16+ cores you might be able to go 1-2 up. You might be happy about the result or find that it doesn't fully use your CPU due to other limits in VS.
This article gives an in depth analysis of why its slow, the conclusion is that you need to set up your compilation to fit VS idea of the world.
A brief of the article
I would guess your setup is something like
Multiprocess compilation off
Giving you the following performance
Setting it to on and setting
enable minimal rebuild off
Gives you
Still for one project because your compilation times for your units is like this
Due to different compilation flags / precompiled headers see article for more. Fixing it gives you something like
and the 3 progression after each other
Now add max project 2 or 3 to use all capacity.
Ideally VS should have offered an option of using X threads so the problem would mostly go away as no more threads are started than usable and it would just pick the next task from the next project when there are free resources.
Memory is very unlikely to become a bottleneck for the compilation. Normally it's CPU-intensive process. The fact that your CPU is used for 25% instead of 100% may indicate that your projects are compiled sequentially using 1 core out of 4.
Try to check the "maximum number of parallel builds" in Visual Studio menu Options -> Projects and Solutions -> Build and Run.
You may find the screenshots attached to my recent question related to similar topic, but opposite problem - too much going in parallel instead of too little :)

Compare performance of 2 machines

Our IT team are going to get our machines upgraded. We are given 2 machines :- One is Quad Core i7 3.4 GHz 64-bit machine with 16 GB RAM. The another one is just an upgraded machine with the spec - Dual Core 2 GHz 64-bit machine with 8 GB RAM. Both have Windows 7 Professional on them.
Now, we are being given these 2 machines and are asked to test and see which one performs better (basically to see if the quad core one performs substantially better than the dual core one).
We mainly use Visual Studio 2010 as the development tool. Is there a way by which we can compare the 2 machines performance using Visual Studio (or any other way).
Is there some some sort of code which I can use to quantify the performance difference between the 2 machines?
Please let me know if you need some more information on this.
Thanks,
Abhi.
I think this one belongs on ServerFault, but I'll give it a shot.
Cores:
Visual Studio won't specifically benefit from multicore processors. To my knowledge, it doesn't use multithreaded compilation by default (though a savvy developer can make this happen through clever launching of MSBuild), so it won't take advantage of multiple cores. However, if the developers are running several apps in parallel - say, Photoshop, Office, etc - VS will be less likely to have to share core time if more cores are available.
Memory
8GB of RAM is plenty these days. I use three different dev boxes, two with 8GB and one with 24GB, and I don't see a significant difference in compilation time or IDE responsiveness. Caveat: if you're working with very large projects, more RAM will improve virtual memory swapping performance. Some large C++ apps out there, with hundreds of source files and embedded resources, can suck up a LOT of compilation time and memory.
Windows 7 Excellent choice.
CPU Clock and on-chip cache speed and size will have the most noticeable impact on performance, as will the amount of on-chip cache.
Also, make sure your video card/chipset is up to date, as that can be a UI speed bottleneck.
In short: RAM and CPU clock - and, to some extent, hard drive speed - are the most important factors.
This article has what appears to be a comprehensive overview of benchmarking processes, but I can't speak to the validity of their approach or the quality of the tools they recommend.
Not sure what you need exactly, but the Windows Experience Index (the one visible for end user) uses an API called WinSAT: Windows System Assessment Tool. The documentation entry point is available here: Windows System Assessment Tool.
Here is an example here: How to get the Windows Experience Index

How to check if app is cpu-bound or memory-bound?

I've got an application that does few computational CPU work, but mostly memory accesses (allocating objects and moving them around, there's few numeric or arithmetic code).
How can I measure the share of the time that am I spending in memory access latencies (due to cache misses), with the CPU being idle?
I should note that the app is running on a Hyper-V guest; I'm not sure it will pose any difficulties, but it might.
You could always profile your application to see where it spends most of the time.
You can learn a lot about your application's behaviour and data access patterns this way.
If you are using Linux, you have a wide range of available tools for profiling, like:
OProfile
sysprof
valgrind + kcachegrind
EDIT:
For a more exact measurement of the processor performance as well as memory accesses, you could also try the AMD CodeAnalyst Performance Analyzer. Here are instructions on how to use it with Intel processors, though I haven't tried it myself.
Another tool that you might also find useful is the Intel Performance Tuning Utility.
Unless you have a latency built into the system, just run the application for some time on a dedicated machine and check the CPU counters. If the app uses 100% of the CPU core it can access, it's CPU bound. Otherwise, it spends time on other things like memory allocations and IOs.

Slowing down computer for debugging intermittent defect

Is there a way to slow down my development computer to try to replicate a defect that only occurs intermittently on a slow machine?
(For what it's worth, Ableton Live has a CPU usage simulation feature, but I've never seen something like this for debuggers.)
This tool provides a decent CPU stress capability. I assume that's what you mean by "slow down". :)
The currently popular stability test
programs are:
Prime95 (this program's torture test)
3DMark2001
CPU Stability test
Sisoft sandra
Quake and other games
Folding#Home
Seti#home
Genome#home
This is from the stress testing documentation for Prime95.
An old question, but I'd also suggest using VMWare Workstation. You can allocate more or less resources to virtual machines, and you can record/playback the execution of the machine, so you can catch the bug in the act and then step through it at your leisure.

How to benchmark virtual machines

I am trying to perform a fair comparison of XenServer vs ESX and one comparison I would like to make is performance with multiple VMs. Does anyone know how to go about benchmarking VM performance in a fair way?
On each server I would like to run a fixed number of XP/Vista VMs (for example 8) and have some measure of how quickly each one runs when under load. Ideally I would like some benchmark of the overall system (CPU/Memory/Disk/Network) rather than just one aspect.
It seems to me this is actually a very tricky thing to do and obtain any meaningful results so would be grateful for any suggestions!
I would also be interested to see any existing reports or comparisons that have been published (preferably independent rather than vendor biased!)
As a general answer, VMware (together with other virtualization vendors in the SPEC Virtualization sub-committee) has put together a hypervisor benchmarking suite called VMmark that is available for download. The VMmark website discusses why this benchmark may be useful for comparing hypervisors, including an FAQ and a whitepaper describing the benchmark.
That said, if you are looking for something very specific (e.g., how will it perform under your workload), you may have to roll your own variants of VMmark, especially if you are not trying to do the sorts of things that VMmark benchmarks (e.g., web servers, database servers, file servers, etc.) Nonetheless, the methodology behind its development should be of interest.
Disclaimer: I work for VMware, though not on VMmark.
I don't see why you can't use common benchmarks inside the VMs: WinSAT, Passmark, Futuremark, SiSoftware, etc... Host the VMs over different hosts and see how it goes.
As an aside, benchmarks that don't closely match your intended usage may actually hinder your evaluation. Depending on the importance of getting this right, you may have to build-your-own to make it relevant.
Why do you want to bench?
How about some anecdotal evidence?
I'm going to assume this is a test environment, because you're wanting to benchmark on XP/Vista. Please correct me if I'm wrong.
My current test environment is about 20 VMs with varying OS's (2000/XP/Vista/Vista64/Server 2008/Server 2003) in different configurations on a Dual Quad Core Xeon machine with 8Gb RAM (looking to upgrade to 16Gb soon) and the slowest machines of all are Vista primarily due to heavy disk access (this is even with Windows Defender disabled)
Recommendations
- Hardware RAID. Too painful to run Vista VMs otherwise.
- More RAM.
If you're benchmarking and looking to run Vista VMs, I would suggest putting your focus on benchmarking disk access. If there are going to be performance differences elsewhere I doubt they would be of anything significant.
I recently came across VMware's ESX Performance documentation - http://www.vmware.com/pdf/VI3.5_Performance.pdf. There is quite a bit in there on improving performance and benchmarking.

Resources