Is there a significant performance impact using MPI on an N-core single machine compared to using N machines with 1-core (on hpc nodes)?
I am running WRF (mesoscale weather modeling) using MPI on single machine AMD Ryzen 64-core, and for some reason performance is lower than I expected. It is weird, because I would assume that running on the same machine all communications among ranks would be much faster than on nodes, and so overall performance better.
thanks
Related
I have some Go code I am benchmarking on my Macbook (Intel Core i5 processor with two physical cores).
Go's runtime.NumCPU() yields 4, because it counts "virtual cores"
I don't know much about virtual cores in this context, but my benchmarks seems to indicate a multiprocessing speedup of only 2x when I configure my code using
runtime.GOMAXPROCS(runtime.NumCPU())
I get the same performance if I use 2 instead of 4 cores. I would post the code, but I think it's largely irrelevant to my questions, which are:
1) is this normal?
2) why, if it is, do multiple virtual cores benefit a machine like my macbook?
Update:
In case it matters, in my code, there are the same number of goroutines as whatever you set runtime.GOMAXPROCS() the tasks are fully parallel, have no interdependencies or shared state. its running as a native compiled binary.
1) is this normal?
If you mean the virtual cores showing up in runtime.NumCPU(), then yes, at least in the sense that programs written in C as well as those running on top of other runtimes like the JVM will see the same number of CPUs. If you mean the performance, see below.
2) why, if it is, do multiple virtual cores benefit a machine like my macbook?
It's a complicated matter that depends on the workload. The workloads where its benefits show the most are typically highly parallel like 3D rendering and certain kinds of data compression. In other workloads, the benefits may be absent and the impact of HT on performance may be negative (due to the communication and context-switching overhead of running more threads). Reading the Wikipedia article on hyper-threading can elucidate the matter further.
Here is a sample benchmark that compares the performance of the same CPU with and without HT. Note how the performance is not always improved by HT and in some cases, in fact, decreases.
I'm compiling the same program on two different machines and then running tests to compare performance.
There is a difference in the power of the two machines: one is MacBook Pro with a four 2.3GHz processors, the other is a Dell server with twelve 2.9 GHz processors.
However, the mac runs the test programs in shorter time!!
The only difference in the compilation is that I run g++-mp-4.8 on the machine mac, and g++-4.8 on the other.
EDIT: There is NO parallel computing going on, and my process was the only one run on the server. Also, I've updated the number of cores on the Dell.
EDIT 2: I ran three tests of increasing complexity, the times obtained were, in the format (Dell,Mac) in seconds: (1.67,0.56), (45,35), (120,103). These differences are quite substantial!
EDIT 3: Regarding the actual processor speed, we considered this with the system administrator and still came up with no good reason. Here is the spec for the MacBook processor:
http://ark.intel.com/fr/products/71459/intel-core-i7-3630qm-processor-6m-cache-up-to-3_40-ghz
and here for the server:
http://ark.intel.com/fr/products/64589/Intel-Xeon-Processor-E5-2667-15M-Cache-2_90-GHz-8_00-GTs-Intel-QPI
I would like to highlight a feature that particularly skews results of single-threaded code on mobile processors:
Note that while there's a 500 MHz difference in base speed (the question mentioned 2.3 GHz, are we looking at the same CPU?), there's only a 100 MHz difference in single-threaded speed, when Turbo Boost is running at maximum.
The Core-i7 also uses faster DDR than its server counterpart, which normally runs at a lower clock speed with more buffers to support much larger capacities of RAM. Normally the number of channels on the Xeon and difference in L3 cache size makes up for this, but different workloads will make use of cache and main memory differently.
Of course generational improvements can make a difference as well. The significance of Ivy Bridge vs Sandy Bridge varies greatly with application.
A final possibility is that the program runtime isn't CPU-bound. I/O subsystem, speed of GPGPU, etc can affect performance over multiple orders of magnitude for applications that exercise those.
The compilers are practically identical (-mp just signifies that this gcc version was installed via macports).
The performance difference you observed results from the different CPUs: The server is a "Sandy Bridge" microarchitecture, running at 3.5 GHz, while the MacBook has a newer "Ivy Bridge" CPU running at 3.4 GHz (single-thread turbo boost speeds).
Between Sandy Bridge and Ivy Bridge is just a "Tick" in Intel parlance, meaning that the process was changed (from 32nm to 22nm), but almost no changes to the microarchitecture. Still there are some changes in Ivy Bridge that improve the IPC (instructions per clock-cycle) for some workloads. In particular, the throughput of division operations, both integer and floating-point, was doubled. (For more changes, see the review on AnandTech: http://www.anandtech.com/show/5626/ivy-bridge-preview-core-i7-3770k/2 )
As your workload contains lots of divisions, this fits your results quite nicely: the "small" testcase shows the largest improvement, while in the larger testcases, the improved core performance is probably shadowed by memory access, which seems roughly the same speed in both systems.
Note that this is purely educated guessing given the current information - one would need to look at your benchmark code, the compiler flags, and maybe analyze it using the CPU performance counters to verify this.
Has anyone else noticed terrible performance when scaling up to use all the cores on a cloud instance with somewhat memory intense jobs (2.5GB in my case)?
When I run jobs locally on my quad xeon chip, the difference between using 1 core and all 4 cores is about a 25% slowdown with all cores. This is to be expected from what I understand; a drop in clock rate as the cores get used up is part of the multi-core chip design.
But when I run the jobs on a multicore virtual instance, I am seeing a slowdown of like 2x - 4x in processing time between using 1 core and all cores. I've seen this on GCE, EC2, and Rackspace instances. And I have tested many difference instance types, mostly the fastest offered.
So has this behavior been seen by others with jobs about the same size in memory usage?
The jobs I am running are written in fortran. I did not write them, and I'm not really a fortran guy so my knowledge of them is limited. I know they have low I/O needs. They appear to be CPU-bound when I watch top as they run. They run without the need to communicate with each other, ie., embarrasingly parallel. They each take about 2.5GB in memory.
So my best guess so far is that jobs that use up this much memory take a big hit by the virtualization layer's memory management. It could also be that my jobs are competing for an I/O resource, but this seems highly unlikely according to an expert.
My workaround for now is to use GCE because they have single-core instance that actually runs the jobs as fast as my laptop's chip, and are priced almost proportionally by core.
You might be running into memory bandwidth constraints, depending on your data access pattern.
The linux perf tool might give some insight into this, though I'll admit that I don't entirely understand your description of the problem. If I understand correctly:
Running one copy of the single-threaded program on your laptop takes X minutes to complete.
Running 4 copies of the single-threaded program on your laptop, each copy takes X * 1.25 minutes to complete.
Running one copy of the single-threaded program on various cloud instances takes X minutes to complete.
Running N copies of the single-threaded program on an N-core virtual cloud instances, each copy takes X * 2-4 minutes to complete.
If so, it sounds like you're either running into a kernel contention or contention for e.g. memory I/O. It would be interesting to see whether various fortran compiler options might help optimize memory access patterns; for example, enabling SSE2 load/store intrinsics or other optimizations. You might also compare results with gcc and intel's fortran compilers.
I know for some machine learning algorithm like random forest, which are by nature should be implemented in parallel. I do a home work and find there are these three parallel programming framework, so I am interested in knowing what are the major difference between these three types of parallelism?
Especially, if some one can point me to some study compare the difference between them, that will be perfect!
Please list the pros and cons for each parallelism , thanks
MPI is a message passing paradigm of parallelism. Here, you have a root machine which spawns programs on all the machines in its MPI WORLD. All the threads in the system are independent and hence the only way of communication between them is through messages over network. The network bandwidth and throughput is one of the most crucial factor in MPI implementation's performance. Idea : If there is just one thread per machine and you have many cores on it, you can use OpenMP shared memory paradigm for solving subsets of your problem on one machine.
CUDA is a SMT paradigm of parallelism. It uses state of the art GPU architecture to provide parallelisim. A GPU contains (blocks of ( set of cores)) working on same instruction in a lock-step fashion (This is similar to SIMD model). Hence, if all the threads in your system do a lot of same work, you can use CUDA. But the amount of shared memory and global memory in a GPU are limited and hence you should not use just one GPU for solving a huge problem.
Hadoop is used for solving large problems on commodity hardware using Map Reduce paradigm. Hence, you do not have to worry about distributing data or managing corner cases. Hadoop also provides a file system HDFS for storing data on compute nodes.
Hadoop, MPI and CUDA are completely orthogonal to each other. Hence, it may not be fair to compare them.
Though, you can always use ( CUDA + MPI ) to solve a problem using a cluster of GPU's. You still need a simple core to perform the communication part of the problem.
I'm working with someone who has some MATLAB code that they want to be sped up. They are currently trying to convert all of this code into CUDA to get it to run on a CPU. I think it would be faster to use MATLAB's parallel computing toolbox to speed this up, and run it on a cluster that has MATLAB's Distributed Computing Toolbox, allowing me to run this across several different worker nodes. Now, as part of the parallel computing toolbox, you can use things like GPUArray. However, I'm confused as to how this would work. Are using things like parfor (parallelization) and gpuarray (gpu programming) compatible with each other? Can I use both? Can something be split across different worker nodes (parallelization) while also making use of whatever GPUs are available on each worker?
They think its still worth exploring the time it takes to convert all of your matlab code to cuda code to run on a machine with multiple GPUs...but I think the right approach would be to use the features already built into MATLAB.
Any help, advice, direction would be really appreciated!
Thanks!
When you use parfor, you are effectively dividing your for loop into tasks, with one task per loop iteration, and splitting up those tasks to be computed in parallel by several workers where each worker can be thought of as a MATLAB session without an interactive GUI. You configure your cluster to run a specified number of workers on each node of the cluster (generally, you would choose to run a number of workers equal to the number of available processor cores on that node).
On the other hand, gpuarray indicates to MATLAB that you want to make a matrix available for processing by the GPU. Underneath the hood, MATLAB is marshalling the data from main memory to the graphics board's internal memory. Certain MATLAB functions (there's a list of them in the documentation) can operate on gpuarrays and the computation happens on the GPU.
The key differences between the two techniques are that parfor computations happen on the CPUs of nodes of the cluster with direct access to main memory. CPU cores typically have a high clock rate, but there are typically fewer of them in a CPU cluster than there are GPU cores. Individually, GPU cores are slower than a typical CPU core and their use requires that data be transferred from main memory to video memory and back again, but there are many more of them in a cluster. As far as I know, hybrid approaches are supposed to be possible, in which you have a cluster of PCs and each PC has one or more Nvidia Tesla boards and you use both parfor loops and gpuarrays. However, I haven't had occasion to try this yet.
If you are mainly interested in simulations, GPU processing is the perfect choice. However, if you want to analyse (big) data, go with Parallization. The reason for this is, that GPU processing is only faster than cpu processing if you don't have to copy data back and forth. In case of a simulation, you can generate most of the data on the GPU and only need to copy the result back. If you try to work with bigger data on the GPU you will very often run into out of memory problems.
Parallization is great if you have big data structures and more than 2 cores in your computer CPU.
If you write it in CUDA it is guaranteed to run in parallel at the chip-level versus going with MATLAB's best guess for a non-parallel architecture and your best effort to get it to run in parallel.
Kind of like drinking fresh mountain water run-off versus buying filtered water. Go with the purist solution.