TBB performs bad on AMD? - performance

I was recently working on performance profiling a library which used TBB to achieve parallelism.
[I cannot report the library and the readings out here]
After a few tests, I realized the performance of the application on AMD processors was slower than that of Intel processors.
However, I don't have a setup where I have comparable processors (same clock speeds for both processors).
I was just wondering is anybody has had the same experience and is TBB optimized to perform really good on Intel processors as compared to corresponding AMD processors.

Related

how to run a openmp program on clusters with multiple nodes? [duplicate]

I want to know if it would be possible to run an OpenMP program on multiple hosts. So far I only heard of programs that can be executed on multiple thread but all within the same physical computer. Is it possible to execute a program on two (or more) clients? I don't want to use MPI.
Yes, it is possible to run OpenMP programs on a distributed system, but I doubt it is within the reach of every user around. ScaleMP offers vSMP - an expensive commercial hypervisor software that allows one to create a virtual NUMA machine on top of many networked hosts, then run a regular OS (Linux or Windows) inside this VM. It requires a fast network interconnect (e.g. InfiniBand) and dedicated hosts (since it runs as a hypervisor beneath the normal OS). We have an operational vSMP cluster here and it runs unmodified OpenMP applications, but performance is strongly dependent on data hierarchy and access patterns.
NICTA used to develop similar SSI hypervisor named vNUMA, but development also stopped. Besides their solution was IA64-specific (IA64 is Intel Itanium, not to be mistaken with Intel64, which is their current generation of x86 CPUs).
Intel used to develop Cluster OpenMP (ClOMP; not to be mistaken with the similarly named project to bring OpenMP support to Clang), but it was abandoned due to "general lack of interest among customers and fewer cases than expected where it showed a benefit" (from here). ClOMP was an Intel extension to OpenMP and it was built into the Intel compiler suite, e.g. you couldn't use it with GCC (this request to start ClOMP development for GCC went in the limbo). If you have access to old versions of Intel compilers (versions from 9.1 to 11.1), you would have to obtain a (trial) ClOMP license, which might be next to impossible given that the product is dead and old (trial) licenses have already expired. Then again, starting with version 12.0, Intel compilers no longer support ClOMP.
Other research projects exist (just search for "distributed shared memory"), but only vSMP (the ScaleMP solution) seems to be mature enough for production HPC environments (and it's priced accordingly). Seems like most efforts now go into development of co-array languages (Co-Array Fortran, Unified Parallel C, etc.) instead. I would suggest that you have a look at Berkeley UPC or invest some time in learning MPI as it is definitely not going away in the years to come.
Before, there was the Cluster OpenMP.
Cluster OpenMP, was an implementation of OpenMP that could make use of multiple SMP machines without resorting to MPI. This advance had the advantage of eliminating the need to write explicit messaging code, as well as not mixing programming paradigms. The shared memory in Cluster OpenMP was maintained across all machines through a distributed shared-memory subsystem. Cluster OpenMP is based on the relaxed memory consistency of OpenMP, allowing shared variables to be made consistent only when absolutely necessary. source
Performance Considerations for Cluster OpenMP
Some memory operations are much more expensive than others. To achieve good performance with Cluster OpenMP, the number of accesses to unprotected pages must be as high as possible, relative to the number of accesses to protected pages. This means that once a page is brought up-to-date on a given node, a large number of accesses should be made to it before the next synchronization. In order to accomplish this, a program should have as little synchronization as possible, and re-use the data on a given page as much as possible. This translates to avoiding fine-grained synchronization, such as atomic constructs or locks, and having high data locality source.
Another option for running OpenMP programs on multiple hosts is the remote offloading plugin in the LLVM OpenMP runtime.
https://openmp.llvm.org/design/Runtimes.html#remote-offloading-plugin
The big issue with running OpenMP programs on distributed memory is data movement. Coincidentally, that is also one of the major issues in programming GPU's. Extending OpenMP to handle GPU programming has given rise to OpenMP directives to describe data transfer. Programming GPU's has also forced programmers to think more carefully about building programs that consider data movement.

Most simple architecture available as GCC target

I'm looking for CPU architecture, which is supported by GCC (and is still maintained) for which is easiest to implement software simulator.
It should be something simple, with flat memory model, 16bit+ address space, 16-32 bit ALU and good code dencity is prefered as for it will be running programs with program memory limitations.
Just few words about origin of those requirements. I need virtual CPU for running 'sandboxed' programs. That will be running on microcontrollers with ~5 KBytes RAM, ARM CPU ~20 MHz clock speed.
Performance is non an issue at all, what I really need is writing C/C++ programs and then running them in sandbox without stdlib. For writing programs GCC can help, just need implement vcpu for one of target architectures.
I've got acquainted with ARMv7-m, avr32 references and found them pretty accaptable but some more powerfull then I need. The less/simpler code I need to write for vcpu implementation, the sooner I will have what I need and less bugs will be there.
UPDATE:
Seems like I found what I need. Is was already answered here: What is the smallest, simplest CPU that gcc can compile for?
Thank you all.

Combining graphic-card and parallel-computation libraries in C++

I'm working on diploma project that heavily uses mathematical calculations and should present some results in 3D. For these purposes I decided to use CUDA or OpenCL for parallel computation of mathematical part and, most possibly, OpenGL for presenting result. In addition, project should be able to be deployed on clusters (operated by MS Windows), for these purposes project supervisor recommended MPI.
My question is the following: where it is easier to combine all these components, in MS Visual tudio
Main part is CUDA + OpenCL + OpenGL, it will be the core of the project.
P.S. This question is not to star holy-war betwen Qt and MS Visual studio.
OpenCL is not limited to GPUs, it can be used for parallel programming in clusters as well. Intel for example provides a OpenCL implementation, that is aimed at multicore CPU and clusters.
So my recommendation is to use OpenCL for both GPU computing and clustering. MPI (Message Passing Interface) is mainly a way to communicate between tasks running on separate cluster nodes. It's not so much of a clustering framework by itself.

OpenCL maturity under Windows

I consider using OpenCL in a consumer product which is currently under development.
Doing a small research I found that generally there is good support under Mac OSX. Linux support is also relatively good, but my target audience does not use Linux. It remains to check how well it is supported in Windows.
Regarding Windows I found OpenCL distribution which raises some concerns.
Do any of you have any experience with using OpenCL in consumer-oriented products under Windows? I am more interested in the GPU side of OpenCL, specifically driver support.
Just like CUDA or Stream, OpenCL needs to be supported by the driver. Most CUDA-capable GPUs support OpenCL with a somewhat up-to-date driver (CUDA 1.0 upwards).
In fact, if you compile with, say, CUDA SDK 4.1 your end users will need newer drivers than if you had used OpenCL.
Also, OpenCL is not bound to any GPU architecture. While this might be problematic for specifically designed algorithms, it shouldn't have a very high impact on normal end user programs.
At least with CUDA, you can only compile code optimized for the current known major version. Compiling OpenCL kernels on the end user machine might allow optimizations for newer binary specifications in the future.
The crashes the author in that questions reported for Nvidia OpenCL generally seem to happen a lot if resources are not freed properly. I've been seeing similar crashes until I fixed a leak that didn't release created kernels.
I'm not saying it's the only reason why it might crash, but apart from programmer errors it appears fairly stable to me.
AMD and NVidia both support OpenCL on most (all?) of their GPUs
Unfortunately Intel only supports it on the CPU which is a bit pointless and if you have to insist that the user has a separate GPU for your app you can also insist that they have an NVidia one and use CUDA. This has limited the uptake of OpenCL.

What Java application is available to stress-test a virtual machine?

I am interested in ways to stress-test as well as benchmark the SANOS operating system kernel.
While I'm not sure if this is suitable for kernel testing you may want to have a look at
SPECjbb2005
""SPECjbb2005 (Java Server Benchmark) is
SPEC's benchmark for evaluating the
performance of server side Java. Like
its predecessor, SPECjbb2000,
SPECjbb2005 evaluates the performance
of server side Java by emulating a
three-tier client/server system (with
emphasis on the middle tier). The
benchmark exercises the
implementations of the JVM (Java
Virtual Machine), JIT (Just-In-Time)
compiler, garbage collection, threads
and some aspects of the operating
system. It also measures the
performance of CPUs, caches, memory
hierarchy and the scalability of
shared memory processors (SMPs).
SPECjbb2005 provides a new enhanced
workload, implemented in a more
object-oriented manner to reflect how
real-world applications are designed
and introduces new features such as
XML processing and BigDecimal
computations to make the benchmark a
more realistic reflection of today's
applications.""

Resources