Could GPU accelerate gcc/g++ compilation - gcc

When I'm building my gentoo system, my nvidia gpu is usually unused, can I make some use of it?

No, you cannot.
GPUs are typically best at accelerating massively parallel math-heavy tasks that involve little branching. Compiling software is basically the exact opposite of this - it's branch-heavy and does not parallelize well beyond the file level.

Related

Fortran code using auto parallelization vs MPI

I am using a Fortran code to run a large scale simulation on a supercomputer. I am able to run the code in serial, but I want to improve the turn around time. I am looking in to making it parallel and i have found that I can use auto-parallelization or MPI, the question I have is: which is more likely to improve the turn around time?
I was able to use Intel Fortran complier with the compiler flag -parallel -par-report to see which DO loops where made parallel, so if I run the complied code on 4 processors would that actually work or do I have to do something special?
In addition, do you know of any useful resources for me too learn MPI. I want to be able to use more processors to increase the simulation time that is my end goal.
More than likely, MPI is going to be faster than auto-parallelization. However, auto-parallelization would take about 0.5 seconds worth of work to get a speed-up of, say, 1.2 compared to Y hours (maybe even up to Q weeks) of trial-and-error debugging to get a speed-up of, say, 1.7.
If you're interested in self-learning MPI through a book, Gropp, Lusk, & Skjellum's Using MPI is probably a good start.
Answer a bit depends on nature of your hardware and your application/workload. Do you use multi-node cluster (most typical) or big shared memory machine? Assuming you are cluster user, you will have to use MPI or Fortran coarray for (more likely) distributed memory cross-node parallelism AND SOMETHING fon inter-node shared memory parallelism (SMP).
Shared memory parallelism can give you speed-up proportional to number of cores on a node(up to 32x with Xeons) or even more with coprocessors. Distributed memory parallelism can give you speedup proportional to number of nodes. Both types (or actually all 3 types) of parallelism have to be used these days to get reasonable performance. You may think of it like a hierarchy: 1.MPI or coarray on the top, 2.something for shared memory threading in the middle and 3. vectorization in the innermost level.
Well, from your question, it sounds like you are talking mostly about SMP multicore threading parallelism level. This is where -parallel Auto-Parallelization behaves. Dont expect big magic from auto-par. If you want to get better scalable parallelism, you have to try fortran OpenMP or MPI-for-shared memory. I would recommend OpenMP in most cases; its often easier to program and more performance.
But. its up to you and you really should think bigger- about all 3 levels of parallelism. If you plan to address all 3 levels, then probably optimal combination (since you are a happy intel fortran user) is 1. MPI for 1st level+ 2. OpenMP for SMP level + 3. AutoVectorization guided by OpenMP 4.0 pragma simd on 3rd level. Im not an expert in coarray, but it might be good alternative to 1.MPI.
My answer does make less sence if you dont deal with classic cluster hardware.

How to get best performance of 8 core system using INTEL fortran

Please let me know how to set INTEL fortran compiler option to gain the best performance of 8 core system for IA32 and X64 bits. Actually I want to execute a fortran program and take the advantages of the all CPU time available in 8 core system. Now the program is only using 13 % of CPU time.
You can learn about autovectorization and guided auto-parallelization features of Intel FORTRAN in this tutorial: http://software.intel.com/sites/products/documentation/hpc/composerxe/en-us/start/win/tutorial_comp_for_win.pdf.
If you are doing linear algebra, solvers, FFTs, you might get best results if you map your problem into calls into the Intel Math Kernel Libraries: http://software.intel.com/en-us/articles/intel-mkl/
which are already multithreaded and vectorized and cache optimized.
If you are doing media / signal processing you might map your problem into calls into the Intel Performance Primitives library: http://software.intel.com/en-us/articles/intel-ipp/
Happy hacking!
In my specific application, a computational network model containing several loops running thoughout 20k iterations, each iteration accessing a number of nested if's, just by enabling /Q2 level optimization in the compiler was sufficient to reduce the computing time drastically, while keeping the CPU load around 15%.
On a similar note, I have noticed rising the optimization setting to the last level (/Q3), did do what you were asking (running all CPUs at about full load), but the computing time have NOT been reduced at all.
Therefore, if one has a small problem and several cases to test and processing capacity is the only bottleneck, it could be a good idea to open more than one Fortran solution and run those cases simultaneously.

Would it be possible for a JIT compiler to utilize GPU for certain operations behind the scenes?

Feel free to correct me if any part of my understanding is wrong.
My understanding is that GPUs offer a subset of the instructions that a normal CPU provides but executes them much faster.
I know there are ways to utilize GPU cycles for non-graphical purpose, but it seems like (in theory) a language that's Just In Time compiled could detect the presence of a suitable GPU and offload some of the work to the GPU behind the scenes without code change.
Is my understanding naive? Is it just a matter of it's really complicated and just hasn't been done it?
My understanding is that GPUs offer a
subset of the instructions that a
normal CPU provides but executes them
much faster.
It's definitly not as simple. The GPU is tailored mainly at SIMD/vector processing. So even though the theoretical potential of GPUs nowadays is vastely superior to CPUs, only programs that can benefit from SIMD instructions can be executed efficiently on the GPU. Also, there is of course a performance penalty when data has to be transfered from the CPU to the GPU to be processed there.
So for a JIT compiler to be able to use the GPU efficiently, it must be able to detect code that can be parallelized to benefit from SIMD instructions and then has to determine, if the overhead induced by transfering data from the CPU to the GPU will be outweight by the performance improvements.
It is possible to use GPU (e.g., a CUDA- or OpenCL-enabled one) to speed up JIT itself. Both register allocation and instruction scheduling could be efficiently implemented.

CUBLAS or supported libraries, and emphasis for reading for a beginner

I'm trying to harness the power of the GPU (nVidia Quadro NVS140M) to speed up some matrix computations in my project. I'm reading through some documentation (programming guide, best practices guide, and reference manual), but not sure which section(s) I should focus on. It would be great if I can receive some advices on this.
Also, I'm wondering if there are third-party maintained SDKs, such as CuBLAS.net, that may simplify the cublas development process before I stick with the features of cublas offered that would help me achieve my goals with my project. Again, thanks in advance for the comments.
Most of the documentation that comes with the CUDA toolkit & SDK downloads are about CUDA generally, not CuBLAS specifically. Start with the CUBLAS_Library_2.3.pdf file if you're just going to use CuBLAS--you won't need to write your own CUDA kernels. If you're already using a CPU BLAS, CuBLAS shouldn't be difficult to pick up. (And if you're not, then consider trying an optimized CPU one before CuBLAS, since it will be easier to program).
If you're coding on .NET, then the easiest way to use CuBLAS is probably via platform-invoke calls into cublas.dll. Be sure to keep straight which arrays are in host (CPU) memory, and which are in device (GPU) memory.
Keep in mind that CUDA & CuBLAS aren't magic bullets. Performance depends on a lot of factors (especially transfers across the PCIe bus), and simply swapping CUBLAS calls for CPU-BLAS calls may not give you speedups. You may have to make more substantial changes to your own code to get performance improvements. Those other guides you mention are very useful for understanding the CUDA architecture and its bottlenecks.
EDIT: I wasn't clear about the boundary between user code and kernel code. CUBLAS is a library of pre-built, optimized CUDA kernels. If you only need BLAS functionality, you do not need to write your own kernels. Instead, just call CUBLAS functions. When performance tuning, you shouldn't need to tweak the CUBLAS kernels, but you may need to change how and when you call them, and how you use memory, so as to minimize the number of transfers across the PCI express bus.

OpenCL: does it play well with OpenMP, can I connect other languages to it, etc

The 1.0 spec for OpenCL just came out a few days ago (Spec is here) and I've just started to read through it. I want to know if it plays well with other high performance multiprocessing APIs like OpenMP (spec) and I want to know what I should learn. So, here are my basic questions:
If I am already using OpenMP, will that break OpenCL or vice-versa?
Is OpenCL more powerful than OpenMP? Or are they intended to be complementary?
Is there a standard way of connecting an OpenCL program to a standard C99 program (or any other language)? What is it?
Does anyone know if anyone is writing an OpenCL book? I'm reading the spec, but I've found books to be more helpful.
OpenMP and OpenCL are distinct, but can be made to work together. Neither of them should "break" the other.
For the sake of argument, let's assume there's a tradeoff between minimizing changes to an existing codebase and performance or computing power. OMP is "easy" in that you can apply it "magically" to embarrassingly parallel problems with a quick pragma or two.
OpenCL introduces brand new high-level concepts beyond typical OS threading models. Khronos probably doesn't want to say it out loud, but its genesis is in NVIDIA's CUDA. If you want to see how it works today, download the CUDA SDK and start playing. If you don't have any NVIDIA GPUs, don't worry, there's a GPU-emulator software option. OpenCL is a handy abstraction of a GPU that should apply to CPUs, DSPs, "accelerators" (Khronos' nickname for IBM's CellBE and probably Intel's Larrabee).
OpenCL is not supposed to be "written directly in C99". It's referred to as a C99 extension since its syntax is similar/identical to C99 with some new keywords. You cannot call libc (or any other library) from a kernel.
You could use both, but theoretically, OpenCL should be "better" (in that it's portable to more computing devices) if you're willing to port your code. You can not use OpenMP pragmas in an OpenCL kernel.
See also:
http://wikipedia.org/wiki/OpenCL
CUDA
LLVM
For the most part OpenMP and OpenCL are independent from each other. They are both ways of giving the developer access to parallelism on their platform.
OpenMP is designed to work well with multiple (identical) processors, where work that is approximately equal can be (nearly) automatically farmed out between them.
OpenCL is a somewhat different beast, in that it is really shines when working with special co-processor hardware. It will allow you to offload some of the heavy-duty number crunching to the GPU or some other co-processor like in the Cell. However, it was also built with the idea that it could be used to harness other main processors, as are now common in multi-core computers. I would consider this feature to be secondary, and if this is all you intend to use OpenCL for, I would not recommend using OpenCL.
That said, I'd guess it would be somewhat challenging, though definitely not impossible to get OpenMP and OpenCL to work together in the same problem.
The first thing to think about is what work you're giving to OpenCL. This would definately be a case where you would only want OpenCL to run on the GPU/Co-processor...not on the other main-processors/cores, since OpenMP is alreay using those. It wouldn't (shouldn't) cause application errors to run OpenCL and OpenMP on the same main processor, but it will cause un-desirable scheduling where both the OpenMP and OpenCL run slower because they spend a good chunk of their time switching back and fourth between each other. This would also happen if you run any other processor-hungry process on the same core at the same time.
The other big thing to think about is how you're going to schedule tasks that do run on the Co-processor. Its true that you can feed a lot of work into one of the modern GPUs, but there are lots of things to think about with the pipeline and memory usage. What you wouldn't want to happen is to have 8 different OpenMP threads each trying to send their own work to the Co-Processor at the same time. I would recommend having only one thread that manages all the interactions with the Co-Processor, so it can make sure to feed it work in an efficient manner.
That said, I'm sure there are programs that have multiple types of tasks happening at the same time, where one type of task could always be farmed out to the Co-Processor and another kind of task could be handled by the multi-core main processor. This would be a fine example of a time to mix OpenMP and OpenCL.
Good Luck!
?
?
OpenCL is supposed to be written directly in C99 afaik? There are header files available now for it anyhow.
?
By the way, there is a work about openMp to gpgpu using CUDA.

Resources