Learn Nvidia CUDA - image

I am C++ programmer that develop image and video algorithims, should i learn Nvidia CUDA? or it is one of these technlogies that will disappear?

CUDA is currently a single vendor technology from NVIDIA and therefore doesn't have the multi vendor support that OpenCL does.
However, it's more mature than OpenCL, has great documentation and the skills learnt using it will be easily transferred to other parrallel data processing toolkit.
As an example of this, read the Data Parallel Algorithms by Steele and Hillis and then look at the Nvidia tutorials - theres a clear link between the two yet the Steele/Hillis paper was written over 20 years before CUDA was introduced.
Finally, the FCUDA Projects is working to allow CUDA projects to target non nvidia hardware (FPGAs).

CUDA should stick around for a while, but if you're just starting out, I'd recommend looking at OpenCL or DirectCompute. Both of these run on ATI as well as NVidia hardware, in addition to also working on the vector units (SSE) of CPUs.

I think you should rather stick with OpenCL, which is an open standard and supported by ATI, nVidia and more. CUDA might not disappear in the next years, but anyway it is not compatible with non-nVidia GPUs.

OpenCL might take sometime to become pervasive but i found learning CUDA very informative and i don't think CUDA's going to be out of the limelight anytime soon. Besides, CUDA is easy enough that the time it takes to learn it is much shorter than CUDA's shelf life.

This is the era of high performance computing, parallel computing. CUDA and OpenCL are the emerging technologies of GPU Computing which is actually a high performance computing! If you are a passionate programmer and willing to achieve benchmark in parallel algorithms, you should really go for these technologies. Data parallel part of your program will get executed within fraction of a second on GPU many-core architecture which usually takes much longer time on your CPU..

Related

What is the optimum hardware requirements to run MXNET smoothly

I am using my MacBookPro. I am trying to run the mxnet python demo code and the execution time is extremely slow. It takes a lot time to execute the code. Is this normal? Also i want to run mxnet on Raspberry Pi 3.
Almost all deep learning frameworks (MXNet included) will run much faster with a CUDA-capable GPU from NVIDIA. GPU's will often speed up the kinds of vector math needed for deep learning by 100x. Apple stopped building machines with NVIDIA GPUs several years ago (2012 IIRC). If you have one of those make sure you have CUDA working on your Mac. I'm not aware of any way right now to get MXNet to make use of the AMD or Intel GPUs that ship with Apple machines. Also know that even with the fastest GPU's deep learning jobs will often take hours, days, or even weeks to complete. So patience is definitely part of the game, regardless of what hardware you're using.
That said, GPU's aren't the only way to run deep learning systems. Particularly for making predictions (inference) with pre-trained models, CPUs are often just fine. So this can be useful for a task like semantic image processing.
Or when training, using smaller datasets and smaller models can make them run faster. Also, to make sure you're getting the most out of your CPU, check that you have installed a good BLAS library like Intel's MKL.
But to get any useful work out of a raspberry pi is going to take some careful optimization, even for inference. This is an area of active scientific research. See for example this paper. Or look at adding a USB hardware accelerator.

Can I run Cuda or opencl on intel iris?

I have a Macbook pro mid 2014 with intel iris and intel core i5 processor 16GB of RAM. I am planing to learn some ray-traced 3D. But, I am not sure, if my laptop can render fast without any nvidia's hardware.
So, I would appreciate it, if someone can tell me if I can use Cuda if not, then could you please teach me in a very easy way how to enable OpenCL in after affects. I am looking for any tutorial for beginners to learn how to create or build OpenCL?
Cuda works only on nvidia hardware but there may be some libraries converting it to run on cpu cores(not igpu).
AMD is working on "hipify"ing old cuda kernels to translate them to opencl or similar codes so they can become more general.
Opencl works everywhere as long as both hardware and os supports. Amd, Nvidia, Intel, Xilinx, Altera, Qualcomm, MediaTek, Marvell, Texas Instruments .. support this. Maybe even Raspberry pi-x can support in future.
Documentation for opencl in stackoverflow.com is under development. But there are some sites:
Amd's tutorial
Amd's parallel programming guide for opencl
Nvidia's learning material
Intel's HD graphics coding tutorial
Some overview of hardware, benchmark and parallel programming subjects
blog
Scratch-a-pixel-raytracing-tutorial (I read it then wrote its teraflops gpu version)
If it is Iris Graphics 6100:
Your integrated gpu has 48 execution units each having 8 ALU units that can do add,multiply and many more operations. Its clock frequency can rise to 1GHz. This means a maximum of 48*8*2(1 add+1multiply)*1G = 768 Giga floating point operations per second but only if each ALU is capable of concurrently doing 1 addition and 1 multiplication. 768 Gflops is more than a low-end discrete gpu such as R7-240 of AMD.(As of 19.10.2017, AMD's low-end is RX550 with 1200 GFlops, faster than Intel's Iris Plus 650 which is nearly 900 GFlops). Ray tracing needs re-accessing to too many geometry data so a device should have its own memory(such as with Nvidia or Amd), to let CPU do its work.
How you install opencl on a computer can change by OS and hardware type, but building a software with an opencl-installed computer is similar:
Query platforms. Result of this can be AMD, Intel, Nvidia,duplicate of these because of overlapped installations of wrong drivers,experimental platforms prior to newer opencl version supports.
Query devices of a platform(or all platforms). This gives individual devices (and their duplicates if there are driver errors or some other things to fix).
Create a context(or multiple) using a platform
Using a context(so everything will have implicit sync in it):
Build programs using kernel strings. Usually CPU can take less time than a GPU to build a program.(there is binary load option to shurtcut this)
Build kernels(as objects now) from programs.
Create buffers from host-side buffers or opencl-managed buffers.
Create a command queue (or multiple)
Just before computing(or an array of computations):
Select buffers for a kernel as its arguments.
Enqueue buffer write(or map/unmap) operations on "input" buffers
Compute:
Enqueue nd range kernel(with specifying which kernel runs and with how many threads)
Enqueue buffer read(or map/unmap) operations on "output" buffers
Don't forget to synchronize with host using clFinish() if you haven't used blocking type enqueueBufferRead.
Use your accelerated data.
After opencl is no more needed:
Be sure all command queues are empty / finished doing kernel work.
Release all in the opposite order of creation
If you need to accelerate an open source software, you can switch a hotspot parallelizable loop with a simple opencl kernel, if it doesn't have another acceleration support already. For example, you can accelerate air-pressure and heat-advection part of powdertoy sand-box simulator.
Yes, you can, because OpenCL is supported by MacOS natively.
From your question it appears you are not seeking advice on programming, which would have been the appropriate subject for Stack Overflow. The first search hit on Google explains how to turn on OpenCL accelerated effects in After Effects (Project Settings dialog -> Video Rendering and Effects), but I have no experience with that myself.

What types of code domains is OpenCL suited to?

I read the OpenCL overview, and it states it is suitable for code that runs of CPUs, GPGPUs, DSPs, etc. However, from looking through the command reference, it seems to be all math and image type operations. I didn't see anything for say strings.
This makes me wonder what would you run on a CPU via OpenCL?
Further, I know OpenCL can be used to perform sorting on GPGPUs. But would one ever use it (or, for that matter, a current GPGPU) to perform string processing such as pattern matching, metaphone extraction, dictionary lookup, or anything else that requires the processing of arrays of strings.
EDIT
I noticed that Intel's upcoming Ivy Bridge is touted as "OpenCL compliant" with reference to its graphics units. Does this infer that the CPU cores are not OpenCL compliant, or is there no such inference?
EDIT
In the interests of non-debate and constructiveness, I would appreciate if anyone could point me to official references that would answer my question.
You can think of OpenCL as a combination of a runtime (for device discovery, queueing) and a C-based programming language. This programming language has native vector types and built-in functions and operations for doing all sorts fun stuff to these vectors. This is nice in that you can write a vectorized kernel in OpenCL, and it it the responsibility of the implementation to map that to the actual vector ISA of your hardware.
From this 4/2011 article, which might vanish:
There are two major CPU architectures out there, x86 and ARM, both of
which should soon run OpenCL code.
If you write an OpenCL application that targets both of these architectures, you wouldn't have to worry about writing two versions, one SSE and one NEON. Just write OpenCL C and be done with it. Yes, I know. This assumes the vendor has done his job and written a solid implementation that fully utilizes the underlying ISA. But if he doesn't, complain!
In addition, some CL implementations offer auto-vectorization of scalar kernels, which are usually easier to write. A good auto-vectorizer would give you a solid performance increase for no effort. Since CL kernels are compiled "online," obtaining such a benefit wouldn't require shipping rebuilt code.
No links, but I would assume this is because algorithms that use strings may do a lot of dynamic memory allocation and branching, both of which GPGPUs are not well-suited for. GPGPUs also have a lot in common with vector processing, so doing units of work with different sized blocks of memory (which a string algorithm will generally work on, you usually don't have a homogeneous group of strings), yields poorer performance and is hard to program.
GPUs were designed to do the same work, with little to no branching, on a homogeneous group of data (such as per-vector or per-pixel operations). Algorithms that can mimic this type of behavior are great on GPUs.
This makes me wonder what would you run on a CPU via OpenCL?
I prefer to use ocl to offload work from the cpu to my graphics hardware. Sometimes there is a limitation with my video card, so I like having a backup kernel for cpu use. Such limitations can be memory size, memory bottleneck, low clock speed, or when the pci-e bus gets in the way.
I say I like using a separate kernel for cpu, because I think all kernels should be tweaked to run on their target hardware. I even like to have an openmp backup plan, as most algorithms I use get tested out in this manner ahead of time.
I suppose it is best practice to test out a gpu kernel on the cpu to make sure it runs as expected. If a user of your software has opencl installed, but only a cpu (or a low-end gpu) it's nice to be able to execute the same code on the different devices.

CUBLAS or supported libraries, and emphasis for reading for a beginner

I'm trying to harness the power of the GPU (nVidia Quadro NVS140M) to speed up some matrix computations in my project. I'm reading through some documentation (programming guide, best practices guide, and reference manual), but not sure which section(s) I should focus on. It would be great if I can receive some advices on this.
Also, I'm wondering if there are third-party maintained SDKs, such as CuBLAS.net, that may simplify the cublas development process before I stick with the features of cublas offered that would help me achieve my goals with my project. Again, thanks in advance for the comments.
Most of the documentation that comes with the CUDA toolkit & SDK downloads are about CUDA generally, not CuBLAS specifically. Start with the CUBLAS_Library_2.3.pdf file if you're just going to use CuBLAS--you won't need to write your own CUDA kernels. If you're already using a CPU BLAS, CuBLAS shouldn't be difficult to pick up. (And if you're not, then consider trying an optimized CPU one before CuBLAS, since it will be easier to program).
If you're coding on .NET, then the easiest way to use CuBLAS is probably via platform-invoke calls into cublas.dll. Be sure to keep straight which arrays are in host (CPU) memory, and which are in device (GPU) memory.
Keep in mind that CUDA & CuBLAS aren't magic bullets. Performance depends on a lot of factors (especially transfers across the PCIe bus), and simply swapping CUBLAS calls for CPU-BLAS calls may not give you speedups. You may have to make more substantial changes to your own code to get performance improvements. Those other guides you mention are very useful for understanding the CUDA architecture and its bottlenecks.
EDIT: I wasn't clear about the boundary between user code and kernel code. CUBLAS is a library of pre-built, optimized CUDA kernels. If you only need BLAS functionality, you do not need to write your own kernels. Instead, just call CUBLAS functions. When performance tuning, you shouldn't need to tweak the CUBLAS kernels, but you may need to change how and when you call them, and how you use memory, so as to minimize the number of transfers across the PCI express bus.

OpenCL: does it play well with OpenMP, can I connect other languages to it, etc

The 1.0 spec for OpenCL just came out a few days ago (Spec is here) and I've just started to read through it. I want to know if it plays well with other high performance multiprocessing APIs like OpenMP (spec) and I want to know what I should learn. So, here are my basic questions:
If I am already using OpenMP, will that break OpenCL or vice-versa?
Is OpenCL more powerful than OpenMP? Or are they intended to be complementary?
Is there a standard way of connecting an OpenCL program to a standard C99 program (or any other language)? What is it?
Does anyone know if anyone is writing an OpenCL book? I'm reading the spec, but I've found books to be more helpful.
OpenMP and OpenCL are distinct, but can be made to work together. Neither of them should "break" the other.
For the sake of argument, let's assume there's a tradeoff between minimizing changes to an existing codebase and performance or computing power. OMP is "easy" in that you can apply it "magically" to embarrassingly parallel problems with a quick pragma or two.
OpenCL introduces brand new high-level concepts beyond typical OS threading models. Khronos probably doesn't want to say it out loud, but its genesis is in NVIDIA's CUDA. If you want to see how it works today, download the CUDA SDK and start playing. If you don't have any NVIDIA GPUs, don't worry, there's a GPU-emulator software option. OpenCL is a handy abstraction of a GPU that should apply to CPUs, DSPs, "accelerators" (Khronos' nickname for IBM's CellBE and probably Intel's Larrabee).
OpenCL is not supposed to be "written directly in C99". It's referred to as a C99 extension since its syntax is similar/identical to C99 with some new keywords. You cannot call libc (or any other library) from a kernel.
You could use both, but theoretically, OpenCL should be "better" (in that it's portable to more computing devices) if you're willing to port your code. You can not use OpenMP pragmas in an OpenCL kernel.
See also:
http://wikipedia.org/wiki/OpenCL
CUDA
LLVM
For the most part OpenMP and OpenCL are independent from each other. They are both ways of giving the developer access to parallelism on their platform.
OpenMP is designed to work well with multiple (identical) processors, where work that is approximately equal can be (nearly) automatically farmed out between them.
OpenCL is a somewhat different beast, in that it is really shines when working with special co-processor hardware. It will allow you to offload some of the heavy-duty number crunching to the GPU or some other co-processor like in the Cell. However, it was also built with the idea that it could be used to harness other main processors, as are now common in multi-core computers. I would consider this feature to be secondary, and if this is all you intend to use OpenCL for, I would not recommend using OpenCL.
That said, I'd guess it would be somewhat challenging, though definitely not impossible to get OpenMP and OpenCL to work together in the same problem.
The first thing to think about is what work you're giving to OpenCL. This would definately be a case where you would only want OpenCL to run on the GPU/Co-processor...not on the other main-processors/cores, since OpenMP is alreay using those. It wouldn't (shouldn't) cause application errors to run OpenCL and OpenMP on the same main processor, but it will cause un-desirable scheduling where both the OpenMP and OpenCL run slower because they spend a good chunk of their time switching back and fourth between each other. This would also happen if you run any other processor-hungry process on the same core at the same time.
The other big thing to think about is how you're going to schedule tasks that do run on the Co-processor. Its true that you can feed a lot of work into one of the modern GPUs, but there are lots of things to think about with the pipeline and memory usage. What you wouldn't want to happen is to have 8 different OpenMP threads each trying to send their own work to the Co-Processor at the same time. I would recommend having only one thread that manages all the interactions with the Co-Processor, so it can make sure to feed it work in an efficient manner.
That said, I'm sure there are programs that have multiple types of tasks happening at the same time, where one type of task could always be farmed out to the Co-Processor and another kind of task could be handled by the multi-core main processor. This would be a fine example of a time to mix OpenMP and OpenCL.
Good Luck!
?
?
OpenCL is supposed to be written directly in C99 afaik? There are header files available now for it anyhow.
?
By the way, there is a work about openMp to gpgpu using CUDA.

Resources