I have a Macbook pro mid 2014 with intel iris and intel core i5 processor 16GB of RAM. I am planing to learn some ray-traced 3D. But, I am not sure, if my laptop can render fast without any nvidia's hardware.
So, I would appreciate it, if someone can tell me if I can use Cuda if not, then could you please teach me in a very easy way how to enable OpenCL in after affects. I am looking for any tutorial for beginners to learn how to create or build OpenCL?
Cuda works only on nvidia hardware but there may be some libraries converting it to run on cpu cores(not igpu).
AMD is working on "hipify"ing old cuda kernels to translate them to opencl or similar codes so they can become more general.
Opencl works everywhere as long as both hardware and os supports. Amd, Nvidia, Intel, Xilinx, Altera, Qualcomm, MediaTek, Marvell, Texas Instruments .. support this. Maybe even Raspberry pi-x can support in future.
Documentation for opencl in stackoverflow.com is under development. But there are some sites:
Amd's tutorial
Amd's parallel programming guide for opencl
Nvidia's learning material
Intel's HD graphics coding tutorial
Some overview of hardware, benchmark and parallel programming subjects
blog
Scratch-a-pixel-raytracing-tutorial (I read it then wrote its teraflops gpu version)
If it is Iris Graphics 6100:
Your integrated gpu has 48 execution units each having 8 ALU units that can do add,multiply and many more operations. Its clock frequency can rise to 1GHz. This means a maximum of 48*8*2(1 add+1multiply)*1G = 768 Giga floating point operations per second but only if each ALU is capable of concurrently doing 1 addition and 1 multiplication. 768 Gflops is more than a low-end discrete gpu such as R7-240 of AMD.(As of 19.10.2017, AMD's low-end is RX550 with 1200 GFlops, faster than Intel's Iris Plus 650 which is nearly 900 GFlops). Ray tracing needs re-accessing to too many geometry data so a device should have its own memory(such as with Nvidia or Amd), to let CPU do its work.
How you install opencl on a computer can change by OS and hardware type, but building a software with an opencl-installed computer is similar:
Query platforms. Result of this can be AMD, Intel, Nvidia,duplicate of these because of overlapped installations of wrong drivers,experimental platforms prior to newer opencl version supports.
Query devices of a platform(or all platforms). This gives individual devices (and their duplicates if there are driver errors or some other things to fix).
Create a context(or multiple) using a platform
Using a context(so everything will have implicit sync in it):
Build programs using kernel strings. Usually CPU can take less time than a GPU to build a program.(there is binary load option to shurtcut this)
Build kernels(as objects now) from programs.
Create buffers from host-side buffers or opencl-managed buffers.
Create a command queue (or multiple)
Just before computing(or an array of computations):
Select buffers for a kernel as its arguments.
Enqueue buffer write(or map/unmap) operations on "input" buffers
Compute:
Enqueue nd range kernel(with specifying which kernel runs and with how many threads)
Enqueue buffer read(or map/unmap) operations on "output" buffers
Don't forget to synchronize with host using clFinish() if you haven't used blocking type enqueueBufferRead.
Use your accelerated data.
After opencl is no more needed:
Be sure all command queues are empty / finished doing kernel work.
Release all in the opposite order of creation
If you need to accelerate an open source software, you can switch a hotspot parallelizable loop with a simple opencl kernel, if it doesn't have another acceleration support already. For example, you can accelerate air-pressure and heat-advection part of powdertoy sand-box simulator.
Yes, you can, because OpenCL is supported by MacOS natively.
From your question it appears you are not seeking advice on programming, which would have been the appropriate subject for Stack Overflow. The first search hit on Google explains how to turn on OpenCL accelerated effects in After Effects (Project Settings dialog -> Video Rendering and Effects), but I have no experience with that myself.
Related
I kind of want to get the intel xeon phi co-processor since there is a model which seems to be running for $230. I have two questions. Can I fully utilize the capabilities of this just using gcc along with openmp or will I need the intel compiler. Also what is it about this model which makes it so cheap?
http://www.amazon.com/Intel-BC31S1P-Xeon-31S1P-Coprocessor/dp/B00OMCB4JI/ref=sr_1_2?ie=UTF8&qid=1444411560&sr=8-2&keywords=intel+xeon+phi
3100 series is a first generation of Xeon Phi (codenamed Knights Corner, abbreviated KNC).
Using GCC for Xeon Phi KNC programming is definitely not perfect idea. See for example: Xeon Phi Knights Corner intrinsics with GCC
So it's extremely recommended to use Intel Compiler for KNC. And yes, in case of non-commerical use, you can apply for free Intel Compilers license here: https://software.intel.com/en-us/qualify-for-free-software (this is kind of new program, unavailable in past).
Given KNC price tag is low enough, although I periodically observe KNC sales for similar prices (so at least it's not "incomplete" Phi; and it's not cheating, although Gilles' passive cooling point is valid). I don't know which problems you work on, but you should be aware that KNC is most of all suitable for some highly parallel workloads. There is a good reference of types of applications which could benefit from using Xeon Phi KNC: https://software.intel.com/en-us/articles/intel-xeon-phi-coprocessor-applications-and-solutions-catalog
As I mentioned in the beginning, you are asking about first generation Xeon Phi. Many things (including GCC answer) will likely change with introduction of second generation of Xeon Phi (codenamed Knights Landing, KNL) to be publically released in ~next year.
Gcc permits you to compile codes and run them for Xeon Phi, and I believes it does quite a good job in that. Indeed, AFAIK, gcc is the compiler used for compiling the Linux environment available on Xeon Phi. However, for fully taking advantage of the potential performance of Xeon Phi, I would strongly encourage you to use the Intel compiler. As a matter of fact, and unless I'm greatly mistaken, you can download and install the Intel compiler suite for free for personal use.
Regarding the Xeon Phi card, it comes cheap, not really because it lacks of anything one would wand for a Xeon Phi card, but more because it is a passively cooled card. That means that, unless you thinker some cooling device with cardboard and fans, you won't be able to slot the card and use it in a standard PC. You'll need a rackable server which doesn't come cheap and is usually very noisy. So if you've got a server to put the card in, this is a bargain. But if you don't, you'd better think it through.
I've been asked to provide a reverb algorithm for an audio interface hardware using a 160 MHz ARM processor. It's a fairly lightweight reverb effect written in C. However, my knowledge is a little lacking when it comes to low level architecture and performance testing and measurement.
I need to provide at least some estimates on how it will perform on the device's CPU, as they would like to keep it within 3 - 5%. So far I've followed these steps, so please let me know if I'm at least on the right track.
I disassembled the .c file containing all the processing of the reverb in Xcode and counted up the number of assembly instructions that are called in the callback function processing the audio. At 256 samples per block, I'm looking at around 400,000 assembly instructions.
Is there any way to roughly estimate how this algorithm will perform on a 160 MHz ARM processor? The audio library I'm using for I/O has a measurment for CPU load, and I'm getting between 2 - 3% on my Mac Pro for the callback routine.
Am I going about this the right way? Any suggestions to provide an estimate on this?
Thanks.
You need a lot more information about the processors particular implementation of the ARM ISA, than just the MHz. Factors affecting performance include use of multi-cycle instructions, super-scaler dispatch/retirement capabilities, pipeline interlocks, cache size and policy affecting the hit ratios, memory latencies, etc. Also how well the compiler you use optimizes for your chosen ARM implementation.
One can easily end up with well over a 10X CPI (cycles-per-instruction) difference in machine code execution between a desktop PC and an embedded RISC CPU, as well as the actual machine code being very different.
It's usually easier to benchmark your code.
I read the OpenCL overview, and it states it is suitable for code that runs of CPUs, GPGPUs, DSPs, etc. However, from looking through the command reference, it seems to be all math and image type operations. I didn't see anything for say strings.
This makes me wonder what would you run on a CPU via OpenCL?
Further, I know OpenCL can be used to perform sorting on GPGPUs. But would one ever use it (or, for that matter, a current GPGPU) to perform string processing such as pattern matching, metaphone extraction, dictionary lookup, or anything else that requires the processing of arrays of strings.
EDIT
I noticed that Intel's upcoming Ivy Bridge is touted as "OpenCL compliant" with reference to its graphics units. Does this infer that the CPU cores are not OpenCL compliant, or is there no such inference?
EDIT
In the interests of non-debate and constructiveness, I would appreciate if anyone could point me to official references that would answer my question.
You can think of OpenCL as a combination of a runtime (for device discovery, queueing) and a C-based programming language. This programming language has native vector types and built-in functions and operations for doing all sorts fun stuff to these vectors. This is nice in that you can write a vectorized kernel in OpenCL, and it it the responsibility of the implementation to map that to the actual vector ISA of your hardware.
From this 4/2011 article, which might vanish:
There are two major CPU architectures out there, x86 and ARM, both of
which should soon run OpenCL code.
If you write an OpenCL application that targets both of these architectures, you wouldn't have to worry about writing two versions, one SSE and one NEON. Just write OpenCL C and be done with it. Yes, I know. This assumes the vendor has done his job and written a solid implementation that fully utilizes the underlying ISA. But if he doesn't, complain!
In addition, some CL implementations offer auto-vectorization of scalar kernels, which are usually easier to write. A good auto-vectorizer would give you a solid performance increase for no effort. Since CL kernels are compiled "online," obtaining such a benefit wouldn't require shipping rebuilt code.
No links, but I would assume this is because algorithms that use strings may do a lot of dynamic memory allocation and branching, both of which GPGPUs are not well-suited for. GPGPUs also have a lot in common with vector processing, so doing units of work with different sized blocks of memory (which a string algorithm will generally work on, you usually don't have a homogeneous group of strings), yields poorer performance and is hard to program.
GPUs were designed to do the same work, with little to no branching, on a homogeneous group of data (such as per-vector or per-pixel operations). Algorithms that can mimic this type of behavior are great on GPUs.
This makes me wonder what would you run on a CPU via OpenCL?
I prefer to use ocl to offload work from the cpu to my graphics hardware. Sometimes there is a limitation with my video card, so I like having a backup kernel for cpu use. Such limitations can be memory size, memory bottleneck, low clock speed, or when the pci-e bus gets in the way.
I say I like using a separate kernel for cpu, because I think all kernels should be tweaked to run on their target hardware. I even like to have an openmp backup plan, as most algorithms I use get tested out in this manner ahead of time.
I suppose it is best practice to test out a gpu kernel on the cpu to make sure it runs as expected. If a user of your software has opencl installed, but only a cpu (or a low-end gpu) it's nice to be able to execute the same code on the different devices.
I have studied a few things about instruction re-ordering by processors and Tomasulo's algorithm.
In an attempt to understand this topic bit more I want to know if there is ANY way to (get the trace) see the actual dynamic reordering done for a given program?
I want to give an input program and see the "out of order instruction execution trace" of my program.
I have access to an IBM-P7 machine and an Intel Core2Duo laptop. Also please tell me if there is an easy alternative.
You have no access to actual reordering done inside the CPU (there is no publically known way to enable tracing). But there is some emulators of reordering and some of them can give you useful hints.
For modern Intel CPUs (core 2, nehalem, Sandy and Ivy) there is "Intel(R) Architecture Code Analyzer" (IACA) from Intel. It's homepage is http://software.intel.com/en-us/articles/intel-architecture-code-analyzer/
This tool allows you to look how some linear fragment of code will be splitted into micro-operations and how they will be planned into execution Ports. This tool has some limitations and it is only inexact model of CPU u-op reordering and execution.
There are also some "external" tools for emulating x86/x86_84 CPU internals, I can recommend the PTLsim (or derived MARSSx86):
PTLsim models a modern superscalar out of order x86-64 compatible processor core at a configurable level of detail ranging ... down to RTL level models of all key pipeline structures. In addition, all microcode, the complete cache hierarchy, memory subsystem and supporting hardware devices are modeled with true cycle accuracy.
But PTLsim models some "PTL" cpu, not real AMD or Intel CPU. The good news is that this PTL is Out-Of-Order, based on ideas from real cores:
The basic microarchitecture of this model is a combination of design features from the Intel Pentium 4, AMD K8 and Intel Core 2, but incorporates some ideas from IBM Power4/Power5 and Alpha EV8.
Also, in arbeit http://es.cs.uni-kl.de/publications/datarsg/Senf11.pdf is said that JavaHASE applet is capable of emulating different simple CPUs and even supports Tomasulo example.
Unfortunately, unless you work for one of these companies, the answer is no. Intel/AMD processors don't even schedule the (macro) instructions you give them. They first convert those instructions into micro operations and then schedule those. What these micro instructions are and the entire process of instruction reordering is a closely guarded secret, so they don't exactly want you to know what is going on.
I am C++ programmer that develop image and video algorithims, should i learn Nvidia CUDA? or it is one of these technlogies that will disappear?
CUDA is currently a single vendor technology from NVIDIA and therefore doesn't have the multi vendor support that OpenCL does.
However, it's more mature than OpenCL, has great documentation and the skills learnt using it will be easily transferred to other parrallel data processing toolkit.
As an example of this, read the Data Parallel Algorithms by Steele and Hillis and then look at the Nvidia tutorials - theres a clear link between the two yet the Steele/Hillis paper was written over 20 years before CUDA was introduced.
Finally, the FCUDA Projects is working to allow CUDA projects to target non nvidia hardware (FPGAs).
CUDA should stick around for a while, but if you're just starting out, I'd recommend looking at OpenCL or DirectCompute. Both of these run on ATI as well as NVidia hardware, in addition to also working on the vector units (SSE) of CPUs.
I think you should rather stick with OpenCL, which is an open standard and supported by ATI, nVidia and more. CUDA might not disappear in the next years, but anyway it is not compatible with non-nVidia GPUs.
OpenCL might take sometime to become pervasive but i found learning CUDA very informative and i don't think CUDA's going to be out of the limelight anytime soon. Besides, CUDA is easy enough that the time it takes to learn it is much shorter than CUDA's shelf life.
This is the era of high performance computing, parallel computing. CUDA and OpenCL are the emerging technologies of GPU Computing which is actually a high performance computing! If you are a passionate programmer and willing to achieve benchmark in parallel algorithms, you should really go for these technologies. Data parallel part of your program will get executed within fraction of a second on GPU many-core architecture which usually takes much longer time on your CPU..