Can Halide be used with SOC Platforms? - halide

I'm trying to use Halide with Texas Instrument's TDA2x platform which is a SoC with DSP and other vision processors in it.
I basically want to run code generated by Halide in the DSP of TDA2x.
TDA2x supports TI's cl6x compiler.
How can I generate code from Halide to compile using cl6x compiler ?

Sorry, our only DSP backend is Hexagon. We don't have the ability to generate TDA2x code.
Halide compiles directly to machine code, not to C, so the cI6x compiler is not useful here. Halide does have a C-generating backend, but the performance of that code is often an order of magnitude worse than generating machine code directly, and it's not going to know about whatever intrinsics you need to program the TDA2x effectively, so I wouldn't depend on it.

Related

Do all compiled codes have same speed no matter what language they were written in?

Suppose I write a program in both Python and C++ and I turn these to executable. Now, will both the executable have the same speed or will it vary (I guess it shouldn't cause it should now be in machine code form) ?
Suppose I write a program in both Python and C++ and I turn these to executable. Now, will both the executable have the same speed
Of course usually not (assuming both code implement the same algorithm). And the runtime speed depends a lot of the compiler itself (e.g. tinycc -for C- and GCC or Clang ....) and even of its versions and compilation flags (e.g. -Os vs -O2 with g++). BTW, Python is compiled to some bytecode, not to machine code.
Of course, some software are mostly spending CPU time elsewhere (e.g. in some relational database manager such as PostGreSQL). Then rewriting them in C++ instead of Python won't gain a lot of performance. And some software are mostly IO bound (e.g. tar(1) used without compression)
At last, some C++ programs could generate machine code at runtime (e.g. using AsmJit...) using partial evaluation techniques, which may give a huge speedup.
On Linux, you could generate some C or C++ code at runtime, compile it as a temporary plugin, then dlopen(3) that temporary plugin (fetching new function pointers using dlsym(3)... Adapt the manydl.c example to your needs)
Also, C++ is a very difficult language to learn. Read some good book about it.
Read of course the Dragon book.
Since an entire book is needed to answer your question !

How AHIR compilation toolchain (A Hardware Intermediate Representation) is used to compile llvm-IR to VHDL?

I want to make an application written in C/C++ to work on FPGA. The application works on huge array of binary bits (bitsets). This could work well on data-parallel architecture but GPU's maynot help because of huge data transfers between host and device and little computation on GPU (basically setting and unsetting a bit). So FPGA sounded more promising for this application.
I came across AHIR (Hardware Intermediate Representation) and a compilation toolchain which can compile c algorithms [without recursions] to VHDL. And here is link where netFPGA worked on this before. They used AHIR as compilation toolchain with llvm-IR. But its not clear how its done. I will be grateful If anyone can explain the sequence of steps to be taken assuming I have an llvm-IR source already as an input to AHIR compiler.
A basic example of C program compiled to VHDL using AHIR would help me too.
Thanks!

How to compare OpenCL with native code performance properly?

Intel provides some advices on comparing OpenCL with native code here.
Are there any additional recommendations? I am particularly interested whether OpenCL is usually compared with straightforward C/C++ code or if optimizazions of the sequential code are also taken into account. What is the case with intrinsic functions?

How can a compiler be cross platform(hardware)?

I just realized that binary compilers convert source code to the binary of the destination platform. Kind of obvious... but if a compiler works such way, then how can the same compiler be used for different systems like x86, ARM, MIPS, etc?
Shouldn't they be supposed to "know" the machine-language of the hardware platform to be able to know how to build the binary? Does a compiler(like gcc) knows the machine language of every single platform that is supported?
How is that system possible, and how can a compiler be optimized for that many platforms at the same time?
Yes, they have to "know" the machine language for every single platform they support. This is a required to generate machine code. However, compilation is a multi-step process. Usually, the first steps of the compilation are common to most architectures.
Taken from wikipedia
Structure of a compiler
Compilers bridge source programs in high-level
languages with the underlying hardware.
A compiler requires
determining the correctness of the syntax of programs,
generating correct and efficient object code,
run-time organization, and
formatting output according to assembler and/or linker conventions.
A
compiler consists of three main parts: the frontend, the middle-end,
and the backend.
The front end
checks whether the program is correctly
written in terms of the programming language syntax and semantics.
Here legal and illegal programs are recognized. Errors are reported,
if any, in a useful way. Type checking is also performed by collecting
type information. The frontend then generates an intermediate
representation or IR of the source code for processing by the
middle-end.
The middle end
is where optimization takes place. Typical
transformations for optimization are removal of useless or unreachable
code, discovery and propagation of constant values, relocation of
computation to a less frequently executed place (e.g., out of a loop),
or specialization of computation based on the context. The middle-end
generates another IR for the following backend. Most optimization
efforts are focused on this part.
The back end
is responsible for translating the IR from the middle-end into assembly code. The target
instruction(s) are chosen for each IR instruction. Register allocation
assigns processor registers for the program variables where possible.
The backend utilizes the hardware by figuring out how to keep parallel
execution units busy, filling delay slots, and so on. Although most
algorithms for optimization are in NP, heuristic techniques are
well-developed.
More this article which describes the structure of a compiler and on this one which deals with Cross compilers.
The http://llvm.org/ project will answer all of your questions in this regard :)
In a nutshell, cross HW compilers emit "intermediate representation" of the code , which is HW agnostic and then its being customized via the native tool chain
Yes it is possible, it's called Cross Compiler. Compilers usually first they generate the object code which is not understanable by the current machine but it can be migrated to the destiny machine with another compiler. Next, object code is "compiled" again and linked with external libraries of the target machines.
TL;DR: Yes, the compilers knows the target code, but you can compile in another hardware.
I recommend you to read attached links for information.
Every platform has its own toolchain, toolchain includes gcc,gdb,ld,nm etc.
Let's take specific example of gcc as of now. GCC source code has many layers including architecture dependent and independent part. Architecture dependent part contains procedures to handle architecture specific things like their stack, function calls, floating point operations. We need to cross compile the gcc source code for a specific architecture like for ARM. You can see its steps here for reference:- http://www.ailis.de/~k/archives/19-arm-cross-compiling-howto.html#toolchain.
This architecture dependent part is responsible for handling machine language operations.

GPU Programming?

I'm new to the GPU Programming world, I've tried reading on Wikipedia and Googling, but I still have several questions:
I downloaded some GPU Examples, for CUDA, there were some .cu files and some CPP files, but all the code was normal C/C++ Code just some weird functions like cudaMemcpyToSymbol and the rest was pure c code. The question is, is the .cu code compiled with nvcc and then linked with gcc? Or how is it programmed?
if I coded something to be run on GPU, will it run on ALL GPUs? or just CUDA? or is there a method to write for CUDA and a Method to write for ATI and a method to write for both?
To answer your second question:
OpenCL is the (only) way to go if you want to write platform independent GPGPU code.
ATIs website actually has a lot of resources for OpenCL if you search a little, and their example projects are very easy to modify into what you need, or just to understand the code.
The OpenCL spec and reference pages is also a very good source of knowledge:
http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/
http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf
There are a lot of talks that explain some of the core concepts, and also that explain how to write fast code that I would recommend (that is applicable to CUDA too).
To almost answer your first question:
In OpenCL, the code is compiled at runtime to the specific GPU you're using (to guarantee speed).
You probably want to do some background reading on CUDA - it's not something you can just pick up by looking at a few code samples. There are about 3 different CUDA books on Amazon now, and there is a lot of reference material at http://developer.nvidia.com.
To answer your questions:
yes, .cu files are compiled with nvcc to an intermediate form (PTX) - this is subsequently converted to GPU-specific code at run-time
the generated code will run on a subset of nVidia GPUs, the size of the subset depending on what CUDA capabilities you use in your code
completing the answer given by #nulvinge, I'd say that OpenCL its to GPU Programming like OpenGL is to GPU Rendering. But its not the only option for multi-architecture development, you could also use DirectCompute, but I wouldn't say that its the best option, just if you want your code running on every DirectX11 compatible GPUs, that includes some intel graphics cards chips too right?
But even if you are thinking in doing some GPU programming with OpenCL, do not forget to study the architecture of the platforms that you're using. ATI CPUs, GPUs and NVIDIA GPUs have big differences and your code is needed to be tuned for each platform that you're using if you want to get the most of it...
Fortunately both NVIDIA and AMD have Programming Guides to help you:)
In addition to previous answers, for CUDA you would need a NVIDIA card/GPU, unless you have access for a remote one, which I would recommend this course from Coursera:
Heterogeneous Parallel Programming
It not just gives an introduction to CUDA and OpenCL, memory model, tiling, handling boundary conditions and performance considerations, but also directive-based languages such as OpenACC, a high level language for expressing parallelism into your code, leaving mostly of the parallel programming work for the compiler (good to start with). Also, this course has a online platform where you can use their GPU, which is good to start GPU programming without concerning about software/hardware setup.
If you want to write a portable code which you can execute on different GPU devices and also on CPUs. You need to use OpenCL.
Actually, to configure your kernel you need to write a host code in C. The configuration file might be shorter if you want to write it for CUDA kernels comparing to OpenCL one.

Resources