CUDA-like workflow for OpenCL

CUDA-like workflow for OpenCL - compilation

The typical example workflow for OpenCL programming seems to be focused on source code within strings, passed to the JIT compiler, then finally enqueued (with a specific kernel name); and the compilation results can be cached - but that's left for you the programmer to take care of.
In CUDA, the code is compiled in a non-JIT way to object files (alongside host-side code, but forget about that for a second), and then one just refers to device-side functions in the context of an enqueue or arguments etc.
Now, I'd like to have the second kind of workflow, but with OpenCL sources. That is, suppose I have some C host-side code my_app.c, and some OpenCL kernel code in a separate file, my_kernel.cl (which for the purpose of discussion is self-contained). I would like to be able to run a magic command on my_kernel.cl, get a my_kernel.whatever, link or faux-link that together with my_app.o, and get a binary. Now, in my_app.c I want to be able to somehow to refer to the kernel, even if it's not an extern symbol, as compiled OpenCL program (or program + kernel name) - and not get compilation errors.
Is this supported somehow? With nVIDIA's ICD or with one of the other ICDs? If not, is at least some of this supported, say, the magic kernel compiler + generation of an extra header or source stub to use in compiling my_app.c?

Look into SYCL, it offers single-source C++ OpenCL. However, not yet available on every platform.
https://www.khronos.org/sycl

There is already ongoing effort that enables CUDA-like workflow in TensorFlow, and it uses SYCL 1.2 - it is actively up-streamed.
Similarly to CUDA, SYCL's approach needs the following steps:
device registration via device factory ( device is called SYCL ) - done here: https://github.com/lukeiwanski/tensorflow/tree/master/tensorflow/core/common_runtime/sycl
operation registration for above device. In order to create / port operation you can either:
re-use Eigen's code since Tensor module has SYCL back-end ( look here: https://github.com/lukeiwanski/tensorflow/blob/opencl/adjustcontrastv2/tensorflow/core/kernels/adjust_contrast_op.cc#L416 - we just partially specialize operation for SYCL device and calling the already implemented functor https://github.com/lukeiwanski/tensorflow/blob/opencl/adjustcontrastv2/tensorflow/core/kernels/adjust_contrast_op.h#L91;
write SYCL code - it has been done for FillPhiloxRandom - see https://github.com/lukeiwanski/tensorflow/blob/master/tensorflow/core/kernels/random_op.cc#L685
SYCL kernel uses modern C++
you can use OpenCL interoperability - thanks to which you can write pure OpenCL C kernel code! - I think this bit is most relevant to you
The workflow is a bit different as you do not have to do an explicit instantiation of the functor templates as CUDA does https://github.com/lukeiwanski/tensorflow/blob/master/tensorflow/core/kernels/adjust_contrast_op_gpu.cu.cc or any .cu.cc file ( in fact you do not have to add any new files - avoids mess with the build system )
As well as this thing: https://github.com/lukeiwanski/tensorflow/issues/89;
TL;DR - CUDA can create "persistent" pointers, OpenCL needs to go through Buffers and Accessors.
Codeplay's SYCL compiler ( ComputeCpp ) at the moment requires OpenCL 1.2 with SPIR extension - these are Intel CPU, Intel GPU ( Beignet work in progress ), AMD GPU ( although older drivers ) - additional platforms are coming!
Setup instructions can be found here: https://www.codeplay.com/portal/03-30-17-setting-up-tensorflow-with-opencl-using-sycl
Our effort can be tracked in my fork of TensorFlow: https://github.com/lukeiwanski/tensorflow ( branch dev/eigen_mehdi )
Eigen used is: https://bitbucket.org/mehdi_goli/opencl ( branch default )
We are getting there! Contributions are welcome! :)

Related

How does Go make system calls?

As far as I know, in CPython, open() and read() - the API to read a file is written in C code. The C code probably calls some C library which knows how to make system call.
What about a language such as Go? Isn't Go itself now written in Go? Does Go call C libraries behind the scenes?

The short answer is "it depends".
Go compiles for multiple combinations of H/W and OS, and they all have different approaches to how syscalls are to be made when working with them.
For instance, Solaris does not provide a stable supported set of syscalls, so they go through the systems libc — just as required by the vendor.
Windows does support a rather stable set of syscalls but it is defined as a C API provided by a set of standard DLLs.
The functions exposed by those DLLs are mostly shims which use a single "make a syscall by number" function, but these numbers are not documented and are different between the kernel flavours and releases (perhaps, intentionally).
Linux does provide a stable and documented set of numbered syscalls and hence there Go just calls the kernel directly.
Now keep in mind that for Go to "call the kernel directly" means following the so-called ABI of the H/W and OS combo. For instance, on modern Linux on amd64 making a syscall requires filling a set of CPU registers with certain values, doing some other arrangements and then issuing the SYSENTER CPU instruction.
On Windows, you have to use its native calling convention (which is stdcall, not cdecl).

Yes go is now written in go. But, you don't need C to make syscalls.
An important thing to call out is that syscalls aren't "written in C." You can make syscalls from C on Unix because of <unistd.h>. In particular, how Linux defines this header is a little convoluted, but you can see from this file the general idea. Syscalls are defined with a name and a number. When you call read for example, what really happens behind the scenes is the parameters are setup in the proper registers/memory (linux expects the syscall number in eax) followed by the instruction syscall which fires interrupt 0x80. The OS has already setup the proper interrupt handlers that will receive this interrupt and the OS goes about doing whatever is needed for that syscall. So, you don't need something written in C (or a standard library for that matter) to make syscalls. You just need to understand the call ABI and know the interrupt numbers.
However, as #retgits points out golang's approach is to piggyback off the fact that libc already has all of the logic for handling syscalls. mksyscall.go is a CLI script that parses these libc files to extract the necessary information.
You can actually trace the life of a syscall if you compile a go script like:
package main
import (
"syscall"
)
func main() {
var buf []byte
syscall.Read(9, buf)
}
Run objdump -D on the resulting binary. The go runtime is rather large, so your best bet is to find the main function, see where it calls syscall.Read and then search for the offsets from there: syscall.Read calls syscall.syscall, syscall.syscall calls runtime.libcCall (which switches from the go ABI to C ABI compatibility so that arguments are located where the OS expects--you can see this in runtime, for darwin for example), runtime.libcCall calls runtime.asmcgocall, etc.
For extra fun, run that binary with gdb and continue stepping in until you hit the syscall.

The sys package takes care of the syscalls to the underlying OS. Depending on the OS you're using different packages are used to generate the appropriate calls. Here is a link to the README for Go running on Unix systems: https://github.com/golang/sys/blob/master/unix/README.md the parts on mksyscall.go, which are hand-written Go files which implement system calls that need special handling, and type files, should walk you through how it works.

The Go compiler (which translates the Go code to target CPU code) is written in Go but that is different to the run time support code which is what you are talking about. The standard library is mainly written in Go and probably knows how to directly make system calls with no C code involved. However, there may be a bit of C support code, depending on the target platform.

Runtime system for Stm32F103 Arm, GNAT Ada compiler

Id like to use Ada with Stm32F103 uc, but here is the problem - there is no build-in runtime system within GNAT 2016. There is another cortex-m3 uc by TI RTS included - zfp-lm3s, but seems like it needs some global updates, simple change of memory size/origin doesn't work.
So, there is some questions:
Does some body have RTS for stm32f103?
Is there any good books about low-level staff of cortex-m3 or other arm uc?
PS. Using zfp-lm3s rises this error, when i try to run program via GPS:
Loading section .text, size 0x140 lma 0x0
Load failed

The STM32F series is from STMicroelectronics, not TI, so the stm32f4 might seem to be a better starting point.
In particular, the clock code in bsp/setup_pll.adb should need only minor tweaking; use STM’s STM32CubeMX tool (written in Java) to find the magic numbers to set up the clock properly.
You will also find that the assembler code used in bsp/start*.S needs simplifying/porting to the Cortex-M3 part.
My Cortex GNAT Run Time Systems project includes an Arduino Due version (also Cortex-M3), which has startup code written entirely in Ada. I don’t suppose the rest of the code would help a lot, being based on FreeRTOS - you’d have to be very very careful about memory usage.

I stumbled upon this question while looking for a zfp runtime specific to the stm32l0xx boards. It doesn't look like one exists from what I can see, but I did stumble upon this guide to creating a new runtime from AdaCore, which might help anyone stuck with the same issue:
https://blog.adacore.com/porting-the-ada-runtime-to-a-new-arm-board

Implementation of putc in Versatile ARM LATEST Kernel-4.6

I am Trying to understand How linux printing
"Uncompressing Linux....... done, booting the kernel"
message even before it uncompressed itself in ARM Versatile Boad.
From this File the function decompress_kernel is writing the message through putstr() function which inturn have putc function which writing to hardware register uart.
putc is implemented in this file, putc writes directly to AMBA_UART_DR registers and these registers are different across architectures and also differs across different chips too.
But in the latest kernel-4.6 this was deprecated .
When i checked putc implemetation for ARM Versatile Boad in latest kernel its been deprecated so
how they implemented in latest kernel-4.6 where as rest of machine-specific code still exist?
How kernel is printing the banner in latest kernel?

Versatile board support code was converted to the multi-platform kernel model (ARCH_MULTIPLATFORM). Just like every other board support code of the same kind, now it takes putc() prototype from arch/arm/include/debug/uncompress.h.
Instead, the actual implementation of putc() is a generic assembly function coded into arch/arm/boot/compressed/debug.S.
Being generic, debug.S makes reference to few macros (addruart, waituart, senduart, busyuart) to get information about the actual UART hardware. These macros are defined in an include file selected by CONFIG_DEBUG_LL_INCLUDE (search arch/arm/Kconfig.debug for it). In case of the Versatile board CONFIG_DEBUG_LL_INCLUDE is defined as arch/arm/include/debug/pl01x.S, where in fact you find those macros.

How to pin a interrupt to a CPU in driver

Is it possible to pin a softirq, or any other bottom half to a processor. I have a doubt that this could be done from within a softirq code.
But then inside a driver is it possible to pin a particular IRQ to a
core.

From user mode, you can easily do this by writing to /proc/irq/N/smp_affinity to control which processor(s) an interrupt is directed to. The symbols for the code implementing this are not exported though, so it's difficult to do from the kernel (at least for a loadable module which is how most drivers are structured).
The fact that the implementing function symbols aren't exported is a sign that the kernel developers don't want to encourage this. Presumably that's because it takes control away from the user. And also embeds assumptions about number of processors and so forth into the driver.
So, to answer your question, yes, it's possible, but it's discouraged, and you would need to do one of several "ugly" things to implement it ((a) change kernel exports, (b) link your driver statically into main kernel, or (c) open/write to the proc file from kernel mode).
The usual way to achieve this is by writing a user-mode program (can even be a shell script) that programs core numbers/masks into the appropriate proc file. See Documentation/IRQ-affinity.txt in the kernel source directory for details.

How does bootstrapping work for gcc?

I was looking up the pypy project (Python in Python), and started pondering the issue of what is running the outer layer of python? Surely, I conjectured, it can't be as the old saying goes "turtles all the way down"! Afterall, python is not valid x86 assembly!
Soon I remembered the concept of bootstrapping, and looked up compiler bootstrapping. "Ok", I thought, "so it can be either written in a different language or hand compiled from assembly". In the interest of performance, I'm sure C compilers are just built up from assembly.
This is all well, but the question still remains, how does the computer get that assembly file?!
Say I buy a new cpu with nothing on it. During the first operation I wish to install an OS, which runs C. What runs the C compiler? Is there a miniature C compiler in the BIOS?
Can someone explain this to me?

Say I buy a new cpu with nothing on it. During the first operation I wish to install an OS, which runs C. What runs the C compiler? Is there a miniature C compiler in the BIOS?
I understand what you're asking... what would happen if we had no C compiler and had to start from scratch?
The answer is you'd have to start from assembly or hardware. That is, you can either build a compiler in software or hardware. If there were no compilers in the whole world, these days you could probably do it faster in assembly; however, back in the day I believe compilers were in fact dedicated pieces of hardware. The wikipedia article is somewhat short and doesn't back me up on that, but never mind.
The next question I guess is what happens today? Well, those compiler writers have been busy writing portable C for years, so the compiler should be able to compile itself. It's worth discussing on a very high level what compilation is. Basically, you take a set of statements and produce assembly from them. That's it. Well, it's actually more complicated than that - you can do all sorts of things with lexers and parsers and I only understand a small subset of it, but essentially, you're looking to map C to assembly.
Under normal operation, the compiler produces assembly code matching your platform, but it doesn't have to. It can produce assembly code for any platform you like, provided it knows how to. So the first step in making C work on your platform is to create a target in an existing compiler, start adding instructions and get basic code working.
Once this is done, in theory, you can now cross compile from one platform to another. The next stages are: building a kernel, bootloader and some basic userland utilities for that platform.
Then, you can have a go at compiling the compiler for that platform (once you've got a working userland and everything you need to run the build process). If that succeeds, you've got basic utilities, a working kernel, userland and a compiler system. You're now well on your way.
Note that in the process of porting the compiler, you probably needed to write an assembler and linker for that platform too. To keep the description simple, I omitted them.
If this is of interest, Linux from Scratch is an interesting read. It doesn't tell you how to create a new target from scratch (which is significantly non trivial) - it assumes you're going to build for an existing known target, but it does show you how you cross compile the essentials and begin building up the system.
Python does not actually assemble to assembly. For a start, the running python program keeps track of counts of references to objects, something that a cpu won't do for you. However, the concept of instruction-based code is at the heart of Python too. Have a play with this:
>>> def hello(x, y, z, q):
... print "Hello, world"
... q()
... return x+y+z
...
>>> import dis
dis.dis(hello)
2 0 LOAD_CONST 1 ('Hello, world')
3 PRINT_ITEM
4 PRINT_NEWLINE
3 5 LOAD_FAST 3 (q)
8 CALL_FUNCTION 0
11 POP_TOP
4 12 LOAD_FAST 0 (x)
15 LOAD_FAST 1 (y)
18 BINARY_ADD
19 LOAD_FAST 2 (z)
22 BINARY_ADD
23 RETURN_VALUE
There you can see how Python thinks of the code you entered. This is python bytecode, i.e. the assembly language of python. It effectively has its own "instruction set" if you like for implementing the language. This is the concept of a virtual machine.
Java has exactly the same kind of idea. I took a class function and ran javap -c class to get this:
invalid.site.ningefingers.main:();
Code:
0: aload_0
1: invokespecial #1; //Method java/lang/Object."<init>":()V
4: return
public static void main(java.lang.String[]);
Code:
0: iconst_0
1: istore_1
2: iconst_0
3: istore_1
4: iload_1
5: aload_0
6: arraylength
7: if_icmpge 57
10: getstatic #2;
13: new #3;
16: dup
17: invokespecial #4;
20: ldc #5;
22: invokevirtual #6;
25: iload_1
26: invokevirtual #7;
//.......
}
I take it you get the idea. These are the assembly languages of the python and java worlds, i.e. how the python interpreter and java compiler think respectively.
Something else that would be worth reading up on is JonesForth. This is both a working forth interpreter and a tutorial and I can't recommend it enough for thinking about "how things get executed" and how you write a simple, lightweight language.

In the interest of performance, I'm sure C compilers are just built up from assembly.
C compilers are, nowadays, (almost?) completely written in C (or higher-level languages - Clang is C++, for instance). Compilers gain little to nothing from including hand-written assembly code. The things that take most time are as slow as they are because they solve very hard problems, where "hard" means "big computational complexity" - rewriting in assembly brings at most a constant speedup, but those don't really matter anymore at that level.
Also, most compilers want high portability, so architecture-specific tricks in the front and middle end are out of question (and in the backends, they' not desirable either, because they may break cross-compilation).
Say I buy a new cpu with nothing on it. During the first operation I wish to install an OS, which runs C. What runs the C compiler? Is there a miniature C compiler in the BIOS?
When you're installing an OS, there's (usually) no C compiler run. The setup CD is full of readily-compiled binaries for that architecture. If there's a C compiler included (as it's the case with many Linux distros), that's an already-compiled exectable too. And those distros that make you build your own kernel etc. also have at least one executable included - the compiler. That is, of course, unless you have to compile your own kernel on an existing installation of anything with a C compiler.
If by "new CPU" you mean a new architecture that isn't backwards-compatible to anything that's yet supported, self-hosting compilers can follow the usual porting procedure: First write a backend for that new target, then compile yourself for it, and suddenly you got a mature compiler with a battle-hardened (compiled a whole compiler) native backend on the new platform.

If you buy a new machine with a pre-installed OS, it doesn't even need to include a compiler anywhere, because all the executable code has been compiled on some other machine, by whoever provides the OS - your machine doesn't need to compile anything itself.
How do you get to this point if you have a completely new CPU architecture? In this case, you would probably start by writing a new code generation back-end for your new CPU architecture (the "target") for an existing C compiler that runs on some other platform (the "host") - a cross-compiler.
Once your cross-compiler (running on the host) works well enough to generate a correct compiler (and necessary libraries, etc.) that will run on the target, then you can compile the compiler with itself on the target platform, and end up with a target-native compiler, which runs on the target and generates code which runs on the target.
It's the same principle with a new language: you have to write code in an existing language that you do have a toolchain for, which will compile your new language into something that you can work with (let's call this the "bootstrap compiler"). Once you get this working well enough, you can write a compiler in your new language (the "real compiler"), and then compile the real compiler with the bootstrap compiler. At this point you're writing the compiler for your new language in the new language itself, and your language is said to be "self-hosting".

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio