Why IR is needed for symbolic execution? - klee

For example, KLEE works on LLVM bitcode.
Can we build symbolic execution directly on C source code?

Each LLVM IR contains only one simple operation, but one C statement could contains multiple operations. For example, a[i] = b[i]; could be split into:
addr = b + i; // getElementPtr instruction
tmp = *addr; // load instruction
addr1 = a + i; // getElementPtr instruction
*addr1 = tmp; // store instruction
So it's much more simple to process LLVM IR than source code for a symbolic executor.


How to have the same routine executed sometimes by the CPU and sometimes by the GPU with OpenACC?

I'm dealing with a routine which I want the first time to be executed by the CPU and every other time by the GPU. This routine contains the loop:
for (k = kb; k <= ke; k++){
for (j = jb; j <= je; j++){
for (i = ib; i <= ie; i++){
I tried with adding #pragma acc loop collapse(3) to the loop and #pragma acc routine(routine) vector just before the calls where I want the GPU to execute the routine. -Minfo=accel doesn't report any message and with Nsight-System I see that the routine is always executed by the CPU so in this way it doesn't work.
Why the compiler is reading neither of the two #pragma?
To follow on to Thomas' answer, here's an example of using the "if" clause:
% cat test.c
#include <stdlib.h>
#include <stdio.h>
void compute(int * Arr, int size, int use_gpu) {
#pragma acc parallel loop copyout(Arr[:size]) if(use_gpu)
for (int i=0; i < size; ++i) {
Arr[i] = i;
int main() {
int *Arr;
int size;
int use_gpu;
Arr = (int*) malloc(sizeof(int)*size);
// Run on the host
// Run on the GPU
% nvc -acc -Minfo=accel test.c
4, Generating copyout(Arr[:size]) [if not already present]
Generating NVIDIA GPU code
7, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
% setenv NV_ACC_TIME 1
% a.out
Accelerator Kernel Timing data
compute NVIDIA devicenum=0
time(us): 48
4: compute region reached 1 time
4: kernel launched 1 time
grid: [8] block: [128]
device time(us): total=5 max=5 min=5 avg=5
elapsed time(us): total=331 max=331 min=331 avg=331
4: data region reached 2 times
9: data copyout transfers: 1
device time(us): total=43 max=43 min=43 avg=43
I'm using nvc and set the compiler's runtime profiler (NV_ACC_TIME=1) to show that the kernel is launched only once.
You need to enable OpenACC processing: -acc (with NVHPC tools) or -fopenacc (with GCC), for example, and then you need to use an OpenACC compute construct (parallel, kernels) to actually launch parallel GPU execution (plus host/device memory management, as necessary). For example, you could call your routine from that compute construct, and the routine would annotate the loop nest with OpenACC loop directives, as you've mentioned, to actually make use of the GPU parallelism.
Then, to answer your actual question: the OpenACC compute constructs then support an if clause to specify whether the region will execute on the current device ("GPU") vs. the local thread will execute the region ("CPU").

Fastest way to swap alternate bytes on ARM Cortex M4 using gcc

I need to swap alternate bytes in a buffer as quickly as possible in an embedded system using ARM Cortex M4 processor. I use gcc. The amount of data is variable but the max is a little over 2K. it doesn't matter if a few extra bytes are converted because I can use an over-sized buffer.
I know that the ARM has the REV16 instruction, which I can use to swap alternate bytes in a 32-bit word. What I don't know is:
Is there a way of getting at this instruction in gcc without resorting to assembler? The __builtin_bswap16 intrinsic appears to operate on 16-bit words only. Converting 4 bytes at a time will surely be faster than converting 2 bytes.
Does the Cortex M4 have a reorder buffer and/or do register renaming? If not, what do I need to do to minimise pipeline stalls when I convert the dwords of the buffer in a partially-unrolled loop?
For example, is this code efficient, where REV16 is appropriately defined to resolve (1):
uint32_t *buf = ... ;
size_t n = ... ; // (number of bytes to convert + 15)/16
for (size_t i = 0; i < n; ++i)
uint32_t a = buf[0];
uint32_t b = buf[1];
uint32_t c = buf[2];
uint32_t d = buf[3];
REV16(a, a);
REV16(b, b);
REV16(c, c);
REV16(d, d);
buf[0] = a;
buf[1] = b;
buf[2] = c;
buf[3] = d;
buf += 4;
You can't use the __builtin_bswap16 function for the reason you stated, it works on 16 bit words so will 0 the other halfword. I guess the reason for this is to keep the intrinsic working the same on processors which don't have an instruction behaving similarly to REV16 on ARM.
The function
uint32_t swap(uint32_t in)
in = __builtin_bswap32(in);
in = (in >> 16) | (in << 16);
return in;
compiles to (ARM GCC 5.4.1 -O3 -std=c++11 -march=armv7-m -mtune=cortex-m4 -mthumb)
rev r0, r0
ror r0, r0, #16
bx lr
And you could probably ask the compiler to inline it, which would give you 2 instructions per 32bit word. I can't think of a way to get GCC to generate REV16 with a 32bit operand, without declaring your own function with inline assembly.
As a follow up, and based on artless noise's comment about the non portability of the __builtin_bswap functions, the compiler recognizes
uint32_t swap(uint32_t in)
in = ((in & 0xff000000) >> 24) | ((in & 0x00FF0000) >> 8) | ((in & 0x0000FF00) << 8) | ((in & 0xFF) << 24);
in = (in >> 16) | (in << 16);
return in;
and creates the same 3 instruction function as above, so that is a more portable way to achieve it. Whether different compilers would produce the same output though...
If inline assembler is allowed, the following function
inline uint32_t Rev16(uint32_t a)
asm ("rev16 %1,%0"
: "=r" (a)
: "r" (a));
return a;
gets inlined, and acts as a single instruction as can be seen here.

Optimize Cuda Kernel time execution

I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.
Here you can find the code I'm talking about:
int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;
case GPU:
HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));
HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));
float time;
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
HANDLE_ERROR( cudaEventRecord(start, 0) );
for(int m = 0; m < nb_loops ; m++)
diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );
HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));
printf("Time to generate: %4.4f ms \n", time/nb_loops);
case CPU:
clock_t begin = clock(), diff;
for (int z=0; z<nb_loops; z++)
// Apply the difference between 2 images
for (int i = 0; i < height; i++)
tmp = i*imgresult_pitch;
for (int j = 0; j < width; j++)
imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
diff = clock() - begin;
float msec = diff*1000/CLOCKS_PER_SEC;
msec = msec/nb_loops;
printf("Time taken %4.4f milliseconds", msec);
And here is my kernel function:
__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
int row = blockIdx.x;
int col = threadIdx.x;
int v = col + row*blockDim.x;
if (row < MAX_H && col < MAX_W)
data_res[v] = (int) data2[v] - (int) data1[v];
I obtained these execution time for each one
CPU: 1,3210ms
GPU: 0,3229ms
I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.
Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.
However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...
Here are two things you could do which may improve the performance of your diff kernel:
1. Let each thread do more work
In your kernel, each thread handles just a single element; but having a thread do anything already has a bunch of overhead, at the block and the thread level, including obtaining the parameters, checking the condition and doing address arithmetic. Now, you could say "Oh, but the reads and writes take much more time then that; this overhead is negligible" - but you would be ignoring the fact, that the latency of these reads and writes is hidden by the presence of many other warps which may be scheduled to do their work.
So, let each thread process more than a single element. Say, 4, as each thread can easily read 4 bytes at once into a register. Or even 8 or 16; experiment with it. Of course you'll need to adjust your grid and block parameters accordingly.
2. "Restrict" your pointers
__restrict is not part of C++, but it is supported in CUDA. It tells the compiler that accesses through different pointers passed to the function never overlap. See:
What does the restrict keyword mean in C++?
Realistic usage of the C99 'restrict' keyword?
Using it allows the CUDA compiler to apply additional optimizations, e.g. loading or storing data via non-coherent cache. Indeed, this happens with your kernel although I haven't measured the effects.
3. Consider using a "SIMD" instruction
CUDA offers this intrinsic:
__device__ ​ unsigned int __vsubss4 ( unsigned int a, unsigned int b )
Which subtracts each signed byte value in a from its corresponding one in b. If you can "live" with the result, rather than expecting a larger int variable, that could save you some of work - and go very well with increasing the number of elements per thread. In fact, it might let you increase it even further to get to the optimum.
I don't think you are measuring times correctly, memory copy is a time consuming step in GPU that you should take into account when measuring your time.
I see some details that you can test:
I suppose you are using MAX_H and MAX_H as constants, you may consider doing so using cudaMemcpyToSymbol().
Remember to sync your threads using __syncthreads(), so you don't get issues between each loop iteration.
CUDA works with warps, so block and number of threads per block work better as multiples of 8, but not larger than 512 threads per block unless your hardware supports it. Here is an example using 128 threads per block: <<<(cols*rows+127)/128,128>>>.
Remember as well to free your allocated memory in GPU and destroying your time events created.
In your kernel function you can have a single variable int v = threadIdx.x + blockIdx.x * blockDim.x .
Have you tested, beside the execution time, that your result is correct? I think you should use cudaMallocPitch() and cudaMemcpy2D() while working with arrays due to padding.
Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:
if (row < MAX_H && col < MAX_W)
data_res[v] = (int) data2[v] - (int) data1[v];
Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:
Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.
Border areas that will use your diff kernel.
Obviously you'll have modify your code that calls the kernels.
And on a separate note:
GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?
CUDA Profiler is a very handy tool that will tell you're not optimal in the code.

Including standalone C code into assembler

I haven't looked in ages how the computer actually starts up, so I started playing around with writing my own loader which would boot into IA-32e mode and initialize all the CPUs with some dummy code to run. I'm fairly far, but I'm getting tired of writing trivial things in assembler.
Here's a toy case of what I would like to achieve. Say I want to write a simple piece of code that prints a C-style string and keeps track of the cursor in some fixed location in memory. A C implementation would be something along the following lines (this code is untested, I wrote it on the fly, so don't comment on bugs, since they're not relevant):
#define VIDEORAM_ADDRESS 0xa0000
#define VGA_GREY_ON_BLACK 0x07
#define CURSOR_X 0x100 /* dummy address */
#define CURSOR_Y 0x101
void printk(const char *s)
volatile char *p;
int x, y;
x = *(volatile char*)CURSOR_X;
y = *(volatile char*)CURSOR_Y;
while(*s != 0) {
if(*s == '\n') {
y = y >= 25 ? 0 : y;
x = 0;
} else {
if(x >= 80) {
y = y >= 25 ? 0 : y;
x = 0;
p = (volatile char*)VIDEORAM_ADDRESS + x + y * VIDEORAM_LINE_LENGTH;
*p++ = *s++;
*(volatile char*)CURSOR_X = x;
*(volatile char*)CURSOR_Y = y;
I can compile this with gcc -m32 -O2 -S printk.c, which generates printk.s. My question is essentially how to combine this together with a handwritten assembly file? The end result should of course be nothing else except a single binary blob of machine code and data that is loaded by the BIOS onto 0000:7C00 if, say, I want to include the code into the stage 1 loader loaded from a disk and call it after switching over to protected mode.
Is an alternative putting an .include directive somewhere in the handwritten assembly file to get the code included? Unfortunately, gcc emits all kinds of directives for the GNU Assembler in the .s file and I really only want the code for the printk function.
Is there some canonical way of doing this?

How can I get my CPU's branch target buffer(BTB) size?

It's useful when execute this routine when LOOPS > BTB_SIZE,
int n = 0;
for (int i = 0; i < LOOPS; i++)
int n = 0;
int loops = LOOPS / 2;
for(int i = 0; i < loops; i+=2)
n += 2;
can reduce branch misses.
BTB ref:http://www-ee.eng.hawaii.edu/~tep/EE461/Notes/ILP/buffer.html but it doesn't tell how to get the BTB size.
Any modern compiler worth its salt should optimise the code to int n = LOOPS;, but in a more complex example, the compiler will take care of such optimisation; see LLVM's auto-vectorisation, for instance, which handles many kinds of loop unrolling. Rather than trying to optimise your code, find appropriate compiler flags to get the compiler to do all the hard work.
From the BTB's point of view, both versions are the same. In both versions (if compiled unoptimized) there is only one conditional jump (each originating from the i<LOOPS), so there is only one jump target in the code, thus only one branch target buffer is used. You can see the resulting assembler code using Matt Godbolt's compiler explorer.
There would be difference between
for(int i=0;i<n;i++){
for(int i=0;i<n;i++){
The first version would need 2 branch target buffers (for for and for if), the second would need 3 branch target buffers (for for and for two ifs).
However, how Matt Godbolt found out, there are 4096 branch target buffers, so I would not worry too much about them.
