feature request: an atomicAdd() function included in gwan.h - key-value-store

In the G-WAN KV options, KV_INCR_KEY will use the 1st field as the primary key.
That means there is a function which increments atomically already built in the G-WAN core to make this primary index work.
It would be good to make this function opened to be used by servlets, i.e. included in gwan.h.
By doing so, ANSI C newbies like me could benefit from it.

There was ample discussion about this on the old G-WAN forum, and people were invited to share their experiences with atomic operations in order to build a rich list of documented functions, platform by platform.
Atomic operations are not portable because they address the CPU directly. It means that the code for Intel x86 (32-bit) and Intel AMD64 (64-bit) is different. Each platform (ARM, Power7, Cell, Motorola, etc.) has its own atomic instruction sets.
Such a list was not published in the gwan.h file so far because basic operations are easy to find (the GCC compiler offers several atomic intrinsics as C extensions) but more sophisticated operations are less obvious (needs asm skills) and people will build them as they need - for very specific uses in their code.
Software Engineering is always a balance between what can be made available at the lowest possible cost to entry (like the G-WAN KV store, which uses a small number of functions) and how it actually works (which is far less simple to follow).
So, beyond the obvious (incr/decr, set/get), to learn more about atomic operations, use Google, find CPU instruction sets manuals, and arm yourself with courage!

Thanks for Gil's helpful guidance.
Now, I can do it by myself.
I change the code in persistence.c, as below:
firstly, i changed the definition of val in data to volatile.
//data[0]->val++;
//xbuf_xcat(reply, "Value: %d", data[0]->val);
int new_count, loops=50000000, time1, time2, time;
time1=getus();
for(int i; i<loops; i++){
new_count = __sync_add_and_fetch(&data[0]->val, 1);
}
time2=getus();
time=loops/(time2-time1);
time=time*1000;
xbuf_xcat(reply, "Value: %d, time: %d incr_ops/msec", new_count, time);
I got 52,000 incr_operations/msec with my old E2180 CPU.
So, with GCC compiler I can do it by myself.
thanks again.

Related

Runtime system for Stm32F103 Arm, GNAT Ada compiler

Id like to use Ada with Stm32F103 uc, but here is the problem - there is no build-in runtime system within GNAT 2016. There is another cortex-m3 uc by TI RTS included - zfp-lm3s, but seems like it needs some global updates, simple change of memory size/origin doesn't work.
So, there is some questions:
Does some body have RTS for stm32f103?
Is there any good books about low-level staff of cortex-m3 or other arm uc?
PS. Using zfp-lm3s rises this error, when i try to run program via GPS:
Loading section .text, size 0x140 lma 0x0
Load failed
The STM32F series is from STMicroelectronics, not TI, so the stm32f4 might seem to be a better starting point.
In particular, the clock code in bsp/setup_pll.adb should need only minor tweaking; use STM’s STM32CubeMX tool (written in Java) to find the magic numbers to set up the clock properly.
You will also find that the assembler code used in bsp/start*.S needs simplifying/porting to the Cortex-M3 part.
My Cortex GNAT Run Time Systems project includes an Arduino Due version (also Cortex-M3), which has startup code written entirely in Ada. I don’t suppose the rest of the code would help a lot, being based on FreeRTOS - you’d have to be very very careful about memory usage.
I stumbled upon this question while looking for a zfp runtime specific to the stm32l0xx boards. It doesn't look like one exists from what I can see, but I did stumble upon this guide to creating a new runtime from AdaCore, which might help anyone stuck with the same issue:
https://blog.adacore.com/porting-the-ada-runtime-to-a-new-arm-board

CUDA-like workflow for OpenCL

The typical example workflow for OpenCL programming seems to be focused on source code within strings, passed to the JIT compiler, then finally enqueued (with a specific kernel name); and the compilation results can be cached - but that's left for you the programmer to take care of.
In CUDA, the code is compiled in a non-JIT way to object files (alongside host-side code, but forget about that for a second), and then one just refers to device-side functions in the context of an enqueue or arguments etc.
Now, I'd like to have the second kind of workflow, but with OpenCL sources. That is, suppose I have some C host-side code my_app.c, and some OpenCL kernel code in a separate file, my_kernel.cl (which for the purpose of discussion is self-contained). I would like to be able to run a magic command on my_kernel.cl, get a my_kernel.whatever, link or faux-link that together with my_app.o, and get a binary. Now, in my_app.c I want to be able to somehow to refer to the kernel, even if it's not an extern symbol, as compiled OpenCL program (or program + kernel name) - and not get compilation errors.
Is this supported somehow? With nVIDIA's ICD or with one of the other ICDs? If not, is at least some of this supported, say, the magic kernel compiler + generation of an extra header or source stub to use in compiling my_app.c?
Look into SYCL, it offers single-source C++ OpenCL. However, not yet available on every platform.
https://www.khronos.org/sycl
There is already ongoing effort that enables CUDA-like workflow in TensorFlow, and it uses SYCL 1.2 - it is actively up-streamed.
Similarly to CUDA, SYCL's approach needs the following steps:
device registration via device factory ( device is called SYCL ) - done here: https://github.com/lukeiwanski/tensorflow/tree/master/tensorflow/core/common_runtime/sycl
operation registration for above device. In order to create / port operation you can either:
re-use Eigen's code since Tensor module has SYCL back-end ( look here: https://github.com/lukeiwanski/tensorflow/blob/opencl/adjustcontrastv2/tensorflow/core/kernels/adjust_contrast_op.cc#L416 - we just partially specialize operation for SYCL device and calling the already implemented functor https://github.com/lukeiwanski/tensorflow/blob/opencl/adjustcontrastv2/tensorflow/core/kernels/adjust_contrast_op.h#L91;
write SYCL code - it has been done for FillPhiloxRandom - see https://github.com/lukeiwanski/tensorflow/blob/master/tensorflow/core/kernels/random_op.cc#L685
SYCL kernel uses modern C++
you can use OpenCL interoperability - thanks to which you can write pure OpenCL C kernel code! - I think this bit is most relevant to you
The workflow is a bit different as you do not have to do an explicit instantiation of the functor templates as CUDA does https://github.com/lukeiwanski/tensorflow/blob/master/tensorflow/core/kernels/adjust_contrast_op_gpu.cu.cc or any .cu.cc file ( in fact you do not have to add any new files - avoids mess with the build system )
As well as this thing: https://github.com/lukeiwanski/tensorflow/issues/89;
TL;DR - CUDA can create "persistent" pointers, OpenCL needs to go through Buffers and Accessors.
Codeplay's SYCL compiler ( ComputeCpp ) at the moment requires OpenCL 1.2 with SPIR extension - these are Intel CPU, Intel GPU ( Beignet work in progress ), AMD GPU ( although older drivers ) - additional platforms are coming!
Setup instructions can be found here: https://www.codeplay.com/portal/03-30-17-setting-up-tensorflow-with-opencl-using-sycl
Our effort can be tracked in my fork of TensorFlow: https://github.com/lukeiwanski/tensorflow ( branch dev/eigen_mehdi )
Eigen used is: https://bitbucket.org/mehdi_goli/opencl ( branch default )
We are getting there! Contributions are welcome! :)

getting system time in Vxworks

is there anyways to get the system time in VxWorks besides tickGet() and tickAnnounce? I want to measure the time between the task switches of a specified task but I think the precision of tickGet() is not good enough because the the two tickGet() values at the beggining and the end of taskSwitchHookAdd function is always the same!
If you are looking to try and time task switches, I would assume you need a timer at least at the microsecond (us) level.
Usually, timers/clocks this fine grained are only provided by the platform you are running on. If you are working on an embedded system, you can try and read thru the manuals for your board support package (if there is one) to see if there are any functions provided to access various timers on a board.
A more low level solution would be to figure out the processor that is running on your system and then write some simple assembly code to poll the processor's internal timebase register (TBR). This might require a bit of research on the processor you are running on, but could be easily done.
If you are running on a PPC based processor, you can use the code below to read the TBR:
loop: mftbu rx #load most significant half from TBU
mftbl ry #load least significant half from TBL
mftbu rz #load from TBU again
cmpw rz,rx #see if 'old' = 'new'
bne loop #repeat if two values read from TBU are unequal
On an x86 based processor, you might consider using the RDTSC assembly instruction to read the Time Stamp Counter (TSC). On vxWorks, pentiumALib has some library functions (pentiumTscGet64() and pentiumTscGet32()) that will make reading the TSC easier using C.
source: http://www-inteng.fnal.gov/Integrated_Eng/GoodwinDocs/pdf/Sys%20docs/PowerPC/PowerPC%20Elapsed%20Time.pdf
Good luck!
It depends on what platform you are on, but if it is x86 then you can use:
pentiumTscGet64();

Can I assume sizeof(GUID)==16 at all times?

The definition of GUID in the windows header's is like this:
typedef struct _GUID {
unsigned long Data1;
unsigned short Data2;
unsigned short Data3;
unsigned char Data4[ 8 ];
} GUID;
However, no packing is not defined. Since the alignment of structure members is dependent on the compiler implementation one could think this structure could be longer than 16 bytes in size.
If i can assume it is always 16 bytes - my code using GUIDs is more efficient and simple.
However, it would be completely unsafe - if a compiler adds some padding in between of the members for some reason.
My questions do potential reasons exist ? Or is the probability of the scenario that sizeof(GUID)!=16 actually really 0.
It's not official documentation, but perhaps this article can ease some of your fears. I think there was another one on a similar topic, but I cannot find it now.
What I want to say is that Windows structures do have a packing specifier, but it's a global setting which is somewhere inside the header files. It's a #pragma or something. And it is mandatory, because otherwise programs compiled by different compilers couldn't interact with each other - or even with Windows itself.
It's not zero, it depends on your system. If the alignment is word (4-bytes) based, you'll have padding between the shorts, and the size will be more than 16.
If you want to be sure that it's 16 - manually disable the padding, otherwise use sizeof, and don't assume the value.
If I feel I need to make an assumption like this, I'll put a 'compile time assertion' in the code. That way, the compiler will let me know if and when I'm wrong.
If you have or are willing to use Boost, there's a BOOST_STATIC_ASSERT macro that does this.
For my own purposes, I've cobbled together my own (that works in C or C++ with MSVC, GCC and an embedded compiler or two) that uses techniques similar to those described in this article:
http://www.pixelbeat.org/programming/gcc/static_assert.html
The real tricks to getting the compile time assertion to work cleanly is dealing with the fact that some compilers don't like declarations mixed with code (MSVC in C mode), and that the techniques often generate warnings that you'd rather not have clogging up an otherwise working build. Coming up with techniques that avoid the warnings is sometimes a challenge.
Yes, on any Windows compiler. Otherwise IsEqualGUID would not work: it compares only the first 16 bytes. Similarly, any other WinAPI function that takes a GUID* just checks the first 16 bytes.
Note that you must not assume generic C or C++ rules for windows.h. For instance, a byte is always 8 bits on Windows, even though ISO C allows 9 bits.
Anytime you write code dependent on the size of someone else's structure,
warning bells should go off.
Could you give an example of some of the simplified code you want to use?
Most people would just use sizeof(GUID) if the size of the structure was needed.
With that said -- I can't see the size of GUID ever changing.
#include <stdio.h>
#include <rpc.h>
int main () {
GUID myGUID;
printf("size of GUID is %d\n", sizeof(myGUID));
return 0;
}
Got 16. This is useful to know if you need to manually allocate on the heap.

How can I get a list of legal ARM opcodes from gcc (or elsewhere)?

I'd like to generate pseudo-random ARM instructions. Via assembler directives, I can tell gcc what mode I'm in, and it will complain if I try a set of opcodes and operands that's not legal in that mode, so it must have some internal listing of what can be done in which mode. Where does that live? Would it be easier to extract that info from LLVM?
Is this question "not even wrong"? Should I try a different approach entirely?
To answer my own question, this is actually really easy to do from arm.md and and constraints.md in gcc/config/arm/. I probably spent more time answering asking this question and answering comments for it than I did figuring this out. Turns out I just need to look for 'TARGET_THUMB1', until I get around to implementing thumb2.
For the ARM family the buck stops at the ARM ARM (ARM Architectural Reference Manual). There is an ARM instruction set section and a Thumb instruction set section. Within both each instruction tells you what generation (ARMvX where X is some number like 4 (arm7), or 5 (arm9 time frame) ,etc). Since the opcode and pseudo code is listed for each instruction you should be able to figure out what is a real instruction and, if any, are syntax to save typing on another (push and pop for example).
With the Cortex-m3 and thumb2 in particular you also need to look at the TRM (Technical Reference Manual) as well. ARM has, I forget the name, a universal syntax they are trying to use that should work on both Thumb and ARM. For example on an ARM you have three register instructions:
add r1,r1,r2
In thumb there are only two register operations
add r1,r2
The desire basically is to meet in the middle or I would say more accurately to encourage ARM assemblers to parse Thumb instructions and encode them with the equivalent ARM instruction without complaining. This may have started with thumb and not thumb2, I have always separated the two syntaxes in my code until recently (and I still generally use ARM syntax for ARM and Thumb for Thumb).
And then yes you have to see what the specific implementation of the assembler tool is, in your case binutils. And it sounds like you have found the binutils/gnu secret decoder ring.

Resources