Dynamically creating variable in Kernel Module - linux-kernel

I am planning to use kthread_run API in my kernel module.
As kthread_run, returns a struct task_struct *, which i want to store in a variable global across my module.
However, i want one thread each for a cpu and i obtain no of cpus using num_online_cpus.
However, when i write following code:
//outside init_module function
int cpus = num_online_cpus();
static struct task_struct *my_tasks[cpus];
static int __init init_module(){
for(int j = 0; j < cpus; ++j){
my_tasks[j] = kthread_run(...);
}
}
However, i get following error on this:
error: variably modified ‘tasks’ at file scope
How can i achieve this ???

If your variables are actually one per cpu, you might want to use the per_cpu macros. The gist is, you declare such a variable with:
DEFINE_PER_CPU(struct task_struct, my_tasks);
and then access the variable using
get_cpu_var(my_tasks).foo = bar;
You can get more info at See http://www.makelinux.net/ldd3/chp-8-sect-5 (or percpu.h) for more details.

Related

Problem of allocating memory for a global struct and free it

I am using a embedded board with FreeRTOS.
In a task, I defined two structs and use pvPortMalloc to allocate memory. (One struct is a member in the other)
Besides, I pass the address of struct to some functions.
However, there are some issues about freeing the memory using vPortFree.
The following is my code (test_task.c):
/* Struct definition */
typedef struct __attribute__((packed)) {
uint8_t num_parameter;
uint32_t member1;
uint8_t member2;
uint8_t *parameter;
}struct_member;
typedef struct __attribute__((packed)) {
uint16_t num_member;
uint32_t class;
struct_member *member;
}struct_master;
I define a global struct and an array below.
uint8_t *arr;
struct_master master:
Function definition:
void decode_func(struct_master *master, uint8_t *arr)
{
master->member = pvPortMalloc(master->num_member);
for(int i = 0; i < scr->num_command; ++i){
master->member[i].parameter = pvPortMalloc(master->member[i].num_parameter);
do_something();
}
}
The operation task is shown in the following.
At the end of task, I would like to free memory:
void test_task()
{
decode_func( &master, arr);
do_operation();
vPortFree(master.member);
for (int i = 0; i < master.num_member; ++i)
vPortFree(master.member[i].parameter);
hTest_task = NULL;
vTaskDelete(NULL);
}
It is ok to free master.member.
However, when the program tried free master.member[i].parameter,
it seems that freeing had been executed before and software just reset automatically.
Does anyone know why it happened like that?
At the very first glance, the way you allocate for members is wrong in the decode_func.
I assume that master->num_member indicates the number of struct members that master should contain.
master->member = pvPortMalloc(master->num_member);
should be corrected to,
master->member = pvPortMalloc(master->num_member * sizeof(struct_member));
Again, in the same function the loop seems a bit suspicious as well.
for(int i = 0; i < scr->num_command; ++i){
master->member[i].parameter = pvPortMalloc(master->member[i].num_parameter);
do_something();
}
I'm not sure what src->num_command indicates, but naturally I reckon the loop should execute until i < master->num_member. I assume your loop should be updated as follows as well,
for(int i = 0; i < master->num_member; ++i){
master->member[i].parameter = pvPortMalloc(master->member[i].num_parameter * sizeof(uint8_t));
do_something();
}
While doing the freeing of memory, make sure you free the contained members first before freeing the container structure. Therefore you should first free all the parameters and then the member, so change that order in test_task function as well.
Also make sure that before doing vTaskDelete(NULL); you must deallocate all the resources consumed by test_task, otherwise there will be a resource leak. vTaskDelete(NULL) will simply mark the TCB of that particular task as ready to be deleted so that at some time later the idle task will purge the TCB related resources.
Generally, when you free an object, the contents of the object are destroyed and you can't access them anymore. So when you want to free nested allocations like this, you need to free the inner allocations first and only free the outer (master) allocation afterwards. In other words:
for (int i = 0; i < master.num_member; ++i)
vPortFree(master.member[i].parameter);
vPortFree(master.member);
free the parameters first and then the containing member array.

How to map a data with openmp target to use inside a function?

I would like to know how can I map a data for future use inside of a function?
I wrote some code like the following:
struct {
int* a;
int *b;
// other members...
} s;
void func1(struct s* _s){
int a* = _s->a;
int b* = _s->b;
// do something with _s
#pragma omp target
{
// do something with a and b;
}
}
int main(){
struct s* _s;
// alloc _s, a and b
int *a = _s->a;
int *b = _s->b;
#pragma omp target data map(to: a, b)
{
func1(_s);
// call another funcs with device use of mapped data...
}
// free data
}
The code compiles, but on execution Kernel execution error at <address> is spammed from verbose output of execution, followed by many Device kernel launch failed! and CUDA error is: an illegal memory access was encountered
Your map directive looks like it's probably mapping the value of the pointers a and b to the device rather than the arrays they're pointing to. I think you want to shape them so that the runtime maps the data and not just the pointers. Personally, I would also put the map clause on your target region too, since that gives the compiler more information to work with and the present check will find the data already on the device from the outer data region and not perform any further data movement.

Linux kernel module - accessing memory mapping

I'm running into some odd issue on kernel module load that I'm suspecting having to do with linking and loading. How to I programmatically figure out the address of each section after they are loaded in memory (from inside the module itself). Like where is .bss / .data / .text and so on.
From reading this article
https://lwn.net/Articles/90913/
It is sorta in the directly that I'm looking for.
You can see the sections begin addresses like this from userspace (need root permissions):
sudo cat /sys/module/<modulename>/sections/.text
I have browsed how syfs retrieves this addresses, and i found the following:
There is a section attributes in struct module
309 /* Section attributes */
310 struct module_sect_attrs *sect_attrs;
This attrs is a bunch of attr structs
1296 struct module_sect_attrs {
1297 struct attribute_group grp;
1298 unsigned int nsections;
1299 struct module_sect_attr attrs[0];
1300 };
where sect attr is the thing you are looking for
1290 struct module_sect_attr {
1291 struct module_attribute mattr;
1292 char *name;
1293 unsigned long address;
From the module's code THIS_MODULE macro is actually a pointer to the struct module object. Its module_init and module_core fields point to memory regions, where all module sections are loaded.
As I understand, sections division is inaccessible from the module code(struct load_info is dropped after module is loaded into memory). But having module's file you can easily deduce section's addresses after load:
module_init:
- init sections with code (.init.text)
- init sections with readonly data
- init sections with writable data
module_core:
- sections with code (.text)
- sections with readonly data
- sections with writable data
If several sections suit to one category, they are placed in the same order, as in the module's file.
Within module's code you can also print address of any its symbol, and after calculate start of the section, contained this symbol.
While this question is five years old, I thought I would contribute my two-cents. I was able to access the kernel's sections in a sort of hack-y way inspired by Alex Hoppus' answer. I don't advocate doing things this way, unless you are writing the kernel module to debug things or understand the kernel etc.
Anyway, I copy the following two structs into my module to help resolve incomplete types.
struct module_sect_attr {
struct module_attribute mattr;
char *name;
unsigned long address;
};
struct module_sect_attrs {
struct attribute_group grp;
unsigned int nsections;
struct module_sect_attr attrs[0];
};
Then, in my module initialization function, I do the following to get the section addresses.
unsigned long text = 0;
unsigned int nsections = 0;
unsigned int i;
struct module_sect_attr* sect_attr;
nsections = THIS_MODULE->sect_attrs->nsections;
sect_attr = THIS_MODULE->sect_attrs->attrs;
for (i = 0; i < nsections; i++) {
if (strcmp((sect_attr + i)->name, ".text") == 0)
text = (sect_attr + i)->address;
}
Finally, it should be noted that if you are looking for the address of .rodata, .bss, or .data you will need to define constant global variables, uninitialized global variables, or regular global variables, respectively, if you don't want those sections to be omitted.

OpenACC 2.0 routine: data locality

Take the following code, which illustrates the calling of a simple routine on the accelerator, compiled on the device using OpenACC 2.0's routine directive:
#include <iostream>
#pragma acc routine
int function(int *ARRAY,int multiplier){
int sum=0;
#pragma acc loop reduction(+:sum)
for(int i=0; i<10; ++i){
sum+=multiplier*ARRAY[i];
}
return sum;
}
int main(){
int *ARRAY = new int[10];
int multiplier = 5;
int out;
for(int i=0; i<10; i++){
ARRAY[i] = 1;
}
#pragma acc enter data create(out) copyin(ARRAY[0:10],multiplier)
#pragma acc parallel present(out,ARRAY[0:10],multiplier)
if (function(ARRAY,multiplier) == 50){
out = 1;
}else{
out = 0;
}
#pragma acc exit data copyout(out) delete(ARRAY[0:10],multiplier)
std::cout << out << std::endl;
}
How does function know to use the device copies of ARRAY[0:10] and multiplier when it is called from within a parallel region? How can we enforce the use of the device copies?
When your routine is called within a device region (the parallel in your code), it is being called by the threads on the device, which means those threads will only have access to arrays on the device. The compiler may actually choose to inline that function, or it may be a device-side function call. That means that you can know that when the function is called from the device it will be receiving device copies of the data because the function is essentially inheriting the present data clause from the parallel region. If you still want to convince yourself that you're running on the device once inside the function, you could call acc_on_device, but that only tells you that you're running on the accelerator, not that you received a device pointer.
If you want to enforce the use of device copies more than that, you could make the routine nohost so that it would technically not be valid to call from the host, but that doesn't really do what you're asking, which is to do a check on the GPU that the array really is a device array.
Keep in mind though that any code inside a parallel region that is not inside a loop will be run gang-redundantly, so the write to out is likely a race condition, unless you happen to be running with one gang or you write to it using an atomic.
Basically, when you involved "data" clause, the device will create/copy data to the device memory, then the block of code that defined with "acc routine" will be executed on the device. Notice that the memory between host and device does not share unlike multi-threading (OpenMP). So yes, "function" will be using the device copies of ARRAY and multiplier as long as it is under data segment. Hope this helps! :)
You should assign the function with one parallelism level such as gang/worker/vector. It's a more accurate way.
The routine will use the date in device memory.

How to use arrays in program (global) scope in OpenCL

AMD OpenCL Programming Guide, Section 6.3 Constant Memory Optimization:
Globally scoped constant arrays. These arrays are initialized,
globally scoped, and in the constant address space (as specified in
section 6.5.3 of the OpenCL specification). If the size of an array is
below 64 kB, it is placed in hardware constant buffers; otherwise, it
uses global memory. An example of this is a lookup table for math
functions.
I want to use this "globally scoped constant array". I have such code in pure C
#define SIZE 101
int *reciprocal_table;
int reciprocal(int number){
return reciprocal_table[number];
}
void kernel(int *output)
{
for(int i=0; i < SIZE; i+)
output[i] = reciprocal(i);
}
I want to port it into OpenCL
__kernel void kernel(__global int *output){
int gid = get_global_id(0);
output[gid] = reciprocal(gid);
}
int reciprocal(int number){
return reciprocal_table[number];
}
What should I do with global variable reciprocal_table? If I try to add __global or __constant to it I get an error:
global variable must be declared in addrSpace constant
I don't want to pass __constant int *reciprocal_table from kernel to reciprocal. Is it possible to initialize global variable somehow? I know that I can write it down into code, but does other way exist?
P.S. I'm using AMD OpenCL
UPD Above code is just an example. I have real much more complex code with a lot of functions. So I want to make array in program scope to use it in all functions.
UPD2 Changed example code and added citation from Programming Guide
#define SIZE 2
int constant array[SIZE] = {0, 1};
kernel void
foo (global int* input,
global int* output)
{
const uint id = get_global_id (0);
output[id] = input[id] + array[id];
}
I can get the above to compile with Intel as well as AMD. It also works without the initialization of the array but then you would not know what's in the array and since it's in the constant address space, you could not assign any values.
Program global variables have to be in the __constant address space, as stated by section 6.5.3 in the standard.
UPDATE Now, that I fully understood the question:
One thing that worked for me is to define the array in the constant space and then overwrite it by passing a kernel parameter constant int* array which overwrites the array.
That produced correct results only on the GPU Device. The AMD CPU Device and the Intel CPU Device did not overwrite the arrays address. It also is probably not compliant to the standard.
Here's how it looks:
#define SIZE 2
int constant foo[SIZE] = {100, 100};
int
baz (int i)
{
return foo[i];
}
kernel void
bar (global int* input,
global int* output,
constant int* foo)
{
const uint id = get_global_id (0);
output[id] = input[id] + baz (id);
}
For input = {2, 3} and foo = {0, 1} this produces {2, 4} on my HD 7850 Device (Ubuntu 12.10, Catalyst 9.0.2). But on the CPU I get {102, 103} with either OCL Implementation (AMD, Intel). So I can not stress, how much I personally would NOT do this, because it's only a matter of time, before this breaks.
Another way to achieve this is would be to compute .h files with the host during runtime with the definition of the array (or predefine them) and pass them to the kernel upon compilation via a compiler option. This, of course, requires recompilation of the clProgram/clKernel for every different LUT.
I struggled to get this work in my own program some time ago.
I did not find any way to initialize a constant or global scope array from the host via some clEnqueueWriteBuffer or so. The only way is to write it explicitely in your .cl source file.
So here my trick to initialize it from the host is to use the fact that you are actually compiling your source from the host, which also means you can alter your src.cl file before compiling it.
First my src.cl file reads:
__constant double lookup[SIZE] = { LOOKUP }; // precomputed table (in constant memory).
double func(int idx) {
return(lookup[idx])
}
__kernel void ker1(__global double *in, __global double *out)
{
... do something ...
double t = func(i)
...
}
notice the lookup table is initialized with LOOKUP.
Then, in the host program, before compiling your OpenCL code:
compute the values of my lookup table in host_values[]
on your host, run something like:
char *buf = (char*) malloc( 10000 );
int count = sprintf(buf, "#define LOOKUP "); // actual source generation !
for (int i=0;i<SIZE;i++) count += sprintf(buf+count, "%g, ",host_values[i]);
count += sprintf(buf+count,"\n");
then read the content of your source file src.cl and place it right at buf+count.
you now have a source file with an explicitely defined lookup table that you just computed from the host.
compile your buffer with something like clCreateProgramWithSource(context, 1, (const char **) &buf, &src_sz, err);
voilà !
It looks like "array" is a look-up table of sorts. You'll need to clCreateBuffer and clEnqueueWriteBuffer so the GPU has a copy of it to use.

Resources