I am stuck trying to figure out how to initialize a pointer to a 3d array inside a 2d structure array. I had it working fine when I could declare the structure like this:
#define ROUSSEAU 300
#define OSSO 500
#define MOJO 9000
typedef struct x_primitive
{
short avg[ROUSSEAU][OSSO][MOJO];
} xprimitive;
But unfortunately the structure is too large to declare as a global. So I have to calloc the memory (192GB memory available on the system: win 7 64 bit);
typedef struct x_primitive
{
short ***avg;
} xprimitive;
xprimitive **xPmtv, *_xPmtv;
void xallocatePmtvMemory(void)
{
int structureCount = 10;
unsigned __int64 pmtvStructureSize = ROUSSEAU * OSSO * MOJO * sizeof(short);
unsigned __int64 memoryBlockSize = structureCount * pmtvStructureSize;
_xPmtv = (xprimitive *) calloc(structureCount, pmtvStructureSize);
xPmtv = (xprimitive **) calloc(structureCount, sizeof(xprimitive *));
for ( int i = 0; i < structureCount; ++i)
{
unsigned __int64 index = i * pmtvStructureSize;
xPmtv[i] = &_xPmtv[ index ];
// **************** here is the problem ******
xPmtv[i]->avg[ROUSSEAU][OSSO][MOJO] = &_xPmtv[ index + (ROUSSEAU + OSSO + MOJO) ];
}
}
I am trying to assign the "avg" variable to a chunk of memory, and utterly failing.
Pointers and arrays aren't interchangeable in the way you seem to be wanting them to be. I think you can probably be doing something much simpler. Make avg into a pointer:
typedef struct x_primitive
{
short (*avg)[ROUSSEAU][OSSO][MOJO];
} xprimitive;
And then allocate the space for the array at runtime:
xprimitive xPmtv;
xPmtv.avg = calloc(1, ROUSSEAU * OSSO * MOJO * sizeof(short));
Using it is a bit funny looking, though:
(*xPmtv.avg)[1][2][3]
If you have multiple structures, just throw the initialization into a loop. Maybe a better idea is to use a flexible array member to keep the usage syntax looking a bit more normal - it will cost you a dummy entry in the structure - flexible array members aren't allowed to play around on their own. Then again, why do you have a structure with only one field anyway?
typedef struct x_primitive
{
int dummyEntry;
short avg[][OSSO][MOJO];
} xprimitive;
To allocate one, you'd use:
xprimitive *xPmtv = calloc(1, sizeof(xprimitive) + ROUSSEAU * OSSO * MOJO * sizeof(short));
And access the array something like:
xPmtv->avg[1][2][3]
Related
I'm studying SYCL at university and I have a question about performance of a code.
In particular I have this C/C++ code:
And I need to translate it in a SYCL kernel with parallelization and I do this:
#include <sycl/sycl.hpp>
#include <vector>
#include <iostream>
using namespace sycl;
constexpr int size = 131072; // 2^17
int main(int argc, char** argv) {
//Create a vector with size elements and initialize them to 1
std::vector<float> dA(size);
try {
queue gpuQueue{ gpu_selector{} };
buffer<float, 1> bufA(dA.data(), range<1>(dA.size()));
gpuQueue.submit([&](handler& cgh) {
accessor inA{ bufA,cgh };
cgh.parallel_for(range<1>(size),
[=](id<1> i) { inA[i] = inA[i] + 2; }
);
});
gpuQueue.wait_and_throw();
}
catch (std::exception& e) { throw e; }
So my question is about c value, in this case I use directly the value two but this will impact on the performance when I'll run the code? I need to create a variable or in this way is correct and the performance are good?
Thanks in advance for the help!
Interesting question. In this case the value 2 will be a literal in the instruction in your SYCL kernel - this is as efficient as it gets, I think! There's the slight complication that you have an implicit cast from int to float. My guess is that you'll probably end up with a float literal 2.0 in your device assembly. Your SYCL device won't have to fetch that 2 from memory or cast at runtime or anything like that, it just lives in the instruction.
Equally, if you had:
constexpr int c = 2;
// the rest of your code
[=](id<1> i) { inA[i] = inA[i] + c; }
// etc
The compiler is almost certainly smart enough to propagate the constant value of c into the kernel code. So, again, the 2.0 literal ends up in the instruction.
I compiled your example with DPC++ and extracted the LLVM IR, and found the following lines:
%5 = load float, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
%add.i = fadd float %5, 2.000000e+00
store float %add.i, float addrspace(4)* %arrayidx.ascast.i.i, align 4, !tbaa !17
This shows a float load & store to/from the same address, with an 'add 2.0' instruction in between. If I modify to use the variable c like I demonstrated, I get the same LLVM IR.
Conclusion: you've already achieved maximum efficiency, and compilers are smart!
After understanding that GCC supports Compound Literals, where an anonymous structure can be filled using a {...} initaliser.
Then consider that gcc accepts (with limitations) variable length structures if the last element is variable length item.
I would like to be able to use macros to fill out lots of tables where most of the data stays the same from compile time and only a few fields change.
My structure are complicated, so here is a simpler working example to start with as a demonstration of the how it is to be used.
#include <stdio.h>
typedef unsigned short int uint16_t;
typedef unsigned long size_t;
#define CONSTANT -20
// The data we are storing, we don't need to fill all fields every time
typedef struct dt {
uint16_t a;
const int b;
} data_t;
// An incomplete structure definiton that matches the general shape
typedef struct ct {
size_t size;
data_t data;
char name[];
} complex_t;
// A typedef to make the code look cleaner
typedef complex_t * complex_t_ptr;
// A macro to generate instances of objects
#define CREATE(X, Y) (complex_t_ptr)&((struct { \
size_t size; \
data_t data; \
char name[sizeof(X)]; \
} ) { \
.size = sizeof(X), \
.data = { .a = Y, .b = CONSTANT }, \
.name = X \
})
// Create an array number of structure instance and put pointers those objects into an array
// Note each object may be a different size.
complex_t_ptr data_table[] = {
CREATE("DATA1", 1),
CREATE("DATA2_LONGER", 2),
CREATE("D3S", 3),
};
static size_t DATA_TABLE_LEN = sizeof(data_table) / sizeof(typeof(0[data_table]));
int main(int argc, char **argv)
{
for(uint16_t idx=0; idx<DATA_TABLE_LEN; idx++)
{
complex_t_ptr p = data_table[idx];
printf("%15s = (%3u, %3d) and is %3lu long\n", p->name, p->data.a, p->data.b, p->size);
}
return 0;
}
$ gcc test_macro.c -o test_macro
$ ./test_macro
DATA1 = ( 1, -20) and is 6 long
DATA2_LONGER = ( 2, -20) and is 13 long
D3S = ( 3, -20) and is 4 long
So far so good...
Now, what if we want to create a more complicated object?
//... skipping the rest as hopefully you have the idea by now
// A more complicated data structure
typedef struct dt2 {
struct {
unsigned char class[10];
unsigned long start_address;
} xtra;
uint16_t a;
const int b;
} data2_t;
// A macro to generate instances of objects
#define CREATE2(X, Y, XTRA) (complex2_t_ptr)&((struct { \
size_t size; \
data2_t data; \
char name[sizeof(X)]; \
} ) { \
.size = sizeof(X), \
.data = { .xtra = XTRA, .a = Y, .b = CONSTANT }, \
.name = X \
})
// Again create the table
complex2_t_ptr bigger_data_table[] = {
CREATE2("DATA1", 1, {"IO_TBL", 0x123456L}),
CREATE2("DATA2_LONGER", 2, {"BASE_TBL", 0xABC123L}),
CREATE2("D3S", 3, {"MAIN_TBL", 0x555666L << 2}),
};
//...
But there is a probem. This does not compile as the compiler (preprocessor) gets confused by the commas between the structure members.
The comma in the passed structure members is seen by the macro and it thinks there are extra parameters.
GCC says you can put brackets round terms where you want to keep the commas, like this
MACRO((keep, the, commas))
e.g. In this case, that would be
CREATE_EXTRA("DATA1", 1, ({"IO_TBL", 0x123456L}) )
But that would not work with a structure as we'd get
.xtra = ({"IO_TBL", 0x123456L})
Which is not a valid initaliser.
The other option would be
CREATE_EXTRA("DATA1", 1, {("IO_TBL", 0x123456L)} )
Which results in
.xtra = {("IO_TBL", 0x123456L)}
Which is also not valid
And if we put the braces inside the macro
.xtra = {EXTRA}
...
CREATE_EXTRA("DATA1", 1, ("IO_TBL", 0x123456L) )
We get the same
Obviously some might say "just pass the elements of XTRA one at a time".
Remember this is a simple, very cut down, example and in practice doing that would lose information and make the code much harder to understand, it would be harder to maintain but easer to read if the structures were just copied out longhand.
So the question is, "how to pass compound literal structures to macros as initalisers without getting tripped up by the commas between fields".
NOTE I am stuck with C11 on GCC4.8.x, so C++ or any more recent GCC is not possible.
So there is a way, though I can't find it meantioned on the GCC pages for Macros.
I found what I needed in this article: Comma omission and comma deletion
The following works.
typedef struct _array_data {
size_t size;
char * data;
}array_data_t;
#define ARRAY_DATA(ARRAY...) (char *) \
&(array_data_t) { \
sizeof((char []){ARRAY}), \
(char []){ARRAY} \
}
char * my_array = ARRAY_DATA(1,2,3,4);
size_t sent = send_packet(my_array);
if (len != my_array->size) ERROR("Not all data sent");
There are some interesting aspects to this.
1: Unlike the example in the gcc manual, the brackets are omitted round the {ARRAY}. In the document, the example uses (cast)({structure}) rather than (cast){structure}. In fact it looks like the brackets are never needed and just confuse the compiler in some cases (like when you take the address).
2: The use of the cast (char []) rather than (char *) as one would have thought to be correct.
3: Of course it makes sense but you have to put a cast round the sizeof part too, as otherwise how would it know the size of the individual literals.
For completeness, the macro in the example above expands to:
char * my_array = (char *)&(array_data_t) { \
sizeof((char []){1,2,3,4}),
(char []){1,2,3,4};
}
Any my_array is a pointer to a structure that looks like this.
* my_array = {
size_t size = 4,
char data[4] = {1,2,3,4}
}
This is a class which contains image data.
class MyMat
{
public:
int width, height, format;
uint8_t *data;
}
I want to design MyMat with automatic memory management. The image data could be shared among many objects.
Common APIs which I'm going to design:
+) C++ 11
+) Assignment : share data
MyMat a2(w, h, fmt);
.................
a2 = a1;
+) Accessing data should be simple and short.
Can use raw pointer directly.
In general, I want to design MyMat like as OpenCV cv::Mat
Could you suggest me a proper design ?
1) Using std::vector<uint8_t> data
I have to write some code to remove copy constructor and assignment operator because someone can call them and causes memory copy.
The compiler must support copy ellision and return value optimization.
Always using move assignment and passing by reference are inconvenient
a2 = std::move(a1)
void test(MyMat &mat)
std::queue<MyMat> lists;
lists.push_back(std::move(a1))
..............................
2) Use share_ptr<uint8_t> data
Following this guideline http://www.codingstandard.com/rule/17-3-4-do-not-create-smart-pointers-of-array-type/,
we shouldn't create smart pointers of array type.
3) Use share_ptr< std::vector<uint8_t> > data
To access data, use *(a1.data)[0], the syntax is very inconvenient
4) Use raw pointer, uint8_t *data
Write proper constructor and destructor for this class.
To make automatic memory management, use smart pointer.
share_ptr<MyMat> mat
std::queue< share_ptr<MyMat> > lists;
Matrix classes are normally expected to be a value type with deep copying. So, stick with std::vector<uint8_t> and let the user decide whether copy is expensive or not in their specific context.
Instead of raw pointers for arrays prefer std::unique_ptr<T[]> (note the square brackets).
std::array - fixed length in-place buffer (beautified array)
std::vector - variable length buffer
std::shared_ptr - shared ownership data
std::weak_ptr - expiring view on shared data
std::unique_ptr - unique ownership
std::string_view, std::span, std::ref, &, * - reference to data with no assumption of ownership
Simplest design is to have a single owner and RAII-forced life time ensuring everything that needs to be alive at certain time is alive and needs no other ownership, so generally I'd see if I could live std::unique_ptr<T> before complicating further (unless I can fit all my data on the stack, then I don't even need a unique_ptr).
On a side note - shared pointers are not free, they need dynamic memory allocation for the shared state (two allocations if done incorrectly :) ), whereas unique pointers are true "zero" overhead RAII.
Matrixes should use value semantics, and they should be nearly free to move.
Matrixes should support a view type as well.
There are two approaches for a basic Matrix that make sense.
First, a Matrix type that wraps a vector<T> with a stride field. This has an overhead of 3 instead of 2 pointers (or 1 pointer and a size) compared to a hand-rolled one. I don't consider that significant; the ease of debugging a vector<T> etc makes it more than worth that overhead.
In this case you'd want to write a separate MatrixView.
I'd use CRTP to create a common base class for both to implement operator[] and stride fields.
A distinct basic Matrix approach is to make your Matrix immutable. In this case, the Matrix wraps a std::shared_ptr<T const> and a std::shared_ptr<std::mutex> and (local, or stored with the mutex) width, height and stride field.
Copying such a Matrix just duplciates handles.
Modifying such a Matrix causes you to acquire the std::mutex, then check that shared_ptr<T const> has a use_count()==1. If it does, you cast-away const and modify the data referred to in the shared_ptr. If it does not, you duplicate the buffer, create a new mutex, and operate on the new state.
Here is a copy on write matrix buffer:
template<class T>
struct cow_buffer {
std::size_t rows() const { return m_rows; }
std::size_t cols() const { return m_cols; }
cow_buffer( T const* in, std::size_t rows, std::size_t cols, std::size_t stride ) {
copy_in( in, rows, cols, stride );
}
void copy_in( T const* in, std::size_t rows, std::size_t cols, std::size_t stride ) {
// note it isn't *really* const, this matters:
auto new_data = std::make_shared<T[]>( rows*cols );
for (std::size_t i = 0; i < rows; ++i )
std::copy( in+i*stride, in+i*m_stride+m_cols, new_data.get()+i*m_cols );
m_data = new_data;
m_rows = rows;
m_cols = cols;
m_stride = cols;
m_lock = std::make_shared<std::mutex>();
}
template<class F>
decltype(auto) read( F&& f ) const {
return std::forward<F>(f)( m_data.get() );
}
template<class F>
decltype(auto) modify( F&& f ) {
auto lock = std::unique_lock<std::mutex>(*m_lock);
if (m_data.use_count()==1) {
return std::forward<F>(f)( const_cast<T*>(m_data.get()) );
}
auto old_data = m_data;
copy_in( old_data.get(), m_rows, m_cols, m_stride );
return std::forward<F>(f)( const_cast<T*>(m_data.get()) );
}
explicit operator bool() const { return m_data && m_lock; }
private:
std::shared_ptr<T> m_data;
std::shared_ptr<std::mutex> m_lock;
std::size_t m_rows = 0, m_cols = 0, m_stride = 0;
};
something like that.
The mutex is required to ensure synchonization between multiple threads who are sole owners modifying m_data and the data from the previous write not being synchronzied with the current one.
I use the following C structs in my C++11 code (the code comes from liblwgeom of PostGis, but this is not the core of the question). The code is compiled with the following options using g++-4.8:
-std=c++11 -Wall -Wextra -pedantic-errors -pedantic -Werror
and I don't get any errors during compilation (or warnings) (should I get any?)
Question
Is safe to use LWPOLY (actually pointed by LWGEOM*) in functions that accept LWGEOM and don't modify the void *data; member. I understand that this is poor man's inheritance but this is what I need to work with.
Details
POLYGON:
typedef struct
{
uint8_t type; /* POLYGONTYPE */
uint8_t flags;
GBOX *bbox;
int32_t srid;
int nrings; /* how many rings we are currently storing */
int maxrings; /* how many rings we have space for in **rings */
POINTARRAY **rings; /* list of rings (list of points) */
}
LWPOLY; /* "light-weight polygon" */
LWGEOM:
typedef struct
{
uint8_t type;
uint8_t flags;
GBOX *bbox;
int32_t srid;
void *data;
}
LWGEOM;
POINTARRAY:
typedef struct
{
/* Array of POINT 2D, 3D or 4D, possibly missaligned. */
uint8_t *serialized_pointlist;
/* Use FLAGS_* macros to handle */
uint8_t flags;
int npoints; /* how many points we are currently storing */
int maxpoints; /* how many points we have space for in serialized_pointlist */
}
POINTARRAY;
GBOX:
typedef struct
{
uint8_t flags;
double xmin;
double xmax;
double ymin;
double ymax;
double zmin;
double zmax;
double mmin;
double mmax;
} GBOX;
Am I violating strict aliasing rule when I do something like?
const LWGEOM* lwgeom;
...
const LWPOLY* lwpoly = reinterpret_cast<const LWPOLY*>(lwgeom);
I know that in PostGis types are specifically designed to be "compatible" however I'd like to know if I am violating the standard by doing so.
Also, I noticed that PostGis is not compiled with strict aliasing disabled by default (at least version 2.1.5).
Solution
My colleague helped me to investigate it and it seems the answer is No it doesn't violate strict aliasing, but only in case we access LWGEOMS members that are of the same type as of LWPOLY's and are laid out in the beginning of the struct contiguously. Here is why (quoting standard):
3.10.10 says that you can access a member through a pointer to "aggregate or union".
8.5.1 defines aggregates (C structs are aggregates):
An aggregate is an array or a class (Clause 9) with no user-provided constructors (12.1), no private or
protected non-static data members (Clause 11), no base classes (Clause 10), and no virtual functions (10.3).
9.2.19 says that pointer to the struct is the same as pointer to the fist member for standard layout classes (C structs are standard layout).
Whether this is a safe way to code is a different question.
Yes, it violates the strict aliasing rule. LWGEOM and LWPOLY are unrelated types, and so are int and void*. So, for example, modification to lwgeom->data may not be read through lwpoly->nrings and vice versa.
I validated this with GCC4.9. My code is as follows:
#include <cinttypes>
#include <iostream>
using namespace std;
typedef struct {
uint8_t type; /* POLYGONTYPE */
uint8_t flags;
int32_t srid;
int nrings; /* how many rings we are currently storing */
} LWPOLY; /* "light-weight polygon" */
typedef struct {
uint8_t type;
uint8_t flags;
int32_t srid;
void *data;
} LWGEOM;
void f(LWGEOM* pgeom, LWPOLY* ppoly) {
ppoly->nrings = 7;
pgeom->data = 0;
std::cout << ppoly->nrings << '\n';
}
int main() {
LWGEOM geom = {};
LWGEOM* pgeom = &geom;
LWPOLY* ppoly = (LWPOLY*)pgeom;
f(pgeom, ppoly);
}
Guess what, the output is 7.
AMD OpenCL Programming Guide, Section 6.3 Constant Memory Optimization:
Globally scoped constant arrays. These arrays are initialized,
globally scoped, and in the constant address space (as specified in
section 6.5.3 of the OpenCL specification). If the size of an array is
below 64 kB, it is placed in hardware constant buffers; otherwise, it
uses global memory. An example of this is a lookup table for math
functions.
I want to use this "globally scoped constant array". I have such code in pure C
#define SIZE 101
int *reciprocal_table;
int reciprocal(int number){
return reciprocal_table[number];
}
void kernel(int *output)
{
for(int i=0; i < SIZE; i+)
output[i] = reciprocal(i);
}
I want to port it into OpenCL
__kernel void kernel(__global int *output){
int gid = get_global_id(0);
output[gid] = reciprocal(gid);
}
int reciprocal(int number){
return reciprocal_table[number];
}
What should I do with global variable reciprocal_table? If I try to add __global or __constant to it I get an error:
global variable must be declared in addrSpace constant
I don't want to pass __constant int *reciprocal_table from kernel to reciprocal. Is it possible to initialize global variable somehow? I know that I can write it down into code, but does other way exist?
P.S. I'm using AMD OpenCL
UPD Above code is just an example. I have real much more complex code with a lot of functions. So I want to make array in program scope to use it in all functions.
UPD2 Changed example code and added citation from Programming Guide
#define SIZE 2
int constant array[SIZE] = {0, 1};
kernel void
foo (global int* input,
global int* output)
{
const uint id = get_global_id (0);
output[id] = input[id] + array[id];
}
I can get the above to compile with Intel as well as AMD. It also works without the initialization of the array but then you would not know what's in the array and since it's in the constant address space, you could not assign any values.
Program global variables have to be in the __constant address space, as stated by section 6.5.3 in the standard.
UPDATE Now, that I fully understood the question:
One thing that worked for me is to define the array in the constant space and then overwrite it by passing a kernel parameter constant int* array which overwrites the array.
That produced correct results only on the GPU Device. The AMD CPU Device and the Intel CPU Device did not overwrite the arrays address. It also is probably not compliant to the standard.
Here's how it looks:
#define SIZE 2
int constant foo[SIZE] = {100, 100};
int
baz (int i)
{
return foo[i];
}
kernel void
bar (global int* input,
global int* output,
constant int* foo)
{
const uint id = get_global_id (0);
output[id] = input[id] + baz (id);
}
For input = {2, 3} and foo = {0, 1} this produces {2, 4} on my HD 7850 Device (Ubuntu 12.10, Catalyst 9.0.2). But on the CPU I get {102, 103} with either OCL Implementation (AMD, Intel). So I can not stress, how much I personally would NOT do this, because it's only a matter of time, before this breaks.
Another way to achieve this is would be to compute .h files with the host during runtime with the definition of the array (or predefine them) and pass them to the kernel upon compilation via a compiler option. This, of course, requires recompilation of the clProgram/clKernel for every different LUT.
I struggled to get this work in my own program some time ago.
I did not find any way to initialize a constant or global scope array from the host via some clEnqueueWriteBuffer or so. The only way is to write it explicitely in your .cl source file.
So here my trick to initialize it from the host is to use the fact that you are actually compiling your source from the host, which also means you can alter your src.cl file before compiling it.
First my src.cl file reads:
__constant double lookup[SIZE] = { LOOKUP }; // precomputed table (in constant memory).
double func(int idx) {
return(lookup[idx])
}
__kernel void ker1(__global double *in, __global double *out)
{
... do something ...
double t = func(i)
...
}
notice the lookup table is initialized with LOOKUP.
Then, in the host program, before compiling your OpenCL code:
compute the values of my lookup table in host_values[]
on your host, run something like:
char *buf = (char*) malloc( 10000 );
int count = sprintf(buf, "#define LOOKUP "); // actual source generation !
for (int i=0;i<SIZE;i++) count += sprintf(buf+count, "%g, ",host_values[i]);
count += sprintf(buf+count,"\n");
then read the content of your source file src.cl and place it right at buf+count.
you now have a source file with an explicitely defined lookup table that you just computed from the host.
compile your buffer with something like clCreateProgramWithSource(context, 1, (const char **) &buf, &src_sz, err);
voilĂ !
It looks like "array" is a look-up table of sorts. You'll need to clCreateBuffer and clEnqueueWriteBuffer so the GPU has a copy of it to use.