I have been testing the Tensor module from Eigen3 for a new project.
Even when the module is not yet finished, it seems to have most of the functionality that I need.
But there is one part that I quite not get. Whenever I have a big Tensor and I want to extract a slice from it, Eigen makes a copy of the data.
Is there a way to not copy the data, but instead point to the original data block in the slice?
For example if I do:
Tensor<float, 3> A(100,1000,1000); A.setZero();
Eigen::array<int, 3> offsets = {0, 0, 0};
Eigen::array<int, 3> extents = {2, 2, 2};
Tensor<float, 3> c = A.slice(offsets, extents);
A(0,0,0) = 1.0;
cerr << c << endl;
But the first element of "c" is still zero, instead of mapping to the modified "A(0,0,0)" data block.
You can use a TensorMap to create a tensor based on shared memory space of your slice. However this only works if your slice occupies contiguous portion of the data array. Otherwise you would need to do some tensor arithmetic to figure out the begin and end 1d indices of various parts of your single slice.
TensorMap<Tensor<float, 3, RowMajor> > row_major(data, ...);
Related
I load an Eigen matrix A(5,12), and I would like to assign a new eigen Vector as the first 7 values of the first row of matrix A.
Somehow, it doesn't work...
Later I realize that block returns a pointer to the original data. How to deep copy the block into Eigen Vector?
Eigen::MatrixXd A(5,12);
Eigen::VectorXd B(12); B = A.row(0);
Eigen::VectorXd C(7); C = B.head(7);
Block methods like block, col, row, head, etc. return views on the original data, but operator = always perform a deep copy, so you can simply write:
VectorXd C = A.row(0).head(7);
This will perform a single deep copy. With Eigen 3.4 slicing API, you'll also be able to write:
VectorXd C = A(0,seqN(0,7));
I have a flattened (1D) U32 encoded image array which has r, b, and b 8-bit channel values encoded into the first 24 bits of each U32. I would like to expand this array into and array of U8s that each store a separate r, g, or b value (0-255). The issue is that I need this to happen really fast (hundreds of times per second on an old computer) and the method I created is slow.
I am a novice at labview, so I am not exactly sure what a faster way to do this is.
I have successfully accomplished this by creating a U8 array, iterating through each index of the U32 Image array and assigning the corresponding 3 rgb values to the appropriate index in the U8 array using a shift-register. I attempted to use the In Place Element Structure (which would presumably not require copying data in between loops like shift), but I did not know how to make it work inside the loop and when I tried to return the last array from the loop, only the last element was modified.
Here is the first, working method I described above:
In c/c++, it would be pretty simple (something like this):
uint8_t* convert_img(uint32_t img[640*480]){
uint8_t *img_u8 = new uint8_t[640*480*3];
for (int i=0; i<640*480; ++i){
img_u8[i*3] = img[i] & 0xff; // r
img_u8[i*3 + 1] = (img[i] >> 8) & 0xff; // g
img_u8[i*3 + 2] = (img[i] >> 16) & 0xff; // b
}
return img_u8;
}
The working labview example above only runs at 20 Hz! I think this is super slow for such a simple operation. Does anyone with more experience have a suggestion of how I can make this happen quickly with labview code?
I would do it like this:
U32 to U8
The steps are:
Flatten To String - endian chooses which order the bytes are in
Unflatten From String - into a 1D U8 array
Decimate 1D Array - creates 4 1D arrays
Reshape Array - turns into 640x480 arrays
Should be plenty fast enough.
I expect that the fastest option would be using the Split Numbers primitive to break the U32s into U8s, but you would need to actually test:
Also note that testing performance is not as easy as you might think, although if you're looking at the overall rate, you're probably fine with the basic testing.
I have a data array (double *) in memory which looks like:
[x0,y0,z0,junk,x1,y1,z1,junk,...]
I would like to map it to an Eigen vector and virtually remove the junk values by doing something like:
Eigen::Map<
Eigen::Matrix<double, Eigen::Dynamic, 1, Eigen::ColMajor>,
Eigen::Unaligned,
Eigen::OuterStride<4>
>
But it does not work because the outerstride seems to be restricted to 2D matrices.
Is there a trick to do what I want?
Many thanks!
With the head of Eigen, you can map it as a 2D matrix and then view it as a 1D vector:
auto m1 = Matrix<double,3,Dynamic>::Map(ptr, 3, n, OuterStride<4>());
auto v = m1.reshaped(); // new in future Eigen 3.4
But be aware accesses to such a v involve costly integer division/modulo.
If you want a solution compatible with Eigen 3.3, you can do something like this
VectorXd convert(double const* ptr, Index n)
{
VectorXd res(n*3);
Matrix3Xd::Map(res.data(), 3, n) = Matrix4Xd::Map(ptr, 4, n).topRows<3>();
return res;
}
But this of course would copy the data, which you probably intended to avoid.
Alternatively, you should think about whether it is possible to access your data as a 3xN array/matrix instead of a flat vector (really depends on what you are actually doing).
What is the correct way to initialize a tensor in the ARM compute library? I have not found any documentation on what is the correct way to do it.
The tensor I have contains floats (F32). I can write data directly by accessing the underlying data through the buffer() interface, which returns a pointer to uint8_t. However, I am not sure how to figure out the data layout because it does not appear to be contiguous, i.e. if I write 4 floats to a 4x1 tensor,
Tensor x{};
x.allocator()->init(TensorInfo(4, 1, Format::F32));
float xdata[] = {1, 2, 3, 4};
FILE *fd = fmemopen(x.buffer(), 4 * sizeof(float), "wb");
fwrite(xdata, sizeof(float), 4, fd);
fclose(fd);
x.print(std::cout);
This prints out,
1 2 3 1.17549e-38
The fist 3 elements of 'x' are initialized, but the last one is not. If I change the fwrite line to,
fwrite(xdata, sizeof(float), 6, fd);
then the output is
1 2 3 4
So it may be that there are more bytes being allocated than necessary for 4 floats, or this could be some misleading coincidence. Either way, this is not the right way to initialize the values of the tensor.
Any help would be greatly appreciated.
From arm compute library documentation (v18.08), looks like the right way to initialize in your case would be "import_memory" function. See example here: https://github.com/ARM-software/ComputeLibrary/blob/master/tests/validation/NEON/UNIT/TensorAllocator.cpp
I think you have to allocate the tensor as well.
More precisely
Tensor x{};
x.allocator()->init(TensorInfo(4, 1, Format::F32)); Set the metadata
x.allocator()->allocate(); // Now the memory has been allocated
float xdata[] = {1, 2, 3, 4};
memcpy(x.data(), xdata, 4 * sizeof(float), "wb");
x.print(std::cout);
This code is not tested, but it should give you a fairly good idea!
I'm currently creating a game using GoLang. I'm measuring the FPS. I'm noticing about a 7 fps loss using a for loop to append to a slice like so:
vertexInfo := Opengl.OpenGLVertexInfo{}
for i := 0; i < 4; i = i + 1 {
vertexInfo.Translations = append(vertexInfo.Translations, float32(s.x), float32(s.y), 0)
vertexInfo.Rotations = append(vertexInfo.Rotations, 0, 0, 1, s.rot)
vertexInfo.Scales = append(vertexInfo.Scales, s.xS, s.yS, 0)
vertexInfo.Colors = append(vertexInfo.Colors, s.r, s.g, s.b, s.a)
}
I'm doing this for every sprite, every draw. The question is why do I get such a huge performance hit with just looping for times and appending the same thing to these slices? Is there a more efficient way to do this? It is not like I'm adding exuberant amount of data. Each slice contains about 16 elements as shown above (4 x 4).
When I simply put all 16 elements in one []float32{1..16} then fps is improved by about 4.
Update: I benchmarked each append and it seems that each one takes 1 fps to perform.. That seems like a lot considering this data is pretty static.. I only need 4 iterations...
Update: Added github repo https://github.com/Triangle345/GT
The builtin append() needs to create a new backing array if the capacity of the destination slice is less than what the length of the slice would be after the append. This also requires to copy the current elements from destination to the newly allocated array, so there are much overhead.
Slices you append to are most likely empty slices since you used a slice literal to create your Opengl.OpenGLVertexInfo value. Even though append() thinks for the future and allocates a bigger array than what is needed to append the specified elements, chances are that in your case multiple reallocations will be needed to complete the 4 iterations.
You may avoid reallocations if you create and initialize vertexInfo like this:
vertexInfo := Opengl.OpenGLVertexInfo{
Translations: []float32{float32(s.x), float32(s.y), 0, float32(s.x), float32(s.y), 0, float32(s.x), float32(s.y), 0, float32(s.x), float32(s.y), 0},
Rotations: []float64{0, 0, 1, s.rot, 0, 0, 1, s.rot, 0, 0, 1, s.rot, 0, 0, 1, s.rot},
Scales: []float64{s.xS, s.yS, 0, s.xS, s.yS, 0, s.xS, s.yS, 0, s.xS, s.yS, 0},
Colors: []float64{s.r, s.g, s.b, s.a, s.r, s.g, s.b, s.a, s.r, s.g, s.b, s.a, s.r, s.g, s.b, s.a},
}
Also note that this struct literal will take care of not having to reallocate arrays behind the slices. But if in other places of your code (which we don't see) you append further elements to these slices, they may cause reallocations. If this is the case, you should create slices with bigger capacity covering "future" allocations (e.g. make([]float64, 16, 32)).
An empty slice is empty. To append, it must allocate memory. And then you do more appends, which have to allocate even more memory.
To speed it up use a fixed size array or use make to create a slice with the correct length, or initialize the slice with the items when you declare it.