Google Project Tango: Why does my pointcloud get displayed only along x, y and z axis? - point-clouds

I am trying to edit the point cloud (stored in a FloatBuffer) in order to keep recorded points on the screen. However, when I display the points, they all lie on the x, y or z axis. I am using the example Point Cloud program from Google, and so all I'm doing right now is copying the buffer so I can edit it since the current buffer is read-only. I haven't changed anything else since I need to get my copy to work first. Here is my code for copying the buffer (edited from transferring bytes from one ByteBuffer to another):
private FloatBuffer cloneBuffer(FloatBuffer original) {
final ByteBuffer byteClone = (original.isDirect()) ?
//multiplied by 4 and added 3 so the capacity would be the
//same when converted to a FloatBuffer
ByteBuffer.allocateDirect(original.capacity() * 4 + 3) :
ByteBuffer.allocate(original.capacity() * 4 + 3);
final FloatBuffer clone = byteClone.asFloatBuffer();
final FloatBuffer readOnlyCopy = original.asReadOnlyBuffer();
readOnlyCopy.rewind();
clone.put(readOnlyCopy);
clone.flip();
clone.position(original.position());
return clone;
}

Looking at:
https://developers.google.com/project-tango/apis/known-issues
In the "Depth" section, notice that:
The IJ buffer of the XYZij struct is under development and not yet populated via the API.
Occasionally, or when under high CPU load, the depth flash may appear in the color image, or no depth points are returned. Let the device cool down and/or reboot.
Your problem may lie within the first mentioned point.

Related

OpenCL crash when calling finish()

I am writing an OpenCL app on mac using c++, and it crashes in certain cases depending on the work size.
The program crashes due to a SIGABRT.
Is there any way to get more information about the error?
Why is SIGABRT being raised? Can I catch it?
EDIT:
I realize that this program is a doozie, however I will try to explain it in case anyone would like to take a stab at it.
Through debugging I discovered that the cause of the SIGABRT was one of the kernels timing out.
The program is a tile-based 3D renderer. It is an OpenCL implementation of this algorithm: https://github.com/ssloy/tinyrenderer
The screen is divided into 8x8 tiles. One of the kernels (the tiler) computes which polygons overlap each tile, storing the results in a data structure called tilePolys. A subsequent kernel (the rasterizer), which runs one work item per tile, iterates over the list of polys occupying the tile and rasterizes them.
The tiler writes to an integer buffer which is a list of lists of polygon indices. Each list is of a fixed size (polysPerTile + 1 for the count) where the first element is the count and the subsequent polysPerTile elements are indices of polygons in the tile. There is one such list per tile.
For some reason in certain cases the tiler writes a very large poly count (13172746) to one of the tile's lists in tilePolys. This causes the rasterizer to loop for a long time and time out.
The strange thing is that the index to which the large count is written is never accessed by the tiler.
The code for the tiler kernel is below:
// this kernel is executed once per polygon
// it computes which tiles are occupied by the polygon and adds the index of the polygon to the list for that tile
kernel void tiler(
// number of polygons
ulong nTris,
// width of screen
int width,
// height of screen
int height,
// number of tiles in x direction
int tilesX,
// number of tiles in y direction
int tilesY,
// number of pixels per tile (tiles are square)
int tileSize,
// size of the polygon list for each tile
int polysPerTile,
// 4x4 matrix representing the viewport
global const float4* viewport,
// vertex positions
global const float* vertices,
// indices of vertices
global const int* indices,
// array of array-lists of polygons per tile
// structure of list is an int representing the number of polygons covering that tile,
// followed by [polysPerTile] integers representing the indices of the polygons in that tile
// there are [tilesX*tilesY] such arraylists
volatile global int* tilePolys)
{
size_t faceInd = get_global_id(0);
// compute vertex position in viewport space
float3 vs[3];
for(int i = 0; i < 3; i++) {
// indices are vertex/uv/normal
int vertInd = indices[faceInd*9+i*3];
float4 vertHomo = (float4)(vertices[vertInd*4], vertices[vertInd*4+1], vertices[vertInd*4+2], vertices[vertInd*4+3]);
vertHomo = vec4_mul_mat4(vertHomo, viewport);
vs[i] = vertHomo.xyz / vertHomo.w;
}
float2 bboxmin = (float2)(INFINITY,INFINITY);
float2 bboxmax = (float2)(-INFINITY,-INFINITY);
// size of screen
float2 clampCoords = (float2)(width-1, height-1);
// compute bounding box of triangle in screen space
for (int i=0; i<3; i++) {
for (int j=0; j<2; j++) {
bboxmin[j] = max(0.f, min(bboxmin[j], vs[i][j]));
bboxmax[j] = min(clampCoords[j], max(bboxmax[j], vs[i][j]));
}
}
// transform bounding box to tile space
int2 tilebboxmin = (int2)(bboxmin[0] / tileSize, bboxmin[1] / tileSize);
int2 tilebboxmax = (int2)(bboxmax[0] / tileSize, bboxmax[1] / tileSize);
// loop over all tiles in bounding box
for(int x = tilebboxmin[0]; x <= tilebboxmax[0]; x++) {
for(int y = tilebboxmin[1]; y <= tilebboxmax[1]; y++) {
// get index of tile
int tileInd = y * tilesX + x;
// get start index of polygon list for this tile
int counterInd = tileInd * (polysPerTile + 1);
// get current number of polygons in list
int numPolys = atomic_inc(&tilePolys[counterInd]);
// if list is full, skip tile
if(numPolys >= polysPerTile) {
// decrement the count because we will not add to the list
atomic_dec(&tilePolys[counterInd]);
} else {
// otherwise add the poly to the list
// the index is the offset + numPolys + 1 as tilePolys[counterInd] holds the poly count
int ind = counterInd + numPolys + 1;
tilePolys[ind] = (int)(faceInd);
}
}
}
}
My theories are that either:
I have incorrectly implemented the atomic functions for reading and incrementing the count
I am using an incorrect number format causing garbage to be written into tilePolys
One of my other kernels is inadvertently writing into the tilePolys buffer
I do not think it is the last one though because if instead of writing faceInd to tilePolys, I write a constant value, the large poly count disappears.
tilePolys[counterInd+numPolys+1] = (int)(faceInd); // this is the problem line
tilePolys[counterInd+numPolys+1] = (int)(5); // this fixes the issue
It looks like your kernel is crashing on the GPU itself. You can't really get any extra diagnostics about that directly, at least not on macOS. You'll need to start narrowing down the problem. Some suggestions:
As the crash is currently happening in clFinish() you don't know what asynchronous command is causing the crash. Try switching all your enqueue calls to blocking mode. This should cause it to crash in the call that's actually going wrong.
Check return/error codes on all OpenCL API calls. Sometimes, ignoring an error from an earlier call can cause problems in a later call which relies on earlier results. For example, if creating a buffer fails, passing the result of that buffer creation as a kernel argument will cause problems when trying to run the kernel.
The most likely reason for the crash is that your OpenCL kernel is accessing memory out of bounds or is otherwise misusing pointers. Re-check any array index calculations.
Check if the problem occurs with smaller work batches. Scale up from one workgroup (or work item if not using groups) and see if it only occurs beyond a certain work size. This may give you a clue about buffer sizes and array indices that might be causing the crash.
Systematically comment out parts of your kernel. If the crash goes away if you comment out a specific piece of code, there's a good chance the problem is in that code.
If you've narrowed the problem down to a small area of code but can't work out where it's coming from, start recording diagnostic output to check that variables have the values you're expecting.
Without seeing any code, I can't give you any more specific advice than that.
Note that OpenCL is deprecated on macOS, so if you're specifically targeting that platform and don't need to support Linux, Windows, etc. I recommend learning Metal Compute instead. Apple has made it clear that this is the GPU programming platform they want to support, and the tooling for it is already much better than their OpenCL tooling ever was.
I suspect Apple will eventually stop implementing OpenCL support when they release a Mac with a new type of GPU, so even if you're targeting the Mac as well as other platforms, you will probably need to switch to Metal on the Mac somewhere down the line anyway. As of macOS 10.14, the minimum system requirements of the OS already include a Metal-capable GPU, so you only need OpenCL as a fallback if you wish to support all Mac models able to run 10.13 or an even older OS version.

Few objects moving straight one by one in Unity

I'am trying to create some snake-like movement, but i cant implement algorithm to move one body part straight by another and so on.
I wanna to have some auto-moved snake which consists of separate blocks ( spheres ). This snake should move along some path. I generate path with bezier spline and have already implemented one future snake's part along it. Point for head is obtained from spline by next api:
class BezierSpline
{
Vector3 GetPoint(float progress) // 0 to 1
}
And than I have SnakeMovement script
public class SnakeMovement : MonoBehaviour
{
public BezierSpline Path;
public List<Transform> Parts;
public float minDistance = 0.25f;
public float speed = 1;
//.....
void Update()
{
Vector3 position = Path.GetPoint(progress);
Parts.First().localPosition = position;
Parts.First().LookAt(position + Path.GetDirection(progress));
for (int i = 1; i < Parts.Count; i++)
{
Transform curBody = Parts[i];
Transform prevBody = Parts[i - 1];
float dist = Vector3.Distance(prevBody.position, curBody.position);
Vector3 newP = prevBody.position;
newP.y = Parts[0].position.y;
float t = Time.deltaTime * dist / minDistance * curspeed;
curBody.position = Vector3.Slerp(curBody.position, newP, t);
curBody.rotation = Quaternion.Slerp(curBody.rotation, prevBody.rotation, t);
}
//....
}
For now, if I stopped head movement all parts dont preserve distance and keep moving to the head position. Another problem with above algorithm is that parts don't exectly follow the head path. They can "cut" corners while turning.
The main idea is to have user/ai control for only head(first body part) and each followed part should exectly repeat head path and preserve distance between its neighbours.
For a snake like motion you are likely to get lots of strange behaviours if you treat spheres as seperate objects. While i can imagine its possible to get it to work, I think this is not the best approach.
First solution that comes to mind is to create a List, onto which you would add to index 0, on every frame, the position of the head of the snake.
The list would grow, and all the other segments would wait their turn, so lag x frames, and on each update segment y would have position of list[x*y]
If Count() of the list is greater than number_of_segments*lag, you RemoveAt(Count()-1)
This can be optimized as changing the list is somewhat costly (a ring buffer would be better suited, but a Queue could also work. For starters i find Lists much easier to follow and you can always optimize later). This may behave a bit awkward if your framerate varies a lot but should be very stable in general (as in - no unpredictable motion, we only re-use the same values over and over)
Second method:
You mentioned using a bezier spline to generate a path. beziers are parametrized by a float t so you have something like
SplineAt(t).
if you take your bezier_path_length and distance_between_segments, than segment n should have position of
SplineAt(t-n*distance_between_segments/bezier_path_length)

DirectX 11 compute shader for ray/mesh intersect

I recently converted a DirectX 9 application that was using D3DXIntersect to find ray/mesh intersections to DirectX 11. Since D3DXIntersect is not available in DX11, I wrote my own code to find the intersection, which just loops over all the triangles in the mesh and tests them, keeping track of the closest hit to the origin. This is done on the CPU side and works fine for picking via the GUI, but I have another part of the application that creates a new mesh from an existing one based on several different viewpoints, and I need to check line of sight for every triangle in the mesh many times. This gets pretty slow.
Does it make sense to use a DX11 compute shader to do this (i.e. would there be a significant speedup from doing it on the CPU)? I searched the internet but could not find an existing example.
Assuming the answer is yes, here is the approach I am thinking of:
Launch a thread for every triangle in my mesh
Each thread computes the distance to a hit on that triangle, or returns max float on a miss. Store one value per thread in a buffer.
Then do a reduction and return the minimum (non-negative) value.
I wish I had access to something like CUDA Thrust in DirectX, because I think coding up that reduction is going to be a pain. That's why I'm asking, so I don't do a bunch of work for nothing!
This is totally doable, here is some HLSL code that allows to perform that (and also handles the case where you hit 2 triangles with the same distance).
I assume that you know how to create resources (Structured Buffer) and bind them to compute pipeline.
Also I'll consider that your geometry is Indexed.
The first step is to collect triangles that pass the test. Instead of using a "Hit" flag, we will use an Append buffer to only push elements that pass the test.
First create 2 structured buffers (position and triangle indices), and copy your model data onto those.
Then create a Structured Buffer with an Appendable Unordered view.
To perform Hit detection, you can use the following Compute code:
struct rayHit
{
uint triangleID;
float distanceToTriangle;
};
cbuffer cbRaySettings : register(b0)
{
float3 rayFrom;
float3 rayDir;
uint TriangleCount;
};
StructuredBuffer<float3> positionBuffer : register(t0);
StructuredBuffer<uint3> indexBuffer : register(t1);
AppendStructuredBuffer<rayHit> appendRayHitBuffer : register(u0);
void TestTriangle(float3 p1, float3 p2, float3 p3, out bool hit, out float d)
{
//Perform ray/triangle intersection
hit = false;
d = 0.0f;
}
[numthreads(64,1,1)]
void CS_RayAppend(uint3 tid : SV_DispatchThreadID)
{
if (tid.x >= TriangleCount)
return;
uint3 indices = indexBuffer[tid.x];
float3 p1 = positionBuffer[indices.x];
float3 p2 = positionBuffer[indices.y];
float3 p3 = positionBuffer[indices.z];
bool hit;
float d;
TestTriangle(p1,p2,p3,hit, d);
if (hit)
{
rayHit hitData;
hitData.triangleID = tid.x;
hitData.distanceToTriangle = d;
appendRayHitBuffer.Append(hitData);
}
}
Please note that you need to provide a sufficient size for appendRayHitBuffer (worst case scenario is Triangle Count, eg :every triangle is hit by the ray).
Once this is done, the beginning part of the buffer contains hit data, and the unordered view counter the number of triangles that passed the test.
Then you need to create an argument buffer, and a small Byte Address Buffer (size 16, since I don't think runtime will allow 12)
You also need a small structured buffer (one element is enough), which will be used to store minimum distance
Use CopyStructureCount to pass the UnorderedView counter into those buffers (plase note that second and third element of the Argument buffer needs to be both set to 1, as they will be arguments for use dispatch).
Clear the small StructuredBuffer Buffer using UINT_MAXVALUE, and use the Argument buffer with DispatchIndirect
I assume that you will not have many hits, so for next part numthreads will be set to 1,1,1 (if you want to use larger groups, you will need to run another compute shader to build the argument buffer).
Then to find minimum distance:
StructuredBuffer<rayHit> rayHitbuffer : register(t0);
ByteAddressBuffer rayHitCount : register(t1);
RWStructuredBuffer<uint> rwMinBuffer : register(u0);
[numthreads(1,1,1)]
void CS_CalcMin(uint3 tid : SV_DispatchThreadID)
{
uint count = rayHitCount.Load(0);
if (tid.x >= count)
return;
rayHit hit = rayHitbuffer[tid.x];
uint dummy;
InterlockedMin(rwMinBuffer[0],asuint(hit.distanceToTriangle), dummy);
}
Since we expect that hit distance will be greater than zero, we can use asuint and InterlockedMin in that scenario. Also since we use DispatchIndirect, this part is now only applied to the elements that previously passed the test.
Now your single element buffer contains the minimum distance, but not the index( or indices).
Last part, we need to finally extract triangle index that is at the minimum hit distance.
You need again a new StructuredBuffer with an UnorderedView to store the minimum index.
Use the same dispatch arguments as before (indirect), and perform the following:
ByteAddressBuffer rayHitCount : register(t1);
StructuredBuffer<uint> MinDistanceBuffer : register(t2);
AppendStructuredBuffer<uint> appendMinHitIndexBuffer : register(u0);
[numthreads(1,1,1)]
void CS_AppendMin(uint3 tid : SV_DispatchThreadID)
{
uint count = rayHitCount.Load(0);
if (tid.x >= count)
return;
rayHit hit = rayHitbuffer[tid.x];
uint minDist = MinDistanceBuffer[0];
uint d = asuint(hit.distanceToTriangle);
if (d == minDist)
{
appendMinHitIndexBuffer.Append(hit.triangleID);
}
}
Now appendMinHitIndexBuffer contains the triangle index that is the closest (or several if you have that scenario), you can copy it back using a Staging resource and Map your resource for reading.
Actually it makes a lot of sense. Here is also a whitepaper which has some useful shader snippets: http://www.graphicon.ru/html/2012/conference/EN2%20-%20Graphics/gc2012Shumskiy.pdf . Also you can use DirectCompute/CUDA/OpenCL in DirectX, but if I might give you a hint, do it in DirectCompute, because I think it is the least hassle to set up and get it running

Size of data field in TangoImageBuffer

I am trying to do a memcpy of the TangoImageBuffer data field, which, if the image is YUV, according to me should be (buffer->width * buffer->height * 3 * sizeof(uint8_t)) (sizeof just for kicks, I know it is 1), but this makes it segfault. if I just copy height*width bytes it works, and also with height*width*2, I do seem to be getting valid data, I just don't know which size should this field be.
My (relevant) code:
void onImageCallback(void *context, TangoCameraId id, const TangoImageBuffer *buffer)
{
memcpy(img_struct->image_buffer->getBuffer(), buffer->data, buffer->width * buffer->height * 3 * sizeof(uint8_t)));
}
Where image_buffer is a java ByteBuffer wrapping class which I am using in C++, inside it is allocating memory by calling new with the specified size (which in this case is the same I am trying to memcpy), and a jobject reference by doing a env->NewGlobalRef(env->NewDirectByteBuffer(buffer, this->bufferSize));, where this->bufferSize equals (buffer->width * buffer->height * 3 * sizeof(uint8_t))
I am pretty sure it is allocating the right amount of memory, as I have used it also for memcopying the xyzij buffer in other function (with the correspondent size difference, as those are floats) and it works just fine (yet I have also tried overallocating), so I know the problem is not the destination being too small.
In case this information might add to the question, I am using the regular color camera, so height should be 1280 and width 720 if I recall correctly.
Edit: Upon manually looking for the maximum amount of data I could copy without getting the segfault, it seems it tops on 1384448 (i.e. that works, 1384449 segfaults), which is roughly 1.5 times the size in pixels of the image, adding to my confusion.
The pixel format is documented as YV12 aka YUV420SP. It has full resolution for Y with 2x2 subsampling for U and V, thus U and V both have 1/4 as many samples as Y. The total number of samples is then width*height*(1 + 1/4 + 1/4) = 1.5*width*height.

Folding a selection of points on a 3D cube

I am trying to find an effective algorithm for the following 3D Cube Selection problem:
Imagine a 2D array of Points (lets make it square of size x size) and call it a side.
For ease of calculations lets declare max as size-1
Create a Cube of six sides, keeping 0,0 at the lower left hand side and max,max at top right.
Using z to track the side a single cube is located, y as up and x as right
public class Point3D {
public int x,y,z;
public Point3D(){}
public Point3D(int X, int Y, int Z) {
x = X;
y = Y;
z = Z;
}
}
Point3D[,,] CreateCube(int size)
{
Point3D[,,] Cube = new Point3D[6, size, size];
for(int z=0;z<6;z++)
{
for(int y=0;y<size;y++)
{
for(int x=0;x<size;x++)
{
Cube[z,y,x] = new Point3D(x,y,z);
}
}
}
return Cube;
}
Now to select a random single point, we can just use three random numbers such that:
Point3D point = new Point(
Random(0,size), // 0 and max
Random(0,size), // 0 and max
Random(0,6)); // 0 and 5
To select a plus we could detect if a given direction would fit inside the current side.
Otherwise we find the cube located on the side touching the center point.
Using 4 functions with something like:
private T GetUpFrom<T>(T[,,] dataSet, Point3D point) where T : class {
if(point.y < max)
return dataSet[point.z, point.y + 1, point.x];
else {
switch(point.z) {
case 0: return dataSet[1, point.x, max]; // x+
case 1: return dataSet[5, max, max - point.x];// y+
case 2: return dataSet[1, 0, point.x]; // z+
case 3: return dataSet[1, max - point.x, 0]; // x-
case 4: return dataSet[2, max, point.x]; // y-
case 5: return dataSet[1, max, max - point.x];// z-
}
}
return null;
}
Now I would like to find a way to select arbitrary shapes (like predefined random blobs) at a random point.
But would settle for adjusting it to either a Square or jagged Circle.
The actual surface area would be warped and folded onto itself on corners, which is fine and does not need compensating ( imagine putting a sticker on the corner on a cube, if the corner matches the center of the sticker one fourth of the sticker would need to be removed for it to stick and fold on the corner). Again this is the desired effect.
No duplicate selections are allowed, thus cubes that would be selected twice would need to be filtered somehow (or calculated in such a way that duplicates do not occur). Which could be a simple as using a HashSet or a List and using a helper function to check if the entry is unique (which is fine as selections will always be far below 1000 cubes max).
The delegate for this function in the class containing the Sides of the Cube looks like:
delegate T[] SelectShape(Point3D point, int size);
Currently I'm thinking of checking each side of the Cube to see which part of the selection is located on that side.
Calculating which part of the selection is on the same side of the selected Point3D, would be trivial as we don't need to translate the positions, just the boundary.
Next would be 5 translations, followed by checking the other 5 sides to see if part of the selected area is on that side.
I'm getting rusty in solving problems like this, so was wondering if anyone has a better solution for this problem.
#arghbleargh Requested a further explanation:
We will use a Cube of 6 sides and use a size of 16. Each side is 16x16 points.
Stored as a three dimensional array I used z for side, y, x such that the array would be initiated with: new Point3D[z, y, x], it would work almost identical for jagged arrays, which are serializable by default (so that would be nice too) [z][y][x] but would require seperate initialization of each subarray.
Let's select a square with the size of 5x5, centered around a selected point.
To find such a 5x5 square substract and add 2 to the axis in question: x-2 to x+2 and y-2 to y+2.
Randomly selectubg a side, the point we select is z = 0 (the x+ side of the Cube), y = 6, x = 6.
Both 6-2 and 6+2 are well within the limits of 16 x 16 array of the side and easy to select.
Shifting the selection point to x=0 and y=6 however would prove a little more challenging.
As x - 2 would require a look up of the side to the left of the side we selected.
Luckily we selected side 0 or x+, because as long as we are not on the top or bottom side and not going to the top or bottom side of the cube, all axis are x+ = right, y+ = up.
So to get the coordinates on the side to the left would only require a subtraction of max (size - 1) - x. Remember size = 16, max = 15, x = 0-2 = -2, max - x = 13.
The subsection on this side would thus be x = 13 to 15, y = 4 to 8.
Adding this to the part we could select on the original side would give the entire selection.
Shifting the selection to 0,6 would prove more complicated, as now we cannot hide behind the safety of knowing all axis align easily. Some rotation might be required. There are only 4 possible translations, so it is still manageable.
Shifting to 0,0 is where the problems really start to appear.
As now both left and down require to wrap around to other sides. Further more, as even the subdivided part would have an area fall outside.
The only salve on this wound is that we do not care about the overlapping parts of the selection.
So we can either skip them when possible or filter them from the results later.
Now that we move from a 'normal axis' side to the bottom one, we would need to rotate and match the correct coordinates so that the points wrap around the edge correctly.
As the axis of each side are folded in a cube, some axis might need to flip or rotate to select the right points.
The question remains if there are better solutions available of selecting all points on a cube which are inside an area. Perhaps I could give each side a translation matrix and test coordinates in world space?
Found a pretty good solution that requires little effort to implement.
Create a storage for a Hollow Cube with a size of n + 2, where n is the size of the cube contained in the data. This satisfies the : sides are touching but do not overlap or share certain points.
This will simplify calculations and translations by creating a lookup array that uses Cartesian coordinates.
With a single translation function to take the coordinates of a selected point, get the 'world position'.
Using that function we can store each point into the cartesian lookup array.
When selecting a point, we can again use the same function (or use stored data) and subtract (to get AA or min position) and add (to get BB or max position).
Then we can just lookup each entry between the AA.xyz and BB.xyz coordinates.
Each null entry should be skipped.
Optimize if required by using a type of array that return null if z is not 0 or size-1 and thus does not need to store null references of the 'hollow cube' in the middle.
Now that the cube can select 3D cubes, the other shapes are trivial, given a 3D point, define a 3D shape and test each part in the shape with the lookup array, if not null add it to selection.
Each point is only selected once as we only check each position once.
A little calculation overhead due to testing against the empty inside and outside of the cube, but array access is so fast that this solution is fine for my current project.

Resources