Kinect: Converting from RGB Coordinates to Depth Coordinates - image

I am using the Windows Kinect SDK to obtain depth and RGB images from the sensor.
Since the depth image and the RGB images do not align, I would like to find a way of converting the coordinates of the RGB image to that of the depth image, since I want to use an image mask on the depth image I have obtained from some processing on the RGB image.
There is already a method for converting depth coordinates to the color space coordinates:
NuiImageGetColorPixelCoordinatesFromDepthPixel
unfortunately, the reverse does not exist. There is only an arcane call in INUICoordinateMapper:
HRESULT MapColorFrameToDepthFrame(
NUI_IMAGE_RESOLUTION eColorResolution,
NUI_IMAGE_RESOLUTION eDepthResolution,
DWORD cDepthPixels,
NUI_DEPTH_IMAGE_PIXEL *pDepthPixels,
DWORD cDepthPoints,
NUI_DEPTH_IMAGE_POINT *pDepthPoints
)
How this method works is not very well documented. Has anyone used it before?
I'm on the verge of performing a manual calibration myself to calculate a transformation matrix, so I would be very happy for a solution.

Thanks to commenter horristic, I got a link to msdn with some useful information (thanks also to T. Chen over at msdn for helping out with the API). Extracted from T. Chen's post, here's the code that will perform the mapping from RGB to depth coordinate space:
INuiCoordinateMapper* pMapper;
mNuiSensor->NuiGetCoordinateMapper(&pMapper);
pMapper->MapColorFrameToDepthFrame(
NUI_IMAGE_TYPE_COLOR,
NUI_IMAGE_RESOLUTION_640x480,
NUI_IMAGE_RESOLUTION_640x480,
640 * 480,
(NUI_DEPTH_IMAGE_PIXEL*)LockedRect.pBits,
640 * 480,
depthPoints);
Note: the sensor needs to be initialized and a depth frame locked for this to work.
The transformed coordinates can, e.g., be queried as follows:
/// transform RGB coordinate point to a depth coordinate point
cv::Point TransformRGBtoDepthCoords(cv::Point rgb_coords, NUI_DEPTH_IMAGE_POINT * depthPoints)
{
long index = rgb_coords.y * 640 + rgb_coords.x;
NUI_DEPTH_IMAGE_POINT depthPointAtIndex = depthPoints[index];
return cv::Point(depthPointAtIndex.x, depthPointAtIndex.y);
}

As far as I can tell, MapColorFrameToDepthFrame effectively runs the co-ordinate system conversion on every pixel of your RGB image, storing the depth image coordinates resulting from the conversion and the resultant depth value in the output NUI_DEPTH_IMAGE_POINT array. The definition of that structure is here: http://msdn.microsoft.com/en-us/library/nuiimagecamera.nui_depth_image_point.aspx
Possibly this is overkill for your needs however, and I've no idea how fast that method is. XBOX Kinect developers have a very fast implementation of that function that runs on the GPU at frame rate, Windows developers might not be quite so lucky!

Related

Resolving iphone 7 photo pixel data in matlab?

I just imported an image taken from my iphone 7 onto matlab. It turns out that the image has 3d size instead of 2d.
boxImage1 = imread('IMG_5175.jpg');
boxImage1 480x640x3 921600 uint8
Can anyone explain why the size of image is in 3d instead of just two. I am trying to run object detection tools on a set of images to extract relevant objects.
Thanks,
As pointed out in the comments, the three dimensions corresponds with the R, G and B channels. Have a look into the matlab documentation:
If the file contains a truecolor image, then A is an m-by-n-by-3 array.
Converting it to grayscale, using rgb2gray, is often a good idea, but it may depend on your application:
I = rgb2gray(boxImage1); % 480x640 matrix

Converting to 8-bit image causes white spots where black was. Why is this?

Img is a dtype=float64 numpy data type. When I run this code:
Img2 = np.array(Img, np.uint8)
the background of my images turns white. How can I avoid this and still get an 8-bit image?
Edit:
Sure, I can give more info. The single image is compiled from a stack of 400 images. They are each coming from an .avi video file, and each image is converted into a NumPy array like this:
gray_img = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
A more complicated operation is performed on this whole stack, but does not involve creating new images. It's simply performing calculations on each 1D array to yield a single pixel.
The interpolation is most likely linear (the default in plotting images with matplotlib. The images were saved as .PNGs.
You probably see overflow. If you cast 257 to np.uint8, you will get 1. According to a google search, avi files contain images with a color depth of 15 - 24 bit. When you cast this depth to np.uint8, you will see white regions getting darkened and (if a normalization takes place somewhere) also dark regions getting white (-5 -> 251). For the regions that become bright, you could check whether you have negative pixel values in the original image Img.
The Docs say that sometimes you have to do some scaling to get a proper cast, and to rather use higher depth whenever possible to avoid artefacts.
The solution seems to be either working at higher depth, i.e. casting to np.uint16 or np.uint32, or to scale the pixel values before reducing the depth, i.e. with Img2 already being a numpy matrix
# make sure that values are between 0 and 255, i.e. within 8bit range
Img2 *= 255/Img2.max()
# cast to 8bit
Img2 = np.array(Img, np.uint8)

Transforming raw pixel using rescale slope and rescale intercept in DIcom

I used the solution in this post Window width and center calculation of Dicom Image to transform the raw pixel, it works good most of the images, but i faced problem with some images. That images having pixel value "24", rescale slope "1.0" and rescale intercept "-1024".
When i applied the solution mentioned above am get the new pixel value in negative(-1000).
I can't find the value for this new pixel value in Lookup table created by using window level and window width because look up table having only positive values (0 to 65536). Please help me solve this problem.
You are probably dealing with CT images. RescaleIntercept tag for CTs usually set to -1024. Negative -1000 value you obtain makes perfect sense, it corresponds to air in Hounsfield units (as Anders said). Now if you want to visualize the image, you have to apply a transfer function that will map HU scale to RGB for instance.

3D-Anaglyph creation algorithm, using depth map image: where to find?

I'm looking for a generic algorithm to calculate a red/cian anaglyph starting from the original image and his b/w depth map (example: http://www.swell3d.com/2008/07/turn-2d-painting-into-3d-anagl.html)
That algorythm are used, for example, in Photoshop but I can't find a readable explanation to reproduce it.
Thanks
After some researches I found what I was looking for.
First, I've readed some Photoshop/Gimp tutorials that describes how to make anaglyphs from two inputs: an image and its grayscale depth map. The core of the process is the use of "Displace Tool" and the depth map as a displacement map.
One of the several youtube tutorials: http://www.youtube.com/watch?v=gfYMe_vYhu4
So, I took some documentation about Gimp's Displace Tool by looking at this http://docs.gimp.org/en/plug-in-displace.html and directly at the source code of the tool (the method is very similar to the one proposed by Asgeir).
This lets us to produce two stereo images from the input, by looking at the depth map. The red and cyan colors of every image are calculated by reading this page http://3dtv.at/Knowhow/AnaglyphComparison_en.aspx ("Optimized" matrices are the best ones).
Then, the sum of the two images in one will produce the final anaglyph. Thanks everybody.
There are two algorithms involved. The first uses the original image and the depth map to produce a left and a right image. The second combines these images into a red-cyan anaglyph.
There are a couple ways to accomplish the first part. One is to take the original image and texture map it onto a fine mesh that lies flat in the XY plane. Then you tweak the Z values of each vertex in the mesh according to the corresponding value in the depth map. You've basically created a textured bas relief. You then use a 3D rendering algorithm to render the image from two vantage points that are offset horizontally by a small amount (essentially from the vantage point of a person's left and right eyes as they would view the bas relief).
There is probably a way to directly shift the pixels left and right which is a good fast approximation to what I described above.
Once you have the left and right images, you pass one through a cyan filter and one through a red filter. If you have RGB sources, that's as simple as taking the red channel from one image and combing it with the green and blue channels from the other image.
Anaglyphs work best with muted colors. If you have strong primaries, it won't look as good. You can use an algorithm to reduce the color saturation of the original image before you begin.
From the description in the link you provided I would assume that it is something like
for each pixel in depthmap
x_offset = (depthmap[x][y] / 255.0f) * MAX_PIXEL_OFFSET * DIRECTION
output[x + x_offset][y] = color_buffer[x][y]
blend output with color_buffer
Where MAX_PIXEL_OFFSET is the maximum shift in pixels and DIRECTION is -1 for one color and 1 for the other. This is assuming that the depthbuffer is one byte per pixel, range [0..255] and that 0 in the depthbuffer represents maximum distance.

How WebGL works?

I'm looking for deep understanding of how WebGL works. I'm wanting to gain knowledge at a level that most people care less about, because the knowledge isn't necessary useful to the average WebGL programmer. For instance, what role does each part(browser, graphics driver, etc..) of the total rendering system play in getting an image on the screen?
Does each browser have to create a javascript/html engine/environment in order to run WebGL in browser? Why is chrome a head of everyone else in terms of being WebGL compatible?
So, what's some good resources to get started? The kronos specification is kind of lacking( from what I saw browsing it for a few minutes ) for what I'm wanting. I'm wanting mostly how is this accomplished/implemented in browsers and what else needs to change on your system to make it possible.
Hopefully this little write-up is helpful to you. It overviews a big chunk of what I've learned about WebGL and 3D in general. BTW, if I've gotten anything wrong, somebody please correct me -- because I'm still learning, too!
Architecture
The browser is just that, a Web browser. All it does is expose the WebGL API (via JavaScript), which the programmer does everything else with.
As near as I can tell, the WebGL API is essentially just a set of (browser-supplied) JavaScript functions which wrap around the OpenGL ES specification. So if you know OpenGL ES, you can adopt WebGL pretty quickly. Don't confuse this with pure OpenGL, though. The "ES" is important.
The WebGL spec was intentionally left very low-level, leaving a lot to
be re-implemented from one application to the next. It is up to the
community to write frameworks for automation, and up to the developer
to choose which framework to use (if any). It's not entirely difficult
to roll your own, but it does mean a lot of overhead spent on
reinventing the wheel. (FWIW, I've been working on my own WebGL
framework called Jax for a while
now.)
The graphics driver supplies the implementation of OpenGL ES that actually runs your code. At this point, it's running on the machine hardware, below even the C code. While this is what makes WebGL possible in the first place, it's also a double edged sword because bugs in the OpenGL ES driver (which I've noted quite a number of already) will show up in your Web application, and you won't necessarily know it unless you can count on your user base to file coherent bug reports including OS, video hardware and driver versions. Here's what the debug process for such issues ends up looking like.
On Windows, there's an extra layer which exists between the WebGL API and the hardware: ANGLE, or "Almost Native Graphics Layer Engine". Because the OpenGL ES drivers on Windows generally suck, ANGLE receives those calls and translates them into DirectX 9 calls instead.
Drawing in 3D
Now that you know how the pieces come together, let's look at a lower level explanation of how everything comes together to produce a 3D image.
JavaScript
First, the JavaScript code gets a 3D context from an HTML5 canvas element. Then it registers a set of shaders, which are written in GLSL ([Open] GL Shading Language) and essentially resemble C code.
The rest of the process is very modular. You need to get vertex data and any other information you intend to use (such as vertex colors, texture coordinates, and so forth) down to the graphics pipeline using uniforms and attributes which are defined in the shader, but the exact layout and naming of this information is very much up to the developer.
JavaScript sets up the initial data structures and sends them to the WebGL API, which sends them to either ANGLE or OpenGL ES, which ultimately sends it off to the graphics hardware.
Vertex Shaders
Once the information is available to the shader, the shader must transform the information in 2 phases to produce 3D objects. The first phase is the vertex shader, which sets up the mesh coordinates. (This stage runs entirely on the video card, below all of the APIs discussed above.) Most usually, the process performed on the vertex shader looks something like this:
gl_Position = PROJECTION_MATRIX * VIEW_MATRIX * MODEL_MATRIX * VERTEX_POSITION
where VERTEX_POSITION is a 4D vector (x, y, z, and w which is usually set to 1); VIEW_MATRIX is a 4x4 matrix representing the camera's view into the world; MODEL_MATRIX is a 4x4 matrix which transforms object-space coordinates (that is, coords local to the object before rotation or translation have been applied) into world-space coordinates; and PROJECTION_MATRIX which represents the camera's lens.
Most often, the VIEW_MATRIX and MODEL_MATRIX are precomputed and
called MODELVIEW_MATRIX. Occasionally, all 3 are precomputed into
MODELVIEW_PROJECTION_MATRIX or just MVP. These are generally meant
as optimizations, though I'd like find time to do some benchmarks. It's
possible that precomputing is actually slower in JavaScript if it's
done every frame, because JavaScript itself isn't all that fast. In
this case, the hardware acceleration afforded by doing the math on the
GPU might well be faster than doing it on the CPU in JavaScript. We can
of course hope that future JS implementations will resolve this potential
gotcha by simply being faster.
Clip Coordinates
When all of these have been applied, the gl_Position variable will have a set of XYZ coordinates ranging within [-1, 1], and a W component. These are called clip coordinates.
It's worth noting that clip coordinates is the only thing the vertex shader really
needs to produce. You can completely skip the matrix transformations
performed above, as long as you produce a clip coordinate result. (I have even
experimented with swapping out matrices for quaternions; it worked
just fine but I scrapped the project because I didn't get the
performance improvements I'd hoped for.)
After you supply clip coordinates to gl_Position WebGL divides the result by gl_Position.w producing what's called normalized device coordinates.
From there, projecting a pixel onto the screen is a simple matter of multiplying by 1/2 the screen dimensions and then adding 1/2 the screen dimensions.[1] Here are some examples of clip coordinates translated into 2D coordinates on an 800x600 display:
clip = [0, 0]
x = (0 * 800/2) + 800/2 = 400
y = (0 * 600/2) + 600/2 = 300
clip = [0.5, 0.5]
x = (0.5 * 800/2) + 800/2 = 200 + 400 = 600
y = (0.5 * 600/2) + 600/2 = 150 + 300 = 450
clip = [-0.5, -0.25]
x = (-0.5 * 800/2) + 800/2 = -200 + 400 = 200
y = (-0.25 * 600/2) + 600/2 = -150 + 300 = 150
Pixel Shaders
Once it's been determined where a pixel should be drawn, the pixel is handed off to the pixel shader, which chooses the actual color the pixel will be. This can be done in a myriad of ways, ranging from simply hard-coding a specific color to texture lookups to more advanced normal and parallax mapping (which are essentially ways of "cheating" texture lookups to produce different effects).
Depth and the Depth Buffer
Now, so far we've ignored the Z component of the clip coordinates. Here's how that works out. When we multiplied by the projection matrix, the third clip component resulted in some number. If that number is greater than 1.0 or less than -1.0, then the number is beyond the view range of the projection matrix, corresponding to the matrix zFar and zNear values, respectively.
So if it's not in the range [-1, 1] then it's clipped entirely. If it is in that range, then the Z value is scaled to 0 to 1[2] and is compared to the depth buffer[3]. The depth buffer is equal to the screen dimensions, so that if a projection of 800x600 is used, the depth buffer is 800 pixels wide and 600 pixels high. We already have the pixel's X and Y coordinates, so they are plugged into the depth buffer to get the currently stored Z value. If the Z value is greater than the new Z value, then the new Z value is closer than whatever was previously drawn, and replaces it[4]. At this point it's safe to light up the pixel in question (or in the case of WebGL, draw the pixel to the canvas), and store the Z value as the new depth value.
If the Z value is greater than the stored depth value, then it is deemed to be "behind" whatever has already been drawn, and the pixel is discarded.
[1]The actual conversion uses the gl.viewport settings to convert from normalized device coordinates to pixels.
[2]It's actually scaled to the gl.depthRange settings. They default 0 to 1.
[3]Assuming you have a depth buffer and you've turned on depth testing with gl.enable(gl.DEPTH_TEST).
[4]You can set how Z values are compared with gl.depthFunc
I would read these articles
http://webglfundamentals.org/webgl/lessons/webgl-how-it-works.html
Assuming those articles are helpful, the rest of the picture is that WebGL runs in a browser. It renderers to a canvas tag. You can think of a canvas tag like an img tag except you use the WebGL API to generate an image instead of download one.
Like other HTML5 tags the canvas tag can be styled with CSS, be under or over other parts of the page. Is composited (blended) with other parts of the page. Be transformed, rotated, scaled by CSS along with other parts of the page. That's a big difference from OpenGL or OpenGL ES.

Resources