Future prospects for improvement of depth data on Project Tango tablet - google-project-tango

I am interested in using the Project Tango tablet for 3D reconstruction using arbitrary point features. In the current SDK version, we seem to have access to the following data.
A 1280 x 720 RGB image.
A point cloud with 0-~10,000 points, depending on the environment. This seems to average between 3,000 and 6,000 in most environments.
What I really want is to be able to identify a 3D point for key points within an image. Therefore, it makes sense to project depth into the image plane. I have done this, and I get something like this:
The problem with this process is that the depth points are sparse compared to the RGB pixels. So I took it a step further and performed interpolation between the depth points. First, I did Delaunay triangulation, and once I got a good triangulation, I interpolated between the 3 points on each facet and got a decent, fairly uniform depth image. Here are the zones where the interpolated depth is valid, imposed upon the RGB iamge.
Now, given the camera model, it's possible to project depth back into Cartesian coordinates at any point on the depth image (since the depth image was made such that each pixel corresponds to a point on the original RGB image, and we have the camera parameters of the RGB camera). However, if you look at the triangulation image and compare it to the original RGB image, you can see that depth is valid for all of the uninteresting points in the image: blank, featureless planes mostly. This isn't just true for this single set of images; it's a trend I'm seeing for the sensor. If a person stands in front of the sensor, for example, there are very few depth points within their silhouette.
As a result of this characteristic of the sensor, if I perform visual feature extraction on the image, most of the areas with corners or interesting textures fall in areas without associated depth information. Just an example: I detected 1000 SIFT keypoints from an an RGB image from an Xtion sensor, and 960 of those had valid depth values. If I do the same thing to this system, I get around 80 keypoints with valid depth. At the moment, this level of performance is unacceptable for my purposes.
I can guess at the underlying reasons for this: it seems like some sort of plane extraction algorithm is being used to get depth points, whereas Primesense/DepthSense sensors are using something more sophisticated.
So anyway, my main question here is: can we expect any improvement in the depth data at a later point in time, through improved RGB-IR image processing algorithms? Or is this an inherent limit of the current sensor?

I am from the Project Tango team at Google. I am sorry you are experiencing trouble with depth on the device. Just so that we are sure your device is in good working condition, can you please test the depth performance against a flat wall. Instructions are as below:
https://developers.google.com/project-tango/hardware/depth-test
Even with a device in good working condition, the depth library is known to return sparse depth points on scenes with low IR reflectance objects, small sized objects, high dynamic range scenes, surfaces at certain angles and objects at distances larger than ~4m. While some of these are inherent limitations in the depth solution, we are working with the depth solution provider to bring improvements wherever possible.
Attached an image of a typical conference room scene and the corresponding point cloud. As you can see, 1) no depth points are returned from the laptop screen (low reflectance), the table top objects such as post-its, pencil holder etc (small object sizes), large portions of the table (surface at an angles), room corner at the far right (distance >4m).
But as you move around the device, you will start getting depth point returns. Accumulating depth points is a must to get denser point clouds.
Please also keep us posted on your findings at project-tango-hardware-support#google.com

In my very basic initial experiments, you are correct with respect to depth information returned from the visual field, however, the return of surface points is anything but constant. I find as I move the device I can get major shifts in where depth information is returned, i.e. there's a lot of transitory opacity in the image with respect to depth data, probably due to the characteristics of the surfaces.
So while no return frame is enough, the real question seems to be the construction of a larger model (point cloud to open, possibly voxel spaces as one scales up) to bring successive scans into a common model. It's reminiscent of synthetic aperture algorithms in spirit, but the letters in the equations are from a whole different set of laws.
In short, I think a more interesting approach is to synthesize a more complete model by successive accumulation of point cloud data - now, for this to work, the device team has to have their dead reckoning on the money for whatever scale this is done. Also this addresses an issue that no sensor improvements can address - if your visual sensor is perfect, it still does nothing to help you relate the sides of an object at least be in the close neighborhood of the front of the object.

Related

Algorithm or tool for finding edges of a different area when comparing two images

I am working on a community project which goal is to reduce the speeding violations. In order to recognize the car's license plates I am using OpenALPR. The problem is that it is sensitive to camera position, that is the angle and OpenALPR have troubles detecting the LP when the angle is greater than 20 degrees (yes, I've read the recommendations about the good camera position but IRL sometimes they cannot be satisfied).
I found that the problem is that the LP area is not detected. However, cropping manually the image to contain just the car without any other modifications of the pixels (like filtering) fixes the problem and OpenALPR is able to detect the LP area.
I am looking for a solution that can do the cropping automatically. Either algorithm or tool that can compare two images "base" and "target" and return the coordinates (top left, bottom right) of the changed area in the target image.
Alternative solution would be different configuration file for the OpenALPR. I am experimenting with this last few hours but with no success.
Base image will look like:
Target image will look like:
(these are just two frames from a video)
(original image size is much bigger, i.e. 3840x2160)
Are there algorithm(s) or tools that can help me with automating this task?
The basic method is by differencing, i.e. taking the absolute difference of the RGB component values pixel per pixel. Where differences are large, there is a detection.
But this can work poorly (and it does with the given images) because the two pictures may be slightly unaligned, and wind can move the vegetation.
So I recommend to
reduce the image resolution by a significant factor (say 8);
blur the reduced images;
compute the absolute differences;
keep the largest differences among the components;
binarize with a threshold;
finally use connected components labelling to find the most significant blob and eliminate the residual interferences.
Make sure to refresh the background image (when you are sure there is no car) to avoid the effect of daily drift (there are always slow changes). It may also be useful to normalize the image intensity to thwart the changes is ambiant lighting (passing clouds f.i.).

Shading mask algorithm for radiation calculations

I am working on a software (Ruby - Sketchup) to calculate the radiation (sun, sky and surrounding buildings) within urban development at pedestrian level. The final goal is to be able to create a contour map that shows the level of total radiation. With total radiation I mean shortwave (light) and longwave(heat). (To give you an idea: http://www.iaacblog.com/maa2011-2012-digitaltools/files/2012/01/Insolation-Analysis-All-Year.jpg)
I know there are several existing software that do this, but I need to write my own as this calculation is only part of a more complex workflow.
The (obvious) pseudo code is the following:
Select and mesh surface for analysis
From each point of the mesh
Cast n (see below) rays in the upper hemisphere (precalculated)
For each ray check whether it is in shade
If in shade => Extract properties from intersected surface
If not in shade => Flag it
loop
loop
loop
The approach above is brute force, but it is the only I can think of. The calculation time increases with the fourth power of the accuracy (Dx,Dy,Dazimth, Dtilt). I know that software like radiance use a Montecarlo approach to reduce the number of rays.
As you can imagine, the accuracy of the calculation for a specific point of the mesh is strongly dependent by the accuracy of the skydome subdivision. Similarly the accuracy on the surface depends on the coarseness of the mesh.
I was thinking to a different approach using adaptive refinement based on the results of the calculations. The refinement could work for the surface analyzed and the skydome. If the results between two adjacent points differ more than a threshold value, than a refinement will be performed. This is usually done in fluid simulation, but I could not find anything about light simulation.
Also i wonder whether there are are algorithms, from computer graphics for example, that would allow to minimize the number of calculations. For example: check the maximum height of the surroundings so to exclude certain part of the skydome for certain points.
I don't need extreme accuracy as I am not doing rendering. My priority is speed at this moment.
Any suggestion on the approach?
Thanks
n rays
At the moment I subdivide the sky by constant azimuth and tilt steps; this causes irregular solid angles. There are other subdivisions (e.g. Tregenza) that maintain a constant solid angle.
EDIT: Response to the great questions from Spektre
Time frame. I run one simulation for each hour of the year. The weather data is extracted from an epw weather file. It contains, for each hour, solar altitude and azimuth, direct radiation, diffuse radiation, cloudiness (for atmospheric longwave diffuse). My algorithm calculates the shadow mask separately then it uses this shadow mask to calculate the radiation on the surface (and on a typical pedestrian) for each hour of the year. It is in this second step that I add the actual radiation. In the the first step I just gather information on the geometry and properties of the various surfaces.
Sun paths. No, i don't. See point 1
Include reflection from buildings? Not at the moment, but I plan to include it as an overall diffuse shortwave reflection based on sky view factor. I consider only shortwave reflection from the ground now.
Include heat dissipation from buildings? Absolutely yes. That is the reason why I wrote this code myself. Here in Dubai this is key as building surfaces gets very, very hot.
Surfaces albedo? Yes, I do. In Skethcup I have associated a dictionary to every surface and in this dictionary I include all the surface properties: temperature, emissivity, etc.. At the moment the temperatures are fixed (ambient temperature if not assigned), but I plan, in the future, to combine this with the results from a building dynamic thermal simulation that already calculates all the surfaces temperatures.
Map resolution. The resolution is chosen by the user and the mesh generated by the algorithm. In terms of scale, I use this for masterplans. The scale goes from 100mx100m up to 2000mx2000m. I usually tend to use a minimum resolution of 2m. The limit is the memory and the simulation time. I also have the option to refine specific areas with a much finer mesh: for example areas where there are restaurants or other amenities.
Framerate. I do not need to make an animation. Results are exported in a VTK file and visualized in Paraview and animated there just to show off during presentations :-)
Heat and light. Yes. Shortwave and longwave are handled separately. See point 4. The geolocalization is only used to select the correct weather file. I do not calculate all the radiation components. The weather files I need have measured data. They are not great, but good enough for now.
https://www.lucidchart.com/documents/view/5ca88b92-9a21-40a8-aa3a-0ff7a5968142/0
visible light
for relatively flat global base ground light map I would use projection shadow texture techniques instead of ray tracing angular integration. It is way faster with almost the same result. This will not work on non flat grounds (many bigger bumps which cast bigger shadows and also change active light absorbtion area to anisotropic). Urban areas are usually flat enough (inclination does not matter) so the technique is as follows:
camera and viewport
the ground map is a target screen so set the viewpoint to underground looking towards Sun direction upwards. Resolution is at least your map resolution and there is no perspective projection.
rendering light map 1st pass
first clear map with the full radiation (direct+diffuse) (light blue) then render buildings/objects but with diffuse radiation only (shadow). This will make the base map without reflections and or soft shadows in the Magenta rendering target
rendering light map 2nd pass
now you need to add building faces (walls) reflections for that I would take every outdoor face of the building facing Sun or heated enough and compute reflection points onto light map and render reflection directly to map
in tis parts you can add ray tracing for vertexes only to make it more precise and also for including multiple reflections (bu in that case do not forget to add scattering)
project target screen to destination radiation map
just project the Magenta rendering target image to ground plane (green). It is only simple linear affine transform ...
post processing
you can add soft shadows by blurring/smoothing the light map. To make it more precise you can add info to each pixel if it is shadow or wall. Actual walls are just pixels that are at 0m height above ground so you can use Z-buffer values directly for this. Level of blurring depends on the scattering properties of the air and of coarse pixels at 0m ground height are not blurred at all
IR
this can be done in similar way but temperature behaves a bit differently so I would make several layers of the scene in few altitudes above ground forming a volume render and then post process the energy transfers between pixels and layers. Also do not forget to add the cooling effect of green plants and water vaporisation.
I do not have enough experience in this field to make more suggestions I am more used to temperature maps for very high temperature variances in specific conditions and material not the outdoor conditions.
PS. I forgot albedo for IR and visible light is very different for many materials especially aluminium and some wall paintings

how to improve keypoints detection and matching

I have been working a self project in image processing and robotics where instead robot as usual detecting colors and picking out the object, it tries to detect the holes(resembling different polygons) on the board. For a better understanding of the setup here is an image:
As you can see I have to detect these holes, find out their shapes and then use the robot to fit the object into the holes. I am using a kinect depth camera to get the depth image. The pic is shown below:
I was lost in thought of how to detect the holes with the camera, initially using masking to remove the background portion and some of the foreground portion based on the depth measurement,but this did not work out as, at different orientations of the camera the holes would merge with the board... something like inranging (it fully becomes white). Then I came across adaptiveThreshold function
adaptiveThreshold(depth1,depth3,255,ADAPTIVE_THRESH_GAUSSIAN_C,THRESH_BINARY,7,-1.0);
With noise removal using erode, dilate, and gaussian blur; which detected the holes in a better manner as shown in the picture below. Then I used the cvCanny edge detector to get the edges but so far it has not been good as shown in the picture below.After this I tried out various feature detectors from SIFT, SURF, ORB, GoodFeaturesToTrack and found out that ORB gave the best times and the features detected. After this I tried to get the relative camera pose of a query image by finding its keypoints and matching those keypoints for good matches to be given to the findHomography function. The results are as shown below as in the diagram:
In the end i want to get the relative camera pose between the two images and move the robot to that position using the rotational and translational vectors got from the solvePnP function.
So is there any other method by which I could improve the quality of the
holes detected for the keypoints detection and matching?
I had also tried contour detection and approxPolyDP but the approximated shapes are not really good:
I have tried tweaking the input parameters for the threshold and canny functions but
this is the best I can get
Also ,is my approach to get the camera pose correct?
UPDATE : No matter what I tried I could not get good repeatable features to map. Then I read online that a depth image is cheap in resolution and its only used for stuff like masking and getting the distances. So , it hit me that the features are not proper because of the low resolution image with its messy edges. So I thought of detecting features on a RGB image and using the depth image to get only the distances of those features. The quality of features I got were literally off the charts.It even detected the screws on the board!! Here are the keypoints detected using GoodFeaturesToTrack keypoint detection..
I met an another hurdle while getting the distancewith the distances of the points not coming out properly. I searched for possible causes and it occured to me after quite a while that there was a offset in the RGB and depth images because of the offset between the cameras.You can see this from the first two images. I then searched the net on how to compensate this offset but could not find a working solution.
If anyone one of you could help me in compensate the offset,it would be great!
UPDATE: I could not make good use of the goodFeaturesToTrack function. The function gives the corners in Point2f type .If you want to compute the descriptors we need the keypoints and converting Point2f to Keypoint with the code snippet below leads to the loss of scale and rotational invariance.
for( size_t i = 0; i < corners1.size(); i++ )
{
keypoints_1.push_back(KeyPoint(corners1[i], 1.f));
}
The hideous result from the feature matching is shown below .
I have to start on different feature matchings now.I'll post further updates. It would be really helpful if anyone could help in removing the offset problem.
Compensating the difference between image output and the world coordinates:
You should use good old camera calibration approach for calibrating the camera response and possibly generating a correction matrix for the camera output (in order to convert them into real scales).
It's not that complicated once you have printed out a checkerboard template and capture various shots. (For this application you don't need to worry about rotation invariance. Just calibrate the world view with the image array.)
You can find more information here: http://www.vision.caltech.edu/bouguetj/calib_doc/htmls/own_calib.html
--
Now since I can't seem to comment on the question, I'd like to ask if your specific application requires the machine to "find out" the shape of the hole on the fly. If there are finite amount of hole shapes, you may then model them mathematically and look for the pixels that support the predefined models on the B/W edge image.
Such as (x)^2+(y)^2-r^2=0 for a circle with radius r, whereas x and y are the pixel coordinates.
That being said, I believe more clarification is needed regarding the requirements of the application (shape detection).
If you're going to detect specific shapes such as the ones in your provided image, then you're better off using a classifer. Delve into Haar classifiers, or better still, look into Bag of Words.
Using BoW, you'll need to train a bunch of datasets, consisting of positive and negative samples. Positive samples will contain N unique samples of each shape you want to detect. It's better if N would be > 10, best if >100 and highly variant and unique, for good robust classifier training.
Negative samples would (obviously), contain stuff that do not represent your shapes in any way. It's just for checking the accuracy of the classifier.
Also, once you have your classifier trained, you could distribute your classifier data (say, suppose you use SVM).
Here are some links to get you started with Bag of Words:
https://gilscvblog.wordpress.com/2013/08/23/bag-of-words-models-for-visual-categorization/
Sample code:
http://answers.opencv.org/question/43237/pyopencv_from-and-pyopencv_to-for-keypoint-class/

How can you distribute the color intensity of two images using its gradients?

I am working on an automatic image stitching algorithm using MATLAB. So far, I have downloaded a source code much like the one that I had in mind and so, I'm currently studying how the code work.
The problem is, when stitching two or more images together, their color intensity will most probably be different from each other so the stitched seams will be visible to the eye... So, right now, I'm trying to find out how to redistribute their color intensity using the images gradients so that the whole stitched image will have the same color intensity.
I hope someone can help me out there and if so, thank you very much...
If the images overlap by a significant amount, and the stitching algorithm does a very good job of registering the overlap region, a very simple solution would be to blend the pixel values from the two images together in the overlap region, using a weighted average with weights going from 0-1 depending on the distance from the edge of the overlap region.
blendedPixel = (imageApixel * weightA) + (imageBpixel * weightB)
where weightA is approaches 1 as we get closer to the imageA side of the overlap region, weightB approaches 1 as we get closer to the imageB side of the overlap region, and the sum of weightA and weightB is always 1.
The above solution is not particularly principled, and does depend on the stitching algorithm doing a very good job of image registration in the overlap region.
Another, more principled solution to the problem would be to remove the source of the intensity difference, attempting to homogenize the response of the pixels across the image plane.
The form of this solution will depend on the source of the intensity difference, which will depend on the optics and the scene lighting conditions.
For example when dealing with photographs of outdoor scenes, taken at the same time from the same location, then the dominant effect will likely be "vignetting" effects, which can be due to a variety of different causes, including differences between the various paths taken by the light through camera optics.
As another example, when dealing with photographs taken through a microscope of a sample illuminated at an oblique angle, the dominant effect will likely be due to the difference in illumination between those parts of the image closest to the light and those far away.
Vignetting generally manifests itself as a radially symmetric function centred around the projection of the optical axis of the lens onto the image plane. To correct for vignetting, you should try to fit a suitable radially symmetric function.
Lighting changes can take different functional forms, but fitting a straightforward linear approximation is sufficient in many cases.
Depending upon the scene, and the number and variability of the images that have available, you may need to take calibration images to fit these functions properly.
The above approaches make assumptions about the functional forms of the sources of the intensity differences, but not about the scene or it's statistics.
Yet another approach might be to make some assumptions about the scene, for example, that all significant information is represented by spatial frequencies above some threshold. You can then remove all low image intensity spatial frequency components. This will "flatten" the image, removing much of the low-frequency vignetting and lighting issues.
This approach might be applicable to microscopy images, sattelite images, or images of other scenes within which most of the interest lies in the detail, rather than in the drama of the composition.
There are a number of papers that tackle this problem, many at a level of technical sophistication rather beyond the above discussion. For example, see D Goldman, "Vignette and Exposure Calibration and Compensation", IEEE Trans Pattern Analysis and Machine Intelligence, vol 32, no 12, pp2276-2288

Automatic tracking algorithm

I'm trying to write a simple tracking routine to track some points on a movie.
Essentially I have a series of 100-frames-long movies, showing some bright spots on dark background.
I have ~100-150 spots per frame, and they move over the course of the movie. I would like to track them, so I'm looking for some efficient (but possibly not overkilling to implement) routine to do that.
A few more infos:
the spots are a few (es. 5x5) pixels in size
the movement are not big. A spot generally does not move more than 5-10 pixels from its original position. The movements are generally smooth.
the "shape" of these spots is generally fixed, they don't grow or shrink BUT they become less bright as the movie progresses.
the spots don't move in a particular direction. They can move right and then left and then right again
the user will select a region around each spot and then this region will be tracked, so I do not need to automatically find the points.
As the videos are b/w, I though I should rely on brigthness. For instance I thought I could move around the region and calculate the correlation of the region's area in the previous frame with that in the various positions in the next frame. I understand that this is a quite naïve solution, but do you think it may work? Does anyone know specific algorithms that do this? It doesn't need to be superfast, as long as it is accurate I'm happy.
Thank you
nico
Sounds like a job for Blob detection to me.
I would suggest the Pearson's product. Having a model (which could be any template image), you can measure the correlation of the template with any section of the frame.
The result is a probability factor which determine the correlation of the samples with the template one. It is especially applicable to 2D cases.
It has the advantage to be independent from the sample absolute value, since the result is dependent on the covariance related with the mean of the samples.
Once you detect an high probability, you can track the successive frames in the neightboor of the original position, and select the best correlation factor.
However, the size and the rotation of the template matter, but this is not the case as I can understand. You can customize the detection with any shape since the template image could represent any configuration.
Here is a single pass algorithm implementation , that I've used and works correctly.
This has got to be a well reasearched topic and I suspect there won't be any 100% accurate solution.
Some links which might be of use:
Learning patterns of activity using real-time tracking. A paper by two guys from MIT.
Kalman Filter. Especially the Computer Vision part.
Motion Tracker. A student project, which also has code and sample videos I believe.
Of course, this might be overkill for you, but hope it helps giving you other leads.
Simple is good. I'd start doing something like:
1) over a small rectangle, that surrounds a spot:
2) apply a weighted average of all the pixel coordinates in the area
3) call the averaged X and Y values the objects position
4) while scanning these pixels, do something to approximate the bounding box size
5) repeat next frame with a slightly enlarged bounding box so you don't clip spot that moves
The weight for the average should go to zero for pixels below some threshold. Number 4 can be as simple as tracking the min/max position of anything brighter than the same threshold.
This will of course have issues with spots that overlap or cross paths. But for some reason I keep thinking you're tracking stars with some unknown camera motion, in which case this should be fine.
I'm afraid that blob tracking is not simple, not if you want to do it well.
Start with blob detection as genpfault says.
Now you have spots on every frame and you need to link them up. If the blobs are moving independently, you can use some sort of correspondence algorithm to link them up. See for instance http://server.cs.ucf.edu/~vision/papers/01359751.pdf.
Now you may have collisions. You can use mixture of gaussians to try to separate them, give up and let the tracks cross, use any other before-and-after information to resolve the collisions (e.g. if A and B collide and A is brighter before and will be brighter after, you can keep track of A; if A and B move along predictable trajectories, you can use that also).
Or you can collaborate with a lab that does this sort of stuff all the time.

Resources