Multi-level object recognition using computer vision - algorithm

Are there any possible methodologies or algorithms to detect and identify multiple layers of objects (one above the other) using computer vision and image processing?
For example take a look at this image:
If I were to place a camera on top, will I be able to detect and identify the different layers of fruits.
Also, if not possible with computer vision, which other techniques can I possibly make use of?

Doing this purely with a camera would be very difficult since you don't have any way to know how far away the fruits are. The camera could be looking at a 2D picture of fruit and interpret it the same as a basket.
You could try complementing the camera with a range finding device, ultrasonic or radar to know how far away the fruit is in 3D space.
Another option is if you know what fruit is being observed and can approximate the actual size (diameter of the fruit irl), using similar triangles you could approximate the distance to each fruit and know their location in 3D space. This question seems to have an answer that is detailed in finding distance to known object.


Making 3D representation of an object with a webcam

Is it possible to make a 3D representation of an object by capturing many different angles using a webcam? If it is, how is it possible and how is the image-processing done?
My plan is to make a 3D representation of a person using a webcam, then from the 3D representation, i will be able to tell the person's vital statistics.
As Bart said (but did not post as an actual answer) this is entirely possible.
The research topic you are interested in is often called multi view stereo or something similar.
The basic idea resolves around using point correspondences between two (or more) images and then try to find the best matching camera positions. When the positions are found you can use stereo algorithms to back project the image points into a 3D coordinate system and form a point cloud.
From that point cloud you can then further process it to get the measurements you are looking for.
If you are completely new to the subject you have some fascinating reading to look forward to!
Bart proposed Multiple view geometry by Hartley and Zisserman, which is a very nice book indeed.
As Bart and Kigurai pointed out, this process has been studied under the title of "stereo" or "multi-view stereo" techniques. To be able to get a 3D model from a set of pictures, you need to do the following:
a) You need to know the "internal" parameters of a camera. This includes the focal length of the camera, the principal point of the image and account for radial distortion in the image.
b) You also need to know the position and orientation of each camera with respect to each other or a "world" co-ordinate system. This is called the "pose" of the camera.
There are algorithms to perform (a) and (b) which are described in Hartley and Zisserman's "Multiple View Geometry" book. Alternatively, you can use Noah Snavely's "Bundler" software to also do the same thing in a very robust manner.
Once you have the camera parameters, you essentially know how a 3D point (X,Y,Z) in the world maps to an image co-ordinate (u,v) on the photo. You also know how to map an image co-ordinate to the world. You can create a dense point cloud by searching for a match for each pixel on one photo in a photo taken from a different view-point. This requires a two-dimensional search. You can simplify this procedure by making the search 1-dimensional. This is called "rectification". You essentially take two photos and transform then so that their rows correspond to the same line in the world (simplified statement). Now you only have to search along image rows.
An algorithm for this can be also found in Hartley and Zisserman.
Finally, you need to do the matching based on some measure. There is a lot of literature out there on "stereo matching". Another word used is "disparity estimation". This is basically searching for the match of pixel (u,v) on one photo to its match (u, v') on the other photo. Once you have the match, the difference between them can be used to map back to a 3D point.
You can use Yasutaka Furukawa's "CMVS" or "PMVS2" software to do this. Or if you want to experiment by yourself, openCV is a open-source computer vision toolbox to do many of the sub-tasks required for this.
This can be done with two webcams in the same ways your eyes work. It is called stereoscopic vision.
Have a look at this:
An affordable alternative to get 3D data would be the Kinect camera system.
Maybe not the answer you are hoping for but Microsoft's Kinect is doing that exact thing, there are some open source drivers out there that allow you to connect it to your windows/linux box.

Looking for ways for a robot to locate itself in the house

I am hacking a vacuum cleaner robot to control it with a microcontroller (Arduino). I want to make it more efficient when cleaning a room. For now, it just go straight and turn when it hits something.
But I have trouble finding the best algorithm or method to use to know its position in the room. I am looking for an idea that stays cheap (less than $100) and not to complex (one that don't require a PhD thesis in computer vision). I can add some discrete markers in the room if necessary.
Right now, my robot has:
One webcam
Three proximity sensors (around 1 meter range)
Compass (no used for now)
Its speed can vary if the battery is full or nearly empty
A netbook Eee PC is embedded on the robot
Do you have any idea for doing this? Does any standard method exist for these kind of problems?
The problem of figuring out a robot's position in its environment is called localization. Computer science researchers have been trying to solve this problem for many years, with limited success. One problem is that you need reasonably good sensory input to figure out where you are, and sensory input from webcams (i.e. computer vision) is far from a solved problem.
If that didn't scare you off: one of the approaches to localization that I find easiest to understand is particle filtering. The idea goes something like this:
You keep track of a bunch of particles, each of which represents one possible location in the environment.
Each particle also has an associated probability that tells you how confident you are that the particle really represents your true location in the environment.
When you start off, all of these particles might be distributed uniformly throughout your environment and be given equal probabilities. Here the robot is gray and the particles are green.
When your robot moves, you move each particle. You might also degrade each particle's probability to represent the uncertainty in how the motors actually move the robot.
When your robot observes something (e.g. a landmark seen with the webcam, a wifi signal, etc.) you can increase the probability of particles that agree with that observation.
You might also want to periodically replace the lowest-probability particles with new particles based on observations.
To decide where the robot actually is, you can either use the particle with the highest probability, the highest-probability cluster, the weighted average of all particles, etc.
If you search around a bit, you'll find plenty of examples: e.g. a video of a robot using particle filtering to determine its location in a small room.
Particle filtering is nice because it's pretty easy to understand. That makes implementing and tweaking it a little less difficult. There are other similar techniques (like Kalman filters) that are arguably more theoretically sound but can be harder to get your head around.
A QR Code poster in each room would not only make an interesting Modern art piece, but would be relatively easy to spot with the camera!
If you can place some markers in the room, using the camera could be an option. If 2 known markers have an angular displacement (left to right) then the camera and the markers lie on a circle whose radius is related to the measured angle between the markers. I don't recall the formula right off, but the arc segment (on that circle) between the markers will be twice the angle you see. If you have the markers at known height and the camera is at a fixed angle of inclination, you can compute the distance to the markers. Either of these methods alone can nail down your position given enough markers. Using both will help do it with fewer markers.
Unfortunately, those methods are imperfect due to measurement errors. You get around this by using a Kalman estimator to incorporate multiple noisy measurements to arrive at a good position estimate - you can then feed in some dead reckoning information (which is also imperfect) to refine it further. This part is goes pretty deep into math, but I'd say it's a requirement to do a great job at what you're attempting. You can do OK without it, but if you want an optimal solution (in terms of best position estimate for given input) there is no better way. If you actually want a career in autonomous robotics, this will play large in your future. (
Once you can determine your position you can cover the room in any pattern you'd like. Keep using the bump sensor to help construct a map of obstacles and then you'll need to devise a way to scan incorporating the obstacles.
Not sure if you've got the math background yet, but here is the book:
This doesn't replace the accepted answer (which is great, thanks!) but I might recommend getting a Kinect and use that instead of your webcam, either through Microsoft's recently released official drivers or using the hacked drivers if your EeePC doesn't have Windows 7 (presumably it does not).
That way the positioning will be improved by the 3D vision. Observing landmarks will now tell you how far away the landmark is, and not just where in the visual field that landmark is located.
Regardless, the accepted answer doesn't really address how to pick out landmarks in the visual field, and simply assumes that you can. While the Kinect drivers may already have feature detection included (I'm not sure) you can also use OpenCV for detecting features in the image.
One solution would be to use a strategy similar to "flood fill" (wikipedia). To get the controller to accurately perform sweeps, it needs a sense of distance. You can calibrate your bot using the proximity sensors: e.g. run motor for 1 sec = xx change in proximity. With that info, you can move your bot for an exact distance, and continue sweeping the room using flood fill.
Assuming you are not looking for a generalised solution, you may actually know the room's shape, size, potential obstacle locations, etc. When the bot exists the factory there is no info about its future operating environment, which kind of forces it to be inefficient from the outset.
If that's you case, you can hardcode that info, and then use basic measurements (ie. rotary encoders on wheels + compass) to precisely figure out its location in the room/house. No need for wifi triangulation or crazy sensor setups in my opinion. At least for a start.
Ever considered GPS? Every position on earth has a unique GPS coordinates - with resolution of 1 to 3 metres, and doing differential GPS you can go down to sub-10 cm range - more info here:
And Arduino does have lots of options of GPS-modules:
After you have collected all the key coordinates points of the house, you can then write the routine for the arduino to move the robot from point to point (as collected above) - assuming it will do all those obstacles avoidance stuff.
More information can be found here:
And inside the list I found this - specifically for your case: Arduino + GPS + localization:
I was thinking about this problem too. But I don't understand why you can't just triangulate? Have two or three beacons (e.g. IR LEDs of different frequencies) and a IR rotating sensor 'eye' on a servo. You could then get an almost constant fix on your position. I expect the accuracy would be in low cm range and it would be cheap. You can then map anything you bump into easily.
Maybe you could also use any interruption in the beacon beams to plot objects that are quite far from the robot too.
You have a camera you said ? Did you consider looking at the ceiling ? There is little chance that two rooms have identical dimensions, so you can identify in which room you are, position in the room can be computed from angular distance to the borders of the ceiling and direction can probably be extracted by the position of doors.
This will require some image processing but the vacuum cleaner moving slowly to be efficiently cleaning will have enough time to compute.
Good luck !
Use Ultra Sonic Sensor HC-SR04 or similar.
As above told sense the walls distance from robot with sensors and room part with QR code.
When your are near to a wall turn 90 degree and move as width of your robot and again turn 90deg( i.e. 90 deg left turn) and again move your robot I think it will help :)

Differentiate objects?

i want to identify a ball in the picture. I am thiking of using sobel edge detection algorithm,with this i can detect the round objects in the image.
But how do i differentiate between different objects. For example, a foot ball is there in one picture and in another picture i have a picture of moon.. how to differentiate what object has been detected.
When i use my algorithm i get ball in both the cases. Any ideas?
Well if all the objects you would like to differentiate are round, you could even use a hough transformation for round objects. This is a very good way of distinguishing round objects.
But your basic problem seems to be classification - sorting the objects on your image into different classes.
For this you don't really need a Neural Network, you could simply try with a Nearest Neighbor match. It's functionalities are a bit like neural networks since you can give it several reference pictures where you tell the system what can be seen there and it will optimize itself to the best average values for each attribute you detected. By this you get a dictionary of clusters for the different types of objects.
But for this you'll of course first need something that distinguishes a ball from a moon.
Since they are all real round objects (which appear as circles) it will be useless to compare for circularity, circumference, diameter or area (only if your camera is steady and if you know a moon will always have the same size on your images, other than a ball).
So basically you need to look inside the objects itself and you can try to compare their mean color value or grayscale value or the contrast inside the object (the moon will mostly have mid-gray values whereas a soccer ball consists of black and white parts)
You could also run edge filters on the segmented objects just to determine which is more "edgy" in its texture. But for this there are better methods I guess...
So basically what you need to do first:
Find several attributes that help you distinguish the different round objects (assuming they are already separated)
Implement something to get these values out of a picture of a round object (which is already segmented of course, so it has a background of 0)
Build a system that you feed several images and their class to have a supervised learning system and feed it several images of each type (there are many implementations of that online)
Now you have your system running and can give other objects to it to classify.
For this you need to segment the objects in the image, by i.e Edge filters or a Hough Transformation
For each of the segmented objects in an image, let it run through your classification system and it should tell you which class (type of object) it belongs to...
Hope that helps... if not, please keep asking...
When you apply an edge detection algorithm you lose information.
Thus the moon and the ball are the same.
The moon has a diiferent color, a different texture, ... you can use these informations to differnentiate what object has been detected.
That's a question in AI.
If you think about it, the reason you know it's a ball and not a moon, is because you've seen a lot of balls and moons in your life.
So, you need to teach the program what a ball is, and what a moon is. Give it some kind of dictionary or something.
The problem with a dictionary of course would be that to match the object with all the objects in the dictionary would take time.
So the best solution would probably using Neural networks. I don't know what programming language you're using, but there are Neural network implementations to most languages i've encountered.
You'll have to read a bit about it, decide what kind of neural network, and its architecture.
After you have it implemented it gets easy. You just give it a lot of pictures to learn (neural networks get a vector as input, so you can give it the whole picture).
For each picture you give it, you tell it what it is. So you give it like 20 different moon pictures, 20 different ball pictures. After that you tell it to learn (built in function usually).
The neural network will go over the data you gave it, and learn how to differentiate the 2 objects.
Later you can use that network you taught, give it a picture, and it a mark of what it thinks it is, like 30% ball, 85% moon.
This has been discussed before. Have a look at this question. More info here and here.

image feature identification

I am looking for a solution to do the following:
( the focus of my question is step 2. )
a picture of a house including the front yard
extract information from the picture like the dimensions and location of the house, trees, sidewalk, and car. Also, the textures and colors of the house, cars, trees, and sidewalk.
use extracted information to generate a model
How can I extract that information?
You could also consult Tatiana Jaworska research on this. As I understood, this details at least 1 new algorithm to feature extraction (targeted at roofs, doors, ...) by colour (RGB). More intriguing, the last publication also uses parameterized objects to be identified in the house images... that must might be a really good starting point for what you're trying to do.
link to her publications:
Yes. You can extract these information from a picture.
1. You just identify these objects in a picture using some detection algorithms.
2. Measure these objects dimensions and generate a model using extracted information.
well actually your desired goal is not so easy to achieve. First of all you'll need a good way to figure what what is what and what is where on your image. And there simply is no easy "algorithm" for detecting houses/cars/whatsoever on an image. There are ways to segment different objects (like cars) from an image, but those don't work generally. Especially on houses this would be hard since each house looks different and it's hard to find one solid measurement for "this is house and this is not"...
Am I assuming it right that you are trying to simply photograph a house (with front yard) and build a texturized 3D-model out of it? This is not going to work since you need several photos of the house to get positions of walls/corners and everything in 3D space (There are approaches that try a mesh reconstruction with one image only but they lack of depth information and results are fairly poor). So if you would like to create 3D-mdoels you will need several photos of different angles of the house.
There are several different approaches that use this kind of technique to reconstruct real world objects to triangle-meshes.
Basically they work after the principle:
Try to find points in images of different viewpoint which are the same on an object. Considering you are photographing a house this could be salient structures likes corners of windows/doors or corners or edges on the walls/roof/...
Knowing where one and the same point of your house is in several different photos and knowing the position of the camera of both photos you can reconstruct this point in 3D-space.
Doing this for a lot of equal points will "empower" you to reconstruct the shape of your house as a 3D-model by triangulating the points.
Taking parts of the image as textures and mapping them on the generated model would work as well since you know where what is.
You should have a look at these papers:
The second paper even has an example of doing exactly what you try to achieve, namely reconstruct a textured 3D-model of a house photographed from different angles.
The third link is a powerpoint presentation that shows how the reconstruction works and shows the drawbacks there are.
So you should get familiar with these papers to see what problems you are up to... If you then want to try this on your own have a look at OpenCV. This library provides some methods for feature extraction in images. You then can try to find salient points in each image and try to match them.
Good luck on your project... If you have problems, please keep asking!
I suggest to look at this blog
that shows how to identify certain features on images using a convolutional neural network. This particular blog discusses how to identify human faces on images from a large set of random images. You can adjust this example to train neural network using some other images. Note that even in the case of human faces, the identification rate is about 85%, therefore, more complex objects can be even harder to identify

Image recognition and 3d rendering

How hard would it be to take an image of an object (in this case of a predefined object), and develop an algorithm to cut just that object out of a photo with a background of varying complexity.
Further to this, a photo's object (say a house, car, dog - but always of one type) would need to be transformed into a 3d render. I know there are 3d rendering engines available (at a cost, free, or with some clause), but for this to work the object (subject) would need to be measured in all sorts of ways - e.g. if this is a person, we need to measure height, the curvature of the shoulder, radius of the face, length of each finger, etc.
What would the feasibility of solving this problem be? Anyone know any good links specialing in this research area? I've seen open source solutions to this problem which leaves me with the question of the ease of measuring the object while tracing around it to crop it out.
Essentially I want to take a 2d image (typical image:which is easier than a complex photo containing multiple objects, etc.)
But effectively I want to turn that into a 3d image, so wouldn't what I want to do involve building a 3d rendering/modelling engine?
Furthermore, that link I have provided goes into 3ds max, with a few properties set, and a render is made.
It sounds like you want to do several things, all in the domain of computer vision.
Object Recognition (i.e. find the predefined object)
3D Reconstruction (make the 3d model from the image)
Image Segmentation (cut out just the object you are worried about from the background)
I've ranked them in order of easiest to hardest (according to my limited understanding). All together I would say it is a very complicated problem. I would look at the following Wikipedia links for more information:
Computer Vision Overview (Wikipedia)
The Eight Point Algorithm (for 3d reconstruction)
Image Segmentation
You're right this is an extremely hard set of problems, particularly that of inferring 3D information from a 2D image. Only a very limited understanding exists of how our visual system extrapolates 3D information from 2D images, one such approach is known as "Shape from Shading" and the linked google search shows how much (and consequently how little) we know.
This is a very difficult task. The hardest part is not recognising or segmenting the object from the image, but rather inferring the 3-D geometry of the object from the 2-D image. You will have more success if you can use a stereoscopic camera (or a laser scanner, if you have access to one ;).
For the case of 2-D images, try googling for "shape-from-shading". This is a method for inferring 3-D shape from a 2-D image. It does make assumptions about illumination conditions and surface properties (BRDF and geometry) that may fail in many cases, but if you are using it for only a predefined class of objects (e.g. human faces) it can work reasonably well.
Assuming it's possible, that would be extremely difficult, especially with only one image of the object. The rasterizer has to guess at the depth and distances of objects.
What you describe sounds very similar to Microsoft PhotoSynth.
