Detection of coins (and fit ellipses) on an image

I am currently working on a project where I am trying to detect a few coins lying on a flat surface (i.e. a desk). The coins do not overlap and are not hidden by other objects. But there might be other objects visible and the lighting conditions may not be perfect... Basically, consider yourself filming your desk which has some coins on it.
So each point should be visible as an Ellipse. Since I don't know the position of the camera the shape of the ellipses may vary, from a circle (view from top) to flat ellipses depending on the angle the coins are filmed from.
My problem is that I am not sure how to extract the coins and finally fit ellipses over them (which I am looking for to do further calculations).
For now, I have just made the first attempt by setting a threshold value in OpenCV, using findContours() to get the contour lines and fitting an ellipse. Unfortunately, the contour lines only rarely give me the shape of the coins (reflections, bad lighting, ...) and this way is also not preferred since I don't want the user to set any threshold.
Another idea was to use a template matching method of an ellipse on that image, but since I don't know the angle of the camera nor the size of the ellipses I don't think this would work well...
So I wanted to ask if anybody could tell me a method that would work in my case.
Is there a fast way to extract the three coins from the image? The calculations should be made in realtime on mobile devices and the method should not be too sensitive for different or changing lights or the color of the background.
Would be great if anybody could give me any tips on which method could work for me.

Here's some C99 source implementing the traditional approach (based on OpenCV doco):
#include "cv.h"
#include "highgui.h"
#include <stdio.h>
#ifndef M_PI
#define M_PI 3.14159265358979323846
// We need this to be high enough to get rid of things that are too small too
// have a definite shape. Otherwise, they will end up as ellipse false positives.
#define MIN_AREA 100.00
// One way to tell if an object is an ellipse is to look at the relationship
// of its area to its dimensions. If its actual occupied area can be estimated
// using the well-known area formula Area = PI*A*B, then it has a good chance of
// being an ellipse.
// This value is the maximum permissible error between actual and estimated area.
#define MAX_TOL 100.00
int main( int argc, char** argv )
IplImage* src;
// the first command line parameter must be file name of binary (black-n-white) image
if( argc == 2 && (src=cvLoadImage(argv[1], 0))!= 0)
IplImage* dst = cvCreateImage( cvGetSize(src), 8, 3 );
CvMemStorage* storage = cvCreateMemStorage(0);
CvSeq* contour = 0;
cvThreshold( src, src, 1, 255, CV_THRESH_BINARY );
// Invert the image such that white is foreground, black is background.
// Dilate to get rid of noise.
cvXorS(src, cvScalar(255, 0, 0, 0), src, NULL);
cvDilate(src, src, NULL, 2);
cvFindContours( src, storage, &contour, sizeof(CvContour), CV_RETR_CCOMP, CV_CHAIN_APPROX_SIMPLE, cvPoint(0,0));
cvZero( dst );
for( ; contour != 0; contour = contour->h_next )
double actual_area = fabs(cvContourArea(contour, CV_WHOLE_SEQ, 0));
if (actual_area < MIN_AREA)
// Assuming the axes of the ellipse are vertical/perpendicular.
CvRect rect = ((CvContour *)contour)->rect;
int A = rect.width / 2;
int B = rect.height / 2;
double estimated_area = M_PI * A * B;
double error = fabs(actual_area - estimated_area);
if (error > MAX_TOL)
"center x: %d y: %d A: %d B: %d\n",
rect.x + A,
rect.y + B,
CvScalar color = CV_RGB( rand() % 255, rand() % 255, rand() % 255 );
cvDrawContours( dst, contour, color, color, -1, CV_FILLED, 8, cvPoint(0,0));
cvSaveImage("coins.png", dst, 0);
Given the binary image that Carnieri provided, this is the output:
./opencv-contour.out coin-ohtsu.pbm
center x: 291 y: 328 A: 54 B: 42
center x: 286 y: 225 A: 46 B: 32
center x: 471 y: 221 A: 48 B: 33
center x: 140 y: 210 A: 42 B: 28
center x: 419 y: 116 A: 32 B: 19
And this is the output image:
What you could improve on:
Handle different ellipse orientations (currently, I assume the axes are perpendicular/horizontal). This would not be hard to do using image moments.
Check for object convexity (have a look at cvConvexityDefects)
Your best way of distinguishing coins from other objects is probably going to be by shape. I can't think of any other low-level image features (color is obviously out). So, I can think of two approaches:
Traditional object detection
Your first task is to separate the objects (coins and non-coins) from the background. Ohtsu's method, as suggested by Carnieri, will work well here. You seem to worry about the images being bipartite but I don't think this will be a problem. As long as there is a significant amount of desk visible, you're guaranteed to have one peak in your histogram. And as long as there are a couple of visually distinguishable objects on the desk, you are guaranteed your second peak.
Dilate your binary image a couple of times to get rid of noise left by thresholding. The coins are relatively big so they should survive this morphological operation.
Group the white pixels into objects using region growing -- just iteratively connect adjacent foreground pixels. At the end of this operation you will have a list of disjoint objects, and you will know which pixels each object occupies.
From this information, you will know the width and the height of the object (from the previous step). So, now you can estimate the size of the ellipse that would surround the object, and then see how well this particular object matches the ellipse. It may be easier just to use width vs height ratio.
Alternatively, you can then use moments to determine the shape of the object in a more precise way.

I don't know what the best method for your problem is. About thresholding specifically, however, you can use Otsu's method, which automatically finds the optimal threshold value based on an analysis of the image histogram. Use OpenCV's threshold method with the parameter ThresholdType equal to THRESH_OTSU.
Be aware, though, that Otsu's method work well only in images with bimodal histograms (for instance, images with bright objects on a dark background).
You've probably seen this, but there is also a method for fitting an ellipse around a set of 2D points (for instance, a connected component).
EDIT: Otsu's method applied to a sample image:
Grayscale image:
Result of applying Otsu's method:

If anyone else comes along with this problem in the future as I did, but using C++:
Once you have used findContours to find the contours (as in Misha's answer above), you can easily fit ellipses using fitEllipse, eg
vector<vector<Point> > contours;
findContours(img, contours, CV_RETR_TREE, CV_CHAIN_APPROX_SIMPLE, Point(0,0));
RotatedRect rotRecs[contours.size()];
for (int i = 0; i < contours.size(); i++) {
rotRecs[i] = fitEllipse(contours[i]);


How to convert a screen coordinate into a translation for a projection matrix?

(More info at end)----->
I am trying to render a small picture-in-picture display over my scene. The PiP is just a smaller texture, but it is intended to reveal secret objects in the scene when it is placed over them.
To do this, I want to render my scene, then render the SAME scene on the smaller texture, but with the exact same positioning as the main scene. The intended result would be something like this:
My problem is... I cannot get the scene on the smaller texture to match up 1:1. I keep trying various kludges, but ultimately I suspect that I need to do something to the projection matrix to pan it over to the location of the frame. I can get it to zoom correctly...just can't get it to pan.
Can anyone suggest what I need to do to my projection matrix to render my scene 1:1 (but panned by x,y) onto a smaller texture?
The data I have:
Resolution of the full-screen framebuffer
Resolution of the smaller texture
XY coordinate where I want to draw the smaller texture as an overlay sprite
The world/view/projection matrices from the original full-screen scene
The viewport from the original full-screen scene
Here is the function I use to produce the 3D camera:
void Make3DCamera(Vector theCameraPos, Vector theLookAt, Vector theUpVector, float theFOV, Point theRez, Matrix& theViewMatrix,Matrix& theProjectionMatrix)
Matrix aCombinedViewMatrix;
Matrix aViewMatrix;
Vector aLookAtVector=theLookAt-theCameraPos;
Vector aSideVector=theUpVector.Cross(aLookAtVector);
aViewMatrix.mData.m[0][0] = -aSideVector.mX;
aViewMatrix.mData.m[1][0] = -aSideVector.mY;
aViewMatrix.mData.m[2][0] = -aSideVector.mZ;
aViewMatrix.mData.m[3][0] = 0;
aViewMatrix.mData.m[0][1] = -theUpVector.mX;
aViewMatrix.mData.m[1][1] = -theUpVector.mY;
aViewMatrix.mData.m[2][1] = -theUpVector.mZ;
aViewMatrix.mData.m[3][1] = 0;
aViewMatrix.mData.m[0][2] = aLookAtVector.mX;
aViewMatrix.mData.m[1][2] = aLookAtVector.mY;
aViewMatrix.mData.m[2][2] = aLookAtVector.mZ;
aViewMatrix.mData.m[3][2] = 0;
aViewMatrix.mData.m[0][3] = 0;
aViewMatrix.mData.m[1][3] = 0;
aViewMatrix.mData.m[2][3] = 0;
aViewMatrix.mData.m[3][3] = 1;
if (gG.mRenderToSprite) aViewMatrix.Scale(1,-1,1);
// Projection Matrix
float aAspect = (float) theRez.mX / (float) theRez.mY;
float aNear = gG.mZRange.mData1;
float aFar = gG.mZRange.mData2;
float aWidth = gMath.Cos(theFOV / 2.0f);
float aHeight = gMath.Cos(theFOV / 2.0f);
if (aAspect > 1.0) aWidth /= aAspect;
else aHeight *= aAspect;
float s = gMath.Sin(theFOV / 2.0f);
float d = 1.0f - aNear / aFar;
Matrix aPerspectiveMatrix;
aPerspectiveMatrix.mData.m[0][0] = aWidth;
aPerspectiveMatrix.mData.m[1][0] = 0;
aPerspectiveMatrix.mData.m[2][0] = gG.m3DOffset.mX/theRez.mX/2;
aPerspectiveMatrix.mData.m[3][0] = 0;
aPerspectiveMatrix.mData.m[0][1] = 0;
aPerspectiveMatrix.mData.m[1][1] = aHeight;
aPerspectiveMatrix.mData.m[2][1] = gG.m3DOffset.mY/theRez.mY/2;
aPerspectiveMatrix.mData.m[3][1] = 0;
aPerspectiveMatrix.mData.m[0][2] = 0;
aPerspectiveMatrix.mData.m[1][2] = 0;
aPerspectiveMatrix.mData.m[2][2] = s / d;
aPerspectiveMatrix.mData.m[3][2] = -(s * aNear / d);
aPerspectiveMatrix.mData.m[0][3] = 0;
aPerspectiveMatrix.mData.m[1][3] = 0;
aPerspectiveMatrix.mData.m[2][3] = s;
aPerspectiveMatrix.mData.m[3][3] = 0;
Edit to add more information:
Just playing and tweaking numbers, I have come to a "close" result. However the "close" result requires a multiplication by some kludge numbers, that I don't understand.
Here's what I'm doing to to perspective matrix to produce my close result:
//Before calling Make3DCamera, adjusting FOV:
aFOV*=smallerTexture.HeightF()/normalRenderSize.HeightF(); // Zoom it
aFOV*=1.02f // <- WTH is this?
//Then, to pan the camera over to the x/y position I want, I do:
Matrix aPM=GetCurrentProjectionMatrix();
float aX=(screenX-normalRenderSize.WidthF()/2.0f)/2.0f;
float aY=(screenY-normalRenderSize.HeightF()/2.0f)/2.0f;
aX*=1.07f; // <- WTH is this?
aY*=1.07f; // <- WTH is this?
When I do this, my new picture is VERY close... but not exactly perfect-- the small render tends to drift away from "center" the further the "magic window" is from the center. Without the kludge number, the drift away from center with the magic window is very pronounced.
The kludge numbers 1.02f for zoom and 1.07 for pan reduce the inaccuracies and drift to a fraction of a pixel, but those numbers must be a ratio from somewhere, right? They work at ANY RESOLUTION, though-- so I have have a 1280x800 screen and a 256,256 magic window texture... if I change the screen to 1024x768, it all still works.
Where the heck are these numbers coming from?
If you don't care about sub-optimal performance (i.e., drawing the whole scene twice) and if you don't need the smaller scene in a texture, an easy way to obtain the overlay with pixel perfect precision is:
Set up main scene (model/view/projection matrices, etc.) and draw it as you are now.
Use glScissor to set the rectangle for the overlay. glScissor takes the screen-space x, y, width, and height and discards anything outside that rectangle. It looks like you have those four data items already, so you should be good to go.
Call glEnable(GL_SCISSOR_TEST) to actually turn on the test.
Set the shader variables (if you're using shaders) for drawing the greyscale scene/hidden objects/etc. You still use the same view and projection matrices that you used for the main scene.
Draw the greyscale scene/hidden objects/etc.
Call glDisable(GL_SCISSOR_TEST) so you won't be scissoring at the start of the next frame.
Draw the red overlay border, if desired.
Now, if you actually need the overlay in its own texture for some reason, this probably won't be could be made to work either with framebuffer objects and/or pixel readback, but this would be less efficient.
Most people completely overcomplicate such issues. There is absolutely no magic to applying transformations after applying the projection matrix.
If you have a projection matrix P (and I'm assuming default OpenGL conventions here where P is constructed in a way that the vector is post-multiplied to the matrix, so for an eye space vector v_eye, we get v_clip = P * v_eye), you can simply pre-multiply some other translate and scale transforms to cut out any region of interest.
Assume you have a viewport of size w_view * h_view pixels, and you want to find a projection matrix which renders only a tile w_tile * h_tile pixels , beginning at pixel location (x_tile, y_tile) (again, assuming default GL conventions here, window space origin is bottom left, so y_tile is measured from the bottom). Also note that the _tile coordinates are to be interpreted relative to the viewport, in the typical case, that would start at (0,0) and have the size of your full framebuffer, but this is by no means required nor assumed here.
Since after applying the projection matrix we are in clip space, we need to transform our coordinates from window space pixels to clip space. Note that clip space is a 4D homogeneous space, but we can use any w value we like (except 0) to represent any point (as a point in the 3D space we care about forms a line in the 4D space we work in), so let's just use w=1 for simplicity's sake.
The view volume in clip space is denoted by the [-w,w] range, so in the w=1 hyperplane, it is [-1,1]. Converting our tile into this space yields:
x_clip = 2 * (x_tile / w_view) -1
y_clip = 2 * (y_tile / h_view) -1
w_clip = 2 * (w_tile / w_view) -1
h_clip = 2 * (h_tile / h_view) -1
We now just need to translate the objects such that the center of the tile is moved to the center of the view volume, which by definition is the origin, and scale the w_clip * h_clip sized region to the full [-1,1] extent in each dimension.
That means:
T = translate(-(x_clip + 0.5*w_clip), -(y_clip + 0.5 *h_clip), 0)
S = scale(2.0/w_clip, 2.0/h_clip, 1.0)
We can now create the modified projection matrix P' as P' = S * T * P, and that's all there is. Rendering with P' instead of P will render exactly the region of your tile to whatever viewport you are using, so for it to be pixel-exact with respect to your original viewport, you must now render with a viewport which is also w_tile * h_tile pixels big.
Note that there is also another approach: The viewport is not clamped against the framebuffer you're rendering to. It is actually valid to provide negative values for x and y. If your framebuffer for rendering your tile into is exactly w_tile * h_tile pixels, you simply could set glViewport(-x_tile, -y_tile, x_tile + w_tile, y_tile + h_tile) and render with the unmodified projection matrix P instead.

How can I scale/interpolate an image with indexed values smoothly?

I am wanting to scale grayscale images (input masks, really) with discrete values up smoothly. The values in these images are indexes that represent arbitrary concepts (e.g. "terrain types"; they are usually indices into a table), rather than values on a continuous scale, so they can't be averaged or blended in any way.
Do there exist algorithms that can do this with a more pleasing result than nearest-neighbour, which results in a very blocky, pixelated result? I am looking for something that will at least produce more rounded, more fluid results. The kind of thing that would be ideal would be a whitepaper, or a library (preferably in Java).
I've researched the subject, but I can't find anything. There is plenty about linear or cubic interpolation, etc., but that won't work for indexed values. The only algorithm I ever see mentioned that does not try to average values is nearest-neighbour. But there must be more?
Using colour here for clarity. I do of course understand that the preferred result here is impossible; I'm not asking for something that reconstitutes destroyed information, just hoping for something that will at least guestimate something smoother than the first result.
Scan the destination image and for every corresponding source pixel (non-integer coordinates) check if the colors of the four surrounding pixels are the same. If yes, assign that color.
If not, perform as many bilinear interpolations as there are different colors. For this assign the weight 1 for a given color (each in turn) and 0 for the others, and interpolate the weight. Finally, keep the color with the largest weight.
By analytical geometry, one can show that in bilinear interpolation, the iso-weight curves are arcs of hyperbola. If your magnification is large, you will see them. G1 continuity is not guaranteed. If this is an annoyance, you can work with G1 bicubic interpolation instead.
If this still does not satisfy you, you can try smooth approximating surfaces rather than interpolating ones. But the principle of keeping the color of maximum weight remains.
If there aren't many distinct colors and you want to use ready-made functions, you can work this out as follows:
split the image in several binary images (white for a chosen color, black for background);
magnify all images (to grayscale) using the favorite method;
now implement yourself a function that assigns every pixel the color that has the largest value among the magnified images.
You can also apply a smoothing filter to the binary images before or after magnification.
For the sake of illustration, here is what you would get with two colors at a time (but this easily generalizes).
Color source image:
Smoothing applied to the binary equivalents:
Maximum weight decision:
One thing you could try is to extract a polygon for the boundary of each uniformly-colored region, then upscale and draw the polygon in the output image. You won’t create neatly rounded edges, but you will avoid the stair-case effect of the nearest neighbor interpolation. Upscaling polygons should avoid gaps between the regions too.
I guess that smoothing the shape for each value individually is a way to avoid undesired mixed value.
To handle values individually, here, I started with your nearest-neighbour image v, and create 3 image { A.bmp, B.bmp, C.bmp } by hand.
(each image has only 1 color region and background is black. e.g. A.bmp is below:)
After smoothing the shape for each image, draw these shapes to one result image buffer with different color.
//I use C++ and OpenCV
int main()
const std::string FileNames[3] = { "A.bmp", "B.bmp", "C.bmp" };
const cv::Scalar ResultShowColor[3] = { cv::Scalar(0,255,255), cv::Scalar(0,255,0), cv::Scalar(0,0,255) };
cv::Mat Imgs[3];
const int KernelSize = 15;
for( int i=0; i<3; ++i )
Imgs[i] = cv::imread( FileNames[i], cv::IMREAD_GRAYSCALE );
if( Imgs[i].empty() )return 0;
cv::threshold( Imgs[i], Imgs[i], 32, 255, cv::THRESH_BINARY );
cv::GaussianBlur( Imgs[i], Imgs[i], cv::Size(KernelSize,KernelSize), 0 );
cv::threshold( Imgs[i], Imgs[i], 255*0.5, 255, cv::THRESH_BINARY );
cv::imshow( FileNames[i], Imgs[i] );
cv::Mat ResultImg = cv::Mat::zeros( Imgs[0].size(), CV_8UC3 );
for( int i=0; i<3; ++i )
ResultImg.setTo( ResultShowColor[i], Imgs[i] );
cv::imshow( "ResultImg", ResultImg );
if( cv::waitKey() == 's' ){ cv::imwrite( "ResultImg.png", ResultImg ); }
return 0;
This is result:
Yes, this result is not enough. Gaps exist at the boundaries of shapes.
Therefore some ingenuity is required... but I post this because it might be some hint for you.

Plot elements of specific size

I'm plotting a polygon made of edges and vertices. I'd like to plot these elements at a specific size or proportion: whether the polygon has 10 or 1000 vertices, I'd like the elements to be drawn at the same size. When zooming in and out of the vector image, element size would remain static.
For example, define a canvas of 100inx100in and draw lines .1in thick (and save to a pdf).
Currently, it seems impossible since, e.g., the LineWidth, MarkerSize, and FontSize are relative to the screen instead of the canvas. This means that when you zoom into the figure, the elements keep their size wrt screen. One option is to scale their size according to the zoom level. However, then the large polygon wouldn't necessarily fit the screen.
There are two ways that I see to resolve this, both seem impossible:
Define the size properties wrt the canvas and not the screen.
Go to the proper zoom level, and draw all elements even if they aren't in the figure clip region (save to a pdf).
Questions on the subject asked about specific elements such as lines or markers. The suggested solutions were to draw with alternative functions such as patch() and rectangle().
In that case, I'll forsake matlab's clunky drawing mechanism altogether, export the data, and draw in svg. But it would be a shame since matlab has powerful tools such as different marker shapes or a force graph.
Am I missing something fundamental or is this the worst design I've seen lately?
Matt J. observed that, in fact, when saving a pdf, there's no resolution limit regardless of the figure limitation.
Then, we can do the following:
Draw a small proof-of-concept plot with the right proportion between elements (markers, edges, and fonts). Save the data-unit-to-point ratio (sc0 below). Alternatively, you can use the same constant for all your drawings, considering this matlab's default drawing ratio.
Draw a plot of any complexity with similar proportions.
Scale it to have the same ratio as the saved one.
Save to pdf.
For example:
% draw a vertical polyline with n vertices
n = 5; % polyline size
y = 0:n;
plot( zeros( size(y) ), y, '-o', 'LineWidth', 2, 'MarkerSize', 10 );
axis equal;
% scale
sc0 = 51; % ratio calculated by data_units_to_points_ratio() from the initial (designed) fig of a polyline of size 5
sc = data_unit_to_point_ratio() / sc0;
scale_fig_objects( sc );
% save
print( 'plot.pdf' );
If you change n=100, the figure would be a proportional mess (a thin line, markers not showing), but the pdf would be fine, having the same segment (vertex to edge) proportion.
Functions used:
% Based on Matt's suggestion
function conversionFactor = data_unit_to_point_ratio()
set( gcf, 'Units', 'points' );
DU = diff(xlim); % width of figure in data units
hfig = gcf;
P = hfig.Position(3); % width of figure in points
conversionFactor = P / DU; % conversion factor, data units to points
function scale_fig_objects( s )
hs = findobj;
for i = 1:length( hs )
h = hs(i);
t = h.Type;
if strcmpi( t, 'line' ) || strcmpi( t, 'GraphPlot' )
h.LineWidth = h.LineWidth * s;
h.MarkerSize = h.MarkerSize * s;
elseif strcmpi( t, 'scatter' )
h.SizeData = h.SizeData * s^2; % it's a squared factor!
elseif strcmpi( t, 'text' )
h.FontSize = h.FontSize * s;

What algorithms or approaches apart from Haar cascades could be used for custom objects detection?

I need to do computer visions tasks in order to detect watter bottles or soda cans. I will obtain 'frontal' images of bottles, soda cans or any other random objects (one by one) and my algorithm should determine whether it's a bottle, a can or any of them.
Some details about object detecting scenario:
As mentioned, I will test one single object per image/video frame.
Not all watter bottles are the same. There could be color in plastic, lid or label variation. Maybe some could not get label or lid.
Same about variation goes for soda cans. No wrinkled soda cans are gonna be tested though.
There could be small size variation between objects.
I could have a green (or any custom color) background.
I will do any needed filters on image.
This will be run on a Raspberry Pi.
Just in case, an example of each:
I've tested a couple times OpenCV face detection algorithms and I know it works pretty good but I'd need to obtain an special Haar Cascades features XML file for detecting each custom object on this approach.
So, the distinct alternatives I have in mind are:
Creating a custom Haar Classifier.
Considering shapes.
Considering outlines.
I'd like to get a simple algorithm and I think creating a custom Haar classifier could be even not needed. What would you suggest?
I strongly considered the shape/aspect ratio approach.
However I guess I'm facing some issues as bottles come in distinct sizes or even shapes each. But this made me think or set following considerations:
I'm applying a threshold with THRESH_BINARY method. (Thanks to the answers).
I will use a white background on detection.
Soda cans are all same size.
So, a bounding box for soda cans with high accuracy might distinguish a can.
What I've achieved:
Threshold really helped me, I could notice that on white background tests I would obtain for cans:
And this is what it's obtained for bottles:
So, darker areas left dominancy is noticeable. There are some cases in cans where this might turn into false negatives. And for bottles, light and angle may lead to not consistent results but I really really think this could be a shorter approach.
So, I'm quite confused now how I should evaluate that darkness dominancy, I've read that findContours leads to it but I'm quite lost on how to seize such function. For example, in case of soda cans, it may find several contours, so I get lost on what to evaluate.
Note: I'm open to test any other algorithms or libraries distinct to Open CV.
I see few basic ideas here:
Check object (to be precise - object boundind rect) width/height ratio. For can it's approimetely 2-2.5, for bottle i think it will be >3. It's very simple idea to it should be easy to test it quickly and i think it should has quite good accuracy. For some values, like 2.75 (assumimg that values that i gave are correct, which most likely isn't true) you can use some different algorithm.
Check whether you object contains glass/transparence regions - if yes, than definitely it's a bottle. Here you can read more about it.
Use grabcut algorithm to get object mask/more precise shape and check whether this shape width at the top is similar to width at the bottom - if yes than it's a can, no - bottle (bottles has screw cap at the top).
Since you want to recognize can vs bottle rather than pepsi vs coke, shape matching is probably the way to go when compared to Haar and the features2d matchers like SIFT/SURF/ORB
A unique background color will make things easier.
First create a histogram from an image of just the background
int channels[] = {0,1,2}; // use all the channels
int rgb_bins = 32; // quantize to 32 colors per channel
int histSize[] = {rgb_bins, rgb_bins, rgb_bins};
float _range[] = {0,255};
float* ranges[] = {_range, _range, _range};
cv::SparseMat bghist;
cv::calcHist(&bg_image, 1, channels, cv::noArray(),bghist, 3, histSize, ranges );
Then use calcBackProject to create a mask of bg and not bg
cv::MatND temp_ND;
cv::calcBackProject( &bottle_image, 1, channels, bghist, temp_ND, ranges );
cv::Mat bottle_mask, bottle_backproj;
if( feeling_lazy ){
cv::normalize(temp_ND, bottle_backproj, 0, 255, cv::NORM_MINMAX, CV_8U);
//a small blur here could work nicely
threshold( bottle_backproj, bottle_mask, 0, 255, THRESH_OTSU );
bottle_mask = cv::Scalar(255) - bottle_mask; //invert the mask
} else {
//finding just the right value here might be better than the above method
int magic_threshold = 64;
temp_ND.convertTo( bottle_backproj, CV_8U, 255.);
//I expect temp_ND to be CV_32F ranging from 0-1, but I might be wrong.
threshold( bottle_backproj, bottle_mask, magic_threshold, 255, THRESH_BINARY_INV );
Then either:
Compare bottle_mask or bottle_backproj to a few sample bottle masks/backprojections using matchTemplate with a threshold on confidence to decide if it's a match.
matchTemplate(bottle_mask, bottle_template, result, CV_TM_CCORR_NORMED);
double confidence; minMaxLoc( result, NULL, &confidence);
Or use matchShapes, though I've never gotten this to work properly.
double confidence = matchShapes(bottle_mask, bottle_template, CV_CONTOURS_MATCH_I3);
Or use linemod which is difficult to set up but works great for images like this where the shape isn't very complex. Aside from the linked file, I haven't found any working samples of this method so here's what I did.
First create/train the detector with some sample images
//some magic numbers
std::vector<int> T_at_level;
//add some padding so linemod doesn't scream at you
const int T = 32;
int width = bottle_mask.cols;
if( width % T != 0)
width += T - width % T;
int height = bottle_mask.rows;
if( height % T != 0)
height += T - height % T;
//in this case template_backproj is created specifically from a sample bottle_backproj
cv::Rect padded_roi( (width - template_backproj.cols)/2, (height - template_backproj.rows)/2, template_backproj.cols, template_backproj.rows);
cv::Mat padded_backproj = zeros( width, height, template_backproj.type());
padded_backproj( padded_roi ) = template_backproj;
cv::Mat padded_mask = zeros( width, height, template_mask.type());
padded_mask( padded_roi ) = template_mask;
//you might need to erode padded_mask by a few pixels.
//initialize detector
std::vector< cv::Ptr<cv::linemod::Modality> > modalities;
modalities.push_back( cv::makePtr<cv::linemod::ColorGradient>() ); //for those that don't have a kinect
cv::Ptr<cv::linemod::Detector> new_detector = cv::makePtr<cv::linemod::Detector>(modalities, T_at_level);
//add sample images to the detector
std::vector<cv::Mat> template_images;
templates.push_back( padded_backproj);
cv::Rect ignore_me;
const std::string class_id = "bottle";
template_id = new_detector->addTemplate(template_images, class_id, padded_mask, &ignore_me);
Then do some matching
std::vector<cv::Mat> sources_vec;
sources_vec.push_back( padded_backproj );
//padded_backproj doesn't need to be the same size as the trained template images, but it does need to be padded the same way.
float matching_threshold = 0.8; //a higher number makes the algorithm faster
std::vector<cv::linemod::Match> matches;
std::vector<cv::String> class_ids;
new_detector->match(sources_vec, matching_threshold, matches,class_ids);
float confidence = matches.size() > 0? matches[0].similarity : 0;
As cyriel suggests, the aspect ratio (width/height) might be one useful measure. Here is some OpenCV Python code that finds contours (hopefully including the outline of the bottle or can) and gives you aspect ratio and some other measurements:
# src image should have already had some contrast enhancement (such as
# cv2.threshold) and edge finding (such as cv2.Canny)
contours, hierarchy = cv2.findContours(src, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
for contour in contours:
num_points = len(contour)
if num_points < 5:
# The contour has too few points to fit an ellipse. Skip it.
# We could use area to help determine the type of object.
# Small contours are probably false detections (not really a whole object).
area = cv2.contourArea(contour)
bounding_ellipse = cv2.fitEllipse(contour)
center, radii, angle_degrees = bounding_ellipse
# Let's define an ellipse's normal orientation to be landscape (width > height).
# We must ensure that the ellipse's measurements match this orientation.
if radii[0] < radii[1]:
radii = (radii[1], radii[0])
angle_degrees -= 90.0
# We could use the angle to help determine the type of object.
# A bottle or can's angle is probably approximately a multiple of 90 degrees,
# assuming that it is at rest and not falling.
# Calculate the aspect ratio (width / height).
# For example, 0.5 means the object's height is 2 times its width.
# A bottle is probably taller than a can.
aspect_ratio = radii[0] / radii[1]
For checking transparency, you can compare the picture to a known background using histogram analysis or background subtraction.
The contour's moments can be used to determine its centroid (center of gravity):
moments = cv2.moments(contour)
m00 = moments['m00']
m01 = moments['m01']
m10 = moments['m10']
centroid = (m10 / m00, m01 / m00)
You could compare this to the center. If the object is bigger ("heavier") on one end, the centroid will be closer to that end than the center is.
So, my main approach for detection was:
Bottles are transparent and cans are opaque
Generally algorithm consisted in:
Take a grayscale picture.
Apply a binary threshold.
Select a convenient ROI from it.
Obtain it's color mean and even the standard deviation.
Implementation was basically reduced to this function (where CAN and BOTTLE were previously defined):
int detector(int x, int y, int width, int height, int thresholdValue, CvCapture* capture) {
Mat img;
Rect r;
vector<Mat> channels;
r = Rect(x,y,width,height);
if ( !capture ) {
fprintf( stderr, "ERROR: capture is NULL \n" );
return -1;
img = Mat(cvQueryFrame( capture ));
threshold(img, img, 127, 255, THRESH_BINARY);
// ROI
Mat roiImage = img(r);
split(roiImage, channels);
Scalar m = mean(channels[0]);
float media = m[0];
printf("Media: %f\n", media);
if (media < thresholdValue) {
return CAN;
else {
return BOTTLE;
As it can be seen, a THRESH_BINARY threshold was applied, and it was a plain white background which was used. However the main and critical issue I faced with this whole approach and algorithm was luminosity changes in environment, even minor ones.
Sometimes I could notice a THRESH_BINARY_INV might help more, but I wonder if I could use some certian threshold parameters or wether applying other filters may lead to getting rid of environment lightning as an issue.
I really appreciate the aspect ratio calculation approach from bounding box or finding contours but I found this straight forward and simple when conditions were adjusted.
I'd use deep learning, based on Transfer learning.
The idea is this: given a highly complex well trained neural network, that was trained on a similar classification task (tipically over a large public dataset, like imagenet), you can freeze the majority of its weigths and only train the last layers. There are lots of tutorials out there. You don't need to have a background on deep learning.
There is a tutorial which is almost out of the box with tensorflow here and here there is another based on keras.

Transform point position in trapezoid to rectangle position

I am trying to find out how I can transform a coordinate Pxy within the green trapezoid below into the equivalent coordinate on the real ground plane.
I have the exact measures of the room, meaning I can exactly say how long A,B,C and D are in that room shown below.
Also I know how long A,B,C and D are in that green triangle (coordinate wise).
I have already been reading about homography and matrix transformation, but can't really wrap my head around it. Any input steering me into the right direction would be appreciated.
There is the code computes the affine transformation matrix using the library Opencv (it shows how to trasform your trapezoid to rectangle and how to find transformation matrix for futher calculations):
//example from book
// Learning OpenCV: Computer Vision with the OpenCV Library
// by Gary Bradski and Adrian Kaehler
// Published by O'Reilly Media, October 3, 2008
#include <cv.h>
#include <highgui.h>
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char* argv[])
IplImage *src=0, *dst=0;
// absolute or relative path to image should be in argv[1]
char* filename = argc == 2 ? argv[1] : "Image0.jpg";
// get the picture
src = cvLoadImage(filename,1);
printf("[i] image: %s\n", filename);
assert( src != 0 );
// points (corners of )
CvPoint2D32f srcQuad[4], dstQuad[4];
// transformation matrix
CvMat* warp_matrix = cvCreateMat(3,3,CV_32FC1);
// clone image
dst = cvCloneImage(src);
// define all the points
//here the coordinates of corners of your trapezoid
srcQuad[0].x = ??; //src Top left
srcQuad[0].y = ??;
srcQuad[1].x = ??; //src Top right
srcQuad[1].y = ??;
srcQuad[2].x = ??; //src Bottom left
srcQuad[2].y = ??;
srcQuad[3].x = ??; //src Bot right
srcQuad[3].y = ??;
//- - - - - - - - - - - - - -//
//coordinates of rectangle in src image
dstQuad[0].x = 0; //dst Top left
dstQuad[0].y = 0;
dstQuad[1].x = src->width-1; //dst Top right
dstQuad[1].y = 0;
dstQuad[2].x = 0; //dst Bottom left
dstQuad[2].y = src->height-1;
dstQuad[3].x = src->width-1; //dst Bot right
dstQuad[3].y = src->height-1;
// get transformation matrix that you can use to calculate
//coordinates of point Pxy
// perspective transformation
cvNamedWindow( "cvWarpPerspective", 1 );
cvShowImage( "cvWarpPerspective", dst );
return 0;
Hope it will be helpfull!
If I understand your question correctly, you are looking for the transform matrix that expresses the position and orientation (aka the "pose") of your camera in relation to the world. If you have this matrix - lets call it M - you could map any point from your camera coordinate frame to the world coordinate frame and vice versa. In your case you'll want to transform a rectangle onto the plane (0, 1, 0)^T + 0 in world coordinates.
There are several ways to derive this pose Matrix. First of all you'll need to know another matrix - K - which describes the internal camera parameters to convert positions in the camera coordinate frame to actual pixel positions. This involves a standard pinhole projection as well as radial distortion and a few other things.
To determine both K and M you have to calibrate your camera. This is usually done by taking a calibration pattern (e.g. a chessboard-pattern) for which the positions of the chessboard-fields are known. Then you can establish so called Point-Correspondences between the known positions on the pattern and the observed pixel-positions. Once you have enough of these point-pairs you can solve a Matrix H = KM. This is your Homography matrix you've mentioned already. Once you have that, you can reconstruct K and M.
So much for the theory. For the practical part I would suggest to have a look at the OpenCV-Documentations (e.g. you could start here: OpenCV Camera calibration and here: OpenCV Pose estimation).
I hope this will point you in the right directions ;)
Just for the sake of completion. I ended up looking at the thread suggested by #mmgp and implemented a solution that is equivalent to the one presented by Christopher R. Wren:
Perspective Transform Estimation
This turned out to work really well for my case, although there was some distortion from the camera.
