I 'm trying to find an efficient way of acceptable complexity to
detect an object in an image so I can isolate it from its surroundings
segment that object to its sub-parts and label them so I can then fetch them at will
It's been 3 weeks since I entered the image processing world and I've read about so many algorithms (sift, snakes, more snakes, fourier-related, etc.), and heuristics that I don't know where to start and which one is "best" for what I'm trying to achieve. Having in mind that the image dataset in interest is a pretty large one, I don't even know if I should use some algorithm implemented in OpenCV or if I should implement one my own.
Summarize:
Which methodology should I focus on? Why?
Should I use OpenCV for that kind of stuff or is there some other 'better' alternative?
Thank you in advance.
EDIT -- More info regarding the datasets
Each dataset consists of 80K images of products sharing the same
concept e.g. t-shirts, watches, shoes
size
orientation (90% of them)
background (95% of them)
All pictures in each datasets look almost identical apart from the product itself, apparently. To make things a little more clear, let's consider only the 'watch dataset':
All the pictures in the set look almost exactly like this:
(again, apart form the watch itself). I want to extract the strap and the dial. The thing is that there are lots of different watch styles and therefore shapes. From what I've read so far, I think I need a template algorithm that allows bending and stretching so as to be able to match straps and dials of different styles.
Instead of creating three distinct templates (upper part of strap, lower part of strap, dial), it would be reasonable to create only one and segment it into 3 parts. That way, I would be confident enough that each part was detected with respect to each other as intended to e.g. the dial would not be detected below the lower part of the strap.
From all the algorithms/methodologies I've encountered, active shape|appearance model seem to be the most promising ones. Unfortunately, I haven't managed to find a descent implementation and I'm not confident enough that that's the best approach so as to go ahead and write one myself.
If anyone could point out what I should be really looking for (algorithm/heuristic/library/etc.), I would be more than grateful. If again you think my description was a bit vague, feel free to ask for a more detailed one.
From what you've said, here are a few things that pop up at first glance:
Simplest thing to do it binarize the image and do Connected Components using OpenCV or CvBlob library. For simple images with non-complex background this usually yeilds objects
HOwever, looking at your sample image, texture-based segmentation techniques may work better - the watch dial, the straps and the background are wisely variant in texture/roughness, and this could be an ideal way to separate them.
The roughness of a portion can be easily found by the Eigen transform (explained a bit on SO, check the link to the research paper provided there), then the Mean Shift filter can be applied on the output of the Eigen transform. This will give regions clearly separated according to texture. Both the pyramidal Mean Shift and finding eigenvalues by SVD are implemented in OpenCV, so unless you can optimize your own code its better (and easier) to use inbuilt functions (if present) as far as speed and efficiency is concerned.
I think I would turn the problem around. Instead of hunting for the dial, I would use a set of robust features from the watch to 'stitch' the target image onto a template. The first watch has a set of squares in the dial that are white, the second watch has a number of white circles. I would per type of watch:
Segment out the squares or circles in the dial. Segmentation steps can be tricky as they are usually both scale and light dependent
Estimate the centers or corners of the above found feature areas. These are the new feature points.
Use the Hungarian algorithm to match features between the template watch and the target watch. Alternatively, one can take the surroundings of each feature point in the original image and match these using cross correlation
Use matching features between the template and the target to estimate scaling, rotation and translation
Stitch the image
As the image is now in a known form, one can extract the regions simply via pre set coordinates
Related
I have an idea for an app that takes a printed page with four squares in each corner and allows you to measure objects on the paper given at least two squares are visible. I want to be able to have a user take a picture from less than perfect angles and still have the objects be measured accurately.
I'm unable to figure out exactly how to find information on this subject due to my lack of knowledge in the area. I've been able to find examples of opencv code that does some interesting transforms and the like but I've yet to figure out what I'm asking in simpler terms.
Does anyone know of papers or mathematical concepts I can lookup to get further into this project?
I'm not quite sure how or who to ask other than people on this forum, sorry for the somewhat vague question.
What you describe is very reminiscent of augmented reality marker tracking. Maybe you can start by searching these words on a search engine of your choice.
A single marker, if done correctly, can be used to identify it without confusing it with other markers AND to determine how the surface is placed in 3D space in front of the camera.
But that's all very difficult and advanced stuff, I'd greatly advise to NOT try and implement something like this, it would take years of research... The only way you have is to use a ready-made open source library that outputs the data you need for your app.
It may even not exist. In that case you'll have to buy one. Given the niché of your problem that would be perfectly plausible.
Here I give you only the programming aspect and if you want you can find out about the mathematical aspect from those examples. Most of the functions you need can be done using OpenCV. Here are some examples in python:
To detect the printed paper, you can use cv2.findContours function. The most outer contour is possibly the paper, but you need to test on actual images. https://docs.opencv.org/3.1.0/d4/d73/tutorial_py_contours_begin.html
In case of sloping (not in perfect angle), you can find the angle by cv2.minAreaRect which return the angle of the contour you found above. https://docs.opencv.org/3.1.0/dd/d49/tutorial_py_contour_features.html (part 7b).
If you want to rotate the paper, use cv2.warpAffine. https://docs.opencv.org/3.0-beta/doc/py_tutorials/py_imgproc/py_geometric_transformations/py_geometric_transformations.html
To detect the object in the paper, there are some methods. The easiest way is using the contours above. If the objects are in certain colors, you can detect it by using color filter. https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_imgproc/py_colorspaces/py_colorspaces.html
I am trying to count the number of hairs transplanted in the following image. So practically, I have to count the number of spots I can find in the center of image.
(I've uploaded the inverted image of a bald scalp on which new hairs have been transplanted because the original image is bloody and absolutely disgusting! To see the original non-inverted image click here. To see the larger version of the inverted image just click on it). Is there any known image processing algorithm to detect these spots? I've found out that the Circle Hough Transform algorithm can be used to find circles in an image, I'm not sure if it's the best algorithm that can be applied to find the small spots in the following image though.
P.S. According to one of the answers, I tried to extract the spots using ImageJ, but the outcome was not satisfactory enough:
I opened the original non-inverted image (Warning! it's bloody and disgusting to see!).
Splited the channels (Image > Color > Split Channels). And selected the blue channel to continue with.
Applied Closing filter (Plugins > Fast Morphology > Morphological Filters) with these values: Operation: Closing, Element: Square, Radius: 2px
Applied White Top Hat filter (Plugins > Fast Morphology > Morphological Filters) with these values: Operation: White Top Hat, Element: Square, Radius: 17px
However I don't know what to do exactly after this step to count the transplanted spots as accurately as possible. I tried to use (Process > Find Maxima), but the result does not seem accurate enough to me (with these settings: Noise tolerance: 10, Output: Single Points, Excluding Edge Maxima, Light Background):
As you can see, some white spots have been ignored and some white areas which are not actually hair transplant spots, have been marked.
What set of filters do you advise to accurately find the spots? Using ImageJ seems a good option since it provides most of the filters we need. Feel free however, to advise what to do using other tools, libraries (like OpenCV), etc. Any help would be highly appreciated!
I do think you are trying to solve the problem in a bit wrong way. It might sound groundless, so I'd better show my results first.
Below I have a crop of you image on the left and discovered transplants on the right. Green color is used to highlight areas with more than one transplant.
The overall approach is very basic (will describe it later), but still it provides close to be accurate results. Please note, it was a first try, so there is a lot of room for enhancements.
Anyway, let's get back to the initial statement saying you approach is wrong. There are several major issues:
the quality of your image is awful
you say you want to find spots, but actually you are looking for hair transplant objects
you completely ignores the fact average head is far from being flat
it does look like you think filters will add some important details to your initial image
you expect algorithms to do magic for you
Let's review all these items one by one.
1. Image quality
It might be very obvious statement, but before the actual processing you need to make sure you have best possible initial data. You might spend weeks trying to find a way to process photos you have without any significant achievements. Here are some problematic areas:
I bet it is hard for you to "read" those crops, despite the fact you have the most advanced object recognition algorithms in your brain.
Also, your time is expensive and you still need best possible accuracy and stability. So, for any reasonable price try to get: proper contrast, sharp edges, better colors and color separation.
2. Better understanding of the objects to be identified
Generally speaking, you have a 3D objects to be identified. So you can analyze shadows in order to improve accuracy. BTW, it is almost like a Mars surface analysis :)
3. The form of the head should not be ignored
Because of the form of the head you have distortions. Again, in order to get proper accuracy those distortions should be corrected before the actual analysis. Basically, you need to flatten analyzed area.
3D model source
4. Filters might not help
Filters do not add information, but they can easily remove some important details. You've mentioned Hough transform, so here is interesting question: Find lines in shape
I will use this question as an example. Basically, you need to extract a geometry from a given picture. Lines in shape looks a bit complex, so you might decide to use skeletonization
All of a sadden, you have more complex geometry to deal with and virtually no chances to understand what actually was on the original picture.
5. Sorry, no magic here
Please be aware of the following:
You must try to get better data in order to achieve better accuracy and stability. The model itself is also very important.
Results explained
As I said, my approach is very simple: image was posterized and then I used very basic algorithm to identify areas with a specific color.
Posterization can be done in a more clever way, areas detection can be improved, etc. For this PoC I just have a simple rule to highlight areas with more than one implant. Having areas identified a bit more advanced analysis can be performed.
Anyway, better image quality will let you use even simple method and get proper results.
Finally
How did the clinic manage to get Yondu as client? :)
Update (tools and techniques)
Posterization - GIMP (default settings,min colors)
Transplant identification and visualization - Java program, no libraries or other dependencies
Having areas identified it is easy to find average size, then compare to other areas and mark significantly bigger areas as multiple transplants.
Basically, everything is done "by hand". Horizontal and vertical scan, intersections give areas. Vertical lines are sorted and used to restore the actual shape. Solution is homegrown, code is a bit ugly, so do not want to share it, sorry.
The idea is pretty obvious and well explained (at least I think so). Here is an additional example with different scan step used:
Yet another update
A small piece of code, developed to verify a very basic idea, evolved a bit, so now it can handle 4K video segmentation in real-time. The idea is the same: horizontal and vertical scans, areas defined by intersected lines, etc. Still no external libraries, just a lot of fun and a bit more optimized code.
Additional examples can be found on YouTube: RobotsCanSee
or follow the progress in Telegram: RobotsCanSee
I've just tested this solution using ImageJ, and it gave good preliminary result:
On the original image, for each channel
Small (radius 1 or 2) closing in order to get rid of the hairs (black part in the middle of the white one)
White top-hat of radius 5 in order to detect the white part around each black hair.
Small closing/opening in order to clean a little bit the image (you can also use a median filter)
Ultimate erode in order to count the number of white blob remaining. You can also certainly use a LoG (Laplacian of Gaussian) or a distance map.
[EDIT]
You don't detect all the white spots using the maxima function, because after the closing, some zones are flat, so the maxima is not a point, but a zone. At this point, I think that an ultimate opening or an ultimate eroded would give you the center or each white spot. But I am not sure that there is a function/pluggin doing it in ImageJ. You can take a look to Mamba or SMIL.
A H-maxima (after white top-hat) may also clean a little bit more your results and improve the contrast between the white spots.
As Renat mentioned, you should not expect algorithms to do magic for you, however I'm hopeful to come up with a reasonable estimate of the number of spots. Here, I'm going to give you some hints and resources, check them out and call me back if you need more information.
First, I'm kind of hopeful to morphological operations, but I think a perfect pre-processing step may push the accuracy yielded by them dramatically. I want you put my finger on the pre-processing step. Thus I'm going ti work with this image:
That's the idea:
Collect and concentrate the mass around the spot locations. What do I mean my concentrating the masses? Let's open the book from the other side: As you see, the provided image contains some salient spots surrounded by some noisy gray-level dots.
By dots, I mean the pixels that are not part of a spot, but their gray-value are larger than zero (pure black) - which are available around the spots. It is clear that if you clear these noisy dots, you surely will come up with a good estimate of spots using other processing tools such as morphological operations.
Now, how to make the image more sharp? What if we could make the dots to move forward to their nearest spots? This is what I mean by concentrating the masses over the spots. Doing so, only the prominent spots will be present in the image and hence we have made a significant step toward counting the prominent spots.
How to do the concentrating thing? Well, the idea that I just explained is available in this paper, which its code is luckily available. See the section 2.2. The main idea is to use a random walker to walk on the image for ever. The formulations is stated such that the walker will visit the prominent spots far more times and that can lead to identifying the prominent spots. The algorithm is modeled Markov chain and The equilibrium hitting times of the ergodic Markov chain holds the key for identifying the most salient spots.
What I described above is just a hint and you should read that short paper to get the detailed version of the idea. Let me know if you need more info or resources.
That is a pleasure to think on such interesting problems. Hope it helps.
You could do the following:
Threshold the image using cv::threshold
Find connected components using cv::findcontour
Reject the connected components of size larger than a certain size as you seem to be concerned about small circular regions only.
Count all the valid connected components.
Hopefully, you have a descent approximation of the actual number of spots.
To be statistically more accurate, you could repeat 1-4 for a range of thresholds and take the average.
This is what you get after applying unsharpen radius 22, amount 5, threshold 2 to your image.
This increases the contrast between the dots and the surrounding areas. I used the ballpark assumption that the dots are somewhere between 18 and 25 pixels in diameter.
Now you can take the local maxima of white as a "dot" and fill it in with a black circle until the circular neighborhood of the dot (a circle of radius 10-12) erases the dot. This should let you "pick off" the dots joined to each other in clusters more than 2. Then look for local maxima again. Rinse and repeat.
The actual "dot" areas are in stark contrast to the surrounding areas, so this should let you pick them off as well as you would by eyeballing it.
I have a web cam that takes a picture every N seconds. This gives me a collection of images of the same scene over time. I want to process that collection of images as they are created to identify events like someone entering into the frame, or something else large happening. I will be comparing images that are adjacent in time and fixed in space - the same scene at different moments of time.
I want a reasonably sophisticated approach. For example, naive approaches fail for outdoor applications. If you count the number of pixels that change, for example, or the percentage of the picture that has a different color or grayscale value, that will give false positive reports every time the sun goes behind a cloud or the wind shakes a tree.
I want to be able to positively detect a truck parking in the scene, for example, while ignoring lighting changes from sun/cloud transitions, etc.
I've done a number of searches, and found a few survey papers (Radke et al, for example) but nothing that actually gives algorithms that I can put into a program I can write.
Use color spectroanalisys, without luminance: when the Sun goes down for a while, you will get similar result, colors does not change (too much).
Don't go for big changes, but quick changes. If the luminance of the image changes -10% during 10 min, it means the usual evening effect. But when the change is -5%, 0, +5% within seconds, its a quick change.
Don't forget to adjust the reference values.
Split the image to smaller regions. Then, when all the regions change same way, you know, it's a global change, like an eclypse or what, but if only one region's parameters are changing, then something happens there.
Use masks to create smart regions. If you're watching a street, filter out the sky, the trees (blown by wind), etc. You may set up different trigger values for different regions. The regions should overlap.
A special case of the region is the line. A line (a narrow region) contains less and more homogeneous pixels than a flat area. Mark, say, a green fence, it's easy to detect wheter someone crosses it, it makes bigger change in the line than in a flat area.
If you can, change the IRL world. Repaint the fence to a strange color to create a color spectrum, which can be identified easier. Paint tags to the floor and wall, which can be OCRed by the program, so you can detect wheter something hides it.
I believe you are looking for Template Matching
Also i would suggest you to look on to Open CV
We had to contend with many of these issues in our interactive installations. It's tough to not get false positives without being able to control some of your environment (sounds like you will have some degree of control). In the end we looked at combining some techniques and we created an open piece of software named OpenTSPS (Open Toolkit for Sensing People in Spaces - http://www.opentsps.com). You can look at the C++ source in github (https://github.com/labatrockwell/openTSPS/).
We use ‘progressive background relearn’ to adjust to the changing background over time. Progressive relearning is particularly useful in variable lighting conditions – e.g. if lighting in a space changes from day to night. This in combination with blob detection works pretty well and the only way we have found to improve is to use 3D cameras like the kinect which cast out IR and measure it.
There are other algorithms that might be relevant, like SURF (http://achuwilson.wordpress.com/2011/08/05/object-detection-using-surf-in-opencv-part-1/ and http://en.wikipedia.org/wiki/SURF) but I don't think it will help in your situation unless you know exactly the type of thing you are looking for in the image.
Sounds like a fun project. Best of luck.
The problem you are trying to solve is very interesting indeed!
I think that you would need to attack it in parts:
As you already pointed out, a sudden change in illumination can be problematic. This is an indicator that you probably need to achieve some sort of illumination-invariant representation of the images you are trying to analyze.
There are plenty of techniques lying around, one I have found very useful for illumination invariance (applied to face recognition) is DoG filtering (Difference of Gaussians)
The idea is that you first convert the image to gray-scale. Then you generate two blurred versions of this image by applying a gaussian filter, one a little bit more blurry than the first one. (you could use a 1.0 sigma and a 2.0 sigma in a gaussian filter respectively) Then you subtract from the less-blury image, the pixel intensities of the more-blurry image. This operation enhances edges and produces a similar image regardless of strong illumination intensity variations. These steps can be very easily performed using OpenCV (as others have stated). This technique has been applied and documented here.
This paper adds an extra step involving contrast equalization, In my experience this is only needed if you want to obtain "visible" images from the DoG operation (pixel values tend to be very low after the DoG filter and are veiwed as black rectangles onscreen), and performing a histogram equalization is an acceptable substitution if you want to be able to see the effect of the DoG filter.
Once you have illumination-invariant images you could focus on the detection part. If your problem can afford having a static camera that can be trained for a certain amount of time, then you could use a strategy similar to alarm motion detectors. Most of them work with an average thermal image - basically they record the average temperature of the "pixels" of a room view, and trigger an alarm when the heat signature varies greatly from one "frame" to the next. Here you wouldn't be working with temperatures, but with average, light-normalized pixel values. This would allow you to build up with time which areas of the image tend to have movement (e.g. the leaves of a tree in a windy environment), and which areas are fairly stable in the image. Then you could trigger an alarm when a large number of pixles already flagged as stable have a strong variation from one frame to the next one.
If you can't afford training your camera view, then I would suggest you take a look at the TLD tracker of Zdenek Kalal. His research is focused on object tracking with a single frame as training. You could probably use the semistatic view of the camera (with no foreign objects present) as a starting point for the tracker and flag a detection when the TLD tracker (a grid of points where local motion flow is estimated using the Lucas-Kanade algorithm) fails to track a large amount of gridpoints from one frame to the next. This scenario would probably allow even a panning camera to work as the algorithm is very resilient to motion disturbances.
Hope this pointers are of some help. Good Luck and enjoy the journey! =D
Use one of the standard measures like Mean Squared Error, for eg. to find out the difference between two consecutive images. If the MSE is beyond a certain threshold, you know that there is some motion.
Also read about Motion Estimation.
if you know that the image will remain reletivly static I would reccomend:
1) look into neural networks. you can use them to learn what defines someone within the image or what is a non-something in the image.
2) look into motion detection algorithms, they are used all over the place.
3) is you camera capable of thermal imaging? if so it may be worthwile to look for hotspots in the images. There may be existing algorithms to turn your webcam into a thermal imager.
Sometimes two image files may be different on a file level, but a human would consider them perceptively identical. Given that, now suppose you have a huge database of images, and you wish to know if a human would think some image X is present in the database or not. If all images had a perceptive hash / fingerprint, then one could hash image X and it would be a simple matter to see if it is in the database or not.
I know there is research around this issue, and some algorithms exist, but is there any tool, like a UNIX command line tool or a library I could use to compute such a hash without implementing some algorithm from scratch?
edit: relevant code from findimagedupes, using ImageMagick
try $image->Sample("160x160!");
try $image->Modulate(saturation=>-100);
try $image->Blur(radius=>3,sigma=>99);
try $image->Normalize();
try $image->Equalize();
try $image->Sample("16x16");
try $image->Threshold();
try $image->Set(magick=>'mono');
($blob) = $image->ImageToBlob();
edit: Warning! ImageMagick $image object seems to contain information about the creation time of an image file that was read in. This means that the blob you get will be different even for the same image, if it was retrieved at a different time. To make sure the fingerprint stays the same, use $image->getImageSignature() as the last step.
findimagedupes is pretty good. You can run "findimagedupes -v fingerprint images" to let it print "perceptive hash", for example.
Cross-correlation or phase correlation will tell you if the images are the same, even with noise, degradation, and horizontal or vertical offsets. Using the FFT-based methods will make it much faster than the algorithm described in the question.
The usual algorithm doesn't work for images that are not the same scale or rotation, though. You could pre-rotate or pre-scale them, but that's really processor intensive. Apparently you can also do the correlation in a log-polar space and it will be invariant to rotation, translation, and scale, but I don't know the details well enough to explain that.
MATLAB example: Registering an Image Using Normalized Cross-Correlation
Wikipedia calls this "phase correlation" and also describes making it scale- and rotation-invariant:
The method can be extended to determine rotation and scaling differences between two images by first converting the images to log-polar coordinates. Due to properties of the Fourier transform, the rotation and scaling parameters can be determined in a manner invariant to translation.
Colour histogram is good for the same image that has been resized, resampled etc.
If you want to match different people's photos of the same landmark it's trickier - look at haar classifiers. Opencv is a great free library for image processing.
I don't know the algorithm behind it, but Microsoft Live Image Search just added this capability. Picasa also has the ability to identify faces in images, and groups faces that look similar. Most of the time, it's the same person.
Some machine learning technology like a support vector machine, neural network, naive Bayes classifier or Bayesian network would be best at this type of problem. I've written one each of the first three to classify handwritten digits, which is essentially image pattern recognition.
resize the image to a 1x1 pixle... if they are exact, there is a small probability they are the same picture...
now resize it to a 2x2 pixle image, if all 4 pixles are exact, there is a larger probability they are exact...
then 3x3, if all 9 pixles are exact... good chance etc.
then 4x4, if all 16 pixles are exact,... better chance.
etc...
doing it this way, you can make efficiency improvments... if the 1x1 pixel grid is off by a lot, why bother checking 2x2 grid? etc.
If you have lots of images, a color histogram could be used to get rough closeness of images before doing a full image comparison of each image against each other one (i.e. O(n^2)).
There is DPEG, "The" Duplicate Media Manager, but its code is not open. It's a very old tool - I remember using it in 2003.
You could use diff to see if they are REALLY different.. I guess it will remove lots of useless comparison. Then, for the algorithm, I would use a probabilistic approach.. what are the chances that they look the same.. I'd based that on the amount of rgb in each pixel. You could also find some other metrics such as luminosity and stuff like that.
I am building a web application using .NET 3.5 (ASP.NET, SQL Server, C#, WCF, WF, etc) and I have run into a major design dilemma. This is a uni project btw, but it is 100% up to me what I develop.
I need to design a system whereby I can take an image and automatically crop a certain object within it, without user input. So for example, cut out the car in a picture of a road. I've given this a lot of thought, and I can't see any feasible method. I guess this thread is to discuss the issues and feasibility of achieving this goal. Eventually, I would get the dimensions of a car (or whatever it may be), and then pass this into a 3d modelling app (custom) as parameters, to render a 3d model. This last step is a lot more feasible. It's the cropping issue which is an issue. I have thought of all sorts of ideas, like getting the colour of the car and then the outline around that colour. So if the car (example) is yellow, when there is a yellow pixel in the image, trace around it. But this would fail if there are two yellow cars in a photo.
Ideally, I would like the system to be completely automated. But I guess I can't have everything my way. Also, my skills are in what I mentioned above (.NET 3.5, SQL Server, AJAX, web design) as opposed to C++ but I would be open to any solution just to see the feasibility.
I also found this patent: US Patent 7034848 - System and method for automatically cropping graphical images
Thanks
This is one of the problems that needed to be solved to finish the DARPA Grand Challenge. Google video has a great presentation by the project lead from the winning team, where he talks about how they went about their solution, and how some of the other teams approached it. The relevant portion starts around 19:30 of the video, but it's a great talk, and the whole thing is worth a watch. Hopefully it gives you a good starting point for solving your problem.
What you are talking about is an open research problem, or even several research problems. One way to tackle this, is by image segmentation. If you can safely assume that there is one object of interest in the image, you can try a figure-ground segmentation algorithm. There are many such algorithms, and none of them are perfect. They usually output a segmentation mask: a binary image where the figure is white and the background is black. You would then find the bounding box of the figure, and use it to crop. The thing to remember is that none of the existing segmentation algorithm will give you what you want 100% of the time.
Alternatively, if you know ahead of time what specific type of object you need to crop (car, person, motorcycle), then you can try an object detection algorithm. Once again, there are many, and none of them are perfect either. On the other hand, some of them may work better than segmentation if your object of interest is on very cluttered background.
To summarize, if you wish to pursue this, you would have to read a fair number of computer vision papers, and try a fair number of different algorithms. You will also increase your chances of success if you constrain your problem domain as much as possible: for example restrict yourself to a small number of object categories, assume there is only one object of interest in an image, or restrict yourself to a certain type of scenes (nature, sea, etc.). Also keep in mind, that even the accuracy of state-of-the-art approaches to solving this type of problems has a lot of room for improvement.
And by the way, the choice of language or platform for this project is by far the least difficult part.
A method often used for face detection in images is through the use of a Haar classifier cascade. A classifier cascade can be trained to detect any objects, not just faces, but the ability of the classifier is highly dependent on the quality of the training data.
This paper by Viola and Jones explains how it works and how it can be optimised.
Although it is C++ you might want to take a look at the image processing libraries provided by the OpenCV project which include code to both train and use Haar cascades. You will need a set of car and non-car images to train a system!
Some of the best attempts I've see of this is using a large database of images to help understand the image you have. These days you have flickr, which is not only a giant corpus of images, but it's also tagged with meta-information about what the image is.
Some projects that do this are documented here:
http://blogs.zdnet.com/emergingtech/?p=629
Start with analyzing the images yourself. That way you can formulate the criteria on which to match the car. And you get to define what you cannot match.
If all cars have the same background, for example, it need not be that complex. But your example states a car on a street. There may be parked cars. Should they be recognized?
If you have access to MatLab, you could test your pattern recognition filters with specialized software like PRTools.
Wwhen I was studying (a long time ago:) I used Khoros Cantata and found that an edge filter can simplify the image greatly.
But again, first define the conditions on the input. If you don't do that you will not succeed because pattern recognition is really hard (think about how long it took to crack captcha's)
I did say photo, so this could be a black car with a black background. I did think of specifying the colour of the object, and then when that colour is found, trace around it (high level explanation). But, with a black object in a black background (no constrast in other words), it would be a very difficult task.
Better still, I've come across several sites with 3d models of cars. I could always use this, stick it into a 3d model, and render it.
A 3D model would be easier to work with, a real world photo much harder. It does suck :(
If I'm reading this right... This is where AI shines.
I think the "simplest" solution would be to use a neural-network based image recognition algorithm. Unless you know that the car will look the exact same in each picture, then that's pretty much the only way.
If it IS the exact same, then you can just search for the pixel pattern, and get the bounding rectangle, and just set the image border to the inner boundary of the rectangle.
I think that you will never get good results without a real user telling the program what to do. Think of it this way: how should your program decide when there is more than 1 interesting object present (for example: 2 cars)? what if the object you want is actually the mountain in the background? what if nothing of interest is inside the picture, thus nothing to select as the object to crop out? etc, etc...
With that said, if you can make assumptions like: only 1 object will be present, then you can have a go with using image recognition algorithms.
Now that I think of it. I recently got a lecture about artificial intelligence in robots and in robotic research techniques. Their research went on about language interaction, evolution, and language recognition. But in order to do that they also needed some simple image recognition algorithms to process the perceived environment. One of the tricks they used was to make a 3D plot of the image where x and y where the normal x and y axis and the z axis was the brightness of that particular point, then they used the same technique for red-green values, and blue-yellow. And lo and behold they had something (relatively) easy they could use to pick out the objects from the perceived environment.
(I'm terribly sorry, but I can't find a link to the nice charts they had that showed how it all worked).
Anyway, the point is that they were not interested (that much) in image recognition so they created something that worked good enough and used something less advanced and thus less time consuming, so it is possible to create something simple for this complex task.
Also any good image editing program has some kind of magic wand that will select, with the right amount of tweaking, the object of interest you point it on, maybe it's worth your time to look into that as well.
So, it basically will mean that you:
have to make some assumptions, otherwise it will fail terribly
will probably best be served with techniques from AI, and more specifically image recognition
can take a look at paint.NET and their algorithm for their magic wand
try to use the fact that a good photo will have the object of interest somewhere in the middle of the image
.. but i'm not saying that this is the solution for your problem, maybe something simpler can be used.
Oh, and I will continue to look for those links, they hold some really valuable information about this topic, but I can't promise anything.