What is mAP in object detection? - precision

I have been reading through this blog in order to find what mAP is .In the sub heading of AP, they give the example of 5 apple images and finding out the average precision.As far I understand false positive is when the object is localised and classified but IOU < 0.5 (in the blog) and false negative is when the model fails to identify an object itself.So what about objects which are misclassified dont they belong to false postives?.
Also what does the table in the blog really respresent.The 'correct?' is for one particluar example or 5 examples together.Could you just brief me what is going on in your own terms or just what the blog says?

What is mAP in object detection?
mAP is just mean average precision which is the mean of APs from all the object classes. For example, if you had 5 object classes each of them would have an average precision (AP) and mAP will be the sum of those APs divided by 5.
false positive is when the object is localized and classified but IOU < 0.5
In object detection, we can have multiple classes of objects. The background is also a class but it is implicit. So for example, if we had 3 classes of objects (e.g. apple, orange, banana) the network considers it as 4 classes (apple, orange, banana, background). Only in the results, the program doesn't draw a bounding box around background objects.
False Positive means that the object detection model has reported a part of the image as an object of a specific class (e.g. apple). However, there is no apple in that part of the image. There is either another fruit like an orange (misclassification) or no fruit at all (background). Both cases are the same in the eye of the network and we consider this as false positive. So the network is considering that part as a positive sample for a specific class by mistake. The IoU can have any value in this case (it does not matter). The misclassified objects are also included in the false positive rate because they are reported as positive (for a specific class) but in fact, they are negative (they belong to another class or background).
False Negative means the model has predicted a part of the image as background when it is actually an object. In other words, the network has failed to detect an object and has reported it as background by mistake.
what does the table in the blog really represent?
The IoU (Intersection over Union) referred to in the blog which is used to report correct is calculated by dividing the area of the intersection between the detected box and the ground truth (the box drawn by a human as the correct box) by the union of those areas.
So if IoU is more than 0.5, it means that the network has predicted the apple position correctly. In the table, correct is for each apple and the precision is calculated from the number of correct predictions divided by all predictions.

Related

Vertex AI Object Tracking with only one label

I want to train an object tracking model in Vertex AI for one type of object. The "Train New Model" button says "To train a model, you must have at least two labels and each label included in training must have at least 15 videos assigned to it." I do not find any explanation of this requirement in the documentation. Does anyone know why I must have two labels?
The minimum condition you have mentioned to train a model is required for Vertex AI to know what object to look for, The model will learn to identify the patterns for tracking by setting bounding boxes and label for the object. Generally by having more videos with label will produce a better outcome for the training. To see more details please visit the article here.
Also I believe having more than 1 label is needed for the model to identify an object by having a reference comparison from the 2nd label. This can be handy when you are in the part of evaluating and testing your model as you can tune your score threshold and prediction outcome for a more precise model.

How to detect similar objects in this picture?

I want to find patterns in image. Saying "to find patterns" I mean "to detect similar objects", thus these patterns shouldn't be some high-frequency info like noise.
For example, on this image I'd like to get pattern "window" with ROI/ellipse of each object:
I've read advices to use Autocorrelation, FFT, DCT for this problem. As far as I've understood, Autocorrelation and FFT are alternative, not complementary.
First, I don't know if it is possible to get such high-level info in frequency domain?
As I have FFT implemented, I tried to use it. This is spectrogram:
Could you suggest how to further analyze this spectogram to detect objects "window" with their spatial locations?
Is it needed to find the brightest points/lines on spectrogram?
Should the FFT be done for image chunks instead of whole image?
If that't not possible to find such objects with this approach, what would you advice?
Thanks in advance.
P.S. Sorry for large image size.
Beware this is not my cup of tea so read with extreme prejudice. IIRC for such task are usually used SIFT/SURF + RANSAC methods.
Identify key points of interest of image SIFT/SURF
This will get you list of 2D locations in your image with specific features (which you can handle as integer hash code). I think SIFT (Scale Invariant Feature transform) is ideal for this. They work similarly like our human vision works (identify specific change in some feature and "ignore" the rest of the image). So instead of matching all the pixels of the image we cross match only few of them.
sort by occurrence
each of the found SIFT points have some feature list. if we do a histogram of this features (count how many similar or identical feature points there are) then we can group points with the same occurrence. The idea is that if we got n object placings in the image each of its key points should be n times duplicated in the final image.
So if we have many points with some n times occurrence it hints we got n similar objects in the image. From this we select just these key points for the next step.
find object placings
each object can have different scale,position and orientation. Let assume they got the same aspect ratio. So the corresponding key points in each object should have the same relative properties between the objects (like relative angle between key points, normalized distance, etc).
So the task is to regroup our key points into each object so all the objects have the same key points and the same relative properties.
This can be done by brute force (testing all the combination and checking the properties) or by RANSAC or any other method.
Usually we select one first key point (no matter which) and find 2 others that form the same angle and relative distance ratio (in all of the objects)
so angle is the same and |p1-p0| / |p2-p0| is also the same or close. While grouping realize that key points within objects are more likely closer to each other ... so we can augment our search by distance from the first selected key point.... to decide to which object the key point probably belongs to (if try those first we got high probability we found our combination fast). All the other points pi we can add similarly one by one (using p0,p1,pi)
So I would starting by closest 2 key points ... (this sometimes can be fouled by overlapping or touching mirrored images) as the key point from the neighbor object can be sometimes closer that from the own one ...
After such regrouping just check if all the found objects have the same properties (aspect ratio) ... to visualize them you can find the OBB (Oriented Bounding Box) of the key points (which can be also used for the check)

How to align "tracks" or modular objects in Unity ?

I'm developing a simple game, where user can place different but modular objects (for instance: tracks, road etc).
My question is: how to match and place different object when placed one near the other ?
My first approach is to create an hidden child object (a box) for each module objects, and put it in the border where is possible to place other object (see my image example), so i can use that coordinates (x,y,z) to align other object.
But i don't know if the best approach.
Thanks
Summary:
1.Define what is a "snapping point"
2.Define which is your threshold
3.Update new game object position
Little Explanation
1.
So I suppose that you need a way to define which parts of the object are the "snapping points".
Cause they can be clear in some examples, like a Cube, where the whole vertex could be snapping points, but it's hard to define that every vertex in amorphous objects.
A simple solution could be the one exposed by #PierreBaret, whic consists in define on your transform component which are the "snapping points".
The other one is the one you propouse, creating empty game objects that will act as snapping points locations on the game object.
2.After having those snaped points, when you will drop your new gameObject, you need to define a threshold, as long as you don't want that every object snaps allways to the nearest game object.
3.So you define a minimum distance between snapping points, so if your snapping point is under that threshold, you will need to update it's position, to adjust to the the snapped point.
Visual Representation:
Note: The Threshold distance is showing just ONE of the 4 current threshold checks on the 4 vertex in the square, but this dark blue circle should be repilcate 3 more times, one for each green snapping point of the red square
Of course this method seems expensive, you can make some improvements like setting a first threshold between gameobjects, and if the gameObject is inside this threshold, then check snapping threshold distance.
Hope it helps!
Approach for arbitrary objects/models and deformable models.
[A] A physical approach would consider all the surfaces of the 2 objects, and you might need to check that objects don't overlap, using dot products between surfaces. That's a bit more expensive computing, but nothing nasty. If there is no match involved here, you'll be able to add matching features (see [B]). However, that's the only way to work with non predefined models or deformable models.
Approaches for matching simple and complex models
[B] Snapping points are a good thing but it's not sufficient alone. I think you need to make an object have:
a sparse representation (eg., complex oriented sphere to a cube),
and place key snapping points,
tagged by polarity or color, and eventually orientation (that's oriented snapping points); eg., in the case of rails, you'll want rails to snap {+} with {+} and forbid {+} with {-}. In the case of a more complex object, or when you have several orientations (eg., 2 faces of a surface, but only one is candidate for an pair of objects matching) you'll need more than 2 polarities, but 3 different ones per matching candidate surface or feature therefore the colors (or any enumeration). You need 3 different colors to make sure there is a unique 3D space configuration. You create something that is called in chemistry an enantiomer.
You can also use point pair features that describes the relative
position and orientation of two oriented points, when an oriented
surface is not appropriate.
References
Some are computer vision papers or book extracts, but they expose algorithms and concepts to achieve what I developed in my answer.
Model Globally, Match Locally: Efficient and Robust 3D Object Recognition, Drost et al.
3D Models and Matching

Algorithm to simulate color blindness?

There are many tools online that take images and simulate what that image might look like to someone with color blindness. However, I can't find any descriptions of these algorithms.
Is there a standard algorithm used to simulate color blindness? I'm aware that there are many types of color blindness (see the Wikipedia page on the subject for more details), but I'm primarily interested in algorithms for simulating dichromacy.
I had the same frustration and wrote an article comparing opensource color blindness simulations. In short, there are four main algorithms:
Coblis and the "HCIRN Color Blind Simulation function". You'll find this one in many places, and a Javascript implementation by MaPePeR. The full HCIRN simulation function was not properly evaluated but is reasonable in practice. However the "ColorMatrix" approximation by colorjack is very inaccurate and should be totally avoided (the author himself said that). Unfortunately it's still widespread as it was easy to copy/paste.
"Computerized simulation of color appearance for dichromats" by Brettel, Viénot, and Mollon (1997). A very solid reference. Works for all kinds of dichromacies. I wrote a public domain C implementation in libDaltonLens.
"Digital video colourmaps for checking the legibility of displays by dichromats" by Viénot, Brettel and Mollon (1999). A solid reference too, simplifies the 1997 paper for protanopia and deuteranopia (2 of the 3 kinds of color blindness). Also in libDaltonLens.
"A Physiologically-based Model for Simulation of Color Vision Deficiency" by Machado et al. (2009). Precomputed matrices are available on their website, which makes it easy to implement yourself. You just need to add the conversion from sRGB to linearRGB.
Looks like you're answer is in the wikipedia entry you linked.
For example:
Protanopia (1% of males): Lacking the long-wavelength sensitive
retinal cones, those with this condition are unable to distinguish
between colors in the green–yellow–red section of the spectrum. They
have a neutral point at a greenish wavelength around 492 nm – that is,
they cannot discriminate light of this wavelength from white.
So you need to de-saturate any colors in the green-yellow-red spectrum to white.
Image color saturation
The other 2 types of dichromacy can be handled similarly.
First we have to understand how the eye works:
A regular/healthy eye has 3 types of cones and 1 type of rods that have activation functions over the visible spectrum of light.
Their activations then pass through some function to produce the signal that goes to your brain. Roughly speaking, the function takes 4 channels as input and produces 3 channels as output (namely lightness, yellow-blue and red-green).
A colorblind person would have one of those two things be different (afaik usually/always 1.), so for example the person would be missing one type of cone or the cone's activation would be different.
The best thing to do would be:
Convert all pixels from RGB space to a combination of frequencies (with intensities). To do this, first take calculate the activations of each of the three cones (of a healthy person) then find a "natural" solution for a set of frequencies (+ intensities) that would result in the same activation. Of course, one solution is just the original three RGB frequencies with their intensities, but it is unlikely that the original image actually had that. A natural solution would be for example a normal distribution around some frequency (or even just one frequency).
Then, (again for each pixel) calculate the activations of a colorblind person's cones to your combination of frequencies.
Finally, find an RGB value such that a healthy person would have the same activations as the ones the colorblind person has.
Note that, if the way these activations are combined is also different for the relevant type of colorblindness, you might want to carry that out as well in the above steps. (So instead of matching activations, you are matching the result of the function over the activations).

Differentiate objects?

i want to identify a ball in the picture. I am thiking of using sobel edge detection algorithm,with this i can detect the round objects in the image.
But how do i differentiate between different objects. For example, a foot ball is there in one picture and in another picture i have a picture of moon.. how to differentiate what object has been detected.
When i use my algorithm i get ball in both the cases. Any ideas?
Well if all the objects you would like to differentiate are round, you could even use a hough transformation for round objects. This is a very good way of distinguishing round objects.
But your basic problem seems to be classification - sorting the objects on your image into different classes.
For this you don't really need a Neural Network, you could simply try with a Nearest Neighbor match. It's functionalities are a bit like neural networks since you can give it several reference pictures where you tell the system what can be seen there and it will optimize itself to the best average values for each attribute you detected. By this you get a dictionary of clusters for the different types of objects.
But for this you'll of course first need something that distinguishes a ball from a moon.
Since they are all real round objects (which appear as circles) it will be useless to compare for circularity, circumference, diameter or area (only if your camera is steady and if you know a moon will always have the same size on your images, other than a ball).
So basically you need to look inside the objects itself and you can try to compare their mean color value or grayscale value or the contrast inside the object (the moon will mostly have mid-gray values whereas a soccer ball consists of black and white parts)
You could also run edge filters on the segmented objects just to determine which is more "edgy" in its texture. But for this there are better methods I guess...
So basically what you need to do first:
Find several attributes that help you distinguish the different round objects (assuming they are already separated)
Implement something to get these values out of a picture of a round object (which is already segmented of course, so it has a background of 0)
Build a system that you feed several images and their class to have a supervised learning system and feed it several images of each type (there are many implementations of that online)
Now you have your system running and can give other objects to it to classify.
For this you need to segment the objects in the image, by i.e Edge filters or a Hough Transformation
For each of the segmented objects in an image, let it run through your classification system and it should tell you which class (type of object) it belongs to...
Hope that helps... if not, please keep asking...
When you apply an edge detection algorithm you lose information.
Thus the moon and the ball are the same.
The moon has a diiferent color, a different texture, ... you can use these informations to differnentiate what object has been detected.
That's a question in AI.
If you think about it, the reason you know it's a ball and not a moon, is because you've seen a lot of balls and moons in your life.
So, you need to teach the program what a ball is, and what a moon is. Give it some kind of dictionary or something.
The problem with a dictionary of course would be that to match the object with all the objects in the dictionary would take time.
So the best solution would probably using Neural networks. I don't know what programming language you're using, but there are Neural network implementations to most languages i've encountered.
You'll have to read a bit about it, decide what kind of neural network, and its architecture.
After you have it implemented it gets easy. You just give it a lot of pictures to learn (neural networks get a vector as input, so you can give it the whole picture).
For each picture you give it, you tell it what it is. So you give it like 20 different moon pictures, 20 different ball pictures. After that you tell it to learn (built in function usually).
The neural network will go over the data you gave it, and learn how to differentiate the 2 objects.
Later you can use that network you taught, give it a picture, and it a mark of what it thinks it is, like 30% ball, 85% moon.
This has been discussed before. Have a look at this question. More info here and here.

Resources