Detecting Custom Object From Any Image Google Cloud Vertex AI - google-cloud-vertex-ai

I want to detect whether or not an image has a specific (custom) object in it or not. I tried to go through the documentation of google cloud vertex ai, but I am confused. I am not an AI or ML engineer.
They provide the following services for image
Classification (Single Label)
Classification (Multi Label)
Image Object Detection
Image segmentation
Almost All of these features require at least two labels. At least 10 images must be assigned to each label for the features to work.
Now, suppose I have 10 cat images. One of my label name is cat. And then I will have to create another label named non_cat. right? There can be infinite possibilities of an image not having a cat. Does that mean, I upload 10 cat photos and 10 random junk photos in non_cat label??
Currently I have chosen image object detection. It detects multiple attributes of that custom object with confidence score. Should I use these score to identify the custom object in my backend application? Am I going into the right direction?

As per your explanation in comments you're right going with Object Detection model in this case.
Refer the google documentation on how to prepare the data for object detection model.
As per the documentation, the dataset can have minimum 1 label and can go maximum upto 1000 labels for an AutoML or custom-trained model.
Yes. Afer checking the accuracy of your model, you can utilize the confidence score to identify the object in your application.

Related

Vertex AI Object Tracking with only one label

I want to train an object tracking model in Vertex AI for one type of object. The "Train New Model" button says "To train a model, you must have at least two labels and each label included in training must have at least 15 videos assigned to it." I do not find any explanation of this requirement in the documentation. Does anyone know why I must have two labels?
The minimum condition you have mentioned to train a model is required for Vertex AI to know what object to look for, The model will learn to identify the patterns for tracking by setting bounding boxes and label for the object. Generally by having more videos with label will produce a better outcome for the training. To see more details please visit the article here.
Also I believe having more than 1 label is needed for the model to identify an object by having a reference comparison from the 2nd label. This can be handy when you are in the part of evaluating and testing your model as you can tune your score threshold and prediction outcome for a more precise model.

Incremental learning with Google AutoML Vision classification

I want to use Google AutoML vision API for image classification, but with an incremental learning setup - more specifically I should be able to incrementally provide new training data with possibly brand new (and previously unknown) class labels. For example, lets say I train the network today for three labels: A, B and C. Now, after a week, I want to add some new data labeled with a brand new class D. And then after another week, I want to add even newer data labeled with a brand new class E. At this point, the model should be able to classify an input image into any of those five classes, with each incremental addition to the model causing very little accuracy drop.
Is that possible with google AutoML vision API?
Currently you could keep importing new data into existing AutoML dataset and each week train a new model. There is import API and train API.
The assumption of causing very little accuracy drop may be unrealistic. There may valid cases when adding new label will make the accuracy go down. E.g. add labels that are hard to distinguish from previous labels or adding labels without performing data cleanup (adding label and not applying it to existing images in which objects with this label are visible).

Face matching after detection

I am working on a software that will match a captured image (face) with 3/4 images (faces) of same person. Now there are 2 possibilities
1- That the captured image (face) is of the same person whose 3/4 images (faces) are already stored in database
2. Captured image is of a different person
Now i want to get the results of the above 2 scenarios, i.e matched in case 1 and not matched in case 2. I used 40 Gabor filters so that i can get good results. Moreover i get the results in an array (histogram). But It don't seems to work fine and environment conditions like light also have influence on the matching process. Can any one suggest me a good and efficient technique for this thing to achieve.
Well, This is basically face identification problem.
You can use LBP(Local Binary Pattern) to extract features from images.LBP is very robust and llumination invariance method.
You can try following steps-
Training:-
Extract face region (using OpenCV HaarCascade)
Re-size all the extracted face regions to equal size
Divide resized face into sub regions(Ex: 8*9)
Extract LBP features from each region and concatenate them, , because localization of feature is very important
Train SVM by this concatenated feature, with different label to each different person's image
Testing:-
Take a face image and Follow step 1 to 4
Predict using SVM(about which person's image this is)

Template matching - Image subtraction

I have a project where I am required to subtract an empty template image from an incoming user filled image. The document type is a normal Bank cheque.
The aim is to extract the handwritten fields from it by subtracting one image from the empty template image.
The issue what i am facing is in aligning these two images, as there is scaling, translation, rotation etc
Any ideas on how to align the template image with the incoming image?
UPDATE 1:
I am posting an example image from the wikipedia page but in the monochrome format as my image is in monochrome format.
When working with Image processing for industrial projects we have in most of the cases a fiducial. A fiducial is like a mark - can be a hole, an cross mark - that never changes, is always in the same positions.
Generally two fiducials are enough to correct misaligning problems like rotation, translation and also scale. For instance If you know the distance between the two, you can always check it to make sure the scale factor is right, or correct it based on the difference of the current distance against the right distance.
In your case, what I would ask you is: Does the template and the incoming image share any visual sign that are invariant and can easily be segmented?
If you have the answer for that question, all the rest will be more simple - the difference itself is a quite straightforward algorithm.
The basic answer is write a function that takes two images and a 2D transform and tells you how aligned they are once you apply the transform to the target image. The function needs to be continuous based on the transform and have a local minima (0) where the images are aligned perfectly. This is called a cost function.
Then use any optimization algorithm over the function and inputs -- you are trying to optimize the transform (translation, scale, rotation). Examples are hill climbing, genetic, simulated annealing, etc.
There are products that do this -- usually they are called Forms Recognition, Forms Registration, Forms Processing, etc. Some are SDKs, but there are also applications that can do it without programming.
Disclaimer: I work at Atalasoft, where we sell a Forms Processing add-on to our .NET imaging SDK.

How does the image recognition work in Google Shopper?

I am amazed at how well (and fast) this software works. I hovered my phone's camera over a small area of a book cover in dim light and it only took a couple of seconds for Google Shopper to identify it. It's almost magical. Does anyone know how it works?
I have no idea how Google Shopper actually works. But it could work like this:
Take your image and convert to edges (using an edge filter, preserving color information).
Find points where edges intersect and make a list of them (including colors and perhaps angles of intersecting edges).
Convert to a rotation-independent metric by selecting pairs of high-contrast points and measuring distance between them. Now the book cover is represented as a bunch of numbers: (edgecolor1a,edgecolor1b,edgecolor2a,edgecolor2b,distance).
Pick pairs of the most notable distance values and ratio the distances.
Send this data as a query string to Google, where it finds the most similar vector (possibly with direct nearest-neighbor computation, or perhaps with an appropriately trained classifier--probably a support vector machine).
Google Shopper could also send the entire picture, at which point Google could use considerably more powerful processors to crunch on the image processing data, which means it could use more sophisticated preprocessing (I've chosen the steps above to be so easy as to be doable on smartphones).
Anyway, the general steps are very likely to be (1) extract scale and rotation-invariant features, (2) match that feature vector to a library of pre-computed features.
In any case, the Pattern Recognition/Machine Learning methods often are based on:
Extract features from the image that can be described as numbers. For instance, using edges (as Rex Kerr explained before), color, texture, etc. A set of numbers that describes or represents an image is called "feature vector" or sometimes "descriptor". After extracting the "feature vector" of an image it is possible to compare images using a distance or (dis)similarity function.
Extract text from the image. There are several method to do it, often based on OCR (optical character recognition)
Perform a search on a database using the features and the text in order to find the closest related product.
It is also likely that the image is also cuted into subimages, since the algorithm often finds a specific logo on the image.
In my opinion, the image features are send to different pattern classifiers (algorithms that are able to predict a "class" using as input a feature vector), in order to recognize logos and, afterwards, the product itself.
Using this approach, it can be: local, remote or mixed. If local, all processing is carried out on the device, and just the "feature vector" and "text" are sent to a server where the database is. If remote, the whole image goes to the server. If mixed (I think this is the most probable one), partially executed locally and partially at the server.
Another interesting software is the Google Googles, that uses CBIR (content-based image retrieval) in order to search for other images that are related to the picture taken by the smartphone. It is related to the problem that is addressed by Shopper.
Pattern Recognition.

Resources