I was wondering if it is possible to mask the input in order to get attention like in GNNs. Specifically, if there was a way to send an attention mask in 2D we could tell a token to attend just to its neighbours and not the other tokens. Is this behaviour possible for example using a pretrained language model from Huggingface?
Related
I want to detect whether or not an image has a specific (custom) object in it or not. I tried to go through the documentation of google cloud vertex ai, but I am confused. I am not an AI or ML engineer.
They provide the following services for image
Classification (Single Label)
Classification (Multi Label)
Image Object Detection
Image segmentation
Almost All of these features require at least two labels. At least 10 images must be assigned to each label for the features to work.
Now, suppose I have 10 cat images. One of my label name is cat. And then I will have to create another label named non_cat. right? There can be infinite possibilities of an image not having a cat. Does that mean, I upload 10 cat photos and 10 random junk photos in non_cat label??
Currently I have chosen image object detection. It detects multiple attributes of that custom object with confidence score. Should I use these score to identify the custom object in my backend application? Am I going into the right direction?
As per your explanation in comments you're right going with Object Detection model in this case.
Refer the google documentation on how to prepare the data for object detection model.
As per the documentation, the dataset can have minimum 1 label and can go maximum upto 1000 labels for an AutoML or custom-trained model.
Yes. Afer checking the accuracy of your model, you can utilize the confidence score to identify the object in your application.
I want to train an object tracking model in Vertex AI for one type of object. The "Train New Model" button says "To train a model, you must have at least two labels and each label included in training must have at least 15 videos assigned to it." I do not find any explanation of this requirement in the documentation. Does anyone know why I must have two labels?
The minimum condition you have mentioned to train a model is required for Vertex AI to know what object to look for, The model will learn to identify the patterns for tracking by setting bounding boxes and label for the object. Generally by having more videos with label will produce a better outcome for the training. To see more details please visit the article here.
Also I believe having more than 1 label is needed for the model to identify an object by having a reference comparison from the 2nd label. This can be handy when you are in the part of evaluating and testing your model as you can tune your score threshold and prediction outcome for a more precise model.
I am working on video classification. So let's say, I sub-sample the video temporally. So for each sub-sample, I engineer some features out of it. Let's say I can represent these features with a two dimensional matrix.
So, these values of the matrix are time-Dependant. So for each video, I have a set of matrices, which are time dependent.
So I need to use a RNN to model these time dependent matrix values and represent the video. This representation should classify the video in to classes. In other words, the RNN should be able to classify the video in to classes, depending on these time dependent matrix values.
I need to know, Is this possible with RNNs? would it be a good practice? If so, what are the guide lines anyone could provide me. What is a good library to use? What would be good tutorials? Thanks.
You flatten the images of the video and use them as an element of the sequence.
You will have to put a convnet under the RNN, most likely, to get reasonable results.
As such, you feed the image to a convnet, then flatten the activation map and feed it to the RNN cell.
https://stackoverflow.com/a/36992625/447599
I'm trying to recognize the traffic signs.The shape information is a very important information.I'd like to combine color and shape information to make a rough classification of traffic signs.The question is that how can I determine the shape(triangle,circle,eight-side form,etc) of traffic signs? Can anyone give some advice?(I know there exists Ramer-douglas-peucker Algorithm which can handle the problem?)
I would proceed with the following steps:
Find a way to crop your sign with as little background as possible
Apply a threshold to convert the sign into a simple shape
Create a topological skeleton and apply your algorihm, or a variation of it to classify the skeleton.
I want to use bag of words for content-based image retrieval.
I'm confused as to how to apply bag-of-words to content based image retrieval.
To clarify:
I've trained my program using SURF features and extract the BoW descriptors. I feed this to a support vector machine as training data. Then, given a query image, the support vector machine can predict which class a given image belongs to.
In other words, given a query image it can find a class. For example, given a query image of a car, the program would return 'car'. How would one find similar images?
Would I, given the class, return images from the training set? Or would the program - given a query image - also return a subset of a test-set on which the SVM predicts the same class?
The title only mentions BoW, but in your text you also use SVMs.
I think the core idea of CBIR is, to find the most similar image, according to some distance measure. You can do this with BoW-features. The SVM is not necessary.
The main purpose of using additional classification is to speed up the process. Because after you obtained a class label for your test image, you only need to search this subgroup of your images for the best match. And of course, if the SVM is better in distinguishing certain classes than your distance measure, it might help to reduce errors.
So the standard workflow would be:
obtain the class
return the best match from the training samples of this class