I have a dataset in the form (input_text, embedding_of_input_text), where embedding_of_input_text is an embedding of dimension 512 produced by another model (DistilBERT) when given as input input_text.
I would like to fine-tune BERT on this dataset such that it learns to produce similar embeddings (i.e. a kind of mimicking).
Furthermore, by default BERT returns embeddings of dimension 768, while here embedding_of_input_text are embeddings of dimension 512.
Which is the correct way to to that within the HuggingFace library?
you can get the tokenizer of the dataset
and add the neural network to get embedding of dimension 512.
However,what is the meaning of this operation.
Related
I want to train my custom dataset using the yolov4 algorithm.
The difference between my data and the coco dataset is that my objects label's are as follows:
label annotation: <object-class> <x_center> <y_center> <width> <height> <d>
We always use pre-trained files to object detection, but given that my data has different labels, Do I need to change pre-trained data?
how should I train the dataset?
I am trying to build a Keras model to implement to approach explained in this paper.
Context of my implementation:
I have two different kinds of data representing the same set of classes(labels) that needs to be classified. The 1st kind is Image data, and the second kind is EEG data (a time series sequence).
I know that to classify image data we can use CNN models like this:
model.add(Conv2D(filters=256, kernel_size=(11,11), strides=(1,1), padding='valid'))
model.add(Activation('relu'))
model.add(Dense(1000))
model.add(Activation('relu'))
model.add(Dropout(0.4))
# Batch Normalisation
model.add(BatchNormalization())
# Output Layer
model.add(Dense(40))
model.add(Activation('softmax'))
And to classify sequence data we can use LSTM models like this:
model.add(LSTM(units = 50, return_sequences = True))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(32, activation='relu'))
model.add(Dense(128, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(40, activation='softmax'))
But the approach of the paper above shows that EEG feature vectors can be mapped with image vectors through regression like this:
The first approach is to train a CNN to map images to corresponding
EEG feature vectors. Typically, the first layers of CNN attempt to
learn the general (global) features of the images, which are common
between many tasks, thus we initialize the weights of these layers
using pre-trained models, and then learn the weights of the last
layers from scratch in an end-to-end setting. In particular, we used
the pre-trained AlexNet CNN, and modified it by replacing the
softmax classification layer with a regression layer (containing as
many neurons as the dimensionality of the EEG feature vectors),
using Euclidean loss as the objective function.
The second approach consists of extracting image features using
pre-trained CNN models and then employ regression methods to map
image features to EEG feature vectors. We used our fine-tuned
AlexNet as feature extractors by
reading the output of the last fully connected layer, and then
applied several regression methods (namely, k-NN regression, ridge
regression, random forest regression) to obtain the predicted
feature vectors
I am not able to comprehend how to code the above two approaches. I have never used a regressor for feature mapping and then do classification. Any leads on this are much appreciated.
In my understanding the training data consists of (eeg_signal,image,class_label) triplets.
Train the LSTM model with input=eeg_signal, output=class_label. Loss is crossentropy.
Peel off the last layer of the LSTM model. Let's say the pre-last layer's output is a vector of size 20. Let's call it eeg_representation.
Run this truncated model on all your eeg_signal inputs, save the output of eeg_representation. You will get a tensor of [batch, 20]
Take that AlexNet mentioned in the paper (or any other image classifier), peel off the last layer. Let's say the pre-last layer's output is a vector of size 30. Let's call it image_representation.
Stich a linear layer to the end of the previous layer. This layer will convert image_representation to eeg_representation. It has 20 x 30 weight.
Train the stiched model on (image, eeg_representation) pairs. Loss is the Euclidean distance.
And now the fun part: Stich together model trained in step 7. and the peeled off part of model trained in step 1. If you input an image, you will get class predictions.
This sound like not a big deal (because we do image classification all the time), but if this is really working, it means that this is a "prediction that is running through our brains" :)
Thank you bringing up this question and linking the paper.
I feel I just repeated what's in your question and in the the paper.
I would be beneficial to have some toy dataset to be able to provide code examples.
Here's a Tensorflow tutorial on how to "peel off" the last layer of a pretrained image classification model.
In keras/examples/image_ocr ctc loss was calculated using with TextImageGenrator which require monogram file and bigram file.
Can it be possible to feed only image and there ground truth values to calculate loss and predicting text?
I have used Tensorflow-for-poets to build an image classification model. However, I now want to use the trained model in an object detection model. Can I just import the .pb files directly or do I have to retrain the model?
I am getting this error when I try it
KeyError: "The name 'image_tensor:0' refers to a Tensor which does not exist. The operation, 'image_tensor', does not exist in the graph."
You can not directly use the .pb model produced by image classification to perform object detection. You will have to obtain an object detection model, train it and then use it to detect. There are pretrained object detection models at Tensorflow obejct detection model zoo.
detailed answer below:
Image classification and object detection are two different but very closely related tasks. In fact, Ross Girshick asked a similar question on the famous paper R-CNN
To what extent do the CNN classification results on ImageNet generalize to object detection results on the PASCAL VOC Challenge?
This question basically means that image classification model can be used to help object detection but there are some more steps needed. So you cannot just directly use a classification network to do object detection task. (But the error you gave was something different, you can find the correct tensor name and fix the error, but it just does not make sense to directly use classification network to do object detection that way.)
There is naive solution to combine the two, you could just use a sliding window of various sizes passing through the image and perform classification, this can perform object detection.
Another solution is integrated. To give an example, Faster R-CNN is an object detection network which used VGG as the feature extractor (In the original paper). Here you can see that VGG is an image classification network and it is pretrained on some image classification task.
image source
I want to use bag of words for content-based image retrieval.
I'm confused as to how to apply bag-of-words to content based image retrieval.
To clarify:
I've trained my program using SURF features and extract the BoW descriptors. I feed this to a support vector machine as training data. Then, given a query image, the support vector machine can predict which class a given image belongs to.
In other words, given a query image it can find a class. For example, given a query image of a car, the program would return 'car'. How would one find similar images?
Would I, given the class, return images from the training set? Or would the program - given a query image - also return a subset of a test-set on which the SVM predicts the same class?
The title only mentions BoW, but in your text you also use SVMs.
I think the core idea of CBIR is, to find the most similar image, according to some distance measure. You can do this with BoW-features. The SVM is not necessary.
The main purpose of using additional classification is to speed up the process. Because after you obtained a class label for your test image, you only need to search this subgroup of your images for the best match. And of course, if the SVM is better in distinguishing certain classes than your distance measure, it might help to reduce errors.
So the standard workflow would be:
obtain the class
return the best match from the training samples of this class