Detectron2 - How to use RepeatFactorTrainingSampler in class imbalance problem - imbalanced-data

I'm facing class imbalance problem in a 2 classes classification problem.
Generally class 1 is about 25% of class 2. (class 1= 100 observations, class 2 = 400)
So i will need class 1 to have 4x times more than the current observation.
With the current setting:
cfg.DATALOADER.SAMPLER_TRAIN = "RepeatFactorTrainingSampler"
cfg.DATALOADER.REPEAT_THRESHOLD = 2
cfg.DATALOADER.REPEAT_THRESHOLD is a threshold for both classes? How can i increase the sample for class 1 only?
Thank you.

Related

How to add weights in BERT loss function

I have unbalanced dataset size N with such classes:
class 1 - size 0.554*N
class 2 - size 0.271*N
class 3 - size 0.185*N
I’m trying to solve NER task by fine-tuning Bert “dslim / bert-large-NER”, but during training my eval f1 score doesn’t rise above 0.53
How to add weights in BERT loss function to overcome low f1 score?
I tried to fine-tune other ner models from huggingface, but they didn't help
I use Trainer from Transformers to train the model
As mentioned before the Bert loss function is defined by the model. If you want to modify loss function, you need to define the Bert class again. For example:
class Bert_modified(BertForTokenClassification):
def forward(*some input parameters*):
*important class code*
loss_fct = CrossEntropyLoss(weight=class_weights_tensor)
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
*important class code*
Bert's classes in Transformers you can find here

Create values of new data frame variable based on other column values

I have a question about data set preparation. In a survey, the same people were asked about a number of different variables at two points of measurement. This resulted in a dataset in long format, i.e. information from each participant is stored in two rows. Each row represents the data of this person at the respective time of measurement (see example). Individuals have individual participation codes. The same participation code thus indicates that the data is from the same person.
code
time
risk_perception
DB6M
1
6
DB6M
2
4
TH4D
1
2
TH4D
2
3
Now I would like to create a new variable "risk_perception.complete", which shows me whether the information for each participant is complete. It could be that a person has not given any information at both measurement times or only at one of the two measurement times and therefore values are missing (NAs).In the new variable I would like to check and code this information for each person. If the person has one or more NAs, then a 0 should be coded there. If the person has no NAs, then there should be a 1 (see example).
code
time
risk_perception
risk_perception.complete
DB6M
1
6
1
DB6M
2
4
1
TH4D
1
2
1
TH4D
2
3
1
SU6H
1
NA
0
SU6H
2
3
0
VG9S
1
NA
0
VG9S
2
NA
0
Can anyone tell me the best way to program this?
Here is reproducible example:
data <- data.frame(
code = c("AH6M","AH6M","BD7M","BD7M","SH9L","SH9L"),
time = c(1,2,1,2,1,2),
risk = c(6,7,NA,3,NA,NA))
Thank you in advance and best regards!

Even an image in data set used to train is giving opposite values when making prediction

I am new to ML and TensorFlow. I am trying to build a CNN to categorize a good image against corrupted images, similar to rock paper scissor tutorials in tensor flow, except for only two categories.
The Colab Notebook
Model Architecture
train_generator = training_datagen.flow_from_directory(
TRAINING_DIR,
target_size=(150,150),
class_mode='categorical'
)
validation_generator = validation_datagen.flow_from_directory(
VALIDATION_DIR,
target_size=(150,150),
class_mode='categorical'
)
model = tf.keras.models.Sequential([
# Note the input shape is the desired size of the image 150x150 with 3 bytes color
# This is the first convolution
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(150, 150, 3)),
tf.keras.layers.MaxPooling2D(2, 2),
# The second convolution
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
# The third convolution
tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
# The fourth convolution
tf.keras.layers.Conv2D(128, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
# Flatten the results to feed into a DNN
tf.keras.layers.Flatten(),
tf.keras.layers.Dropout(0.5),
# 512 neuron hidden layer
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(2, activation='softmax')
])
model.summary()
model.compile(loss = 'categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
history = model.fit_generator(train_generator, epochs=25, validation_data = validation_generator, verbose = 1)
model.save("rps.h5")
Only Change I made was turning input shape to (150,150,1) to (150,150,3) and changed last layers output to 2 neurons from 3. The training gave me consistently accuracy of 90 above for data set of 600 images in each class. But when I am making a prediction using code in the tutorial, it gives me highly wrong values even for data in the data set.
PREDICTION
Original code in TensorFlow tutorial
for file in onlyfiles:
path = fn
img = image.load_img(path, target_size=(150, 150,3)) # changed target_size to (150, 150,3)) from (150,150 )
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
images = np.vstack([x])
classes = model.predict(images, batch_size=10)
print(fn)
print(classes)
I changed target_size to (150, 150,3)) from (150,150) in my belief that since my input is a 3 channel image,
Result
It gives very wrong values [0,1][0,1] for even images in which are in dataset
But when I changed the code to this
for file in onlyfiles:
path = fn
img = image.load_img(path, target_size=(150, 150,3))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x /= 255.
classes = model.predict(images, batch_size=10)
print(fn)
print(classes)
In this case values come like
[[9.9999774e-01 2.2242968e-06]]
[[9.9999785e-01 2.1864464e-06]]
[[9.9999785e-01 2.1641024e-06]]
one or two errors are there but it is very much correct
So my question even though the last activation is softmax, why it is now coming in decimal values, is there any logical mistake in the way I am making predictions.? I tried binary also, but couldn't find much difference.
Please note -
When you are changing output classes from 2 to 3, you are asking the model to categorise into 3 classes. This would contradict your problem statement which separates good and corrupted ones i.e 2 output classes (a binary problem). I think it can be reversed from 3 to 2 if I have understood the question correctly.
Second the output you are getting is perfectly correct, the neural network models outputs probabilities instead of absolute class values like 0 or 1. By probability, it tells how likely it belongs to say class 0 or class 1.
Also , as mentioned above by #BBloggsbott - you just have to use np.argmax on the output array which will tell you the probability of belonging to class 1 (Positive class) by default.
Hope this helps.
Thanks.
Softmax returns probability distributions for the vector it gets as input. So, the fact that you are getting decimal values is not a problem. If you want to find the exact class each image belongs to, try using the argmax function on the predictions.

binary:logistic like parameter in LightGBM

I want my predictions in probabilities between 0 and 1. I already did that in xgboost but I wanna try out Lightgbm too but its outputting solid predictions(that is in integer only). I could do that in XGBoost by setting 'objective' parameter to binary:logistic but in Lightgbm there doesn't seem to be any parameter like that, It only has binary and it is giving output in 0 or 1.
To get the class probability between 0 and 1 in lightgbm, you have to use a default value of a parameter "objective" is a regression.
'objective' = 'binary' ( return class label 0 or 1)
'objective' = 'regression' ( return class probability between 0 and 1)
You can do it by setting objective: “multiclass” with num_class: 2 as parameters. The results might not be the same with direct binary classification model yet I can ensure you that there will be no performance loss.
Bonus: As loss metric, you can use “multi_error” or “multi_logloss” or interestingly a combination of both like:
metric: “multi_error”, “multi_logloss”
You can use predict(raw_score=True)
If you are using the sklearn API - You can use objective "binary", just use predict_proba() instead of predict()

How to choose an random object depending on its attribute

I have an array filled with objects, each object has an attribute named amount. I want to subtract a given number evenly from random objects based on the amount.
I'll better explain it on the following example:
Dim subtractBy as integer = 5 'means i want to substract a total of 5
Dim Generator As System.Random = New System.Random()
While (subtractBy > 0)
Dim randomItem = array2(Generator.Next(0, array2.Count))
If (randomItem.amount > 0 ) Then
randomItem.changeAmountBy(-1)
subtractBy = subtractBy - 1
End If
End While
The problem with that example is, every object has the same chance to be choosen for substraction. I want that every object gets higher chances linear to the amount atribute. So an object with amount=6 has 6x higher chance to be selected than the object with amount=1 and so on.
(Althought the example is in VB, I appreciate also general non-code answers)
Thank you

Resources