Getting output probabilities from logits in T0 - probability

I am trying to do prompting using T0. I want the probability scores of the query outputs. For example, for the given prompt: Is this review positive or negative? Review: this is the best cast iron skillet you will ever buy, I wanted the following values: P(positive|prompt)
Is there a way I can condition the language model on "positive" such that I get the probability score for positive for multiple such reviews?
I am trying to do something like this:
`
inputs = tokenizer.encode(test_prompt, return_tensors="pt")
decoder_input_ids = tokenizer.encode("positive",return_tensors="pt")
outputs = model(inputs,decoder_input_ids=decoder_input_ids)
pred_logits = outputs.logits
probs=nn.functional.softmax(pred_logits,dim=-1)
`
The output dimension of probs is [1,2,32128]. I am confused and wanted to know if this is the right way to get the scores. Also, is this the right way to condition on "positive" ? by passing the positive encoding in decoder_input_ids?

Related

For loop for a regression model with increasing number of predictors

Ho can I create a loop to fit models with increasing number of predictors. The first iteration should
use one predictor, then two, and so on until all predictors are included. I have to compute the RMSE
on both the training and test data for this model, and store these values in a list/array.
predictors = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors',
'waterfront','view','condition','grade','sqft_above',
'sqft_basement','yr_built','yr_renovated','zipcode','lat',
'long','sqft_living15','sqft_lot15']
models = []
formula = 'price ~ bedrooms'
for p in predictors[0:19]:
formula = formula + p
print(formula)
model_linear_kc_5 = smf.ols(formula=formula, data=df_train_kc)
models.append(model_linear_kc_5.fit())
My code so far but I know this isn't right and am stuck how to do it.
I have to put print(formula) inside loop and then adjust the formula = … line until it does what I want it to.
I would really appreciate help in this regard. Thank you.

How do I add noise/variability to a dataset in Python, given the CV?

Given a dataset of blood results, say cholesterol level, and knowing that the instrument that produced those results is subject to a known degree of variability, how would I add that variability back into the dataset? i.e. I want to assume the result in the original dataset is the true/mean value, and then produce new results that are subject to the known variability of the instrument.
In Excel you use =NORM.INV(RAND(), mean, std_dev), where RAND() provides a random value between 0 and 1, "mean" will be the original value and I have the CV so I can calculate the SD. NORM.INV then provides the inverse of the cumulative normal distribution function.
I've done the following to create a new column with my new values, but would like to know if it is valid (i.e., will each row have a different random number between 0 and 1 as the probability? and is this formula equivalent to NORM.INV?
df8000['HDL_1'] = norm.ppf(random(), loc = df8000['HDL_0'], scale = TAE_df.loc[0,'HDL'])
Thanks in advance!

EasyPredictModelWrapper giving wrong prediction

public BinomialModelPrediction predictBinomial(RowData data) throws PredictException {
double[] preds = this.preamble(ModelCategory.Binomial, data);
BinomialModelPrediction p = new BinomialModelPrediction();
double d = preds[0];
p.labelIndex = (int)d;
String[] domainValues = this.m.getDomainValues(this.m.getResponseIdx());
p.label = domainValues[p.labelIndex];
p.classProbabilities = new double[this.m.getNumResponseClasses()];
System.arraycopy(preds, 1, p.classProbabilities, 0, p.classProbabilities.length);
if(this.m.calibrateClassProbabilities(preds)) {
p.calibratedClassProbabilities = new double[this.m.getNumResponseClasses()];
System.arraycopy(preds, 1, p.calibratedClassProbabilities, 0, p.calibratedClassProbabilities.length);
}
return p;
}
Eg: classProbabilities =[0.82333,0,276666]
labelIndex = 1
label = true
domainValues = [false,true]
what does this labelIndex signifies and does the class probabilities
order is same as the domain value order ,If order is same then it means that here probability of false is 0.82333 and probability of true is 0.27666 but why is this labelIndex showing as 1 and label as true.
Please help me to figure out this issue.
Like Tom commented, the prediction is not "wrong". You can infer from this that the threshold H2O has chosen is less than 0.27666. You probably have imbalanced training data, otherwise H2O would have not picked a low threshold for classifying a predicted value of 0.27666 as a 1. Does your training set include fewer examples of the positive class than the negative class?
If you don't like that threshold for whatever reason, then you can manually create your own. Just make sure you know how to properly evaluate the effect of using different thresholds on the performance of your model, otherwise I'd recommend just using the default threshold.
The name, "classProbabilities" is a misnomer. These are not actual probabilities, they are predicted values, though people often use the terms interchangeably. Binary classification algorithms produce "predicted values" that look like probabilities when they're between 0 and 1, but unless a calibration process is performed, they are not going to represent the probabilities. Calibration is not necessarily a straight-forward process and there are many techniques. Here's some more info about calibration methods for imbalanced data. In H2O, you can perform calibration using Platt scaling using the calibrate_model option. But this is probably not really necessary to what you're trying to do.
The proper way to use the raw output from a binary classification model is to only look at the predicted value for the positive class (you can simply ignore the predicted value for the negative class). Then you choose a threshold which suits your needs, or you can use the default threshold in H2O, which is chosen to maximize the F1 score. Some other software will use a hardcoded threshold of 0.5, but that will be a terrible choice if you don't have an even number of positive and negative examples in your training data. If you have only a few positive examples in your training data, then the best threshold will be something much lower than 0.5.

Assignment problems with simple random number generation in Modelica

I am relatively new to Modelica (Dymola-environment) and I am getting very desperate/upset that I cannot solve such a simple problem as a random number generation in Modelica and I hope that you can help me out.
The simple function random produces a random number between 0 and 1 with an input seed seedIn[3] and produces the output seed seedOut[3] for the next time step or event. The call
(z,seedOut) = random(seedIn);
works perfectly fine.
The problem is that I cannot find a way in Modelica to compute this assignment over time by using the seedOut[3] as the next seedIn[3], which is very frustrating.
My simple program looks like this:
*model Randomgenerator
Real z;
Integer seedIn[3]( start={1,23,131},fixed=true), seedOut[3];
equation
(z,seedOut) = random(seedIn);
algorithm
seedIn := seedOut;
end Randomgenerator;*
I have tried nearly all possibilities with algorithm assignments, initial conditions and equations but none of them works. I just simply want to use seedOut in the next time step. One problem seems to be that when entering into the algorithm section, neither the initial conditions nor the values from the equation section are used.
Using the 'sample' and 'reinit' functions the code below will calculate a new random number at the frequency specified in 'sample'. Note the way of defining the "start value" of seedIn.
model Randomgenerator
Real seedIn[3] = {1,23,131};
Real z;
Real[3] seedOut;
equation
(z,seedOut) = random(seedIn);
when sample(1,1) then
reinit(seedIn,pre(seedOut));
end when;
end Randomgenerator;
The 'pre' function allows the use of the previous value of the variable. If this was not used, the output 'z' would have returned a constant value. Two things regarding the 'reinint' function, it requires use of 'when' and requires 'Real' variables/expressions hence seedIn and seedOut are now defined as 'Real'.
The simple "random" generator I used was:
function random
input Real[3] seedIn;
output Real z;
output Real[3] seedOut;
algorithm
seedOut[1] :=seedIn[1] + 1;
seedOut[2] :=seedIn[2] + 5;
seedOut[3] :=seedIn[3] + 10;
z :=(0.1*seedIn[1] + 0.2*seedIn[2] + 0.3*seedIn[3])/(0.5*sum(seedIn));
end random;
Surely there are other ways depending on the application to perform this operation. At least this will give you something to start with. Hope it helps.

Python Birthday paradox math not working

it run corectly but it should have around 500 matches but it only has around 50 and I dont know why!
This is a probelm for my comsci class that I am having isues with
we had to make a function that checks a list for duplication I got that part but then we had to apply it to the birthday paradox( more info here http://en.wikipedia.org/wiki/Birthday_problem) thats where I am runing into problem because my teacher said that the total number of times should be around 500 or 50% but for me its only going around 50-70 times or 5%
duplicateNumber=0
import random
def has_duplicates(listToCheck):
for i in listToCheck:
x=listToCheck.index(i)
del listToCheck[x]
if i in listToCheck:
return True
else:
return False
listA=[1,2,3,4]
listB=[1,2,3,1]
#print has_duplicates(listA)
#print has_duplicates(listB)
for i in range(0,1000):
birthdayList=[]
for i in range(0,23):
birthday=random.randint(1,365)
birthdayList.append(birthday)
x= has_duplicates(birthdayList)
if x==True:
duplicateNumber+=1
else:
pass
print "after 1000 simulations with 23 students there were", duplicateNumber,"simulations with atleast one match. The approximate probibilatiy is", round(((duplicateNumber/1000)*100),3),"%"
This code gave me a result in line with what you were expecting:
import random
duplicateNumber=0
def has_duplicates(listToCheck):
number_set = set(listToCheck)
if len(number_set) is not len(listToCheck):
return True
else:
return False
for i in range(0,1000):
birthdayList=[]
for j in range(0,23):
birthday=random.randint(1,365)
birthdayList.append(birthday)
x = has_duplicates(birthdayList)
if x==True:
duplicateNumber+=1
print "after 1000 simulations with 23 students there were", duplicateNumber,"simulations with atleast one match. The approximate probibilatiy is", round(((duplicateNumber/1000.0)*100),3),"%"
The first change I made was tidying up the indices you were using in those nested for loops. You'll see I changed the second one to j, as they were previously bot i.
The big one, though, was to the has_duplicates function. The basic principle here is that creating a set out of the incoming list gets the unique values in the list. By comparing the number of items in the number_set to the number in listToCheck we can judge whether there are any duplicates or not.
Here is what you are looking for. As this is not standard practice (to just throw code at a new user), I apologize if this offends any other users. However, I believe showing the OP a correct way to write a program should be could all do us a favor if said user keeps the lack of documentation further on in his career.
Thus, please take a careful look at the code, and fill in the blanks. Look up the python doumentation (as dry as it is), and try to understand the things that you don't get right away. Even if you understand something just by the name, it would still be wise to see what is actually happening when some built-in method is being used.
Last, but not least, take a look at this code, and take a look at your code. Note the differences, and keep trying to write your code from scratch (without looking at mine), and if it messes up, see where you went wrong, and start over. This sort of practice is key if you wish to succeed later on in programming!
def same_birthdays():
import random
'''
This is a program that does ________. It is really important
that we tell readers of this code what it does, so that the
reader doesn't have to piece all of the puzzles together,
while the key is right there, in the mind of the programmer.
'''
count = 0
#Count is going to store the number of times that we have the same birthdays
timesToRun = 1000 #timesToRun should probably be in a parameter
#timesToRun is clearly defined in its name as well. Further elaboration
#on its purpose is not necessary.
for i in range(0,timesToRun):
birthdayList = []
for j in range(0,23):
random_birthday = random.randint(1,365)
birthdayList.append(random_birthday)
birthdayList = sorted(birthdayList) #sorting for easier matching
#If we really want to, we could provide a check in the above nester
#for loop to check right away if there is a duplicate.
#But again, we are here
for j in range(0, len(birthdayList)-1):
if (birthdayList[j] == birthdayList[j+1]):
count+=1
break #leaving this nested for-loop
return count
If you wish to find the percent, then get rid of the above return statement and add:
return (count/timesToRun)
Here's a solution that doesn't use set(). It also takes a different approach with the array so that each index represents a day of the year. I also removed the hasDuplicate() function.
import random
sim_total=0
birthdayList=[]
#initialize an array of 0's representing each calendar day
for i in range(365):
birthdayList.append(0)
for i in range(0,1000):
first_dup=True
for n in range(365):
birthdayList[n]=0
for b in range(0, 23):
r = random.randint(0,364)
birthdayList[r]+=1
if (birthdayList[r] > 1) and (first_dup==True):
sim_total+=1
first_dup=False
avg = float(sim_total) / 1000 * 100
print "after 1000 simulations with 23 students there were", sim_total,"simulations with atleast one duplicate. The approximate problibility is", round(avg,3),"%"

Resources