What evaluation metric to use for LightGBM ranker function - lightgbm

I'm using LGMRanker from LightGBM but not sure what evaluation metric I should be using. Here is my code:
import lightgbm as lgb
gbm = lgb.LGBMRanker
gridParams = {
'learning_rate': [0.005,0.01,0.02],
'max_depth': [5,6,7],
'n_estimators': [100,200],
'num_leaves': [20,30,50]
}
lgb_grid = GridSearchCV(estimator = gbm, param_grid = gridParams, scoring = '??', cv = 3, verbose=2, n_jobs = -1)
What' appropriate here? I don't have any group, should I specify something?

DCG and NDCG are good evaluation methods for ranking algorithms. They can help you measure the quality of your results. You can read about them here. https://machinelearningmedium.com/2017/07/24/discounted-cumulative-gain/

Related

Fix tokenization to tensors with padding Huggingface

I'm trying to tokenize my dataset with the following preprocessing function. I've already donlowaded with AutoTokenizer from the Spanish BERT version.
`
max_input_length = 280
max_target_length = 280
source_lang = "es"
target_lang = "en"
prefix = "translate spanish_to_women to spanish_to_men: "
def preprocess_function(examples):
inputs = [prefix + ex for ex in examples["mujeres_tweet"]]
targets = [ex for ex in examples["hombres_tweet"]]
model_inputs = tokz(inputs,
padding=True,
truncation=True,
max_length=max_input_length,
return_tensors = 'pt'
)
# Setup the tokenizer for targets
with tokz.as_target_tokenizer():
labels = tokz(targets,
padding=True,
truncation=True,
max_length=max_target_length,
return_tensors = 'pt'
)
model_inputs["labels"] = labels["input_ids"]
return model_inputs
`
And I get the following error when trying to pass my dataset object through the function.
I've already tried dropping the columns that have strings. I've seen also that when I do not set the return_tensors it does tokenize my dataset (but later on I have the same problem when trying to train my BERT model. Anyone knows what might be going on? *inserts crying face
Also, I've tried tokenizing it without the return_tensors and then doing set_format but it returns and empty dataset object *inserts another crying face.
My Dataset looks like the following
And an example of the inputs
So that I just do:
tokenized_datasets = raw_datasets.map(preprocess_function, batched=True)

Get statistical properties of a list of values stored in JSON with Spark

I have my data stored in a JSON format using the following structure:
{"generationId":1,"values":[-36.0431,-35.913,...,36.0951]}
I want to get the distribution of the spacing (differences between the consecutive numbers) between the values averaged over the files (generationIds).
The first lines in my zepplein notebook are:
import org.apache.spark.sql.SparkSession
val warehouseLocation = "/user/hive/warehouse"
val spark = SparkSession.builder().appName("test").config("spark.sql.warehouse.dir", warehouseLocation).enableHiveSupport().getOrCreate()
val jsonData = spark.read.json("/user/hive/warehouse/results/*.json")
jsonData.createOrReplaceTempView("results")
I just now realized however, that this was not a good idea. The data in the above JSON now looks like this:
val gen_1 = spark.sql("SELECT * FROM eig where generationId = 1")
gen_1.show()
+------------+--------------------+
|generationId| values|
+------------+--------------------+
| 1|[-36.0431, -35.91...|
+------------+--------------------+
All the values are in the same field.
Do you have any idea how to approach this issue in a different way? It does not necessarily have to be Hive. Any Spark related solution is OK.
The number of values can be ~10000, and later. I would like to plot this distribution together with an already known function (simulation vs theory).
This recursive function, which is not terribly elegant and certainly not battle-tested, can calculate the differences (assuming an even-sized collection):
def differences(l: Seq[Double]): Seq[Double] = {
if (l.size < 2) {
Seq.empty[Double]
} else {
val values = l.take(2)
Seq(Math.abs(values.head - values(1))) ++ differences(l.tail)
}
}
Given such a function, you could apply it in Spark like this:
jsonData.map(r => (r.getLong(0), differences(r.getSeq[Double](1))))

h2o.auc( perf , xval =TRUE) - what does this call return?

My code is as follows
gbm.fit.hex = h2o.gbm(x= xcols , y =1865 , training_frame = tr.hex , distribution = "bernoulli", model_id = "gbm.model" , key = "gbm.model.key" , ntrees = gbm.trees , max_depth = gbm.depth , min_rows = gbm.min.rows , learn_rate = gbm.learn.rate , nbins = 20 , balance_classes = gbm.balance , nfolds = gbm.folds )
perf <- h2o.performance(gbm.fit.hex , tr.hex)
a = h2o.auc(perf , xval = TRUE)
What does the auc call return? does it return the AUC on training dataset or on the crossvalidation results?
It retrieves the cross-validated AUC.
Since you set the nfolds argument to something non-zero, the h2o.gbm function also performs k-fold cross-validation in addition to training a GBM model on the full training set. In your command, you did not specify a validation set, so the AUC values you can retrieve are training AUC, h2o.auc(perf, train = TRUE), and cross-validated AUC (as above).
If you want to evaluate performance on a separate validation (or test) set, you can pass that frame using the validation_frame argument and retrieve the validation AUC using h2o.auc(perf, valid = TRUE).

groupby.sum() sparse matrix in pandas or scipy: looking for performance

I have the following dataset df:
import numpy.random
import pandas
cat = pandas.Series(numpy.random.random_integers(0,400,1000000))
ids = pandas.Series(numpy.random.random_integers(0,10000,1000000))
team = pandas.Series(numpy.random.random_integers(0,1,1000000))
df = pandas.concat([ids,cat,team],axis=1)
df.columns = ['ids','cat','team']
Note that there is only 400 distinct categories in the cat column. Conseqently, I want to prepare the dataset for a machine learning classification, i.e., create one column for each distinct category value from 0 to 400, and for each row, write 1 if the id has the corresponding category, and 0 otherwise. My goal is then to make a groupby ids, and sum the 1 for every category column, as follows:
df2 = pandas.get_dummies(df['cat'], sparse=True)
df2['ids'] = df['ids']
df3 = df2.groupby('ids').sum()
My problem is that the groupby.sum() is very very long, far too long (more than 30 mins). So I need a different strategy to make my calculation. Here is a second attempt.
from sklearn import preprocessing
import numpy
text_encoder = preprocessing.OneHotEncoder(dtype=numpy.int)
X = text_encoder.fit_transform(df.drop(['team','ids'],axis=1).values).astype(int)
But then, X is a sparse scipy matrix. Here I have two choices: either a find a way to groupby.sum() efficiently on this sparse scipy matrix, or I convert it to a real numpy matrix with .toarray() as follows:
X = X.toarray()
df2 = pandas.DataFrame(X)
df2['ids'] = df['ids']
df3 = df2.groupby('ids').sum()
The problem now is that a lot of memory is lost due to the .toarray(). And the groupby.sum() surely takes a lot of memory.
So my question is: is there a smart way to solve my problem using SPARSE MATRIX with EFFICIENT TIME for the groupby.sum()?
EDIT: In fact this is a job for pivot_table(), so once your df is created:
df_final = df.pivot_table(cols='cat', rows='ids', aggfunc='count')
df_final.fillna(0, inplace = True)
For the record but useless: following my comments on the question:
import numpy.random
import pandas
from sklearn import preprocessing
cat = pandas.Series(numpy.random.random_integers(0,400,1000000))
ids = pandas.Series(numpy.random.random_integers(0,10000,1000000))
team = pandas.Series(numpy.random.random_integers(0,1,1000000))
df = pandas.concat([ids,cat,team],axis=1)
df.columns = ['ids','cat','team']
df.sort('ids', inplace = True)
text_encoder = preprocessing.OneHotEncoder(dtype=numpy.int)
X = text_encoder.fit_transform(df.drop(['team','ids'],axis=1).values).astype(int)
se_size = df.groupby('ids').size()
ls_rows = []
row_ind = 0
for name, nb_lines in se_size.iteritems():
ls_rows.append(X[row_ind : row_ind + nb_lines,:].sum(0).tolist()[0])
row_ind += nb_lines
df_final = pandas.DataFrame(ls_rows,
index = se_size.index,
columns = text_encoder.active_features_)

Summarizing Bayesian rating formula

Based on this url i found Bayesian Rating, which explains the rating model very well, i wanted to summarize the formula to make it much easier for anyone implementing an SQL statement. Would this be correct if i summarized the formula like this?
avg_num_votes = Sum(votes)/Count(votes) * Count(votes)
avg_rating = sum(votes)/count(votes)
this_num_votes = count(votes)
this_rating = Positive_votes - Negative_votes
Gath
It would look more like this:
avg_num_votes = Count(votes)/Count(items with at least 1 vote)
avg_rating = Sum(votes)/Count(items with at least 1 vote)
this_num_votes = Count(votes for this item)
this_rating = Sum(votes for this item)/Count(votes for this item)
If you are using a simple +/- system, Sum(votes) = Count(positive votes) (ie. treat + as 1, - as 0)
See also: Bayesian average.
Should the avg_rating not be:
Sum(votes)/Count(votes)
Yves

Resources