How to pass multiple labels into Huggingface transformers Trainer - huggingface-transformers

Huggingface transformer Trainer says "your model can accept multiple label arguments (use the label_names in your TrainingArguments to indicate their name to the Trainer)". Anyone know the correct format to pass multiple labels into the trainer? It doesn't seem clear from the documentation. I would like to avoid putting all of the labels into one label column, since some have multiple labels at the same time, so there would be all the combinations of column names (there is a 1 in the row for each label that a comment applies to, so like one-hot encoding with possibly multiple 1's in a row).

Related

Writing a formula in a cell in Google Sheets that averages the results from a column derived from expected values in multiple columns

I'm an average user of Google sheets and I've tried writing/looking up the formula I'm going for, but I haven't had any luck yet.
I have a spreadsheet that details multiple values that I need to display in a single cell the average of a certain set of values derived from a specific set of those values from multiple columns.
The flow of information would look something along the lines of:
if value in Column D=L
then
if value in Column J<$1.20
then
Find Avg of all Values in Column N
I'd need the formula to narrow it's field of data each time so the final result was the average of all the values in Column N that had a value in column J<$1.20 with a value in Column D=L.
I feel like a dummy over here because I just can't narrow down how I should write this flow and get it to work right without adding multiple extra hidden columns. Can anyone help on this one?
I've tried writing the formula multiple different ways but haven't kept it written down to pass on.

(Google Sheets) How to remove certain dropdown options after a certain number of cells with said option is met?

I'm currently working on a google sheets file to organize the members of my class. I am currently assigning committees and I want them to choose their committee in Google Sheets. However, I want to apply only a certain limit per committee.
What I want to happen is, if a certain choice has been chosen i.e. 5 times, I would like that choice to disappear from the choices and would make it reappear again if ever a students change their choice, however, I do not know how to do this in terms of a formula or through data validation.
I would really appreciate your help. Thank you!
Here's a toy example you may be able to adapt to your needs:
Create a list of options a,b,c,d,e in A1:E1 of Sheet1
Create a list of the limits for each option in A2:E2 (for instance 2,1,3,5,3)
Create a list of people Person1,Person2,Person3 in G2:G4
Apply data validation to H2:H4:
Use criteria 'drop down (from a range)'
Set the data range to =Sheet1!$A3:$E3 (only lock columns, not rows)
In A3 enter the following formula:
=lambda(people,choices,list,limits,
makearray(counta(people),counta(list),lambda(r,c,
if(index(choices,r)<>index(list,,c),if(countif(choices,index(list,,c))<index(limits,,c),index(list,,c),),index(list,,c)))))(
$G$2:$G$4,$H$2:$H$4,$A$1:$E$1,$A$2:$E$2)
We are using MAKEARRAY to create a 2D array with the list of options on each line, however we are asking it to omit elements of the list from each line if they haven't already been selected AND a preset limit on the number of selections for that option has not been reached. Obviously in a 'real' example you would place the data range for validation in a separate sheet and probably hide and protect that sheet as well. You could also potentially use an array literal of strings rather than a cell range as the list of options in order to make the validation list formula completely self-contained.

How to use the field cardinality repeating in Render-CSV BW step?

I am building a generic CSV output module with a variable number of columns. The DataFormat in BW (5.14) lets you define repeating item and thus offers a list of items that I could use to map data to in the RenderCSV step.
But when I run this with data for >> 1 column (and loopings) only one column is generated.
Is the feature broken or do I use it wrongly?
Alternatively I defined "enough" optional columns in the data format and map each field separately - no really generic solution.
Looks like In BW 5, when using Data Format and Parse Data to parse text, repeating elements isn’t supported.
Please see https://support.tibco.com/s/article/Tibco-KnowledgeArticle-Article-27133
The workaround is to use Data Format resource, Parse Data and Mapper
activities together. First use Data Format and Parse Data to parse the
text into the xml where every element represents one line of the text.
Then use Mapper activity and tib:tokenize-allow-empty XSLT function to
tokenize every line and get sub-elements for each field in the lines.
The link has also attached workaround implementation

Google Sheets data validation different for each row

For my current project, I'm making a sheet that lets me keep track of my D&D characters. I use data validation to remind me what all the options are for various stats, with the information being kept in a separate "RefTables" sheet. Creating a data validation for selecting a character class is very easy, since there are only 14 classes total. What I'm having trouble with is the 'subclass' column. After you choose the character class, you get to choose your specialization, or 'subclass'. This differs depending on the character class you chose.
Right now I can do the proper data validation for each cell individually. In my ref tables sheet, I have a section where it will grab the character class value and put all the 'subclass' options into a row. I can then use data validation in that specific cell to grab the subclass row. This works, but is tedious to do for every single cell.
The formula I would love to put in the range section is
=INDIRECT(CONCATENATE("RefTables!Q",ROW(),":AJ",ROW()))
which appends the row number with the appropriate columns so each row automatically gets its own subclass row (EX: RefTables!Q3:AJ3, RefTables!Q21:AJ21, etc.). I've seen solutions for Excel, but I'm using Google Sheets so I can share this document more easily with friends.
tldr; How to use data validation in Google Sheets that is slightly different for each row
unfortunately, this is possible to achieve only the manual way setting it up for every single cell/row. Google Sheets' Data validation does not support injecting CSVs via formula.

How do h2o models determine what columns to use for predictions (position, name, etc.)?

Using h2o python API to train some models and am a bit confused on how to correctly implement some parts of the API. Specifically, what columns should be ignored in a training dataset and how models look for the actual predictor features in a data set when actually using the model's predict() method. Also how weight columns should be handled (when the actual prediction datasets don't really have weights)
The details of the code here (I think) are not majorly important, but the basic training logic looks something like
drf_dx = h2o.h2o.H2ORandomForestEstimator(
# denoting update version name by epoch timestamp
model_id='drf_dx_v'+str(version)+'t'+str(int(time.time())),
response_column='dx_outcome',
ignored_columns=[
'ucl_id', 'patient_id', 'account_id', 'tar_id', 'charge_line', 'ML_data_begin',
'procedure_outcome', 'provider_outcome',
'weight'
],
weights_column='weight',
ntrees=64,
nbins=32,
balance_classes=True,
binomial_double_trees=True)
.
.
.
drf_dx.train(x=X_train, y=Y_train,
training_frame=train_u, validation_frame=val_u,
max_runtime_secs=max_train_time_hrs*60*60)
(note the ignored columns) and the prediction logic just looks like
preds = model.predict(X)
where X is some (h2o)dataframe with more (or less) columns than in X_train used to train the model (includes some columns for post-processing exploration (in a Jupyter notebook)). Eg. X_train columns may look like
<columns to ignore (as seen in the code)> <columns to use a features for training> <outcome label>
and X columns may look like
<columns to ignore (as seen in the code)> <EVEN MORE COLUMNS TO IGNORE> <columns to use a features for training>
My question is: Is this going to confuse the model when making predictions? Ie. is the model getting the columns to use as features by column name (in which case, I don't think the different dataframe width would be a problem) or is it going by column position (in which case adding more data columns to each sample would shift the positions and become a problem) or something else? What happens since these columns were not explicated in the ignored_columns arg in the model constructor?
** Slight aside: should the weights_column name be in the ignored_columns list or not? The example in the docs (http://docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/algo-params/weights_column.html#weights-column) seems to use it as a predictor feature as well as seems to recommend it
For scoring, all computed metrics will take the observation weights into account (for Gains/Lift, AUC, confusion matrices, logloss, etc.), so it’s important to also provide the weights column for validation or test sets if you want to up/down-weight certain observations (ideally consistently between training and testing).
but these weight values are not something that comes with the data used in actual predictions.
I've summarized your question into a few distinct parts, so the answers will be in a Q/A type fashion.
1). When I use my_model.predict(X), how does H2O-3 know which columns to predict with?
H2O-3 will use the columns that you passed as predictors when you built your model (i.e. whatever you passed to the x argument in the estimator, or all the columns you included in your training_frame which were not: ignored using ignored_columns, passed as a target to the y argument, dropped because the column has a constant value.). My recommendation would be to use the x argument to specify your predictors and ignore the ignore_columns parameter. If X, the new dataframe you are predicting on includes columns that were not used when you were building a model, those columns will be ignored - so column names not column positions.
2) Should the weights column name be in the ignored column list?
No, if you pass the weights column to the ignored column list, that column will not be considered in any fashion during the model building phase. In fact, if you test this out, you should get a null pointer error or something similar.
3) Why is the "weights" column specified as a predictor and as the weights_column in the following code example?
This is a great question! I've created two Jira tickets one to update the documentation to clear up the confusion and another one to potentially add a user warning.
The short answer, is if you pass the same column to the predictors argument x and the weights_column argument, that column will only be used as a weight - it will not be used as a feature.
4) Does the user guide recommend using the weights as a feature and as a weight?
No, in the paragraph you are pointing to, the recommendation is to ensure that the column you pass as your weights_column exists in your training frame and validation frame - not that it should also be included as a feature.

Resources