Huggingface load_dataset() method how to assign the "features" argument? - huggingface-transformers

I'm trying to load a custom dataset to use for finetuning a Huggingface model. My data is a csv file with 2 columns: one is 'sequence' which is a string , the other one is 'label' which is also a string, with 8 classes. I want to load my dataset and assign the type of the 'sequence' column to 'string' and the type of the 'label' column to 'ClassLabel'
my code is this:
from datasets import Features
from datasets import load_dataset
ft = Features({'sequence':'str','label':'ClassLabel'})
mydataset = load_dataset("csv", data_files="mydata.csv",features= ft)
running this code, I got the following error:
TypeError Traceback (most recent call last)
<ipython-input-59-45fedff522e8> in <module>()
7
8 mydataset = load_dataset("csv", data_files="mydata.csv",
----> 9 features= ft)
10
8 frames
/usr/local/lib/python3.7/dist-packages/datasets/features/features.py in get_nested_type(schema)
794
795 # Other objects are callable which returns their data type (ClassLabel, Array2D, Translation, Arrow datatype creation methods)
--> 796 return schema()
TypeError: 'str' object is not callable
could someone help please?

You should define your Features similarly to what is in here:
from datasets import Features, Value, ClassLabel
from datasets import load_dataset
class_names = ['class_label_1', 'class_label_2']
ft = Features({'sequence': Value('string'), 'label': ClassLabel(names=class_names)})
mydataset = load_dataset("csv", data_files="mydata.csv",features=ft)

Related

Why can't we convert flat columns of awkward1 arrays `to_parquet`?

A follow up from this question; Best way to save a dict of awkward1 arrays?
To save multiple columns of nested awkward1 arrays (with varying length);
import awkward1 as ak
dog = ak.from_iter([[1, 2], [5]])
cat = ak.from_iter([[4]])
pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)
ak.to_parquet(pets, "pets.parquet")
Unfortunately, this doesn't seem to work for flat lists;
import awkward1 as ak
dog = ak.from_iter([1, 2, 5])
cat = ak.from_iter([4])
pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)
ak.to_parquet(pets, "pets.parquet")
creates the error;
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-31-7f3a7fefb261> in <module>
3 cat = ak.from_iter([3])
4 pets = ak.zip({"dog": dog[np.newaxis], "cat": cat[np.newaxis]}, depth_limit=1)
----> 5 ak.to_parquet(pets, "pets.parquet")
~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/awkward/operations/convert.py in to_parquet(array, where, explode_records, list_to32, string_to32, bytestring_to32, **options)
2983 layout = to_layout(array, allow_record=False, allow_other=False)
2984 iterator = batch_iterator(layout)
-> 2985 first = next(iterator)
2986
2987 if "schema" not in options:
~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/awkward/operations/convert.py in batch_iterator(layout)
2978 )
2979 yield pyarrow.RecordBatch.from_arrays(
-> 2980 pa_arrays, schema=pyarrow.schema(pa_fields)
2981 )
2982
~/Programs/anaconda3/envs/tree/lib/python3.7/site-packages/pyarrow/table.pxi in pyarrow.lib.RecordBatch.from_arrays()
TypeError: object of type 'pyarrow.lib.Tensor' has no len()
What is the reason for encountering this error?
What you found is a bug, and now it is fixed: https://github.com/scikit-hep/awkward-1.0/pull/799
What's happening here is that pyarrow can't write pyarrow.lib.Tensor (regular-length lists, such as the one you created with np.newaxis) to Parquet files. Parquet files don't have a concept of "regular-length list," so that makes sense. But rather than converting it, pyarrow hits an unhandled case, in which it fails to find the length of that pyarrow.lib.Tensor. (It's a little odd that pyarrow.lib.Tensor doesn't have a __len__ method, but that's another thing.)
Anyway, with version 1.2.0 of Awkward Array, we'll simply convert regular-length lists into (in principle) variable-length lists when writing to Parquet, since the format doesn't have that type. According to the schedule, version 1.2.0 will be released tomorrow. (This bug-fix is likely the last prerelease.)

KeyError when using non-default models in Huggingface transformers pipeline

I have no problems using the default model in the sentiment analysis pipeline.
# Allocate a pipeline for sentiment-analysis
nlp = pipeline('sentiment-analysis')
nlp('I am a black man.')
>>>[{'label': 'NEGATIVE', 'score': 0.5723695158958435}]
But, when I try to customise the pipeline a little by adding a specific model. It throws a KeyError.
nlp = pipeline('sentiment-analysis',
tokenizer = AutoTokenizer.from_pretrained("DeepPavlov/bert-base-cased-conversational"),
model = AutoModelWithLMHead.from_pretrained("DeepPavlov/bert-base-cased-conversational"))
nlp('I am a black man.')
>>>---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-55-af7e46d6c6c9> in <module>
3 tokenizer = AutoTokenizer.from_pretrained("DeepPavlov/bert-base-cased-conversational"),
4 model = AutoModelWithLMHead.from_pretrained("DeepPavlov/bert-base-cased-conversational"))
----> 5 nlp('I am a black man.')
6
7
~/opt/anaconda3/lib/python3.7/site-packages/transformers/pipelines.py in __call__(self, *args, **kwargs)
721 outputs = super().__call__(*args, **kwargs)
722 scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
--> 723 return [{"label": self.model.config.id2label[item.argmax()], "score": item.max().item()} for item in scores]
724
725
~/opt/anaconda3/lib/python3.7/site-packages/transformers/pipelines.py in <listcomp>(.0)
721 outputs = super().__call__(*args, **kwargs)
722 scores = np.exp(outputs) / np.exp(outputs).sum(-1, keepdims=True)
--> 723 return [{"label": self.model.config.id2label[item.argmax()], "score": item.max().item()} for item in scores]
724
725
KeyError: 58129
I am facing the same problem. I am working with a model from XML-R fine-tuned with squadv2 data set ("a-ware/xlmroberta-squadv2"). In my case, the KeyError is 16.
Link
Looking for help on the issue I have found this information: link I hope you find it helpful.
Answer (from the link)
The pipeline throws an exception when the model predicts a token that is not part of the document (e.g. final special token [SEP])
My problem:
from transformers import XLMRobertaTokenizer, XLMRobertaForQuestionAnswering
from transformers import pipeline
nlp = pipeline('question-answering',
model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2'),
tokenizer= XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2'))
nlp(question = "Who was Jim Henson?", context ="Jim Henson was a nice puppet")
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-15-b5a8ece5e525> in <module>()
1 context = "Jim Henson was a nice puppet"
2 # --------------- CON INTERROGACIONES
----> 3 nlp(question = "Who was Jim Henson?", context =context)
1 frames
/usr/local/lib/python3.6/dist-packages/transformers/pipelines.py in <listcomp>(.0)
1745 ),
1746 }
-> 1747 for s, e, score in zip(starts, ends, scores)
1748 ]
1749
KeyError: 16
Solution 1: Adding punctuation at the end of the context
In order to avoid the bug of trying to extract the final token (which may be an special one as [SEP]) I added an element (in this case a punctuation mark) at the end of the context:
nlp(question = "Who was Jim Henson?", context ="Jim Henson was a nice puppet.")
[OUT]
{'answer': 'nice puppet.', 'end': 28, 'score': 0.5742837190628052, 'start': 17}
Solution 2: Do not use pipeline()
The original model can handle itself to retrieve the correct token`s index.
from transformers import XLMRobertaTokenizer, XLMRobertaForQuestionAnswering
import torch
tokenizer = XLMRobertaTokenizer.from_pretrained('a-ware/xlmroberta-squadv2')
model = XLMRobertaForQuestionAnswering.from_pretrained('a-ware/xlmroberta-squadv2')
question, text = "Who was Jim Henson?", "Jim Henson was a nice puppet"
encoding = tokenizer(question, text, return_tensors='pt')
input_ids = encoding['input_ids']
attention_mask = encoding['attention_mask']
start_scores, end_scores = model(input_ids, attention_mask=attention_mask, output_attentions=False)[:2]
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1])
answer = tokenizer.convert_tokens_to_ids(answer.split())
answer = tokenizer.decode(answer)
Update
Looking in more detail your case, I found that the default model for Conversational task in the pipeline is distilbert-base-cased (source code).
The first solution I posted is not a good solution indeed. Trying other questions I got the same error. However, the model itself outside the pipeline works fine (as I showed in solution 2). Thus, I believe that not all models can be introduced in the pipeline. If anyone has more information about it please help us out. Thanks.

How to calculate shap values for ADABoost model?

I am running 3 different model (Random forest, Gradient Boosting, Ada Boost) and a model ensemble based on these 3 models.
I managed to use SHAP for GB and RF but not for ADA with the following error:
Exception Traceback (most recent call last)
in engine
----> 1 explainer = shap.TreeExplainer(model,data = explain_data.head(1000), model_output= 'probability')
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, model_output, feature_perturbation, **deprecated_options)
110 self.feature_perturbation = feature_perturbation
111 self.expected_value = None
--> 112 self.model = TreeEnsemble(model, self.data, self.data_missing)
113
114 if feature_perturbation not in feature_perturbation_codes:
/home/cdsw/.local/lib/python3.6/site-packages/shap/explainers/tree.py in __init__(self, model, data, data_missing)
752 self.tree_output = "probability"
753 else:
--> 754 raise Exception("Model type not yet supported by TreeExplainer: " + str(type(model)))
755
756 # build a dense numpy version of all the tree objects
Exception: Model type not yet supported by TreeExplainer: <class 'sklearn.ensemble._weight_boosting.AdaBoostClassifier'>
I found this link on Git that state
TreeExplainer creates a TreeEnsemble object from whatever model type we are trying to explain, and then works with that downstream. So all you would need to do is and add another if statement in the
TreeEnsemble constructor similar to the one for gradient boosting
But I really don't know how to implement it since I quite new to this.
I had the same problem and what I did, was to modify the file in the git you are commenting.
In my case I use windows so the file is in C:\Users\my_user\AppData\Local\Continuum\anaconda3\Lib\site-packages\shap\explainers but you can do double click over the error message and the file will be opened.
The next step is to add another elif as the answer of the git help says. In my case I did it from the line 404 as following:
1) Modify the source code.
...
self.objective = objective_name_map.get(model.criterion, None)
self.tree_output = "probability"
elif str(type(model)).endswith("sklearn.ensemble.weight_boosting.AdaBoostClassifier'>"): #From this line I have modified the code
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line I added
elif str(type(model)).endswith("sklearn.ensemble.forest.ExtraTreesClassifier'>"): # TODO: add unit test for this case
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
...
Note in the other models, the code of shap needs the attribute 'criterion' that the AdaBoost classifier doesn't have in a direct way. So in this case this attribute is obtained from the "weak" classifiers with the AdaBoost has been trained, that's why I add model.base_estimator_.criterion .
Finally you have to import the library again, train your model and get the shap values. I leave an example:
2) Import again the library and try:
from sklearn import datasets
from sklearn.ensemble import AdaBoostClassifier
import shap
# import some data to play with
iris = datasets.load_iris()
X = iris.data
y = iris.target
ADABoost_model = AdaBoostClassifier()
ADABoost_model.fit(X, y)
shap_values = shap.TreeExplainer(ADABoost_model).shap_values(X)
shap.summary_plot(shap_values, X, plot_type="bar")
Which generates the following:
3) Get your new results:
It seems that the shap package has been updated and still does not contain the AdaBoostClassifier. Based on the previous answer, I've modified the previous answer to work with the shap/explainers/tree.py file in lines 598-610
### Added AdaBoostClassifier based on the outdated StackOverflow response and Github issue here
### https://stackoverflow.com/questions/60433389/how-to-calculate-shap-values-for-adaboost-model/61108156#61108156
### https://github.com/slundberg/shap/issues/335
elif safe_isinstance(model, ["sklearn.ensemble.AdaBoostClassifier", "sklearn.ensemble._weighted_boosting.AdaBoostClassifier"]):
assert hasattr(model, "estimators_"), "Model has no `estimators_`! Have you called `model.fit`?"
self.internal_dtype = model.estimators_[0].tree_.value.dtype.type
self.input_dtype = np.float32
scaling = 1.0 / len(model.estimators_) # output is average of trees
self.trees = [Tree(e.tree_, normalize=True, scaling=scaling) for e in model.estimators_]
self.objective = objective_name_map.get(model.base_estimator_.criterion, None) #This line is done to get the decision criteria, for example gini.
self.tree_output = "probability" #This is the last line added
Also working on testing to add this to the package :)

pandas series sort_index() not working with kind='mergesort'

I needed a stable index sorting for DataFrames, when I had this problem:
In cases where a DataFrame becomes a Series (when only a single column matches the selection), the kind argument returns an error. See example:
import pandas as pd
df_a = pd.Series(range(10))
df_b = pd.Series(range(100, 110))
df = pd.concat([df_a, df_b])
df.sort_index(kind='mergesort')
with the following error:
----> 6 df.sort_index(kind='mergesort')
TypeError: sort_index() got an unexpected keyword argument 'kind'
If DataFrames (more then one column is selected), mergesort works ok.
EDIT:
When selecting a single column from a DataFrame for example:
import pandas as pd
import numpy as np
df_a = pd.DataFrame(np.array(range(25)).reshape(5,5))
df_b = pd.DataFrame(np.array(range(100, 125)).reshape(5,5))
df = pd.concat([df_a, df_b])
the following returns an error:
df[0].sort_index(kind='mergesort')
...since the selection is casted to a pandas Series, and as pointed out the pandas.Series.sort_index documentation contains a bug.
However,
df[[0]].sort_index(kind='mergesort')
works alright, since its type continues to be a DataFrame.
pandas.Series.sort_index() has no kind parameter.
here is the definition of this function for Pandas 0.18.1 (file: ./pandas/core/series.py):
# line 1729
#Appender(generic._shared_docs['sort_index'] % _shared_doc_kwargs)
def sort_index(self, axis=0, level=None, ascending=True, inplace=False,
sort_remaining=True):
axis = self._get_axis_number(axis)
index = self.index
if level is not None:
new_index, indexer = index.sortlevel(level, ascending=ascending,
sort_remaining=sort_remaining)
elif isinstance(index, MultiIndex):
from pandas.core.groupby import _lexsort_indexer
indexer = _lexsort_indexer(index.labels, orders=ascending)
indexer = com._ensure_platform_int(indexer)
new_index = index.take(indexer)
else:
new_index, indexer = index.sort_values(return_indexer=True,
ascending=ascending)
new_values = self._values.take(indexer)
result = self._constructor(new_values, index=new_index)
if inplace:
self._update_inplace(result)
else:
return result.__finalize__(self)
file ./pandas/core/generic.py, line 39
_shared_doc_kwargs = dict(axes='keywords for axes', klass='NDFrame',
axes_single_arg='int or labels for object',
args_transpose='axes to permute (int or label for'
' object)')
So most probably it's a bug in the pandas documentation...
Your df is Series, it's not a data frame

cx_Oracle equivalent of datetime64[ns]

Given the following kind of data in a column of a Pandas data frame (Pandas data type is datetime64[ns]:
Contract_EffDt
1974-02-01
I want to import this (along with some string and int columns) into Oracle like this:
import cx_Oracle as cx
.
.
.
cur.setinputsizes(25,25,255,255,25,25,25,255,255,25,25,cx.DATE,int)
cur=con.cursor()
cur.bindarraysize = 1000
cur.executemany('''insert into Table(Contract_ID,Plan_ID,Org_Type,Plan_Type,Offers_Part_D,SNP_Plan,\
EGHP,Org_Name,Org_Mkt_Name,Plan_Name,Parent_Org,Contract_EffDt,YYYYMM) values (:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,:12,:13)''', rows)
con.commit()
cur.close()
con.close()
How can I get cx_Oracle to recognize my datetime column for what it is and insert it accordingly? As you can see, I've tried cx.DATE but this doesn't work.
Thanks in advance!
Update with more information:
import cx_Oracle as cx
import pandas as pd
pwd=pwd
namec='table'
con = cx.Connection("table/"+pwd+"#server")
dfc=pd.read_csv(r'\\...\CPSC_Contract_Info_2016_03.csv',
encoding='latin-1')
#taken from https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Downloads/2016/March/CPSC-Enrollment-2016-03.zip
#^contract info file^
rows=dfc.values.tolist()
dfc['Contract_EffDt'] = pd.to_datetime(dfc['Contract_EffDt'])
dfc['Contract_EffDt'] = pd.to_datetime(dfc['Contract_EffDt'],format='%YYYY-%mm-%dd')
dfc['Plan_ID']=dfc['Plan_ID'].astype(str)
dfc.dtypes
Contract_ID object
Plan_ID object
Org_Type object
Plan_Type object
Offers_Part_D object
SNP_Plan object
EGHP object
Org_Name object
Org_Mkt_Name object
Plan_Name object
Parent_Org object
Contract_EffDt datetime64[ns]
YYYYMM int64
dtype: object
cur.setinputsizes(25,25,255,255,25,25,25,255,255,25,25,cx.DATETIME,int)
cur=con.cursor()
cur.bindarraysize = 1000
cur.executemany('''insert into table(Contract_ID,Plan_ID,Org_Type,Plan_Type,Offers_Part_D,SNP_Plan,\
EGHP,Org_Name,Org_Mkt_Name,Plan_Name,Parent_Org,Contract_EffDt,YYYYMM) values (:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,to_date(:12,'yyyy-mm-dd'),:13)''', rows)
con.commit()
cur.close()
con.close()
Output:
TypeError Traceback (most recent call last)
<ipython-input-152-6696ee05e0f2> in <module>()
4 cur=con.cursor()
5 cur.bindarraysize = 1000
----> 6 cur.executemany('''insert into table(Contract_ID,Plan_ID,Org_Type,Plan_Type,Offers_Part_D,SNP_Plan,EGHP,Org_Name,Org_Mkt_Name,Plan_Name,Parent_Org,Contract_EffDt,YYYYMM) values (:1,:2,:3,:4,:5,:6,:7,:8,:9,:10,:11,to_date(:12,'yyyy-mm-dd'),:13)''', rows)
7 con.commit()
8 cur.close()
TypeError: expecting numeric data

Resources