pytorch-lightning example throws assertion error - pytorch-lightning

I'm using the following code out of the box from this url: https://lightning-transformers.readthedocs.io/en/latest/tasks/nlp/question_answering.html
import pytorch_lightning as pl
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.question_answering import (
QuestionAnsweringTransformer,
SquadDataModule,
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="bert-base-uncased")
model = QuestionAnsweringTransformer(pretrained_model_name_or_path="bert-base-uncased")
dm = SquadDataModule(
batch_size=1,
dataset_config_name="plain_text",
max_length=384,
version_2_with_negative=False,
null_score_diff_threshold=0.0,
doc_stride=128,
n_best_size=20,
max_answer_length=30,
tokenizer=tokenizer,
)
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
trainer.fit(model, dm)
which throws this error
AssertionError Traceback (most recent call last)
<ipython-input-2-0b608c02a52e> in <module>
14 trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
15
---> 16 trainer.fit(model, dm)
16 frames
/usr/local/lib/python3.8/dist-packages/lightning_transformers/task/nlp/question_answering/datasets/squad/processing.py in postprocess_qa_predictions(examples, features, predictions, version_2_with_negative, n_best_size, max_answer_length, null_score_diff_threshold, output_dir, prefix)
245 all_start_logits, all_end_logits, example_ids = predictions
246
--> 247 assert len(predictions[0]) == len(features), f"Got {len(predictions[0])} predictions and {len(features)} features."
248
249 # Build a map example to its corresponding features.
AssertionError: Got 2 predictions and 10784 features.
I was simply trying to get a single example from the documentation to run within google colab before investigating further if this would meet my use case, but I see an error when I try to use the example as is, which is disheartening to consider investigating it. Nothing comes up when I google "AssertionError: Got 2 predictions and 10784 features."

Related

Pytorch is not working with DistributedDataParallel for multi gpu training

I am trying to train my model on multiple GPUS. I used the libraries and a added a code for it
from torch.nn.parallel import DistributedDataParallel as DDP
from torch.distributed import init_process_group, destroy_process_group
Initialization
def ddp_setup(rank: int, world_size: int):
os.environ["MASTER_ADDR"] = "localhost"
os.environ["MASTER_PORT"] = "12355"
os.environ["TORCH_DISTRIBUTED_DEBUG"]="DETAIL"
init_process_group(backend="gloo", rank=0, world_size=1)
my model
model = CMGCNnet(config,
que_vocabulary=glovevocabulary,
glove=glove,
device=device)
model = model.to(0)
if -1 not in args.gpu_ids and len(args.gpu_ids) > 1:
model = DDP(model, device_ids=[0,1])
it throws following error:
config_yml : model/config_fvqa_gruc.yml
cpu_workers : 0
save_dirpath : exp_test_gruc
overfit : False
validate : True
gpu_ids : [0, 1]
dataset : fvqa
Loading FVQATrainDataset…
True
done splitting
Loading FVQATestDataset…
Loading glove…
Building Model…
Traceback (most recent call last):
File “trainfvqa_gruc.py”, line 512, in
train()
File “trainfvqa_gruc.py”, line 145, in train
ddp_setup(0,1)
File “trainfvqa_gruc.py”, line 42, in ddp_setup
init_process_group(backend=“gloo”, rank=0, world_size=1)
File “/home/seecs/miniconda/envs/mucko-edit/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py”, line 360, in init_process_group
timeout=timeout)
RuntimeError: [enforce fail at /opt/conda/conda-bld/pytorch_1544202130060/work/third_party/gloo/gloo/transport/tcp/device.cc:128] rp != nullptr. Unable to find address for: 127.0.0.1localhost.
localdomainlocalhost
I tried printing the issue with os.environ["TORCH_DISTRIBUTED_DEBUG"]="DETAIL"
it outputs:
Loading FVQATrainDataset...
True
done splitting
Loading FVQATestDataset...
Loading glove...
Building Model...
Segmentation fault
with NCCL background it starts the training but get stuck and doesn’t go further than this :slight_smile:
Training for epoch 0:
0%| | 0/2039 [00:00<?, ?it/s]
I found this solution but where to add these lines?
GLOO_SOCKET_IFNAME* , for example export GLOO_SOCKET_IFNAME=eth0`
mentioned in
https://discuss.pytorch.org/t/runtime-error-using-distributed-with-gloo/16579/3
Can someone help me with this issue?
to seek help. I am hoping to get and answer

How to fix ModuleNotFoundError: No module named 'binance'?

From Jupyter Notebook I ran pip install binance. Running from binance.client import Client gives the error above. I have renamed the binance.py file as mentioned in similar questions however I'm still getting the error. I haven't installed for one version of python while trying to run my code with another as mentioned in another question. Trying pip uninstall gives "WARNING: Skipping binance as it is not installed.".
How can I get the python-binance package to work?
Edit: Following Wayne's comment I tried %conda install -c conda-forge python-binance and encounter a new error when trying to import: No module named 'importlib.readers'
Edit 2: conda list and pip list both run without errors.
My traceback:
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[2], line 1
----> 1 from binance.client import Client
File ~\anaconda3\envs\py3\lib\site-packages\binance\__init__.py:9
1 """An unofficial Python wrapper for the Binance exchange API v3
2
3 .. moduleauthor:: Sam McHardy
4
5 """
7 __version__ = '1.0.16'
----> 9 from binance.client import Client, AsyncClient # noqa
10 from binance.depthcache import DepthCacheManager, OptionsDepthCacheManager, ThreadedDepthCacheManager # noqa
11 from binance.streams import BinanceSocketManager, ThreadedWebsocketManager # noqa
File ~\anaconda3\envs\py3\lib\site-packages\binance\client.py:7
5 import hashlib
6 import hmac
----> 7 import requests
8 import time
9 from operator import itemgetter
File ~\anaconda3\envs\py3\lib\site-packages\requests\__init__.py:147
144 import logging
145 from logging import NullHandler
--> 147 from . import packages, utils
148 from .__version__ import (
149 __author__,
150 __author_email__,
(...)
158 __version__,
159 )
160 from .api import delete, get, head, options, patch, post, put, request
File ~\anaconda3\envs\py3\lib\site-packages\requests\utils.py:58
54 from .structures import CaseInsensitiveDict
56 NETRC_FILES = (".netrc", "_netrc")
---> 58 DEFAULT_CA_BUNDLE_PATH = certs.where()
60 DEFAULT_PORTS = {"http": 80, "https": 443}
62 # Ensure that ', ' is used to preserve previous delimiter behavior.
File ~\anaconda3\envs\py3\lib\site-packages\certifi\core.py:71, in where()
58 global _CACERT_PATH
59 if _CACERT_PATH is None:
60 # This is slightly janky, the importlib.resources API wants you
61 # to manage the cleanup of this file, so it doesn't actually
(...)
69 # it will do the cleanup whenever it gets garbage collected, so
70 # we will also store that at the global level as well.
---> 71 _CACERT_CTX = get_path("certifi", "cacert.pem")
72 _CACERT_PATH = str(_CACERT_CTX.__enter__())
74 return _CACERT_PATH
File ~\anaconda3\envs\py3\lib\importlib\resources.py:119, in path(package, resource)
112 else:
113 return BytesIO(data)
116 def open_text(package: Package,
117 resource: Resource,
118 encoding: str = 'utf-8',
--> 119 errors: str = 'strict') -> TextIO:
120 """Return a file-like object opened for text reading of the resource."""
121 resource = _normalize_path(resource)
File ~\anaconda3\envs\py3\lib\importlib\_common.py:52, in get_resource_reader(package)
ModuleNotFoundError: No module named 'importlib.readers'
As suggested in the question comments, my problem was inconsistency of installed packages due to using pip instead of conda. Uninstalling and reinstalling Anaconda fixed the module not found error.

Why does Explainable AI not find implementations of the model?

In this Notebook, we use Explainable AI SDK from Google to load a model, right after saving it. This fails with a message that the model is missing.
But note
the info message saying that the model was saved
checking working/model shows that the model is there.
However, working/model/assets is empty.
Why do we get this error message? How can we avoid it?
model_path = "working/model"
model.save(model_path)
builder = SavedModelMetadataBuilder(model_path)
builder.set_numeric_metadata(
"numpy_inputs",
input_baselines=[X_train.median().tolist()], # attributions relative to the median of the target
index_feature_mapping=X_train.columns.tolist(), # the names of each feature
)
builder.save_metadata(model_path)
explainer = explainable_ai_sdk.load_model_from_local_path(
model_path=model_path,
config=configs.SampledShapleyConfig(path_count=20),
)
INFO:tensorflow:Assets written to: working/model/assets
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
/tmp/ipykernel_26061/1928503840.py in <module>
18 explainer = explainable_ai_sdk.load_model_from_local_path(
19 model_path=model_path,
---> 20 config=configs.SampledShapleyConfig(path_count=20),
21 )
22
/opt/conda/lib/python3.7/site-packages/explainable_ai_sdk/model/model_factory.py in load_model_from_local_path(model_path, config)
128 """
129 if _LOCAL_MODEL_KEY not in _MODEL_REGISTRY:
--> 130 raise NotImplementedError('There are no implementations of local model.')
131 return _MODEL_REGISTRY[_LOCAL_MODEL_KEY](model_path, config)
132
NotImplementedError: There are no implementations of local model.

Loading multiple CSV files (silos) to compose Tensorflow Federated dataset

I am working on pre-processed data that were already siloed into separated csv files to represent separated local data for federated learning.
To correct implement the federated learning with these multiple CSVs on TensorFlow Federated, I am just trying to reproduce the same approach with a toy example in the iris dataset. However, when trying to use the method tff.simulation.datasets.TestClientData, I am getting the error:
TypeError: can't pickle _thread.RLock objects
The current code is as follows, first, load the three iris dataset CSV files (50 samples on each) into a dictionary from the filenames iris1.csv, iris2.csv, and iris3.csv:
silos = {}
for silo in silos_files:
silo_name = silo.replace(".csv", "")
silos[silo_name] = pd.read_csv(silos_path + silo)
silos[silo_name]["variety"].replace({"Setosa" : 0, "Versicolor" : 1, "Virginica" : 2}, inplace=True)
Creating a new dict with tensors:
silos_tf = collections.OrderedDict()
for key, silo in silos.items():
silos_tf[key] = tf.data.Dataset.from_tensor_slices((silo.drop(columns=["variety"]).values, silo["variety"].values))
Finally, trying to converting the Tensorflow Dataset into a Tensorflow Federated Dataset:
tff_dataset = tff.simulation.datasets.TestClientData(
silos_tf
)
That raises the error:
TypeError Traceback (most recent call last)
<ipython-input-58-a4b5686509ce> in <module>()
1 tff_dataset = tff.simulation.datasets.TestClientData(
----> 2 silos_tf
3 )
/usr/local/lib/python3.7/dist-packages/tensorflow_federated/python/simulation/datasets/from_tensor_slices_client_data.py in __init__(self, tensor_slices_dict)
59 """
60 py_typecheck.check_type(tensor_slices_dict, dict)
---> 61 tensor_slices_dict = copy.deepcopy(tensor_slices_dict)
62 structures = list(tensor_slices_dict.values())
63 example_structure = structures[0]
...
/usr/lib/python3.7/copy.py in deepcopy(x, memo, _nil)
167 reductor = getattr(x, "__reduce_ex__", None)
168 if reductor:
--> 169 rv = reductor(4)
170 else:
171 reductor = getattr(x, "__reduce__", None)
TypeError: can't pickle _thread.RLock objects
I also tried to use Python dictionary instead of OrderedDict but the error is the same. For this experiment, I am using Google Colab with this notebook as reference running with TensorFlow 2.8.0 and TensorFlow Federated version 0.20.0. I also used these previous questions as references:
Is there a reasonable way to create tff clients datat sets?
'tensorflow_federated.python.simulation' has no attribute 'FromTensorSlicesClientData' when using tff-nightly
I am not sure if this is a good way that derives for a case beyond the toy example, please, if any suggestion on how to bring already siloed data for TFF tests, I am thankful.
I did some search of public code in github using class tff.simulation.datasets.TestClientData, then I found the following implementation (source here):
def to_ClientData(clientsData: np.ndarray, clientsDataLabels: np.ndarray,
ds_info, is_train=True) -> tff.simulation.datasets.TestClientData:
"""Transform dataset to be fed to fedjax
:param clientsData: dataset for each client
:param clientsDataLabels:
:param ds_info: dataset information
:param train: True if processing train split
:return: dataset for each client cast into TestClientData
"""
num_clients = ds_info['num_clients']
client_data = collections.OrderedDict()
for i in range(num_clients if is_train else 1):
client_data[str(i)] = collections.OrderedDict(
x=clientsData[i],
y=clientsDataLabels[i])
return tff.simulation.datasets.TestClientData(client_data)
I understood from this snippet that the tff.simulation.datasets.TestClientData class requires as argument an OrderedDict composed by numpy arrays instead of a dict of tensors (as my previous implementation), now I changed the code for the following:
silos_tf = collections.OrderedDict()
for key, silo in silos.items():
silos_tf[key] = collections.OrderedDict(
x=silo.drop(columns=["variety"]).values,
y=silo["variety"].values)
Followed by:
tff_dataset = tff.simulation.datasets.TestClientData(
silos_tf
)
That correctly runs as the following output:
>>> tff_dataset.client_ids
['iris3', 'iris1', 'iris2']

Huggingface AutoTokenizer can't load from local path

I'm trying to run language model finetuning script (run_language_modeling.py) from huggingface examples with my own tokenizer(just added in several tokens, see the comments). I have problem loading the tokenizer. I think the problem is with AutoTokenizer.from_pretrained('local/path/to/directory').
Code:
from transformers import *
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# special_tokens = ['<HASHTAG>', '<URL>', '<AT_USER>', '<EMOTICON-HAPPY>', '<EMOTICON-SAD>']
# tokenizer.add_tokens(special_tokens)
tokenizer.save_pretrained('../twitter/twittertokenizer/')
tmp = AutoTokenizer.from_pretrained('../twitter/twittertokenizer/')
Error Message:
OSError Traceback (most recent call last)
/z/huggingface_venv/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, pretrained_config_archive_map, **kwargs)
248 resume_download=resume_download,
--> 249 local_files_only=local_files_only,
250 )
/z/huggingface_venv/lib/python3.7/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, local_files_only)
265 # File, but it doesn't exist.
--> 266 raise EnvironmentError("file {} not found".format(url_or_filename))
267 else:
OSError: file ../twitter/twittertokenizer/config.json not found
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
<ipython-input-32-662067cb1297> in <module>
----> 1 tmp = AutoTokenizer.from_pretrained('../twitter/twittertokenizer/')
/z/huggingface_venv/lib/python3.7/site-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
190 config = kwargs.pop("config", None)
191 if not isinstance(config, PretrainedConfig):
--> 192 config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
193
194 if "bert-base-japanese" in pretrained_model_name_or_path:
/z/huggingface_venv/lib/python3.7/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
192 """
193 config_dict, _ = PretrainedConfig.get_config_dict(
--> 194 pretrained_model_name_or_path, pretrained_config_archive_map=ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, **kwargs
195 )
196
/z/huggingface_venv/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, pretrained_config_archive_map, **kwargs)
270 )
271 )
--> 272 raise EnvironmentError(msg)
273
274 except json.JSONDecodeError:
OSError: Can't load '../twitter/twittertokenizer/'. Make sure that:
- '../twitter/twittertokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'
- or '../twitter/twittertokenizer/' is the correct path to a directory containing a 'config.json' file
If I change AutoTokenizer to BertTokenizer, the code above can work. Also I can run the script without any problem is I load by shortcut name instead of path. But in the script run_language_modeling.py it uses AutoTokenizer. I'm looking for a way to get it running.
Any idea? Thanks!
The problem is that you are using nothing that would indicate the correct tokenizer to instantiate.
For reference, see the rules defined in the Huggingface docs. Specifically, since you are using BERT:
contains bert: BertTokenizer (Bert model)
Otherwise, you have to specify the exact type yourself, as you mentioned.
AutoTokenizer.from_pretrained fails if the specified path does not contain the model configuration files, which are required solely for the tokenizer class instantiation.
In the context of run_language_modeling.py the usage of AutoTokenizer is buggy (or at least leaky).
There is no point to specify the (optional) tokenizer_name parameter if it's identical to the model name or path. Therefore, to my understanding, it supposes to support exactly the case of a modified tokenizer. I also found this issue very confusing.
The best workaround that I have found is to add config.json to the tokenizer directory with only the "missing" configuration:
{
"model_type": "bert"
}
when loading modified tokenizer or pretrained tokenizer you should load it as follows:
tokenizer = AutoTokenizer.from_pretrained(path_to_json_file_of_tokenizer, config=AutoConfig.from_pretrained('path to thefolderthat contains the config file of the model'))

Resources