Create expectation suite without CLI - validation

I am starting to use Great Expectations for a project. I am trying to create a expectation suite programatically with Great Expectations. I have a GCS datasource (consisting on 2 csv files) defined in great_expectations.yml as follows:
datasources:
GCS_Data:
class_name: Datasource
data_connectors:
default_inferred_data_connector_name:
class_name: InferredAssetFilesystemDataConnector
default_regex:
group_names:
- data_asset_name
pattern: (.*)
base_directory: gs://mybucket/GCS_datasource
module_name: great_expectations.datasource.data_connector
default_runtime_data_connector_name:
class_name: RuntimeDataConnector
module_name: great_expectations.datasource.data_connector
assets:
my_runtime_asset_name:
class_name: Asset
module_name: great_expectations.datasource.data_connector.asset
batch_identifiers:
- runtime_batch_identifier_name
execution_engine:
class_name: PandasExecutionEngine
module_name: great_expectations.execution_engine
module_name: great_expectations.datasource
config_variables_file_path: uncommitted/config_variables.yml
When I try to create the expectation suite I run:
import great_expectations as ge
from great_expectations.core.batch import BatchRequest
from great_expectations.checkpoint import SimpleCheckpoint #needed?
from great_expectations.exceptions import DataContextError
context = ge.data_context.DataContext()
# Note that if you modify this batch request, you may save the new version as a .json file
# to pass in later via the --batch-request option
batch_request = {
"datasource_name": "GCS_Data",
"data_connector_name": "default_inferred_data_connector_name",
"data_asset_name": "yellow_tripdata_sample_2019-01.csv",
"limit": 1000,
}
suite = context.create_expectation_suite(expectation_suite_name='my_second_expectation_suite')
validator = context.get_validator(
batch_request=BatchRequest(**batch_request),
expectation_suite_name='my_second_expectation_suite')
But the 'get_validator' step throws the following error:
---------------------------------------------------------------------------
InvalidBatchRequestError Traceback (most recent call last)
/tmp/ipykernel_27667/3237782419.py in <module>
35 validator = context.get_validator(
36 batch_request=BatchRequest(**batch_request),
---> 37 expectation_suite_name='my_second_expectation_suite')
38
39 validator.expect_column_max_to_be_between(column = 'passenger_count', min = 4, max = 10)
/opt/conda/lib/python3.7/site-packages/great_expectations/data_context/data_context/abstract_data_context.py in get_validator(self, datasource_name, data_connector_name, data_asset_name, batch, batch_list, batch_request, batch_request_list, batch_data, data_connector_query, batch_identifiers, limit, index, custom_filter_function, sampling_method, sampling_kwargs, splitter_method, splitter_kwargs, runtime_parameters, query, path, batch_filter_parameters, expectation_suite_ge_cloud_id, batch_spec_passthrough, expectation_suite_name, expectation_suite, create_expectation_suite_with_name, include_rendered_content, **kwargs)
1393 expectation_suite=expectation_suite, # type: ignore[arg-type]
1394 batch_list=batch_list,
-> 1395 include_rendered_content=include_rendered_content,
1396 )
1397
/opt/conda/lib/python3.7/site-packages/great_expectations/data_context/data_context/abstract_data_context.py in get_validator_using_batch_list(self, expectation_suite, batch_list, include_rendered_content, **kwargs)
1418 raise ge_exceptions.InvalidBatchRequestError(
1419 """Validator could not be created because BatchRequest returned an empty batch_list.
-> 1420 Please check your parameters and try again."""
1421 )
1422
InvalidBatchRequestError: Validator could not be created because BatchRequest returned an empty batch_list.
Please check your parameters and try again.
Something I don't really understand because my batch_request object it's not empty. Does somebody has any idea of what can happens?
Thanks in advance
I also have tried to follow the steps from here: https://legacy.docs.greatexpectations.io/en/stable/guides/how_to_guides/creating_and_editing_expectations/how_to_create_a_new_expectation_suite_without_the_cli.html
But in the step:
batch = context.get_batch(batch_kwargs, suite)
I also get this error:
AttributeError: 'Datasource' object has no attribute 'get_batch'

Related

pytorch-lightning example throws assertion error

I'm using the following code out of the box from this url: https://lightning-transformers.readthedocs.io/en/latest/tasks/nlp/question_answering.html
import pytorch_lightning as pl
from transformers import AutoTokenizer
from lightning_transformers.task.nlp.question_answering import (
QuestionAnsweringTransformer,
SquadDataModule,
)
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path="bert-base-uncased")
model = QuestionAnsweringTransformer(pretrained_model_name_or_path="bert-base-uncased")
dm = SquadDataModule(
batch_size=1,
dataset_config_name="plain_text",
max_length=384,
version_2_with_negative=False,
null_score_diff_threshold=0.0,
doc_stride=128,
n_best_size=20,
max_answer_length=30,
tokenizer=tokenizer,
)
trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
trainer.fit(model, dm)
which throws this error
AssertionError Traceback (most recent call last)
<ipython-input-2-0b608c02a52e> in <module>
14 trainer = pl.Trainer(accelerator="auto", devices="auto", max_epochs=1)
15
---> 16 trainer.fit(model, dm)
16 frames
/usr/local/lib/python3.8/dist-packages/lightning_transformers/task/nlp/question_answering/datasets/squad/processing.py in postprocess_qa_predictions(examples, features, predictions, version_2_with_negative, n_best_size, max_answer_length, null_score_diff_threshold, output_dir, prefix)
245 all_start_logits, all_end_logits, example_ids = predictions
246
--> 247 assert len(predictions[0]) == len(features), f"Got {len(predictions[0])} predictions and {len(features)} features."
248
249 # Build a map example to its corresponding features.
AssertionError: Got 2 predictions and 10784 features.
I was simply trying to get a single example from the documentation to run within google colab before investigating further if this would meet my use case, but I see an error when I try to use the example as is, which is disheartening to consider investigating it. Nothing comes up when I google "AssertionError: Got 2 predictions and 10784 features."

Why does Explainable AI not find implementations of the model?

In this Notebook, we use Explainable AI SDK from Google to load a model, right after saving it. This fails with a message that the model is missing.
But note
the info message saying that the model was saved
checking working/model shows that the model is there.
However, working/model/assets is empty.
Why do we get this error message? How can we avoid it?
model_path = "working/model"
model.save(model_path)
builder = SavedModelMetadataBuilder(model_path)
builder.set_numeric_metadata(
"numpy_inputs",
input_baselines=[X_train.median().tolist()], # attributions relative to the median of the target
index_feature_mapping=X_train.columns.tolist(), # the names of each feature
)
builder.save_metadata(model_path)
explainer = explainable_ai_sdk.load_model_from_local_path(
model_path=model_path,
config=configs.SampledShapleyConfig(path_count=20),
)
INFO:tensorflow:Assets written to: working/model/assets
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
/tmp/ipykernel_26061/1928503840.py in <module>
18 explainer = explainable_ai_sdk.load_model_from_local_path(
19 model_path=model_path,
---> 20 config=configs.SampledShapleyConfig(path_count=20),
21 )
22
/opt/conda/lib/python3.7/site-packages/explainable_ai_sdk/model/model_factory.py in load_model_from_local_path(model_path, config)
128 """
129 if _LOCAL_MODEL_KEY not in _MODEL_REGISTRY:
--> 130 raise NotImplementedError('There are no implementations of local model.')
131 return _MODEL_REGISTRY[_LOCAL_MODEL_KEY](model_path, config)
132
NotImplementedError: There are no implementations of local model.

trainer.train() in Kaggle: StdinNotImplementedError: getpass was called, but this frontend does not support input requests

When saving a version in Kaggle, I get StdinNotImplementedError: getpass was called, but this frontend does not support input requests whenever I use the Transformers.Trainer class. The general code I use:
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(params)
trainer = Trainer(params)
trainer.train()
And the specific cell I am running now:
from transformers import Trainer, TrainingArguments,EarlyStoppingCallback
early_stopping = EarlyStoppingCallback()
training_args = TrainingArguments(
output_dir=OUT_FINETUNED_MODEL_PATH,
num_train_epochs=20,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
warmup_steps=0,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=100,
evaluation_strategy="steps",
eval_steps=100,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=val_dataset,
callbacks=[early_stopping]
)
trainer.train()
When trainer.train() is called, I get the error below, which I do not get if I train with native PyTorch. I understood that the error arises since I am asked to input a password, but no password is asked when using native PyTorch code, nor when using the same code with trainer.train() on Google Colab.
Any solution would be ok, like:
Avoid being asked the password.
Enable input requests when saving a notebook on Kaggle. After that, if I understood correctly, I would need to go to https://wandb.ai/authorize (after having created an account) and copy the generated key to console. However, I do not understand why wandb should be necessary since I never explicitly used it so far.
wandb: You can find your API key in your browser here: https://wandb.ai/authorize
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 741, in init
wi.setup(kwargs)
File "/opt/conda/lib/python3.7/site-packages/wandb/sdk/wandb_init.py", line 155, in setup
wandb_login._login(anonymous=anonymous, force=force, _disable_warning=True)
File "/opt/conda/lib/python3.7/site-packages/wandb/sdk/wandb_login.py", line 210, in _login
wlogin.prompt_api_key()
File "/opt/conda/lib/python3.7/site-packages/wandb/sdk/wandb_login.py", line 144, in prompt_api_key
no_create=self._settings.force,
File "/opt/conda/lib/python3.7/site-packages/wandb/sdk/lib/apikey.py", line 135, in prompt_api_key
key = input_callback(api_ask).strip()
File "/opt/conda/lib/python3.7/site-packages/ipykernel/kernelbase.py", line 825, in getpass
"getpass was called, but this frontend does not support input requests."
IPython.core.error.StdinNotImplementedError: getpass was called, but this frontend does not support input requests.
wandb: ERROR Abnormal program exit
---------------------------------------------------------------------------
StdinNotImplementedError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/wandb/sdk/wandb_init.py in init(job_type, dir, config, project, entity, reinit, tags, group, name, notes, magic, config_exclude_keys, config_include_keys, anonymous, mode, allow_val_change, resume, force, tensorboard, sync_tensorboard, monitor_gym, save_code, id, settings)
740 wi = _WandbInit()
--> 741 wi.setup(kwargs)
742 except_exit = wi.settings._except_exit
/opt/conda/lib/python3.7/site-packages/wandb/sdk/wandb_init.py in setup(self, kwargs)
154 if not settings._offline and not settings._noop:
--> 155 wandb_login._login(anonymous=anonymous, force=force, _disable_warning=True)
156
/opt/conda/lib/python3.7/site-packages/wandb/sdk/wandb_login.py in _login(anonymous, key, relogin, host, force, _backend, _silent, _disable_warning)
209 if not key:
--> 210 wlogin.prompt_api_key()
211
/opt/conda/lib/python3.7/site-packages/wandb/sdk/wandb_login.py in prompt_api_key(self)
143 no_offline=self._settings.force,
--> 144 no_create=self._settings.force,
145 )
/opt/conda/lib/python3.7/site-packages/wandb/sdk/lib/apikey.py in prompt_api_key(settings, api, input_callback, browser_callback, no_offline, no_create, local)
134 )
--> 135 key = input_callback(api_ask).strip()
136 write_key(settings, key, api=api)
/opt/conda/lib/python3.7/site-packages/ipykernel/kernelbase.py in getpass(self, prompt, stream)
824 raise StdinNotImplementedError(
--> 825 "getpass was called, but this frontend does not support input requests."
826 )
StdinNotImplementedError: getpass was called, but this frontend does not support input requests.
The above exception was the direct cause of the following exception:
Exception Traceback (most recent call last)
<ipython-input-82-4d1046ab80b8> in <module>
42 )
43
---> 44 trainer.train()
/opt/conda/lib/python3.7/site-packages/transformers/trainer.py in train(self, resume_from_checkpoint, trial, **kwargs)
1067 model.zero_grad()
1068
-> 1069 self.control = self.callback_handler.on_train_begin(self.args, self.state, self.control)
1070
1071 # Skip the first epochs_trained epochs to get the random state of the dataloader at the right point.
/opt/conda/lib/python3.7/site-packages/transformers/trainer_callback.py in on_train_begin(self, args, state, control)
338 def on_train_begin(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):
339 control.should_training_stop = False
--> 340 return self.call_event("on_train_begin", args, state, control)
341
342 def on_train_end(self, args: TrainingArguments, state: TrainerState, control: TrainerControl):
/opt/conda/lib/python3.7/site-packages/transformers/trainer_callback.py in call_event(self, event, args, state, control, **kwargs)
386 train_dataloader=self.train_dataloader,
387 eval_dataloader=self.eval_dataloader,
--> 388 **kwargs,
389 )
390 # A Callback can skip the return of `control` if it doesn't change it.
/opt/conda/lib/python3.7/site-packages/transformers/integrations.py in on_train_begin(self, args, state, control, model, **kwargs)
627 self._wandb.finish()
628 if not self._initialized:
--> 629 self.setup(args, state, model, **kwargs)
630
631 def on_train_end(self, args, state, control, model=None, tokenizer=None, **kwargs):
/opt/conda/lib/python3.7/site-packages/transformers/integrations.py in setup(self, args, state, model, **kwargs)
604 project=os.getenv("WANDB_PROJECT", "huggingface"),
605 name=run_name,
--> 606 **init_args,
607 )
608 # add config parameters (run may have been created manually)
/opt/conda/lib/python3.7/site-packages/wandb/sdk/wandb_init.py in init(job_type, dir, config, project, entity, reinit, tags, group, name, notes, magic, config_exclude_keys, config_include_keys, anonymous, mode, allow_val_change, resume, force, tensorboard, sync_tensorboard, monitor_gym, save_code, id, settings)
779 if except_exit:
780 os._exit(-1)
--> 781 six.raise_from(Exception("problem"), error_seen)
782 return run
/opt/conda/lib/python3.7/site-packages/six.py in raise_from(value, from_value)
Exception: problem
You may want to try adding report_to="tensorboard" or any other reasonable string array in your TrainingArguments
https://huggingface.co/transformers/main_classes/trainer.html#transformers.TrainingArguments
If you have multiple logger that you want to use report_to="all" (the default value)
try os.environ["WANDB_DISABLED"] = "true" such that wandb is always disabled.
see: https://huggingface.co/transformers/main_classes/trainer.html#transformers.TFTrainer.setup_wandb

Register Azure ML Model from DatabricksStep

I'm calculating a model while executing a DatabricksStep in an Azure ML Pipeline, save it on my Blob Storage as .pkl file and upload it to the current Azure ML Run using Run.upload_file (). All this works without any problems.
But as soon as I try to register the model to the Azure ML Workspace using Run.register_model (), the script throws the following error:
UserErrorException: UserErrorException:
Message:
Operation returned an invalid status code 'Forbidden'. The possible reason could be:
You are not authorized to access this resource, or directory listing denied.
you may not login your azure service, or use other subscription, you can check your
default account by running azure cli commend:
'az account list -o table'.
You have multiple objects/login session opened, please close all session and try again.
InnerException None
ErrorResponse
{
"error": {
"code": "UserError",
"message": "\nOperation returned an invalid status code 'Forbidden'. The possible reason could be:\n1. You are not authorized to access this resource, or directory listing denied.\n2. you may not login your azure service, or use other subscription, you can check your\ndefault account by running azure cli commend:\n'az account list -o table'.\n3. You have multiple objects/login session opened, please close all session and try again.\n "
}
}
with the following call stack
/databricks/python/lib/python3.7/site-packages/azureml/_restclient/models_client.py in register_model(self, name, tags, properties, description, url, mime_type, framework, framework_version, unpack, experiment_name, run_id, datasets, sample_input_data, sample_output_data, resource_requirements)
70 return self.
71 _execute_with_workspace_arguments(self._client.ml_models.register, model,
---> 72 custom_headers=ModelsClient.get_modelmanagement_custom_headers())
73
74 #error_with_model_id_handling
/databricks/python/lib/python3.7/site-packages/azureml/_restclient/workspace_client.py in _execute_with_workspace_arguments(self, func, *args, **kwargs)
65
66 def _execute_with_workspace_arguments(self, func, *args, **kwargs):
---> 67 return self._execute_with_arguments(func, copy.deepcopy(self._workspace_arguments), *args, **kwargs)
68
69 def get_or_create_experiment(self, experiment_name, is_async=False):
/databricks/python/lib/python3.7/site-packages/azureml/_restclient/clientbase.py in _execute_with_arguments(self, func, args_list, *args, **kwargs)
536 return self._call_paginated_api(func, *args_list, **kwargs)
537 else:
--> 538 return self._call_api(func, *args_list, **kwargs)
539 except ErrorResponseException as e:
540 raise ServiceException(e)
/databricks/python/lib/python3.7/site-packages/azureml/_restclient/clientbase.py in _call_api(self, func, *args, **kwargs)
234 return AsyncTask(future, _ident=ident, _parent_logger=self._logger)
235 else:
--> 236 return self._execute_with_base_arguments(func, *args, **kwargs)
237
238 def _call_paginated_api(self, func, *args, **kwargs):
/databricks/python/lib/python3.7/site-packages/azureml/_restclient/clientbase.py in _execute_with_base_arguments(self, func, *args, **kwargs)
323 total_retry = 0 if self.retries < 0 else self.retries
324 return ClientBase._execute_func_internal(
--> 325 back_off, total_retry, self._logger, func, _noop_reset, *args, **kwargs)
326
327 #classmethod
/databricks/python/lib/python3.7/site-packages/azureml/_restclient/clientbase.py in _execute_func_internal(cls, back_off, total_retry, logger, func, reset_func, *args, **kwargs)
343 return func(*args, **kwargs)
344 except Exception as error:
--> 345 left_retry = cls._handle_retry(back_off, left_retry, total_retry, error, logger, func)
346
347 reset_func(*args, **kwargs) # reset_func is expected to undo any side effects from a failed func call.
/databricks/python/lib/python3.7/site-packages/azureml/_restclient/clientbase.py in _handle_retry(cls, back_off, left_retry, total_retry, error, logger, func)
384 3. You have multiple objects/login session opened, please close all session and try again.
385 """
--> 386 raise_from(UserErrorException(error_msg), error)
387
388 elif error.response.status_code == 429:
/databricks/python/lib/python3.7/site-packages/six.py in raise_from(value, from_value)
Did anybody experience the same error and knows what is its cause and how to solve it?
Best,
Jonas
UPDATE:
model = sklearn.linear_model.LinearRegression ( )
model_path = "<path to 'model.pkl' in my blob storage>"
joblib.dump(model, model_path)
aml_run = azureml.core.get_context ( )
aml_run.upload_file (name = "model.pkl", path_or_stream = model_path)
# Until this point, everything works fine
aml_run.register_model (model_name = "model.pkl")
# This throws the posted "Forbidden"-Error
To configure the workspace to authenticate to the subscription, Please follow the steps in the notebooks..
Persist model (joblib.dump) to a custom folder other than outputs.
Manually run upload_file to upload the model AML workspace. Name the destination same name with your model file.
Then run run.register_model.
or
AML background process to automatically upload content under ./outputs to AML workspace. Once the upload is complete and call run.register_model which takes the content from AML workspace.
The documentation DatabricksStep class and the sample notebook https://aka.ms/pl-databricks should both be helpful.
UserErrorException: UserErrorException: Message: Operation returned an invalid status code 'Forbidden'.
This error might be due to the fact that Azure Databricks Compute is unable to authenticate the Azure Machine Learning Workspace.
I had been facing a similar error, and this is the Microsoft preferred way of solving this issue:
Create an Azure Key Vault.
Create a Service Principal (App registration) inside Azure Active Directory.
Add this Service Principal with Contributor/ Owner access in AML and ADB.
Create an Azure Databricks Scope and link it with the key vault created in Step 1.
Save the Client ID, Directory ID and Client Secret in the Key Vault.
Use ServicePrincipalAuthentication to validate the credentials.
For Step Six use Databricks Secret Scope to get the values. This resource will walk you through this step: Secret Management in Azure Databricks
Some references that will be helpful:
A worked-out example provided by Microsoft.
Microsoft Documentation on ServicePrincipalAuthentication

Huggingface AutoTokenizer can't load from local path

I'm trying to run language model finetuning script (run_language_modeling.py) from huggingface examples with my own tokenizer(just added in several tokens, see the comments). I have problem loading the tokenizer. I think the problem is with AutoTokenizer.from_pretrained('local/path/to/directory').
Code:
from transformers import *
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
# special_tokens = ['<HASHTAG>', '<URL>', '<AT_USER>', '<EMOTICON-HAPPY>', '<EMOTICON-SAD>']
# tokenizer.add_tokens(special_tokens)
tokenizer.save_pretrained('../twitter/twittertokenizer/')
tmp = AutoTokenizer.from_pretrained('../twitter/twittertokenizer/')
Error Message:
OSError Traceback (most recent call last)
/z/huggingface_venv/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, pretrained_config_archive_map, **kwargs)
248 resume_download=resume_download,
--> 249 local_files_only=local_files_only,
250 )
/z/huggingface_venv/lib/python3.7/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, local_files_only)
265 # File, but it doesn't exist.
--> 266 raise EnvironmentError("file {} not found".format(url_or_filename))
267 else:
OSError: file ../twitter/twittertokenizer/config.json not found
During handling of the above exception, another exception occurred:
OSError Traceback (most recent call last)
<ipython-input-32-662067cb1297> in <module>
----> 1 tmp = AutoTokenizer.from_pretrained('../twitter/twittertokenizer/')
/z/huggingface_venv/lib/python3.7/site-packages/transformers/tokenization_auto.py in from_pretrained(cls, pretrained_model_name_or_path, *inputs, **kwargs)
190 config = kwargs.pop("config", None)
191 if not isinstance(config, PretrainedConfig):
--> 192 config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
193
194 if "bert-base-japanese" in pretrained_model_name_or_path:
/z/huggingface_venv/lib/python3.7/site-packages/transformers/configuration_auto.py in from_pretrained(cls, pretrained_model_name_or_path, **kwargs)
192 """
193 config_dict, _ = PretrainedConfig.get_config_dict(
--> 194 pretrained_model_name_or_path, pretrained_config_archive_map=ALL_PRETRAINED_CONFIG_ARCHIVE_MAP, **kwargs
195 )
196
/z/huggingface_venv/lib/python3.7/site-packages/transformers/configuration_utils.py in get_config_dict(cls, pretrained_model_name_or_path, pretrained_config_archive_map, **kwargs)
270 )
271 )
--> 272 raise EnvironmentError(msg)
273
274 except json.JSONDecodeError:
OSError: Can't load '../twitter/twittertokenizer/'. Make sure that:
- '../twitter/twittertokenizer/' is a correct model identifier listed on 'https://huggingface.co/models'
- or '../twitter/twittertokenizer/' is the correct path to a directory containing a 'config.json' file
If I change AutoTokenizer to BertTokenizer, the code above can work. Also I can run the script without any problem is I load by shortcut name instead of path. But in the script run_language_modeling.py it uses AutoTokenizer. I'm looking for a way to get it running.
Any idea? Thanks!
The problem is that you are using nothing that would indicate the correct tokenizer to instantiate.
For reference, see the rules defined in the Huggingface docs. Specifically, since you are using BERT:
contains bert: BertTokenizer (Bert model)
Otherwise, you have to specify the exact type yourself, as you mentioned.
AutoTokenizer.from_pretrained fails if the specified path does not contain the model configuration files, which are required solely for the tokenizer class instantiation.
In the context of run_language_modeling.py the usage of AutoTokenizer is buggy (or at least leaky).
There is no point to specify the (optional) tokenizer_name parameter if it's identical to the model name or path. Therefore, to my understanding, it supposes to support exactly the case of a modified tokenizer. I also found this issue very confusing.
The best workaround that I have found is to add config.json to the tokenizer directory with only the "missing" configuration:
{
"model_type": "bert"
}
when loading modified tokenizer or pretrained tokenizer you should load it as follows:
tokenizer = AutoTokenizer.from_pretrained(path_to_json_file_of_tokenizer, config=AutoConfig.from_pretrained('path to thefolderthat contains the config file of the model'))

Resources