Bert Tokenizer punctuation for named entity recognition task - huggingface-transformers

I'm working on a named entity recognition task, where I need to identify person names, books etc.
I am using Huggingface Transformer package and BERT with PyTorch. Generally it works very good, however, my issue is that for some first names a dot "." is a part of the first name and shouldn't be separate it from it. For example, for the person name "Paul Adam", the first name in the training data is shortened to one letter combined with dot "P. Adam". The tokenizer tokenize it as ["P", ".", "Adam"] which later negatively impact the ner trained model performance as "P." is presented in the training data and not only "P". The model is capable to recognize the full names but fails in the shortened one. I used Spacy tokenizer before and I didn't face this issue. Here more details:
from transformers import BertTokenizer, BertConfig, AutoTokenizer, AutoConfig, BertModel
path_pretrained_model='/model/bert/'
tokenizer = BertTokenizer.from_pretrained(path_pretrained_model)
print(tokenizer.tokenize("P. Adam is a scientist."))
Output:
['p', '.', 'adam', 'is', 'a', 'scientist', '.']
The helpful output would be
['p.', 'adam', 'is', 'a', 'scientist', '.']

Not sure whether this might be a viable solution for you, but here's a possible hack.
from transformers import BertTokenizer, BertConfig, AutoTokenizer, AutoConfig, BertModel
import string
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenization=True, never_split=[f"{letter}." for letter in list(string.ascii_lowercase)])
print(tokenizer.tokenize("P. Adam is a scientist."))
# ['p.', 'adam', 'is', 'a', 'scientist', '.']
Indeed, from the documentation
never_split (Iterable, optional) — Collection of tokens which will never be split during tokenization. Only has an effect when do_basic_tokenize=True

Related

Struggling to understand user dictionary format in Elasticsearch Kuromoji tokenizer

I wanted to use Elasticsearch Kuromoji plugin for Japanese language. However, I'm struggling to understand the user_dictionary format of the file in the tokenizer. It's explained in elastic doc https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji-tokenizer.html as the CSV of the following form:
The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A user_dictionary may be appended to the default dictionary. The dictionary should have the following CSV format:
<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
So there is not much in the documentation about that.
When looking at the sample entry the doc shows, it can looks like below:
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
So, breaking it down, the first element is the dictionary text:
東京スカイツリー - Tokyo Sky Tree
東京 スカイツリー - is Tokyo Sky tree - I assuming the space here is to denote token, but wondering why only "Tokyo" is a separate token, but sky tree is not split into "sky" "tree" ?
トウキョウ スカイツリー - Then we have a reading forms. And again, "Tokyo" "sky tree" - again, why it's splited such way. Can I specify more than one reading form of the text in this column (of course if there are any)
And the last is the part of speech, which is the bit I don't understand. カスタム名詞 means "Custom noun". I assuming I can define the part of speech such as verb, noun etc. But what are the rules, should it follow some format of part of speech name. I saw examples where it's specified as "noun" - 名詞. But in this example is custom noun.
Anyone have some ideas, materials especially around Part of speech field - such as what are the available values. Additionally, what impact has this field to the overall tokenizer capabilities ?
Thanks
Do you try to define "tokyo sky tree" like this
"東京スカイツリー,東京スカイツリー,トウキョウスカイツリー,カスタム名詞"
"東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞"
I encounter another problem Found duplicate term [東京スカイツリー] in user dictionary at line [1]

How to initialize BertForSequenceClassification for different input rather than [CLS] token?

BertForSequenceClassification uses [CLS] token's representation to feed a linear classifier. I want to leverage another token (say [X] in the input sequence) rather than [CLS]. What's the most straightforward way to implement that in Transformers?
You can define the special tokens when creating the tokenizer.
This is an example of how to modify an special token of a pretrained tokenizer:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased", cls_token="[X]")
Please check the BertTokenizer documentation to see which other special tokens you can modify (unk_token, sep_token, pad_token, cls_token, mask_token, etc.).

How to replace the matched text with a word specifying by rules files?

I am currently working on a Stanford CoreNLP program that replaces a matched text with a specified word using a list of given rules. I checked TokensRegex Expression, I know there is a regex function can be used in Action field:
Replace(List<CoreMap>, tokensregex, replacement)<br>Match(String,regex,replacement)
to do that. However, it is not clear to me how to implement this function in my rules files. And I couldn't find any example on GitHub or other web pages.
Here is an example of a replacement:
Input text: John Smith is a member of the NLP lab.
Matched pattern: "John Smith" is replaced with "Student A" in the text.
Resulting text: Student A is a member of the NLP lab.
Anyone could help me? I am new to Stanford CoreNLP and have a lot of things to learn.

Defining TokensRegex for Stanford NLP RegexNER

I'm trying to create a regex token to tag Universities as schools in input text. For e.g. University of Wisconsin or Universidad Anahuac should get tagged as SCHOOL.
I have this as my pattern
( /University|Universidad/ /of?/ [ {ner:LOCATION}|{ner:ORGANIZATION} ]+ ) SCHOOL
I can't seem to get the syntax correct. Any help would be appreciated.

Searching within text fields in CloudKit

How are people searching within a string (a substring) field using CloudKit?
For making predicates for use with CloudKit, from what I gather, you can only can only do BEGINSWITH and TOKENMATCHES to search search a single text field's (prefix) or all fields (exact match) respectively. CONTAINS only works on collections despite these examples. I can't determine a way to find, for example, roses in the following string "Red roses are pretty"
I was thinking of making a tokenized version of certain string fields; for example the following fields on a hypothetical record:
description: 'Red roses are pretty'
descriptionTokenized: ['Red', 'roses', 'are', 'pretty']
testing this out makes CONTAINS somewhat useful when searching for distinct, space separated substrings but still not as good as SQL LIKE would be.

Resources