without the encode_plus method in tokenizers, how to make a feature matrix - huggingface-transformers

I am working on a low-resource language and need to make a classifier.
I used the tokenizers library to train the following tokenizers: WLV, BPE, UNI, WPC. I have saved the result of each into a json file.
I load each of the tokenizers using Tokenizer.from_file function.
tokenizer_WLV = Tokenizer.from_file('tokenizer_WLV.json')
and I can see it is loaded properly. However only the method encode exists.
so if I do tokenizer_WLV.encode(s1), I get an output like
Encoding(num_tokens=7, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]
and I can see each token along with the id as following.
out_wlv = tokenizer_WLV.encode(s1)
I can use the encode_batch
def tokenize_sentences(sentences, tokenizer, max_seq_len = 128):
tokenizer.enable_padding(pad_id=3, pad_token="[PAD]", direction='right')
tokenized_sentences = tokenizer.encode_batch(sentences)
return tokenized_sentences
which results in something like
[Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing]),
Encoding(num_tokens=40, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing])]
I need to make a data feature in a size of mxn where m is the number of observations and n number of unique tokens. encode_plus does this automatically. So I am curious what is the most efficient way for constructing this feature matrix ?

encode_plus is a method that huggingface transformer tokenizers have (but it is already deprecated and should therefore be ignored).
The alternative huggingface tokenizers and the huggingface transformer tokenizers provide is __call__:


Adding new tokens to BERT/RoBERTa while retaining tokenization of adjacent tokens

I'm trying to add some new tokens to BERT and RoBERTa tokenizers so that I can fine-tune the models on a new word. The idea is to fine-tune the models on a limited set of sentences with the new word, and then see what it predicts about the word in other, different contexts, to examine the state of the model's knowledge of certain properties of language.
In order to do this, I'd like to add the new tokens and essentially treat them like new ordinary words (that the model just hasn't happened to encounter yet). They should behave exactly like normal words once added, with the exception that their embedding matrices will be randomly initialized and then be learned during fine-tuning.
However, I'm running into some issues doing this. In particular, the tokens surrounding the newly added tokens do not behave as expected when initializing the tokenizer with do_basic_tokenize=False in the case of BERT (in the case of RoBERTa, changing this setting doesn't seem to affect the output in the examples here). The problem can be observed in the following example; in the case of BERT, the period following the newly added token is not tokenized as a subword (i.e., it is tokenized as . instead of as the expected ##.), and in the case of RoBERTa, the word following the newly added subword is treated as though it does not have a preceding space (i.e., it is tokenized as a instead of as Ġa.
from transformers import BertTokenizer, RobertaTokenizer
new_word = 'mynewword'
bert = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize = False)
bert.tokenize('mynewword') # does not exist yet
# ['my', '##ne', '##w', '##word']
# ['testing', '##.']
bert.tokenize('mynewword') # now it does
# ['mynewword']
# ['mynewword', '.']
roberta = RobertaTokenizer.from_pretrained('roberta-base', do_basic_tokenize = False)
roberta.tokenize('mynewword') # does not exist yet
# ['my', 'new', 'word']
roberta.tokenize('A testing a')
# ['A', 'Ġtesting', 'Ġa']
roberta.tokenize('mynewword') # now it does
# ['mynewword']
roberta.tokenize('A mynewword a')
# ['A', 'mynewword', 'a']
Is there a way for me to add the new tokens while getting the behavior of the surrounding tokens to match what it would be if there were not an added token there? I feel like it's important because the model could end up learning that (for instance), the new token can occur before ., while most others can only occur before ##. That seems like it would affect how it generalizes. In addition, I could turn on basic tokenization to solve the BERT problem here, but that wouldn't really reflect the full state of the model's knowledge, since it collapses the distinction between different tokens. And that doesn't help with the RoBERTa problem, which is still there regardless.
In addition, I'd ideally be able to add the RoBERTa token as Ġmynewword, but I'm assuming that as long as it never occurs as the first word in a sentence, that shouldn't matter.
After continuing to try and figure this out, I seem to have found something that might work. It's not necessarily generalizable, but one can load a tokenizer from a vocabulary file (+ a merges file for RoBERTa). If you manually edit those files to add the new tokens in the right way, everything seems to work as expected. Here's an example for BERT:
from transformers import BertTokenizer
bert = BertTokenizer.from_pretrained('bert-base-uncased', do_basic_tokenize=False)
bert.tokenize('testing.') # ['testing', '##.']
bert.tokenize('mynewword') # ['my', '##ne', '##w', '##word']
bert_vocab = bert.get_vocab() # get the pretrained tokenizer's vocabulary
bert_vocab.update({'mynewword' : len(bert_vocab)}) # add the new word to the end
with open('vocab.tmp', 'w', encoding = 'utf-8') as tmp_vocab_file:
new_bert = BertTokenizer(name_or_path = 'bert-base-uncased', vocab_file = 'vocab.tmp', do_basic_tokenize=False)
new_bert.max_model_length = 512 # for identity to this setting on the pretrained one
new_bert.tokenize('mynewword') # ['mynewword']
new_bert.tokenize('mynewword.') # ['mynewword', '##.']
import os
os.remove('vocab.tmp') # cleanup
RoBERTa is much harder since we also have to add the pairs to merges.txt. I have a way of doing this that works for the new tokens, but unfortunately it can affect tokenization of words that are subparts of the new tokens, so it's not perfect—if one is using this to add made up words (as in my use case), you can just choose strings that are unlikely to cause problems (unlike the example here of 'mynewword'), but in other cases it is likely to cause problems. (While it's not a perfect solution, hopefully it might get others to see a better one.)
import re
import json
import requests
from transformers import RobertaTokenizer
roberta = RobertaTokenizer.from_pretrained('roberta-base')
roberta.tokenize('testing a') # ['testing', 'Ġa']
roberta.tokenize('mynewword') # ['my', 'new', 'word']
# update the vocabulary with the new token and the 'Ġ'' version
roberta_vocab = roberta.get_vocab()
roberta_vocab.update({'mynewword' : len(roberta_vocab)})
roberta_vocab.update({chr(288) + 'mynewword' : len(roberta_vocab)}) # chr(288) = 'Ġ'
with open('vocab.tmp', 'w', encoding = 'utf-8') as tmp_vocab_file:
json.dump(roberta_vocab, tmp_vocab_file, ensure_ascii=False)
# get and modify the merges file so that the new token will always be tokenized as a single word
url = 'https://huggingface.co/roberta-base/resolve/main/merges.txt'
roberta_merges = requests.get(url).content.decode().split('\n')
# this is a helper function to loop through a list of new tokens and get the byte-pair encodings
# such that the new token will be treated as a single unit always
def get_roberta_merges_for_new_tokens(new_tokens):
merges = [gen_roberta_pairs(new_token) for new_token in new_tokens]
merges = [pair for token in merges for pair in token]
return merges
def gen_roberta_pairs(new_token, highest = True):
# highest is used to determine whether we are dealing with the Ġ version or not.
# we add those pairs at the end, which is only if highest = True
# this is the hard part...
chrs = [c for c in new_token] # list of characters in the new token, which we will recursively iterate through to find the BPEs
# the simplest case: add one pair
if len(chrs) == 2:
if not highest:
return tuple([chrs[0], chrs[1]])
return [' '.join([chrs[0], chrs[1]])]
# add the tokenization of the first letter plus the other two letters as an already merged pair
if len(chrs) == 3:
if not highest:
return tuple([chrs[0], ''.join(chrs[1:])])
return gen_roberta_pairs(chrs[1:]) + [' '.join([chrs[0], ''.join(chrs[1:])])]
if len(chrs) % 2 == 0:
pairs = gen_roberta_pairs(''.join(chrs[:-2]), highest = False)
pairs += gen_roberta_pairs(''.join(chrs[-2:]), highest = False)
pairs += tuple([''.join(chrs[:-2]), ''.join(chrs[-2:])])
if not highest:
return pairs
# for new tokens with odd numbers of characters, we need to add the final two tokens before the
# third-to-last token
pairs = gen_roberta_pairs(''.join(chrs[:-3]), highest = False)
pairs += gen_roberta_pairs(''.join(chrs[-2:]), highest = False)
pairs += gen_roberta_pairs(''.join(chrs[-3:]), highest = False)
pairs += tuple([''.join(chrs[:-3]), ''.join(chrs[-3:])])
if not highest:
return pairs
pairs = tuple(zip(pairs[::2], pairs[1::2]))
pairs = [' '.join(pair) for pair in pairs]
# pairs with the preceding special token
g_pairs = []
for pair in pairs:
if re.search(r'^' + ''.join(pair.split(' ')), new_token):
g_pairs.append(chr(288) + pair)
pairs = g_pairs + pairs
pairs = [chr(288) + ' ' + new_token[0]] + pairs
pairs = list(dict.fromkeys(pairs)) # remove any duplicates
return pairs
# first line of this file is a comment; add the new pairs after it
roberta_merges = roberta_merges[:1] + get_roberta_merges_for_new_tokens(['mynewword']) + roberta_merges[1:]
roberta_merges = list(dict.fromkeys(roberta_merges))
with open('merges.tmp', 'w', encoding = 'utf-8') as tmp_merges_file:
new_roberta = RobertaTokenizer(name_or_path='roberta-base', vocab_file='vocab.tmp', merges_file='merges.tmp')
# for some reason, we have to re-add the <mask> token to roberta if we are using it, since
# loading the tokenizer from a file will cause it to be tokenized as separate parts
# the weight matrix is identical, and once re-added, a fill-mask pipeline still identifies
# the mask token correctly (not shown here)
new_roberta.add_tokens(new_roberta.mask_token, special_tokens=True)
new_roberta.model_max_length = 512
new_roberta.tokenize('mynewword') # ['mynewword']
new_roberta.tokenize('mynewword a') # ['mynewword', 'Ġa']
new_roberta.tokenize(' mynewword') # ['Ġmynewword']
# however, this does not guarantee that tokenization of other words will not be affected
roberta.tokenize('mynew') # ['my', 'new']
new_roberta.tokenize('mynew') # ['myne', 'w']
import os
os.remove('merges.tmp') # cleanup
If you want to add new tokens to fine-tune a Roberta-based model, consider training your tokenizer on your corpus. Take a look at the HuggingFace How To Train for a complete roadmap of how to do that.
I did that myself to fine-tune the XLM-Roberta-base on my health-related corpus.
Here's the snippet:
from tokenizers import ByteLevelBPETokenizer
from glob import glob
import os
CORPUS_TRAIN = 'corpus_train.shc'
TOKENIZER_DIR = 'you_tokenizer_dir'
paths = list(
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer(lowercase=False)
# Customize training
tokenizer.train(files=paths, vocab_size=32000, min_frequency=3, special_tokens=[
# Save files to disk
os.makedirs(TOKENIZER_DIR, exist_ok=True)
The 32k parameter was arbitrarily chosen. It took 10min on my corpus, then I was able to train my model.
Inside the TOKENIZER_DIR you will see the vocab.json and merges.txt.
If you are using a custom script for training, you can load the tokenizer like this: tokenizer = RobertaTokenizerFast.from_pretrained(TOKENIZER_DIR, max_len=512).

Is there a way to infer topic distributions on unseen document from gensim LDA pre-trained model using matrix multiplication?

Is there a way to get the topic distribution of an unseen document using a pretrained LDA model without using the LDA_Model[unseenDoc] syntax? I am trying to implement my LDA model into a web application, and if there was a way to use matrix multiplication to get a similar result then I could use the model in javascript.
For example, I tried the following:
import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
def Preprocesser(text_list):
smallestWordSize = 3
processedList = []
for token in gensim.utils.simple_preprocess(text_list):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
return processedList
lda_model = models.LdaModel.load('LDAModel\GoldModel') #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict") #Load dictionary model was trained on
#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"
termTopicMatrix = lda_model.get_topics() #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc) #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc) #Create bow using dictionary
dictSize = len(termTopicMatrix[0]) #Get length of terms in dictionary
fullDict = np.zeros(dictSize) #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc] #Get index of terms in bag of words
Second = [second[1] for second in bowDoc] #Get frequency of term in bag of words
fullDict[First] = Second #Add word frequency to full dictionary
print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])
Matrix Multiplication:
[0.0283254 0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
0.01558603 0.0370233 0.04648389 0.02887623 0.00776652 0.02147539
0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax:
[(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]
In the pretrained model there are 35 topics and 1155 words.
In the "Conventional Syntax" output, the first element of each tuple is the index of the topic and the second element is the probability of the topic. In the "Matrix Multiplication" version, the probability is the index and the value is the probability. Clearly the two don't match up.
For example, the lda_model[unseenDoc] shows that topic 0 has a 0.07 probability, but the matrix multiplication method says that topic has a 0.028 probability. Am I missing a step here?
You can review the full source code used by LDAModel's get_document_topics() method in your installation, or online at:
(It also makes use of the inference() method in the same file.)
It's doing a lot more scaling/normalization/clipping than your code, which is likely the cause of the discrepancy. But you should be able to examine, line-by-line, where your process & its differ to get the steps to match up.
It also shouldn't be hard to use the gensim code's steps as guidance for creating parallel Javascript code that, given the right parts of the model's state, can reproduce its results.

Why does a Gensim Doc2vec object return empty doctags?

My question is how I should interpret my situation?
I trained a Doc2Vec model following this tutorial https://blog.griddynamics.com/customer2vec-representation-learning-and-automl-for-customer-analytics-and-personalization/.
For some reason, doc_model.docvecs.doctags returns {}. But doc_model.docvecs.vectors_docs seems to return a proper value.
Why the doc2vec object doesn't return any doctags but vectors_docs?
Thank you for any comments and answers in advance.
This is the code I used to train a Doc2Vec model.
from gensim.models.doc2vec import LabeledSentence, TaggedDocument, Doc2Vec
import timeit
import gensim
embeddings_dim = 200 # dimensionality of user representation
filename = f'models/customer2vec.{embeddings_dim}d.model'
class TaggedDocumentIterator(object):
def __init__(self, df):
self.df = df
def __iter__(self):
for row in self.df.itertuples():
yield TaggedDocument(words=dict(row._asdict())['all_orders'].split(),tags=[dict(row._asdict())['user_id']])
it = TaggedDocumentIterator(combined_orders_by_user_id)
doc_model = gensim.models.Doc2Vec(vector_size=embeddings_dim,
epochs=20) # use fixed learning rate
train_corpus = list(it)
for epoch in tqdm(range(10)):
doc_model.alpha -= 0.005 # decrease the learning rate
doc_model.min_alpha = doc_model.alpha # fix the learning rate, no decay
doc_model.train(train_corpus, total_examples=doc_model.corpus_count, epochs=doc_model.iter)
print('Iteration:', epoch)
print(f'Model saved to [{filename}]')
doc_model = Doc2Vec.load(filename)
print(f'Model loaded from [{filename}]')
doc_model.docvecs.vectors_docs returns
If all of the tags you supply are plain Python ints, those ints are used as the direct-indexes into the vectors-array.
This saves the overhead of maintaining a mapping from arbitrary tags to indexes.
But, it may also cause an over-allocation of the vectors array, to be large enough for the largest int tag you provided, even if other lower ints are never used. (That is: if you provided a single document, with a tags=[1000000], it will allocate an array sufficient for tags 0 to 1000000, even if most of those never appear in your training data.)
If you want model.docvecs.doctags to collect a list of all your tags, use string tags rather than plain ints.
Separately: don't call train() multiple times in your own loop, or manage the alpha learning-rate in your own code, unless you have an overwhelmingly good reason to do so. It's inefficient & error-prone. (Your code, for example, is actually performing 200 training-epochs, and if you were to increase the loop count without carefully adjusting your alpha increment, you could wind up with nonsensical negative alpha values – a very common error in code following this bad practice. Call .train() once with your desired number of epochs. Set the alpha and min_alpha at reasonable starting and nearly-zero values – probably just the defaults unless you're sure your change is helping – and then leave them alone.

Ruby storing data for queries

I have a string
This is telephone outbound call data where each new line represents a new phone call.
(Call From, Call To, Duration, Line Type)
I want to save this data in a way that allows me to query a specific number and get a string output of the number, its type, its total minutes used, and all the calls that it made (outbound calls). I just want to do this in a single ruby file.
Thus typing in this
4813243948, Type 2, 3.9 Minutes total
1234433948, 1.3
2435677524, 1.3
5245654367, 1.3
I am wondering if I should try to store values in arrays, or create a custom class and make each number an object of a class then append the calls to each number.. not sure how to do the class method. Having a different array for each number seems like it would get cluttered as there are thousands of numbers and millions of calls. Of course, the provided input string is a very small portion of the real source.
I have a string
This looks like a CSV. If you slap some headers on top, you can parse it into an array of hashes.
str = "4813243948,1234433948,1.3,Type2
require 'csv'
calls = CSV.parse(str, headers: %w[from to length type], header_converters: :symbol).map(&:to_h)
# => [{:from=>"4813243948", :to=>"1234433948", :length=>"1.3", :type=>"Type2"},
# {:from=>"1234433948", :to=>"4813243948", :length=>"1.3", :type=>"Type1"}]
This is essentially the same as your original string, only it trades some memory for ease of access. You can now "query" this dataset like this:
calls.select{ |c| c[:from] == '4813243948' }
And then aggregate for presentation however you wish.
Naturally, searching through this array takes linear time, so if you have millions of calls you might want to organize them in a more efficient search structure (like a B-Tree) or move the whole dataset to a real database.
If you only want to make queries for the number the call originated from, you could store the data in a hash where the keys are the "call from" numbers and the value is an array, or another hash, containing the rest of the data. For example:
{ '4813243948': { call_to: 1234433948, duration: 1.3, line_type: 'Type2' }, ... }
If the dataset is very large, or you need more complex queries, it might be better to store it in a database and just query it directly.

How do you index select over a matrix in Rust

I have a matrix that looks like this:
pub struct Matrix<T> {
pub grid: Vec<T>,
/// constructor
impl<T> Matrix<T> {
pub fn new(data: Vec<T>) -> Matrix<T> {
Matrix { grid: data }
I need to be able to do something like this, something close to the iloc of Python's pandas if you know what that is (Pandas Cheat sheet):
// m equals a matrix of type Vec<String>
// the matrix is 11 rows 4 columns. I dont want the first row
// because of headers or the first or last column.
// args {cols: Vec<?>||tuple(i64,str,i64), rows: Vec<?>||tuple(i64,str,i64}
let data = m.extract(vec![1,2], (-1,':',))
I have looked at those other libraries and they are great for what they do but I am working on something that these libraries do not accomplish.
The arguments for the extraction is columns, rows, which I would like to take either a tuple or Vec, but if that is not possible I will settle for a vector. As stated this syntax is derived from python's pandas library which I posted a link to the cheat sheet with reference to the iloc function. So vec![1,2] would return me only columns 1 and 2 with index starting at 0. While for the tuple (-1,':',), I would want to return all rows except the first. So if instead it was (,':',-1) I would get all the rows except the last. (1,':',6) would return rows 1-6 including all in between. (,':',) would return all rows. Which (0,':',0) and vec![0] both give only the first row/column except the vector type cannot specify a range.
As for the data structure I want to return is a Matrix like the ones used in the other matrix libraries. A matrix type with rows, columns and data. I am reading in a csv file into a generic Matrix T which I should probably turn into just a String since I doubt I will ever change that. I need to be able to convert the data int the String matrix into f64 and then have a new matrix type hold that to work matrices math on. I still need to keep the String matrix as a reference to create my own logic to communicate positive or negative feedback(true/false) on fields that contain things like country names by a match or with yes/no in columns.
How would I iterate over this matrix which will contain a Vec<Vec<T>> to retrieve a specific matrix from it? I have headers, and I want all the data so I need to skip the index of Matrix.grid[0] but none else.
Do I need to write an impl Iterator and if so how would this be done? I have seen an example with a different format but do I need lifetimes to accomplish this? Is there anything I need to add the the struct or with a impl to get the capability to copy, get a mut ref to all values, a ref to all values?
I am rather new to Rust and am very eager to learn more. I have read the whole Rust book, taken a course online, but still I find myself lost when dealing with a Matrix.
