I have a special, non-language use case using a fixed vocabulary—i.e., a relatively small set of generated tokens that represent the entire vocabulary of our "language." I’d like to be able to use this with any of the different models and I’m wondering what would be the best approach? It’s just a vocab.txt file of short strings, which I don’t think will work with any of the BPE tokenizers. Am I correct in that assumption? Also, is there a way to “force” a vocabulary onto any of the tokenizers?
To clarify, our "language" uses prefixes to identify certain types of tokens, which have certain functions in the overall syntax. We want to be able to mask by type during inference, both on input and as part of the selection process, for example, by limiting top-k or top-p sampling to a give type. With a fixed/hand-tuned vocabulary we can be very specific about which ids, or how many ids we need; i.e., we know which tokens are used by each type, so we can mask/filter accordingly. However, with BPE tokenization a given type may be tokenized with any number of tokens, making this process much less straightforward.
The motivation is just to make life easier by fitting into the Huggingface universe a little better, so we can experiment with off-the-shelf models more fluently. We already have this working using the standard BertTokenizer with both GPT2 and RoBERTa, but it would be nice to be able to experiment with different Huggingface models "out of the box," so to speak (using Trainers, Pipelines, etc.). With the BertTokenizer we just load our vocab.txt and we're done, so I wondered whether there would be some way of doing this with the other tokenizers (really, the BPE ones are the only issue, at this point).
It seems to me that being able specify a vocab for any tokenizer would be more straightforward than getting our tokenizer working with other models. Though perhaps a better approach would be to look at streamlining that process? I suppose I could fork and modify AutoTokenizer... ??
Any help much appreciated.
As far as I understand the solution below might help you, as you can use this tokenizer, like you would the other pre-trained ones.
As I do not really understand all the inner workings of the tokenizer, I may very well be off with this solution, but hopefully it can help someone.
The main idea is to subclass the PreTrainedTokenizer. This way, you should only override some of the key methods like _tokenize, _convert_token_to_id, etc..., which are more straightforward than implementing a whole new tokenizer.
import json
from pathlib import Path
from typing import Optional, Tuple, Dict
from transformers import PreTrainedTokenizer
class FixedVocabTokenizer(PreTrainedTokenizer):
def __init__(self, vocab: Dict[str, int], max_len: int = None):
super().__init__(max_len=max_len)
self.__token_ids = vocab
self.__id_tokens: Dict[int, str] = {value: key for key, value in vocab.items()}
def _tokenize(self, text: str, **kwargs):
return text.split(' ')
def _convert_token_to_id(self, token: str) -> int:
return self.__token_ids[token] if token in self.__token_ids else self.unk_token_id
def _convert_id_to_token(self, index: int) -> str:
return self.__id_tokens[index] if index in self.__id_tokens else self.unk_token
def get_vocab(self) -> Dict[str, int]:
return self.__token_ids.copy()
def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
if filename_prefix is None:
filename_prefix = ''
vocab_path = Path(save_directory, filename_prefix + 'vocab.json')
json.dump(self.__token_ids, open(vocab_path, 'w'))
return str(vocab_path),
#property
def vocab_size(self) -> int:
return len(self.__token_ids)
if __name__ == '__main__':
# your custom, fixed vocabulary
custom_vocab = {
'[UNK]': 0,
'word0': 1,
'word1': 2,
'word2': 3,
'word3': 4,
'word4': 5,
'word5': 6,
'[CLS]': 7,
'[SEP]': 8,
'[PAD]': 9
}
model_max_len = 8
tokenizer = FixedVocabTokenizer(custom_vocab, max_len=model_max_len)
# tell your tokenizer about your special tokens
tokenizer.add_special_tokens({
'unk_token': '[UNK]',
'pad_token': '[PAD]',
'cls_token': '[CLS]',
'sep_token': '[SEP]'
})
res = tokenizer(
[
'word1 word2 word word1 word3',
'word2 word0 word0 word3 word5 word4 word2 word1 word0'
],
padding=True,
truncation=True
)
# the result should look like something like this
# res -> BatchEncoding(
# data: {
# 'input_ids': [[2, 3, 0, 2, 4, 9, 9, 9], [3, 1, 1, 4, 6, 5, 3, 2]],
# 'attention_mask': [[1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1]],
# ...
# },
# ...
# )
This is the solution I could come up with, however I could not figure out if you could do something similar with PreTrainedTokenizerFast. So one more note being, that you can only use slow tokenizers using this method.
Related
I'm trying to build a simple Bayesian network, where rain and sprinkler are the parents of wetgrass, but rain and sprinkler each have three (fuzzy-logic type rather rather than the usual two boolean) states, and wetgrass has two states (true/false). I can't find anywhere in the pymc3 docs what syntax to use to describe the CPTs for this -- I'm trying the following based on 2-state examples but it's not generalizing to three states the way I thought it would. Can anyone show the correct way to do this? (And also for the more general case where wetgrass has three states too.)
rain = mc.Categorical('rain', p = np.array([0.5, 0. ,0.5]))
sprinker = mc.Categorical('sprinkler', p=np.array([0.33,0.33,0.34]))
wetgrass = mc.Categorical('wetgrass',
mc.math.switch(rain,
mc.math.switch(sprinker, 10, 1, -4),
mc.math.switch(sprinker, -20, 1, 3),
mc.math.switch(sprinker, -5, 1, -0.5)))
[gives error at wetgrass definition:
Wrong number of inputs for Switch.make_node (got 4((, , , )), expected 3)
]
As I understand it - switch is a theano function similar to (b?a:b) in a C program; which is only doing a two way comparison. It's maybe possible to set up the CPT using a whole load of binary switches like this, but I really want to just give a 3D matrix CPT as the input as in BNT and other bayes net libraries. Is this currently possible ?
You can code a three-way switch using two individual switches:
tt.switch(sprinker == 0,
10
tt.switch(sprinker == 1, 1, -4))
But in general it is probably better to index into a table:
table = tt.constant(np.array([[...], [...]]))
value = table[rain, sprinker]
Say I have a sorted Array, such as this:
myArray = [1, 2, 3, 4, 5, 6]
Suppose I call Enumerable#partition on it:
p myArray.partition(&:odd?)
Must the output always be the following?
[[1, 3, 5], [2, 4, 6]]
The documentation doesn't state this; this is what it says:
partition { |obj| block } → [ true_array, false_array ]
partition → an_enumerator
Returns two arrays, the first containing the elements of enum for which the block evaluates to true, the second containing the rest.
If no block is given, an enumerator is returned instead.
But it seems logical to assume partition works this way.
Through testing Matz's interpreter, it appears to be the case that the output works like this, and it makes full sense for it to be like this. However, can I count on partition working this way regardless of the Ruby version or interpreter?
Note: I made implementation-agnostic because I couldn't find any other tag that describes my concern. Feel free to change the tag to something better if you know about it.
No, you can't rely on the order. The reason is parallelism.
A traditional serial implementation of partition would loop through each element of the array evaluating the block one at a time in order. As each call to odd returns, it's immediately pushed into the appropriate true or false array.
Now imagine an implementation which takes advantage of multiple CPU cores. It still iterates through the array in order, but each call to odd can return out of order. odd(myArray[2]) might return before odd(myArray[0]) resulting in [[3, 1, 5], [2, 4, 6]].
List processing idioms such as partition which run a list through a function (most of Enumerable) benefit greatly from parallel processing, and most computers these days have multiple cores. I wouldn't be surprised if a future Ruby implementation took advantage of this. The writers of the API documentation for Enumerable likely carefully omitted any mention of process ordering to leave this optimization possibility open.
The documentation makes no explicit mention of this, but judging from the official code, it does retain ordering:
static VALUE
partition_i(RB_BLOCK_CALL_FUNC_ARGLIST(i, arys))
{
struct MEMO *memo = MEMO_CAST(arys);
VALUE ary;
ENUM_WANT_SVALUE();
if (RTEST(enum_yield(argc, i))) {
ary = memo->v1;
}
else {
ary = memo->v2;
}
rb_ary_push(ary, i);
return Qnil;
}
This code gets called from the public interface.
Essentially, the ordering in which your enumerable emits objects gets retained with the above logic.
I want to make mathematica insensitive to the functions first capital letter. For example, it accepts both "Plot" and "plot" as plotting function.
I agree with george's sentiment: "You don't want to do that." It is common practice to start user Symbols with lowercase letters which both identifies them and prevents collisions with built-ins. Nevertheless you can do this in several ways. One is just to create aliases as george also suggested, e.g.
plot = Plot;
sin = Sin;
plot[sin[x], {x, 0, 6}]
This has the advantage of working even in packages because it does not rely on the Front End. However, because these are not true aliases it will fail in some cases, e.g.:
evaluate = Evaluate;
Hold[evaluate[2 + 2]]
Hold[evaluate[2 + 2]]
Whereas the "real" function behaves like this:
Hold[Evaluate[2 + 2]]
Hold[4]
To get complete equivalence, though only in the Front End, you can use $PreRead. (Example.) You will need to build a list of rules that replace the string form of each lowercase Symbol with the uppercase string. I shall do that only for all Symbols in the System` context.
With[{rules = Thread[ToLowerCase[#] -> #] & # Names["System`*"]},
$PreRead = # /. rules &
];
Now both of these examples work:
plot[sin[x], {x, 0, 6}]
hold[evaluate[2 + 2], 3 + 4]
The latter producing:
Hold[4, 3 + 4]
This is not a direct answer to your question and I strongly advise you against redefining Mathematica functions just for the sake of the letter-case.
Nevertheless, have you seen that there is an option Match case in command completion when you go to Edit -> Preferences -> Interface?
If you turn this off, then you can type plot in the notebook and you get the correct Plot as suggestion from the autocompletion. You only have to hit enter and the correct command is inserted.
I have been trying a simple Ruby program to parse a simple pdf file and extract the texts I am interested in. I found that pdf-reader is quite good gem for pdf file parsing. I have read through the examples given in that gem and some tutorials around that.
I have tried the callback method and was able to get all the text from my pdf file. But I did not understand the concept behind the arguments for some of the callbacks.
For example, If my pdf has a simple table with 3 columns and 2 rows. (Header row values are Name, Address, Age) and first row values are (Arun, Hoskote, 22) and when U run the a ruby following ruby script
receiver = PDF::Reader::RegisterReceiver.new
reader = PDF::Reader.new("Arun.pdf")
reader.pages.each do |page|
page.walk(receiver)
receiver.callbacks.each do |cb|
puts cb.inspect
end
end
It prints series of callbacks among which some of the interesting callbacks show_text_with_positioning were like following
{:name=>:show_text_with_positioning, :args=>[["N", 5, "am", -4, "e"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["Ad", 6, "d", 3, "ress"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["Age"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["Ar", 4, "u", 3, "n"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["H", 3, "o", -5, "sk", 9, "o", -5, "te"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
{:name=>:show_text_with_positioning, :args=>[["22"]]}
{:name=>:show_text_with_positioning, :args=>[[" "]]}
From the above callbacks, what does args represent with respect to pdf file ? If I want to extract only name value that is 'Arun' (Anything can come here) here or age value i,e '25' (any value can come here) here in this example, how can I do that in ruby program ? Is there any pdf-parser API or Ruby API to get only a single "interested" value(s) from a pdf file ?
How can I write a Ruby program to access a particular callback which I am interested in which gives me the text I wanted ?
If you particularly only want the text, you can do something like this (but probably using a different stream as the destination for the text):
receiver = PDF::Reader::TextReceiver.new($stdout)
PDF::Reader.file("Arun.pdf", receiver)
Once you have the text, you could use regular expressions or whatever to get the specific value you want out of it.
How can I mock an array's sort expect a lambda expression?
This is a trivial example of my problem:
# initializing the data
l = lambda { |a,b| a <=> b }
array = [ 1, 2, 3, 4, 5 ]
sorted_array = [ 2, 3, 8, 9, 1]
# I expect that sort will be called using the lambda as a parameter
array.expects(:sort).with( l ).returns( sorted_array )
# perform the sort using the lambda expression
temp = array.sort{|a,b| l.call(a,b) }
Now, at first I expected that this would work; however, I got the following error:
- expected exactly once, not yet invoked: [ 1, 2, 3, 4, 5 ].sort(#<Proc:0xb665eb48>)
I realize that this will not work because l is not passed as a parameter to l. However, is there another way to do what this code is trying to accomplish?
NOTE: I have figured out how to solve my issue without figuring out how to do the above. I will leave this open just in case someone else has a similar problem.
Cheers,
Joseph
Mocking methods with blocks can be quite confusing. One of the keys is to be clear about what behaviour you want to test. I can't tell from your sample code exactly what it is that you want to test. However, you might find the documentation for Mocha::Expectation#yields (or even Mocha::Expectation#multiple_yields) useful.