why I can't use encode_plus properly - huggingface-transformers

Here is my code:
segment_a = "food anecdotes service ambience or price?"
label_names = ['food', 'anecdotes', 'service', 'ambience', 'price']
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
sentences = ["I am sad! I can't fix this bug!"]
for x in sentences:
x = tokenizer.encode_plus(segment_a, x)
input_id = x['input_ids']
print(tokenizer.convert_ids_to_tokens(input_id))
then output is
['[CLS]', 'food', 'an', '##ec', '##dote', '##s', 'service', 'am', '##bie', '##nce', 'or', 'price', '?', '[SEP]', 'i', 'am', 'sad', '!', 'i', 'can', "'", 't', 'fix', 'this', 'bug', '!', '[SEP]']
I don't understand why I am getting the subwords ['food', 'an', '##ec', '##dote', '##s', 'service', 'am', '##bie', '##nce', 'or', 'price', '?'].
My expectation is that it should be split like ['food', 'anecdotes', 'service', 'ambience', 'price'].
Does anybody know about it? I would appreciate it if you could help me!!

Related

Can the performance of this type-selection be improved?

Assuming I get some data like { :type => 'X', :some_other_key => 'foo' } on runtime and depending on some conditions I want to initialize the corresponding class for it. Our way to do this is like this.
TYPE_CLASSES = [
TypeA,
TypeB,
TypeC,
# ...
TypeUnknown
]
TYPE_CLASSES.detect {|type| type.responsible_for?(data)}.new
We iterate over a list of classes and ask each one if it is responsible for the given data and initialize the first one found.
The order of the TYPE_CLASSES is important and some responsible_for? methods do not only check the type but also other keys inside of data. So some specialized class checking for type == 'B' && some_other_key == 'foo' has to come before a generalized class checking only for type == 'B'.
This works fine and is easily extensible, but TYPE_CLASSES list is already quite long, so in the worst case finding out the right type could result in iterating until the last element and calling for each type the responsible_for? check.
Is there any way to improve the performance and avoid iterating over each element while still preserving the order of the checks?
If matching the data set to classes is as complex as you described it, it might make sense to use decision tree building algorithms (example).
You can use AI4R library to do that in Ruby.
Probably you don't need to build that tree dynamically. So you can just use the library to basically generate optimized detection strategy for you, example from the documentation:
DATA_LABELS = [ 'city', 'age_range', 'gender', 'marketing_target' ]
DATA_SET = [
['New York', '<30', 'M', 'Y'],
['Chicago', '<30', 'M', 'Y'],
['Chicago', '<30', 'F', 'Y'],
['New York', '<30', 'M', 'Y'],
['New York', '<30', 'M', 'Y'],
['Chicago', '[30-50)', 'M', 'Y'],
['New York', '[30-50)', 'F', 'N'],
['Chicago', '[30-50)', 'F', 'Y'],
['New York', '[30-50)', 'F', 'N'],
['Chicago', '[50-80]', 'M', 'N'],
['New York', '[50-80]', 'F', 'N'],
['New York', '[50-80]', 'M', 'N'],
['Chicago', '[50-80]', 'M', 'N'],
['New York', '[50-80]', 'F', 'N'],
['Chicago', '>80', 'F', 'Y']
]
id3 = ID3.new(DATA_SET, DATA_LABELS)
id3.get_rules
# => if age_range=='<30' then marketing_target='Y'
elsif age_range=='[30-50)' and city=='Chicago' then marketing_target='Y'
elsif age_range=='[30-50)' and city=='New York' then marketing_target='N'
elsif age_range=='[50-80]' then marketing_target='N'
elsif age_range=='>80' then marketing_target='Y'
else raise 'There was not enough information during training to do a proper induction for this data element' end
(So you basically will be able to take that last line insert it into your code.)
You need to choose enough already classified records to make DATA_SET and DATA_LABELS, and also you need to convert your hashes into arrays (which isn't that difficult – basically your hashes' keys are DATA_LABELS, and your hashes values are values of DATA_SET array).
When you add new TYPE_CLASS, just retry the 'teaching' and update your detection code.

Refactor multiple gsub statements into 1

Trying to refactor this into one line to get all vowels in a string to be capitalized. I tried using a hash, but that failed. Still too new at Ruby to know of any alternatives, despite my best efforts to look it up. something like.... str.gsub!(/aeiou/
def LetterChanges(str)
str.gsub!(/a/, "A") if str.include? "a"
str.gsub!(/e/, "E") if str.include? "e"
str.gsub!(/i/, "I") if str.include? "i"
str.gsub!(/o/, "O") if str.include? "o"
str.gsub!(/u/, "U") if str.include? "u"
puts str
end
The best way is
str.tr('aeiou', 'AEIOU')
String#tr
Returns a copy of str with the characters in from_str replaced by the corresponding characters in to_str. If to_str is shorter than from_str, it is padded with its last character in order to maintain the correspondence.
You can use gsub's second parameter, which is a replacement hash:
str.gsub!(/[aeiou]/, 'a' => 'A', 'e' => 'E', 'i' => 'I', 'o' => 'O', 'u' => 'U')
or, alternatively, pass a block:
str.gsub!(/[aeiou]/, &:upcase)
Both will return:
'this is a test'.gsub!(/[aeiou]/, 'a' => 'A', 'e' => 'E', 'i' => 'I', 'o' => 'O', 'u' => 'U')
# => "thIs Is A tEst"
'this is a test'.gsub!(/[aeiou]/, &:upcase)
# => "thIs Is A tEst"

.join an array output [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I'm building a basic encryptor that is outputting into an array and not a string. I'm guessing I need to use the .join method but for the life of me can't find out where, without getting an error.
class Encryptor
def cipher
{'a' => 'n', 'b' => 'o', 'c' => 'p', 'd' => 'q',
'e' => 'r', 'f' => 's', 'g' => 't', 'h' => 'u',
'i' => 'v', 'j' => 'w', 'k' => 'x', 'l' => 'y',
'm' => 'z', 'n' => 'a', 'o' => 'b', 'p' => 'c',
'q' => 'd', 'r' => 'e', 's' => 'f', 't' => 'g',
'u' => 'h', 'v' => 'i', 'w' => 'j', 'x' => 'k',
'y' => 'l', 'z' => 'm'}
end
def encrypt_letter(letter)
lowercase_letter = letter.downcase
end
def encrypt(string)
letters = string.split("")
letters.collect do |letter|
encrypted_letter = encrypt_letter(letter)
end
end
end
You could tighten up your encrypt_letter method by remembering that the last value evaluated in the method is also the return value.
def encrypt_letter(letter)
cipher[letter.downcase]
end
Encryptor.new.encrypt_letter('h') #=> "u"
Also, the collect method will actually return an array of all the values returned by the block (the last value evaluated by the block) so there's no need to assign it to a variable within the block. Since you have the array from collect (which is just all the encrypted letters, call join on that (and since that is the final evaluation in the method, it is the return value).
def encrypt(string)
letters = string.split("")
letters.collect {|letter| encrypt_letter(letter) }.join
end
Encryptor.new.encrypt("Hello") #=> "uryyb"
Technically, you could even just remove the letters variable and do it all in one line but I personally think it is a little more readable this way.
IMHO:
You could probably make all of the methods class methods since you aren't storing any instance variables and there doesn't seem to be any reason to keep it around outside of just encrypting a string.
class Encryptor
def cipher
{'a' => 'n', 'b' => 'o', 'c' => 'p', 'd' => 'q',
'e' => 'r', 'f' => 's', 'g' => 't', 'h' => 'u',
'i' => 'v', 'j' => 'w', 'k' => 'x', 'l' => 'y',
'm' => 'z', 'n' => 'a', 'o' => 'b', 'p' => 'c',
'q' => 'd', 'r' => 'e', 's' => 'f', 't' => 'g',
'u' => 'h', 'v' => 'i', 'w' => 'j', 'x' => 'k',
'y' => 'l', 'z' => 'm'}
end
def encrypt_letter(letter)
lowercase_letter = cipher[letter.downcase] #each letter passed is crypted here
end
def encrypt(string)
letters = string.split("")
encrypted_letter = [] #define an array to store each encrypted char
letters.collect do |letter|
encrypted_letter << encrypt_letter(letter) #accumulate encrypted chars in the array
end
encrypted_letter.join #now time to use join to form a string and return it
end
end
Encryptor.new.encrypt("something") #=> "fbzrguvat"

Local Variables at Intermediate Steps

Hi I was wondering if someone could explain to me why the map function written in the below code is written in the way its written. Specifically why do we need to do
results = letters.map do |letter| encrypted_letter = encrypt_letter(letter)
instead of just doing
results = letters.map do |letter| encrypt_letter(letter)
class Encryptor
def cipher
{"a" => "n", "b" => "o", 'c' => 'p', 'd' => 'q',
'e' => 'r', 'f' => 's', 'g' => 't', 'h' => 'u',
'i' => 'v', 'j' => 'w', 'k' => 'x', 'l' => 'y',
'm' => 'z', 'n' => 'a', 'o' => 'b', 'p' => 'c',
'q' => 'd', 'r' => 'e', 's' => 'f', 't' => 'g',
'u' => 'h', 'v' => 'i', 'w' => 'j', 'x' => 'k',
'y' => 'l', 'z' => 'm'}
end
def encrypt_letter(letter)
lowercase_letter = letter.downcase
cipher[lowercase_letter]
end
def encrypt(string)
letters = string.split("")
results = letters.map do |letter|
encrypted_letter = encrypt_letter(letter)
end
results.join
end
def decrypt_letter(letter)
lowercase_letter = letter.downcase
cipher.key(lowercase_letter)
end
def decrypt(string)
letters = string.split("")
results = letters.map do |letter|
decrypted_letter = decrypt_letter(letter)
end
results.join
end
end
No reason; the variable is immediately discarded.
I'd argue it's misleading and uncommunicative on top of it.
Most of the code seems a bit verbose, for example:
def encrypt(string)
letters = string.split("")
results = letters.map do |letter|
encrypted_letter = encrypt_letter(letter)
end
results.join
end
IMO this would be more Ruby-esque as something closer to:
def encrypt(str)
str.chars.collect { |c| encrypt(c) }.join
end
It could be tighter than that, or written in other ways, although some of it is a matter of preference. For example, each_with_object could be used with the shovel operator, but that's less "functional".
(I prefer collect over map when collecting; a preference I find more communicative, if longer.)
Spreading functionality over more lines doesn't make things readable, but it depends on context. People new to Ruby or method chaining might be confused by the (IMO more canonical) one-liner.
As others say, it has no reason. It is obviously a code written by a beginner. In addition to Dave Newton's point, it is a bad habit to define a constant hash as a method cipher. Each time that code is called, a new hash is created. And this has to be done for each letter. That is a huge waste of resource.
Using the hash, you can simply do this:
h = {"a" => "n", "b" => "o", 'c' => 'p', 'd' => 'q',
'e' => 'r', 'f' => 's', 'g' => 't', 'h' => 'u',
'i' => 'v', 'j' => 'w', 'k' => 'x', 'l' => 'y',
'm' => 'z', 'n' => 'a', 'o' => 'b', 'p' => 'c',
'q' => 'd', 'r' => 'e', 's' => 'f', 't' => 'g',
'u' => 'h', 'v' => 'i', 'w' => 'j', 'x' => 'k',
'y' => 'l', 'z' => 'm'}
h.default_proc = ->x{x}
"hello world".gsub(/./, h)
# => "uryyb jbeyq"
But I would rather go with this:
from = "abcdefghijklmnopqrstuvwxyz"
to = "nopqrstuvwxyzabcdefghijklm"
"hello world".tr(from, to)
# => "uryyb jbeyq"
There is no functional reason for it. Sometimes programmers feel more comfortable having an explicit variable destination for their results. Maybe this is one of those cases. Same with the decrypted_letter case.

Replacing accented characters in Ruby 1.9.3, without Rails

I would like to use Ruby 1.9.3 to replace accented UTF-8 characters with their ASCII equivalents. For example,
Acsády --> Acsady
The traditional way to do this is using the IConv package, which is part of Ruby's standard library. You can do something like this:
str = 'Acsády'
IConv.iconv('ascii//TRANSLIT', 'utf8', str)
Which will yield
Acsa'dy
One then has to delete the apostrophes. While this method still works in Ruby 1.9.3, I get a warning saying that IConv is deprecated and that String#encode should be used instead. However, String#encode does not offer exactly the same functionality. Undefined characters throw an exception by default, but you can handle them by either setting :undef=>:replace (which replaces undefined chars with the default '?' char) or the :fallback option to a hash which maps undefined source encoding characters to target encoding. I am wondering whether there are standard :fallback hashes available in the standard library or through some gem, such that I don't have to write my own hash to handle all possible accent marks.
#raina77ow:
Thanks for the response. That's exactly what I was looking for. However, after looking at the thread you linked to I realized that a better solution may be to simply match unaccented characters to their accented equivalents, in the way that databases use a character set collation. Does Ruby have anything equivalent to collations?
I use this:
def convert_to_ascii(s)
undefined = ''
fallback = { 'À' => 'A', 'Á' => 'A', 'Â' => 'A', 'Ã' => 'A', 'Ä' => 'A',
'Å' => 'A', 'Æ' => 'AE', 'Ç' => 'C', 'È' => 'E', 'É' => 'E',
'Ê' => 'E', 'Ë' => 'E', 'Ì' => 'I', 'Í' => 'I', 'Î' => 'I',
'Ï' => 'I', 'Ñ' => 'N', 'Ò' => 'O', 'Ó' => 'O', 'Ô' => 'O',
'Õ' => 'O', 'Ö' => 'O', 'Ø' => 'O', 'Ù' => 'U', 'Ú' => 'U',
'Û' => 'U', 'Ü' => 'U', 'Ý' => 'Y', 'à' => 'a', 'á' => 'a',
'â' => 'a', 'ã' => 'a', 'ä' => 'a', 'å' => 'a', 'æ' => 'ae',
'ç' => 'c', 'è' => 'e', 'é' => 'e', 'ê' => 'e', 'ë' => 'e',
'ì' => 'i', 'í' => 'i', 'î' => 'i', 'ï' => 'i', 'ñ' => 'n',
'ò' => 'o', 'ó' => 'o', 'ô' => 'o', 'õ' => 'o', 'ö' => 'o',
'ø' => 'o', 'ù' => 'u', 'ú' => 'u', 'û' => 'u', 'ü' => 'u',
'ý' => 'y', 'ÿ' => 'y' }
s.encode('ASCII',
fallback: lambda { |c| fallback.key?(c) ? fallback[c] : undefined })
end
You can check for other symbols you might want to provide fallback for here
I suppose what you look for is similar to this question. If it is, you can use the ports of Text::Unidecode written for Ruby - like this gem (or this fork of it, looks like it's ready to be used in 1.9), for example.
The following code will work for a pretty wide variety of European languages, including Greek, which is hard to get right and is not handled by the previous answers.
# Code generated by code at https://stackoverflow.com/a/68338690/1142217
# See notes there on how to add characters to the list.
def remove_accents(s)
return s.unicode_normalize(:nfc).tr("ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿΆΈΊΌΐάέήίΰϊϋόύώỏἀἁἂἃἄἅἆἈἉἊἌἍἎἐἑἒἓἔἕἘἙἜἝἠἡἢἣἤἥἦἧἨἩἫἬἭἮἯἰἱἲἳἴἵἶἷἸἹἼἽἾὀὁὂὃὄὅὈὉὊὋὌὍὐὑὓὔὕὖὗὙὝὠὡὢὣὤὥὦὧὨὩὫὬὭὮὯὰὲὴὶὸὺὼᾐᾑᾓᾔᾕᾖᾗᾠᾤᾦᾧᾰᾱᾳᾴᾶᾷᾸᾹῂῃῄῆῇῐῑῒῖῗῘῙῠῡῢῥῦῨῩῬῳῴῶῷῸ","AAAAAAÆCEEEEIIIINOOOOOOUUUUYaaaaaaæceeeeiiiinoooooouuuuyyΑΕΙΟιαεηιυιυουωoαααααααΑΑΑΑΑΑεεεεεεΕΕΕΕηηηηηηηηΗΗΗΗΗΗΗιιιιιιιιΙΙΙΙΙοοοοοοΟΟΟΟΟΟυυυυυυυΥΥωωωωωωωωΩΩΩΩΩΩΩαεηιουωηηηηηηηωωωωααααααΑΑηηηηηιιιιιΙΙυυυρυΥΥΡωωωωΟ")
end
It was generated by the following long, slow program, which shells out to the linux command-line utility "unicode." If you come across characters that are missing from this list, add them to the longer program, re-run it, and you'll get code output that will handle those characters. For example, I think the list is missing some characters that occur in Czech, such as a c with a wedge on it, as well as Latin-language vowels with macrons. If these new characters have accents on them that aren't on the list below, the program will not strip them until you add the names of the new accents to names_of_accents.
$stderr.print %q{
This program generates ruby code to strip accents from characters in Latin and Greek scripts.
Progress will be printed to stderr, the final result to stdout.
}
all_characters = %q{
ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÑÒÓÔÕÖØÙÚÛÜÝàáâãäåæçèéêëìíîïñòóôõöøùúûüýÿ
ΆΈΊΌΐάέήίϊόύώỏἀἁἃἄἅἈἐἑἒἔἕἘἙἜἡἢἣἤἥἦἨἩἫἬἮἰἱἲἴἵἶἸὀὁὂὃὄὅὊὍὐὑὓὔὕὖὗὝὡὢὣὤὥὧὨὩὰὲὴὶὸὺὼᾐᾗᾳᾴᾶῂῆῇῖῥῦῳῶῷῸᾤᾷἂἷ
ὌᾖὉἧἷἂῃἌὬὉἷὉἷῃὦἌἠἳᾔἉᾦἠἳᾔὠᾓὫἝὈἭἼϋὯῴἆῒῄΰῢἆὙὮᾧὮᾕὋἍἹῬἽᾕἓἯἾᾠἎῗἾῗἯἊὭἍᾑᾰῐῠᾱῑῡᾸῘῨᾹῙῩ
}.gsub(/\s/,'')
# The first line is a list of accented Latin characters. The second and third lines are polytonic Greek.
# The Greek on this list includes every character occurring in the Project Gutenberg editions of Homer, except for some that seem to be
# mistakes (smooth rho, phi and theta in symbol font). Duplications and characters out of order in this list have no effect at run time.
# Also includes vowels with macron and vrachy, which occur in Project Perseus texts sometimes.
# The following code shells out to the linux command-line utility called "unicode," which is installed as the debian package
# of the same name.
# Documentation: https://github.com/garabik/unicode/blob/master/README
names_of_accents = %q{
acute grave circ and rough smooth ypogegrammeni diar with macron vrachy tilde ring above diaeresis cedilla stroke
tonos dialytika hook perispomeni dasia varia psili oxia
}.split(/\s+/).select { |x| x.length>0}.sort.uniq
# The longer "circumflex" will first be shortened to "circ" in later code.
def char_to_name(c)
return `unicode --string "#{c}" --format "{name}"`.downcase
end
def name_to_char(name)
list = `unicode "#{name}" --format "{pchar}" --max 0` # returns a string of possibilities, not just exact matches
# Usually, but not always, the unaccented character is the first on the list.
list.chars.each { |c|
if char_to_name(c)==name then return c end
}
raise "Unable to convert name #{name} to a character, list=#{list}."
end
regex = "( (#{names_of_accents.join("|")}))+"
from = ''
to = ''
all_characters.chars.sort.uniq.each { |c|
name = char_to_name(c).gsub(/circumflex/,'circ')
name.gsub!(/#{regex}/,'')
without_accent = name_to_char(name)
from = from+c.unicode_normalize(:nfc)
to = to+without_accent.unicode_normalize(:nfc)
$stderr.print c
}
$stderr.print "\n"
print %Q{
# Code generated by code at https://stackoverflow.com/a/68338690/1142217
# See notes there on how to add characters to the list.
def remove_accents(s)
return s.unicode_normalize(:nfc).tr("#{from}","#{to}")
end
}

Resources