How do I enforce a token not to be split by huggingface tokenizer? - huggingface-transformers

I have a string such as "xxx yyy zzz" and I am using the BERT tokenizer from Huggingface:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
mylistoftoken = tokenizer.tokenize("xxx yyy zzz")
However, I want to be able to enforce that certain words (for example "abcd") should not be subtokenized into subwords ("aa" and "##bb" or something of that sort).
Is there a way for me to enforce that without post-processing the array of tokens and putting them back together?

There might be better solutions depending on your use case, but based on the information you provided, you are looking for add_tokens:
from transformers import BertTokenizer
t = BertTokenizer.from_pretrained("bert-base-uncased")
print(t.tokenize("xxx yyy zzz abcd"))
t.add_tokens(["yyy", "abcd"])
print(t.tokenize("xxx yyy zzz abcd"))
Output:
['xx', '##x', 'y', '##y', '##y', 'z', '##zz', 'abc', '##d']
['xx', '##x', 'yyy', 'z', '##zz', 'abcd']

Related

How to select words that are made up of the same letter using regex?

I have a dictionary text file that contains some words that I don't want.
Example:
aa
aaa
aaaa
bb
b
bbb
etc
I want to use a regular expression to select these words and remove them. However,
what I have seems to be getting too long and there must be a more efficient approach.
Here is my code so far:
/^a{1,6}$|^b{1,6}$|^c{1,6}$|^d{1,6}$|^e{1,6}$|^f{1,6}$|^g{1,6}$|^[i]{2,3}$/
It seems that I have to do this for every letter. How could I do this more succinctly?
It's a lot easier to collapse the word down to unique letters and remove all of those with just one letter in them:
words = "aa aaa aaaa bb b bbb etc aab abcabc"
words.split(/\s+/).select do |word|
word.chars.uniq.length > 1
end
# => ["etc", "aab", "abcabc"]
This splits your string into words, then selects only those words that have more than one type of character in them (.chars.uniq)
^([a-z])\1?\1?\1?\1?\1?$
Match any single letter, followed by 5 optional backreferences to the initial letter.
This might work too:
^([a-z])\1{,5}$
Try this
\b([a-zA-Z])\1*\b
if you want (in addition to letters) to include also repeated digits or underscores, use this code:
\b([\w])\1*\b
Update:
To exclude I from being removed:
(?i)ii+|\b((?i)[a-hj-z])\1*\b
(?i) is added above to make letters not case sensitive.
Demo:
https://regex101.com/r/gFUWE8/7
You can try with this regex:
\b([a-z])\1{0,}\b
and replace by empty
Ruby code sample:
re = /\b([a-z])\1{0,}\b/m
str = 'aa aaa aaaa bb b bbb abc aa a pqaaa '
result = str.gsub(re,'')
puts result
Run the code here

How to match undefined number of arguments or how to match known keywords in a regular expression

Some questions about regex, simple for you but not for me.
a) I want to match a string using a regular expression.
keyword term1,term2,term3,.....termN
The number of terms is undefined. I know how to begin but after I am lost ;-)
\(\w+)(\s+) but after ?\i
b) A little bit more complicated:
capitale france paris,england london,germany berlin, ...
I want to separate the couples ai bi in order to analyse them.
c) how to check if one among several keywords are present or not ?
direction LEFT,RIGHT,UP,DOWN
This isn't a good task for a regular expression as you want to use it. In addition, you're asking several questions that have to be addressed in several steps; Determining duplicates isn't part of a regex's skill set.
Regex assume there is a repeating pattern, and if you're trying to parse an entire line of indeterminate number of elements at once, it will take a very complex pattern.
I'd recommend you use a simple split(',') to break the line on commas:
'keyword term1,term2,term3,.....termN'.split(',')
# => ["keyword term1", "term2", "term3", ".....termN"]
'capitale france paris,england london,germany berlin, ...'.split(',')
# => ["capitale france paris", "england london", "germany berlin", " ..."]
Once you have the line split, if you want to break apart complex entries on white-space, use a bare split:
'capitale france paris,england london,germany berlin, ...'.split(',').map(&:split)
# => [["capitale", "france", "paris"],
# ["england", "london"],
# ["germany", "berlin"],
# ["..."]]
This will all fall apart if there are embedded commas in a field. The data you're working with looks like CSV (comma-separated values), and that spec allows for them. IF you're working with true CSV data, then use the CSV library that comes with Ruby. It will save your sanity and keep you from trying to reinvent a wheel.
To count keywords you can do something like:
entries = 'capitale france paris,england london,germany berlin, ...'.split(',').map(&:split)
# => [["capitale", "france", "paris"],
# ["england", "london"],
# ["germany", "berlin"],
# ["..."]]
keywords = Hash.new { |h, k| h[k] = 0 }
entries.each do |entry|
entry.each do |e|
keywords[e] += 1 if e[/\b(?:france|england|germany)\b/i]
end
end
keywords # => {"france"=>1, "england"=>1, "germany"=>1}
There are other ways to do this using various methods in Enumerable and Array, but this demonstrates the technique. I used a pattern to locate the keyword hits because it's fast and can find the keyword within a string. You could do a lookup using index or find or any? but they'll slow your code as the list of keywords grows.

Scan String with Ruby Regular Expression

I am attempting to scan the following string with the following regular expression:
text = %q{akdce ALASKA DISTRICT COURT CM/ECFalmdce
ALABAMA MIDDLE DISTRICT COURTalndce
}
p courts = text.scan(/(ECF\w+)|(COURT\w+)/)
Ideally, what I want to do is scan the text and pull the text 'ECFalmdce' and 'COURTalndce'
With the regex I am using, I am trying to say I want a string that starts with either COURT or ECF followed by a random string of characters.
The array being returned is:
[["ECFalmdce", nil], [nil, "COURTalndce"]]
What is the deal with the nil's, does anyone have a more efficient way of writing the regex, and does anyone have a link to further documentation on match groups?
Your regex captures differently for ECF and COURT. You can create non-capture groups with ?:
text.scan(/(?:ECF|COURT)\w+/)
# => ["ECFalmdce", "COURTalndce"]
Edit
About non-capture groups: You can use them to create patterns using parenthesis without capturing the pattern.
They're patterns such as (?:pattern)
You can find more information on regular expressions at http://www.regular-expressions.info/refadv.html

How to get GL, G0 from "GL=>G0" using Ruby Regular Expression

I have following string:
"xxxxx GL=>G0 yyyyyy "
I want to extract GL and G0 using ruby regular expression.
Thanks.
Well, this is rather vague. Do you want to pull out key/value pairs when separated by => ?
The following regexp may suit your needs:
matches = /.*(\w{2})=>(\w{2}).*/.match("xxxxxx GL=>G0 yyyyy ")
puts matches[1] // GL
puts matches[2] // G0
This assumes that your key/values are 2 characters long separated by a => sign. It does not permit spaces between the characters and the => sign. Let me know if this is what you need. Otherwise, provide a more detailed description of what strings you may need to parse.

stripping street numbers from street addresses

Using Ruby (newb) and Regex, I'm trying to parse the street number from the street address. I'm not having trouble with the easy ones, but I need some help on:
'6223 1/2 S FIGUEROA ST' ==> 'S FIGUEROA ST'
Thanks for the help!!
UPDATE(s):
'6223 1/2 2ND ST' ==> '2ND ST'
and from #pesto
'221B Baker Street' ==> 'Baker Street'
This will strip anything at the front of the string until it hits a letter:
street_name = address.gsub(/^[^a-zA-Z]*/, '')
If it's possible to have something like "221B Baker Street", then you have to use something more complex. This should work:
street_name = address.gsub(/^((\d[a-zA-Z])|[^a-zA-Z])*/, '')
Group matching:
.*\d\s(.*)
If you need to also take into account apartment numbers:
.*\d.*?\s(.*)
Which would take care of 123A Street Name
That should strip the numbers at the front (and the space) so long as there are no other numbers in the string. Just capture the first group (.*)
There's another stackoverflow set of answers:
Parse usable Street Address, City, State, Zip from a string
I think the google/yahoo decoder approach is best, but depends on how often/many addresses you're talking about - otherwise the selected answer would probably be the best
Can street names be numbers as well? E.g.
1234 45TH ST
or even
1234 45 ST
You could deal with the first case above, but the second is difficult.
I would split the address on spaces, skip any leading components that do not contain a letter and then join the remainder. I do not know Ruby, but here is a Perl example which also highlights the problem with my approach:
#!/usr/bin/perl
use strict;
use warnings;
my #addrs = (
'6223 1/2 S FIGUEROA ST',
'1234 45TH ST',
'1234 45 ST',
);
for my $addr ( #addrs ) {
my #parts = split / /, $addr;
while ( #parts ) {
my $part = shift #parts;
if ( $part =~ /[A-Z]/ ) {
print join(' ', $part, #parts), "\n";
last;
}
}
}
C:\Temp> skip
S FIGUEROA ST
45TH ST
ST
Ouch! Parsing an address by itself can be extremely nasty unless you're working with standardized addresses. The reason for this that the "primary number" which is often called the house number can be at various locations within the string, for example:
RR 2 Box 15 (RR can also be Rural Route, HC, HCR, etc.)
PO Box 17
12B-7A
NW95E235
etc.
It's not a trivial undertacking. Depending upon the needs of your application, you're best bet to get accurate information is to utilize an address verification web service. There are a handful of providers that offer this capability.
In the interest of full disclosure, I'm the founder of SmartyStreets. We have an address verification web service API that will validate and standardize your address to make sure it's real and allow you to get the primary/house number portion. You're more than welcome to contact me personally with questions.
/[^\d]+$/ will also match the same thing, except without using a capture group.
For future reference a great tool to help with regex is http://www.rubular.com/

Resources