why does huggingface t5 tokenizer ignore some of the whitespaces?

why does huggingface t5 tokenizer ignore some of the whitespaces? - huggingface-transformers

I am using T5 model and tokenizer for a downstream task. I want to add certain whitesapces to the tokenizer like line ending (\t) and tab (\t). Adding these tokens work but somehow the tokenizer always ignores the second whitespace. So, it tokenizes the sequence “\n\n” as a single line ending and the sequence "\n\n\n\n" is tokenized as two line endings and so on. See below to reproduce.
from transformers import T5Tokenizer
tokenizer = T5Tokenizer.from_pretrained("t5-large")
tokenizer.add_tokens(["\n"])
tokenizer.encode("\n") # returns [32100, 1] as expected
tokenizer.encode("\n\n") # returns [32100, 1] but expected would be [32100, 32100, 1]
tokenizer.encode("\n\n\n\n") # returns [32100, 32100, 1] but expected would be [32100, 32100, 32100, 32100, 1]
what is the reasoning behind this behaviour? Is it a bug or something related to how tokenizer works? I noticed that this only happens for added whitespaces but not for other characters.
Is there way to prevent tokenizer from ignoring the repeated whitespaces?

The behaviour is explained by how the tokenize method in T5Tokenizer strips tokens by default. What one can do is adding the token '\n' as a special token to the tokenizer. Because the special tokens are never seperated, it works as expected.
It is a bit hacky but seems to work.
from tokenizers import AddedToken
tokenizer.add_special_tokens({"additional_special_tokens": [AddedToken("\n")]})
print(tokenizer.special_tokens_map)
Then it tokenizes the '\n' without skipping any occurences. Note that AddedToken is important because somehow the following does NOT work.
tokenizer.add_special_tokens({"additional_special_tokens": ["\n"]})
Edit
After spending more time on it, I actually found a way to add it as a normal token without using special tokens. The main reason for the issue is the normalization process that happens behind the scenes even before the tokenization. When you add a new token, you can specify if it should be normalized or not. By setting normalize to False, you avoid the tokenizer from stripping consecutive occurrences of the added token.
from tokenizers import AddedToken
tokenizer.add_tokens(AddedToken("\n", normalized=False))
You can find more information on this link: https://huggingface.co/course/en/chapter6/4?fw=pt

Related

Elasticsearch standard tokenizer behaviour and word boundaries

I am not sure why the standard tokenizer (used by the default standard analyzer) behaves like this in this scenario:
- If I use the word system.exe it generates the token system.exe. I understand . is not a word breaker.
- If I use the word system32.exe it generates the tokens system and exe. I don´t understand this, why it breaks the word when it finds a number + a . ?
- If I use the word system32tm.exe it generates the token system32tm.exe. As in the first example, it works as expected, not breaking the word into different tokens.
I have read http://unicode.org/reports/tr29/#Word_Boundaries but I still don´t understand why a number + dot (.) is a word boundary

As mentioned in the question, the standard tokenizer provides grammar based tokenization based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29
The rule http://unicode.org/reports/tr29/#Word_Boundaries is to not break if you have letter + dot + letter, see WB6 in the above spec. So tm.exe is preserved and system32.exe is split.
The spec says that it always splits, except for the listed exceptions. Exceptions WB6 and WB7 say that it never splits on letter, then punctuation, then letter. Rules WB11 and WB12 say that it never splits on number, then punctuation, then number. However there is no such rule for number then punctuation then letter, so the default rule applies and system32.exe gets splitted.

Getting an exact match to the string `#deprecated` in Kibana/ELK

I'm using Kibana to find all logs containing an exact match of the string #deprecated.
For a reason I don't understand, it matches string with the word "deprecated" without the # sign.
I tried to use escaping for # according to the Lucene Documentation. i.e. message:"\\#deprecated" - without change in results.
How can I query to exact match the #deprecated text exact match only
Why is this happening?

You problem isn't an issue with query syntax, which is what escaping is for, it's with analysis. You analyzer removes punctuation, because it's parsing it as full text. It removes #, in much the same way that it will remove periods and commas.
So, after analysis (assuming standard analysis) of something like: "Class is #deprecated" the token stream generated will have the following tokens: "class", "deprecated" ("is" is a stop word). The indexed form of "#deprecated" and "deprecated" are identical, so it is impossible to have a query that can differentiate between them as it is currently indexed.
To fix this you would have to change your analyzer. WhitespaceAnalyzer may be a good choice, and should fix this issue. However, be careful you aren't doing more harm than good. If you use WhitespaceAnalyzer, you are going to have to contend with other punctuation as well, and a search for "sentence"
would not find "match at the end of this sentence.", because of the period. So, if you are searching full text, this will certainly cause far more problems than it solves.
If you want to know the full rules of standard analysis, by the way, it's an implementation of UAX #29 word boundaries

How to search for # in Azure Search

Hi I have a string field which has an nGram analyzer.
And our query goes like this.
$count=true&queryType=full&searchFields=name&searchMode=any&$skip=0&$top=50&search=/(.*)Site#12(.*)/
The test we are searching for has Site#123
The above query will work with all other alpha numeric charecters except #. Any idea how could I make this work.

If you are using the standard tokenizer, the ‘#’ character was removed from indexed documents as it’s considered a separator. For indexing, you can either use a different tokenizer, such as the whitespace tokenizer, or replace the ‘#’ character with another character such as ‘_’ with the mapping character filter (underscore ‘_’ is not considered a separator). You can test the analyzer behavior using the Analyze API : https://learn.microsoft.com/rest/api/searchservice/test-analyzer.
It’s important to know that the query terms of regex queries are not analyzed. This means that the ‘#’ character won’t be removed by the analyzer from the regex expression. You can learn more about query processing in Azure Search here: How full text search works in Azure Search

Your string is being tokenized by spaces and punctuation like #. If you want to search for # and other punctuation characters, you could consider tokenzing only by whitespace. Or perhaps do not apply any tokenization at all and treat a whole string as a single token.

Regex for Git commit message

I'm trying to come up with a regex for enforcing Git commit messages to match a certain format. I've been banging my head against the keyboard modifying the semi-working version I have, but I just can't get it to work exactly as I want. Here's what I have now:
/^([a-z]{2,4}-[\d]{2,5}[, \n]{1,2})+\n{1}^[\w\n\s\*\-\.\:\'\,]+/i
Here's the text I'm trying to enforce:
AB-1432, ABC-435, ABCD-42
Here is the multiline description, following a blank
line after the Jira issue IDs
- Maybe bullet points, with either dashes
* Or asterisks
Currently, it matches that, but it will also match if there's no blank line after the issue IDs, and if there's multiple blank lines after.
Is there anyway to enforce that, or will I just have to live with it?
It's also pretty ugly, I'm sure there's a more succinct way to write that out.
Thanks.

Your regex allows for \n as one of the possible characters after the required newline, so that's why it matches when there are multiple.
Here's a cleaned up regex:
/^([a-z]{2,4}-\d{2,5}(?=[, \n]),? ?\n?)+^\n([-\w\s*.:',]+\n)+/i
Notes:
This requires at least one [-\w\s*.:',] character before the next newline.
I changed the issue IDs to have one possible comma, space, and newline, in that order (up to one of each). Can you use lookaheads? If so, I added (?=[, \n]) to make sure the issue ID is followed by at least one of those characters.
Also notice that many of the characters don't need to be escaped in a character class.

ruby regex hangs

I wrote a ruby script to process a large amount of documents and use the following URI to extract URIs from a document's string representation:
#Taken from: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
URI_REGEX = /
( # Capture 1: entire matched URL
(?:
[a-z][\w-]+: # URL protocol and colon
(?:
\/{1,3} # 1-3 slashes
| # or
[a-z0-9%] # Single letter or digit or '%'
)
| # or
www\d{0,3}[.] # "www.", "www1.", "www2." … "www999."
| # or
[a-z0-9.\-]+[.][a-z]{2,4}\/ # looks like domain name followed by a slash
)
(?: # One or more:
[^\s()<>]+ # Run of non-space, non-()<>
| # or
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
)+
(?: # End with:
\(([^\s()<>]+|(\([^\s()<>]+\)))*\) # balanced parens, up to 2 levels
| # or
[^\s`!()\[\]{};:'".,<>?«»“”‘’] # not a space or one of these punct chars
)
)/xi
It works pretty well for 99.9 percent of all documents but always hangs up my script when it encounters the following token in of the documents: token = "synsem:local:cat:(subcat:SubMot,adjuncts:Adjs,subj:Subj),"
I am using the standard ruby regexp oeprator: token =~ URI_REGEX and I don't get any exception or error message.
First I tried to solve the problem encapsulating the regex evaluation into a Timeout::timeoutblock, but this degrades performance to much.
Any other ideas on how to solve this problem?

Your problem is catastrophic backtracking. I just loaded your regex and your test string into RegexBuddy, and it gave up after 1.000.000 iterations of the regex engine (and from the looks of it, it would have gone on for many millions more had it not aborted).
The problem arises because some parts of your text can be matched by different parts of your regex (which is horribly complicated and painful to read); it seems that the "One or more:" part of your regex and the "End with:" part struggle over the match (when it's not working), trying out millions of permutations that all fail.
It's difficult to suggest a solution without knowing what the rules for matching a URI are (which I don't). All this balancing of parentheses suggests to me that regexes may not be the right tool for the job. Maybe you could break down the problem. First use a simple regex to find everything that looks remotely like a URI, then validate that in a second step (isn't there a URI parser for Ruby of some sort?).
Another thing you might be able to do is to prevent the regex engine from backtracking by using atomic groups. If you can change some (?:...) groups into (?>...) groups, that would allow the regex to fail faster by disallowing backtracking into those groups. However, that might change the match and make it fail on occasions where backtracking is necessary to achieve a match at all - so that's not always an option.

Why reinvent the wheel?
require 'uri'
uri_list = URI.extract("Text containing URIs.")

URI.extract("Text containing URIs.") is the best solution if you only need the URIs.
I finally used pat = URI::Parser.new.make_regexp('http')to get the built-in URI parsing regexp and use it in match = str.match(pat, start_pos) to iteratively parse the input text URI by URI. I am doing this because I also need the URI positions in the text and the returned match object gives me this information match.begin(0).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

why does huggingface t5 tokenizer ignore some of the whitespaces? - huggingface-transformers

Related

Elasticsearch standard tokenizer behaviour and word boundaries

Getting an exact match to the string `#deprecated` in Kibana/ELK

How to search for # in Azure Search

Regex for Git commit message

ruby regex hangs

Categories

Resources