GPT2 huggingface's transformer preprocessing - huggingface-transformers

What should we do exactly during the preprocessing step for GPT2? Are there any guidelines?
Would this be just fine for the preprocessing step?
1. Remove any \n from sentence
2. Remove extra spaces from sentence
3. Leave everything else that is part of the sentence but not exactly words (e.g. urls, non-english words that may be added in an english sentence, emojis, etc...)
Wouldn't it be better to remove extra punctuation or any non-english character?

Related

Better algorithm for shortening English words

I have some unique codes that are generated from strings (ex: website host names) in various independent components of my application.
These codes are meant to be used by machines only so i would like to keep them as short as possible.
The below algorithm would be applied to every word in the string. The output words would be concatenated with a dash to generate the unique code.
The current algorithm I have used:
- Skip word if length is less than 6
- Leave first character as is
- Remove every wowel in the word from the second character onwards
architectural digest eu => archtctrl-dgst-eu
arizona foothills magazine => arzn-fthlls-mgzn
Is there a better way to shorten an English word leaving it as recognisable as possible to a human reader?
The output should be deterministic and produce the same shortened version whenever it is run on the same input.
A good algorithm should also minimise the number of clashes for similarly spelt words.
I have some unique codes that are generated from strings
I am afraid that is not true. There are many English words that will reduce to the same 'code word' when stripped of their vowels. For example, 'leaving' -> 'living' Given, this is fairly rare, it could still cause issues.
How important is it that these 'code words' remain human-readable if as you say, they are meant to be used by machines only? If its not that important, I'd suggest looking into some simpler compression algorithms like Huffman Coding or LZW Compression. Then if the user needs to see the translation of the code word, just uncompress it.
If you must keep it human-readable, I'm not sure that there is much more you can do to shorten it. You could take a look at specific latin + greek roots, and determine if you can shorten those any more by hand, and then just substitute those out automatically.
Alternatively, you could turn to a phonetic approach. Automatically search the pronunciation of the word, and then see if that is any shorter (or itself can be compressed, taking 'cee' to 'C', or 'kay' to 'K'). This would be much more time and CPU intensive, but its still an option if you really, really need short but yet readable codes.
What you're generating sounds like what's called a "slug". There are many libraries to handle this for blogs or site generators that should suit your purposes. Here's a usage example from a Python library called slugify:
txt = "___This is a test ---"
r = slugify(txt)
self.assertEqual(r, "this-is-a-test")
Slug libraries generally work like this:
replacing non-ascii linguistic characters via a mapping (ex: 影師嗎 -> ying-shi-ma)
replace accented latin letters with ascii equivalents via a mapping (ex: C'est déjà l'été. -> c-est-deja-l-ete)
remove beginning and trailing spaces/punctuation
convert remaining spaces and punctuation to dashes, collapsing multiple dashes in a row to a single dash
If you want to make slugs shorter you could remove vowels or, more simply, use a maximum length.

Count Number of Sentence Ruby

I happened to search around everywhere and did not managed to find a solution to count number of sentence in a String using Ruby. Does anyone how to do it?
Example
string = "The best things in an artist’s work are so much a matter of intuition, that there is much to be said for the point of view that would altogether discourage intellectual inquiry into artistic phenomena on the part of the artist. Intuitions are shy things and apt to disappear if looked into too closely. And there is undoubtedly a danger that too much knowledge and training may supplant the natural intuitive feeling of a student, leaving only a cold knowledge of the means of expression in its place. For the artist, if he has the right stuff in him ... "
This string should return number 4.
You can split the text into sentences and count them. Here:
string.scan(/[^\.!?]+[\.!?]/).map(&:strip).count # scan has regex to split string and strip will remove trailing spaces.
# => 4
Explaining regex:
[^\.!?]
Caret inside of a character class [^ ] is the negation operator. Which means we are looking for characters which are not present in list: ., ! and ?.
+
is a greedy operator that returns matches between 1 and unlimited times. (capturing our sentences here and ignoring repetitions like ...)
[\.!?]
matching characters ., ! or ?.
In a nutshell, we are capturing all characters that are not ., ! or ? till we get characters that are ., ! or ?. Which basically can be treated as a sentence (in broad senses).
I think it makes sense to consider a word char followed by a ?! or . the delimiter of a sentence:
string.strip.split(/\w[?!.]/).length
#=> 4
So I'm not considering the ... a delimiter when it hangs on it's own like that:
"I waited a while ... and then I went home"
But then again, maybe I should...
It also occurs to me that maybe a better delimiter is a punctuation followed by some space and a capital letter:
string.split(/[?!.]\s+[A-Z]/).length
#=> 4
Sentences end with full stops, question marks, and exclamation marks. They can also be
separated with dashes and other punctuation, but we won’t worry about these rare cases here.
The split is simple. Instead of asking Ruby to split the text on one type of character, you simply
ask it to split on any of three types of characters, like so:
txt = "The best things in an artist’s work are so much a matter of intuition, that there is much to be said for the point of view that would altogether discourage intellectual inquiry into artistic phenomena on the part of the artist. Intuitions are shy things and apt to disappear if looked into too closely. And there is undoubtedly a danger that too much knowledge and training may supplant the natural intuitive feeling of a student, leaving only a cold knowledge of the means of expression in its place. For the artist, if he has the right stuff in him ... "
sentence_count = txt.split(/\.|\?|!/).length
puts sentence_count
#=> 7
string.squeeze('.!?').count('.!?')
#=> 4

Counting words from a mixed-language document

Given a set of lines containing Chinese characters, Latin-alphabet-based words or a mixture of both, I wanted to obtain the word count.
To wit:
this is just an example
这只是个例子
should give 10 words ideally; but of course, without access to a dictionary, 例子 would best be treated as two separate characters. Therefore, a count of 11 words/characters would also be an acceptable result here.
Obviously, wc -w is not going to work. It considers the 6 Chinese characters / 5 words as 1 "word", and returns a total of 6.
How do I proceed? I am open to trying different languages, though bash and python will be the quickest for me right now.
You should split the text on Unicode word boundaries, then count the elements which contain letters or ideographs. If you're working with Python, you could use the uniseg or nltk packages, for example. Another approach is to simply use Unicode-aware regexes but these will only break on simple word boundaries. Also see the question Split unicode string on word boundaries.
Note that you'll need a more complex dictionary-based solution for some languages. UAX #29 states:
For Thai, Lao, Khmer, Myanmar, and other scripts that do not typically use spaces between words, a good implementation should not depend on the default word boundary specification. It should use a more sophisticated mechanism, as is also required for line breaking. Ideographic scripts such as Japanese and Chinese are even more complex. Where Hangul text is written without spaces, the same applies. However, in the absence of a more sophisticated mechanism, the rules specified in this annex supply a well-defined default.
I thought about a quick hack since Chinese characters are 3 bytes long in UTF8:
(pseudocode)
for each character:
if character (byte) begins with 1:
add 1 to total chinese chars
if it is a space:
add 1 to total "normal" words
if it is a newline:
break
Then take total chinese chars / 3 + total words to get the sum for each line. This will give an erroneous count for the case of mixed languages, but should be a good start.
这是test
However, the above sentence will give a total of 2 (1 for each of the Chinese characters.) A space between the two languages would be needed to give the correct count.

matching single letters in a sentence with a regular expression

I want to match single letters in a sentence. So in ...
I want to have my turkey. May I. I 20,000-t bar-b-q
I'd like to match
*I* want to have my turkey. May *I*. *I* 20,000-t bar-b-q
right now I'm using
/\b\w\b/
as my regular expression, but that is matching
*I* want to have my turkey. May *I*. *I* 20,000-*t* bar-*b*-*q*
Any suggestions on how to get past that last mile?
Use a negative lookbehind and negative lookahead to fail if the previous character is a word or a hyphen, or if the next character is a word a or a hyphen:
/(?<![\w\-])\w(?![\w\-])/
Example: http://www.rubular.com/r/9upmgfG9u4
Note that as mentioned by rtcherry, this will also match single numbers. To prevent this you may want to change the \w that is outside of the character classes to [a-zA-Z].
F.J's answer will also include numbers. This is restricted to ASCII characters, but you really need to define what characters can be side by side an still count as a single letter.
/(?<![0-9a-zA-Z\-])[a-zA-Z](?![0-9a-zA-Z\-])/
That will also avoid things like This -> 1a <- is not a single letter. Neither is -> 2 <- that.
As long as we're being picky, non-ASCII letters are easy to include:
/(?<![[:alnum:]-])[[:alpha:]](?![[:alnum:]-])/
This will avoid matching the t in 'Cómo eres tú'
Notice that it's not necessary to escape the - when it is the last character in a character class (which I'm not sure that this technically is).
You are asking far too much of a regular expression. \w matches a word character, which includes upper and lower case alphabetics, the ten digits, and underscore. So it is the same as [0-9A-Z_a-z].
\b matches the (zero-width) boundary where a word character doesn't have another word character next to it, for instance at the beginning or end of a string, or next to some punctuation or white space.
Using negative look-behinds and look-aheads, this amounts to \b\w\b being equivalent to
(?<!\w)\w(?!\w)
i.e. a word character that doesn't have another word character before or after it.
As you have found, that finds t, b and q in 20,000-t bar-b-q. So it's back in your court to define what you really mean by "single letters in a sentence".
It nearly works to say "any letter that isn't preceded or followed by a printable character, which is
/(?<!\S)[A-Za-z](?!\S)/
But that leaves out I in May I. because it has a dot after it.
So, do you mean a single letter that isn't preceded by a printable character, and is followed by whitespace, a dot, or the end of the string (or a comma, a semicolon or a colon for good measure)? Then you want
/(?<!\S)[A-Za-z](?=(?:[\s.,;:]|\z))/
which finds exactly three I characters in your string.
I hope that helps.

Password validation

I need to validate a password with the following requirements:
1. Be at least seven characters long
2. Contain at least one letter (a-z or A-Z)
3 Contain at least one number (0-9)
4 Contain at least one symbol (#, $, %, etc)
Can anyone give me the correct expression?
/.{7,}/
/[a-zA-Z]/
/[0-9]/
/[-!##$%^...]/
For a single regex, the most straightforward way to check all of the requirements would be with lookaheads:
/(?=.*[a-zA-Z])(?=.*\d)(?=.*[^a-zA-Z0-9\s]).{7,}/
Breaking it down:
.{7,} - at least seven characters
(?=.*[a-zA-Z]) - a letter must occur somewhere after the start of the string
(?=.*\d) - ditto 2, except a digit
(?=.*[^a-zA-Z0-9\s]) - ditto 2, except something not a letter, digit, or whitespace
However, you might choose to simply utilize multiple separate regex matches to keep things even more readable - chances are you aren't validating a ton of passwords at once, so performance isn't really a huge requirement.

Resources