Elasticsearch standard tokenizer behaviour and word boundaries - elasticsearch

I am not sure why the standard tokenizer (used by the default standard analyzer) behaves like this in this scenario:
- If I use the word system.exe it generates the token system.exe. I understand . is not a word breaker.
- If I use the word system32.exe it generates the tokens system and exe. I don´t understand this, why it breaks the word when it finds a number + a . ?
- If I use the word system32tm.exe it generates the token system32tm.exe. As in the first example, it works as expected, not breaking the word into different tokens.
I have read http://unicode.org/reports/tr29/#Word_Boundaries but I still don´t understand why a number + dot (.) is a word boundary

As mentioned in the question, the standard tokenizer provides grammar based tokenization based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29
The rule http://unicode.org/reports/tr29/#Word_Boundaries is to not break if you have letter + dot + letter, see WB6 in the above spec. So tm.exe is preserved and system32.exe is split.
The spec says that it always splits, except for the listed exceptions. Exceptions WB6 and WB7 say that it never splits on letter, then punctuation, then letter. Rules WB11 and WB12 say that it never splits on number, then punctuation, then number. However there is no such rule for number then punctuation then letter, so the default rule applies and system32.exe gets splitted.

Related

custom token filter for elasticsearch

I want to implement a custom token filter like this:
single words are accepted if they match a specific (regex) pattern - adjacent words are concatenated if one ends in a letter and the other one begins with a digit (or vice versa)
This seems to map to:
step 1 - shingle - adjacent words joined together with a space
step 2 - if token matches pattern /pat1/, keep ... if token matches /pata patb/, replace the whitespace
step 3 - remove everything else.
Is there a way to achieve that?. I have seen https://stackoverflow.com/questions/35742426/how-to-filter-tokens-based-on-a-regex-in-elasticsearch but dont feel like converting a complex pattern into one with lookahead.
the idea is to factor out potential order numbers from user input.
The data is assumed to be normalised, so an order number could be a regular isbn 978<10_more_digits> or something like "ME4713P". Users might input "ME 4713P" or 978-<10_digits_and_some_dashes> instead
Order numbers can be described as "contain both letters and digits, optional dashes" or "contain letters, a dash, more letters" or "contain digits, a dash, more digits"
BTW: sorry to use different email this time...

Discard contractions from string

I have a special use case where I want to discard all the contractions from the string and select only words followed by alphabets which do not contain any special character.
For eg:
string = "~ ASAP ASCII Achilles Ada Stackoverflow James I'd I'll I'm I've"
string.scan(/\b[A-z][a-z]+\b/)
#=> ["Achilles", "Ada", "Stackoverflow", "James", "ll", "ve"]
Note: It's not discarding the whole word I'll and I've
Can someone please help how to discard the whole word which contains contractions?
Try this Regex:
(?:(?<=\s)|(?<=^))[a-zA-Z]+(?=\s|$)
Explanation:
(?:(?<=\s)|(?<=^)) - finds the position immediately preceded by either start of the line or by a white-space
[a-zA-Z]+ - matches 1+ occurrences of a letter
(?=\s|$) - The substring matched above must be followed by either a whitespace or end of the line
Click for Demo
Update:
To make sure that not all the letters are in upper case, use the following regex:
(?:(?<=\s)|(?<=^))(?=\S*[a-z])[a-zA-Z]+(?=\s|$)
Click for Demo
The only thing added here is (?=\S*[a-z]) which means that there must be atleast one lowercase letter
I know that there's an accepted answer already, but I'd like to give my own shot:
(?<=\s|^)\w+[a-z]\w*
You can test it here. This regex is shorter and more efficient (157 steps against 315 from the accepted answer).
The explanation is rather simple:
(?<=\s|^)- This is a positive look behind. It means that we want strings preceded by a whitespace character or the start of the string.
\w+[a-z]\w* - This one means that we want strings composed by letters only (word characters) containing least one lowercase letter, thus discarding words which are whole uppercase. Along with the positive look behind, the whole regex ends up discarding words containing special characters.
NOTE: this regex won't take into account one-letter words. If you want to accomplish that, then you should use \w*[a-z]\w* instead, with a little efficiency cost.

Regex incorrectly matching punctuation (including spaces)

I am trying to check if a string contains at least one lowercase letter, uppercase letter, and a number, but not punctuation (including spaces).
For example
4aBc8Fk3 should match
4aBc 8.;3 should not match
I tried the following, but it matches spaces:
^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9]).{6,}[^[:punct:]]$
Any ideas how to not match strings containing punctuation including spaces?
The regular expression you have got there does the following for as far as I understand (I'm not familiar with the ruby variety, and still quite new to regex myself; this will give you an idea, but may not be 100% correct):
Go to the beginning of the string
Ensure the string matches any number of any characters followed by a lowercase letter, e.g. --a
Ensure the string matches any number of any characters followed by an uppercase letter, e.g.--aA
Ensure the string matches any number of any characters followed by a number, e.g. --aA0
If that is all true, make sure the beginning of the string is followed by at least 6 random characters, e.g.--aA0-
Ensure that is followed by a single non-punctuation character (although this is the part I'm not sure about, as I haven't used character classes before, and don't know if it's [^[:punct:]] or [^:punct:]), e.g. --aA0-c
Ensure that is followed directly by the end of the string
Now, the lookaheads would also allow a different order of occurrences, e.g. 0---Aa, as long as the string contains any characters followed by what they are looking for.
What you probably want is ^[a-zA-Z0-9]{6,}$, i.e. at least six characters, with the characters being letters and numbers (though that would also allow aaaaaa, for example).
Maybe try ^(?=.*[a-z])(?=.*[A-Z])(?=.*[0-9])[a-zA-Z0-9]{6,}$ to make sure each group is present, and to get alpha-numerical characters (at least six of them) only.
I always use a tool such as http://www.regexpal.com/ to slowly build up my regex and to see where I go wrong, deconstructing a "bad" regex until I get to a "good" one, then slowly adding to it again.
Hope that helps. :)
P.S.: I'm still a bit unclear how many characters you want to match in total, i.e. if the string is fixed length or not...?

Regex for capital letters not matching accented characters

I am new to ruby and I'm trying to work with regex.
I have a text which looks something like:
HEADING
Some text which is always non capitalized. Headings are always capitalized, followed by a space or nothing more.
YOU CAN HAVE MULTIPLE WORDS IN HEADING
I'm using this regular expression to choose all headings:
^[A-Z]{2,}\s?([A-Z]{2,}\s?)*$
However, it matches all headings which does not contain chars as Č, Š, Ž(slovenian characters).
So I'm guessing [A-Z] only matches ASCII characters? How could I get utf8?
You are right in that when you define the ASCII range A-Z, the match is made literally only for those characters. This is to do with the history of characters on computers, more and more characters have been added over time, and they are not always structured in an encoding in ways that are easy to use.
You could make a larger character class that matches the slovenian characters you need, by listing them.
But there is a shortcut. Someone else has already added necessary data to the Unicode data so that you can write shorter matches for "all uppercase characters": /[[:upper:]]/. See http://ruby-doc.org//core-2.1.4/Regexp.html for more.
Altering your regular expression with just this adjustment:
^[[:upper:]]{2,}\s?([[:upper:]]{2,}\s?)*$
You may need to adjust it further, for instance it would not match the heading "I AM A HEADING" due to the match insisting each word is at least two letters long.
Without seeing all your examples, I would probably simplify the group matching and just allow spaces anywhere:
^[[:upper:]\s]+$
You can use unicode upper case letter:
\p{Lu}
Your regex:
\b\p{Lu}{2,}(?:\s*\p{Lu}{2,})\b
RegEx Demo

Need RegExpr matching group of 8 containing, 2+ uppercase, 2+ lowercase and 1+ number

I am trying to create a spam filter using Regular Expressions that matches the following situation.
There is a group of exactly 8 alphanumeric characters to be matched.
It must contain 2 or more uppercase letters;
AND it must contain 2 or more lowercase letters;
AND it must contain 1 or more numbers.
So far, all I have been able to come up with is this:
(?i)[A-Za-z0-9]{8}
My code does match a mixed case group of 8, but does not force upper or lower case or specify how many times each type must occur. So, I couple it with other must-haves that are always present in the messages in question.
Here is a sample of the pattern I am trying to detect:
WbNDSk9e
This is part of a spam URL. Other groups I have seen follow the same pattern of at least 2 each UC and LC letters and 1 or more numbers and always have exactly 8 characters. I've seen no other characters or variations yet.
To my knowledge, the only switch I am able to use is to turn on Case Sensitivity, with (?i). Some of the other switches I have seen in some replies do not work in the program I use. Am I asking too much from a single line RegExpr rule?
I currently use RegEx Match to test my rules and my anti-spam program uses the same engine.
^(?=.*?[A-Z].*?[A-Z])(?=.*?[a-z].*?[a-z])(?=.*?\d).{8}$
Broken down:
(?=.*?[A-Z].*?[A-Z]) forces at least 2 upper-case letters.
(?=.*?[a-z].*?[a-z]) forces at least 2 lower-case letters.
(?=.*?\d) forces at least 1 digit.
The ^ ... $ caret and dollar force that it matches the whole string.
You don't want the (?i) flag because it will make it case-insensitive.

Resources