Regexp Greek chars by number - ruby

I deal with strings that contain Greek and English (Latin) text. I'd like to use a regex to catch all the Greek words that contain 4 or more characters on them.
Using regexp manual I figure out that I can use \p{Greek} to grab all Greek words and \w{4,} in order to grab 4+ character words. However, these two don't work together, from various tests I made.
Is there any way to do what I want using 1 regexp expression? Strings are UTF-8 and come out of tweets.
Regards

Are you using the UTF-8 pattern modifier?
/\p{Greek}{4,}/u

Related

How to strip vowels/nekudot/diacritics from hebrew utf-8 in ruby?

Hebrew prints vowels as nekudot/points around the letters.
with vowels: " יִתְגַּבֵּר כַּאֲרִי לַעֲמֹד בַּבֹּקֶר לַעֲבוֹדַת בּוֹרְאוֹ שֶׁיְּהֵא הוּא מְעוֹרֵר הַשַּׁחַר"
without vowels: "יתגב כארי לעמוד בבקר לעבודת בוראו שיהא הוא מעורר השחר"
I need a way to strip these vowels from a string. As in convert the string, with vowels, to the string, without vowels. Any suggestions?
-p.s.
I have tried "hebrew.gsub(/[^א-ת]/, '')"
but this has two problems: a: this will remove all punctuation, english, etc. I only want to remove the vowels. b: some letters get removed as well. (my understanding is limited, but it seems some letters/vowel combinations become "multibyte" in utf-8 and will not match "א-ת".
I found this:https://gist.github.com/yakovsh/345a71d841871cc3d375 online, but the ruby suggestion only works with rails, (assuming it works at all).
However, perhaps that page can be helpful in finding a solution.
Please help, thanks in advance.
The vowels are all between U+0591 and U+05C7, so you can just do
hebrew.gsub(/[\u0591-\u05c7]/,"")
e.g.
" יִתְגַּבֵּר כַּאֲרִי לַעֲמֹד בַּבֹּקֶר לַעֲבוֹדַת בּוֹרְאוֹ שֶׁיְּהֵא הוּא מְעוֹרֵר הַשַּׁחַר".gsub(/[\u0591-\u05c7]/,"")
# => " יתגבר כארי לעמד בבקר לעבודת בוראו שיהא הוא מעורר השחר"
However, that only works if the vowels are all separate characters in the string - or to say the same thing in Unicode-speak, if the text is in Normalization Form D. You can ensure that is the case by calling String#unicode_normalize on it first:
hebrew.unicode_normalize(:nfd).gsub(/[\u0591-\u05c7]/,"")
This step is necessary because Unicode includes several individual characters that combine a letter with nekuddot in a single code point, for round-trip compatibility with older character sets that did not support combining diacritics. Those characters mean you can't tell just by looking whether the string "בּ" consists of the two-code-point sequence U+05D1 HEBREW LETTER BET followed by U+05BC HEBREW POINT DAGESH OR MAPIQ, or just the single character U+FB31 HEBREW LETTER BET WITH DAGESH. Putting the string into Normalization Form D replaces the latter with the former, and also splits up any other "precomposed" characters into their component parts.

Regex that considers apostrophes as word characters? [duplicate]

So I want to split a string in java on any non-alphanumeric characters.
Currently I have been doing it like this
words= Str.split("\\W+");
However I want to keep apostrophes("'") in there. Is there any regular expression to preserve apostrophes but kick the rest of the junk? Thanks.
words = Str.split("[^\\w']+");
Just add it to the character class. \W is equivalent to [^\w], which you can then add ' to.
Do note, however, that \w also actually includes underscores. If you want to split on underscores as well, you should be using [^a-zA-Z0-9'] instead.
For basic English characters, use
words = Str.split("[^a-zA-Z0-9']+");
If you want to include English words with special characters (such as fiancé) or for languages that use non-English characters, go with
words = Str.split("[^\\p{L}0-9']+");

Regexp with letters and hexadecimal

I need regexp for "EWD-eb-AEW-97-QOW" like strings.
The general pattern is:
3 uppercase letters, hex, 3 uppercase letters, hex, 3 uppercase letters.
I use:
/[A-Z]{3}-\h-[A-Z]{3}-\h-[A-Z]{3}/
but it doesn't work. Can anyone help with it and explain why it doesn't work?
\h doesn't match 2 digit hex numbers, use this regex:
/[A-Z]{3}-[A-F0-9]{2}-[A-Z]{3}-[A-F0-9]{2}-[A-Z]{3}/i
RegEx Demo
In addition to anubhava's answer.. You can add {2} occurrences in your answer for achieving the same.
/[A-Z]{3}-\h{2}-[A-Z]{3}-\h{2}-[A-Z]{3}/i
See DEMO
I'd use something like:
/(?:[A-Z]{3}-[a-f0-9]{2}-){2}[A-Z]{3}/
https://www.regex101.com/r/aY5eF6/1
(?:[A-Z]{3}-[a-f0-9]{2}-)
groups the three-letters, '-', two hexadecimal letters, and '-', and then does it twice, followed by three-letters again.
Regarding using Ruby's \h Regexp extension:
I'd be careful using special characters like \h in a mixed-language environment, or one that is supporting old versions of Ruby. We use YAML to contain patterns shared among various languages, and something like this would open up a very hard to track bug. I'd recommend using [a-f0-9] unless you KNOW you'll never run into that problem.

Regex for capital letters not matching accented characters

I am new to ruby and I'm trying to work with regex.
I have a text which looks something like:
HEADING
Some text which is always non capitalized. Headings are always capitalized, followed by a space or nothing more.
YOU CAN HAVE MULTIPLE WORDS IN HEADING
I'm using this regular expression to choose all headings:
^[A-Z]{2,}\s?([A-Z]{2,}\s?)*$
However, it matches all headings which does not contain chars as Č, Š, Ž(slovenian characters).
So I'm guessing [A-Z] only matches ASCII characters? How could I get utf8?
You are right in that when you define the ASCII range A-Z, the match is made literally only for those characters. This is to do with the history of characters on computers, more and more characters have been added over time, and they are not always structured in an encoding in ways that are easy to use.
You could make a larger character class that matches the slovenian characters you need, by listing them.
But there is a shortcut. Someone else has already added necessary data to the Unicode data so that you can write shorter matches for "all uppercase characters": /[[:upper:]]/. See http://ruby-doc.org//core-2.1.4/Regexp.html for more.
Altering your regular expression with just this adjustment:
^[[:upper:]]{2,}\s?([[:upper:]]{2,}\s?)*$
You may need to adjust it further, for instance it would not match the heading "I AM A HEADING" due to the match insisting each word is at least two letters long.
Without seeing all your examples, I would probably simplify the group matching and just allow spaces anywhere:
^[[:upper:]\s]+$
You can use unicode upper case letter:
\p{Lu}
Your regex:
\b\p{Lu}{2,}(?:\s*\p{Lu}{2,})\b
RegEx Demo

What is an example regex in ruby that will match any emojis?

I need to match emojis in a string in Ruby using a regex. I have tried several unicode sequences and none seem to quite do the job. I am also not sure where the start and end range for emojis would be.
This regex matches all 845 emoji, taken from Emoji unicode characters for use on the web:
[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]
I generated this regex directly from the raw list of Unicode emoji. The algorithm is here: https://github.com/franklsf95/ruby-emoji-regex.
Example usage:
regex = /[\u{203C}\u{2049}\u{20E3}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F1E7}-\u{1F1EC}\u{1F1EE}-\u{1F1F0}\u{1F1F3}\u{1F1F5}\u{1F1F7}-\u{1F1FA}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F320}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}]/
str = "I am a string with emoji 😍😍😱😱👿👿🐔🌚 and other Unicode characters 比如中文."
str.gsub regex, ''
# "I am a string with emoji and other Unicode characters 比如中文."
Other Unicode characters, such as Asian characters, are preserved.
EDIT: I udpated the regex to exclude ASCII numbers and symbols. See comments from How do I remove emoji from string for details.
Emojis don't exist in one single range. They are scattered about. This is a collection of codes, and ranges where possible, that will match emojis. Tested in ruby 2.0.0p451:
str = "😣"
str.scan(/[\u{00A9}\u{00AE}\u{203C}\u{2049}\u{2122}\u{2139}\u{2194}-\u{2199}\u{21A9}-\u{21AA}\u{231A}-\u{231B}\u{23E9}-\u{23EC}\u{23F0}\u{23F3}\u{24C2}\u{25AA}-\u{25AB}\u{25B6}\u{25C0}\u{25FB}-\u{25FE}\u{2600}-\u{2601}\u{260E}\u{2611}\u{2614}-\u{2615}\u{261D}\u{263A}\u{2648}-\u{2653}\u{2660}\u{2663}\u{2665}-\u{2666}\u{2668}\u{267B}\u{267F}\u{2693}\u{26A0}-\u{26A1}\u{26AA}-\u{26AB}\u{26BD}-\u{26BE}\u{26C4}-\u{26C5}\u{26CE}\u{26D4}\u{26EA}\u{26F2}-\u{26F3}\u{26F5}\u{26FA}\u{26FD}\u{2702}\u{2705}\u{2708}-\u{270C}\u{270F}\u{2712}\u{2714}\u{2716}\u{2728}\u{2733}-\u{2734}\u{2744}\u{2747}\u{274C}\u{274E}\u{2753}-\u{2755}\u{2757}\u{2764}\u{2795}-\u{2797}\u{27A1}\u{27B0}\u{27BF}\u{2934}-\u{2935}\u{2B05}-\u{2B07}\u{2B1B}-\u{2B1C}\u{2B50}\u{2B55}\u{3030}\u{303D}\u{3297}\u{3299}\u{1F004}\u{1F0CF}\u{1F170}-\u{1F171}\u{1F17E}-\u{1F17F}\u{1F18E}\u{1F191}-\u{1F19A}\u{1F201}-\u{1F202}\u{1F21A}\u{1F22F}\u{1F232}-\u{1F23A}\u{1F250}-\u{1F251}\u{1F300}-\u{1F31F}\u{1F330}-\u{1F335}\u{1F337}-\u{1F37C}\u{1F380}-\u{1F393}\u{1F3A0}-\u{1F3C4}\u{1F3C6}-\u{1F3CA}\u{1F3E0}-\u{1F3F0}\u{1F400}-\u{1F43E}\u{1F440}\u{1F442}-\u{1F4F7}\u{1F4F9}-\u{1F4FC}\u{1F500}-\u{1F507}\u{1F509}-\u{1F53D}\u{1F550}-\u{1F567}\u{1F5FB}-\u{1F640}\u{1F645}-\u{1F64F}\u{1F680}-\u{1F68A}\u{1F68C}-\u{1F6C5}]/)
You can use the emoji_data gem to canonically match emoji in a string via it's .scan method: https://github.com/mroth/emoji_data.rb
(disclaimer: I am the author)
Some of the more recent Emoji need to be constructed by multiple Emoji-related codepoints, for example, using the invisible "Zero-width joiner" (U+200D) codepoint to construct so called Emoji ZWJ sequences. You can use my unicode-emoji gem, which comes with a regex, build from the latest Emoji data by the Unicode consortium.

Resources