Pattern match on any single UTF-8 character

Pattern match on any single UTF-8 character - utf-8

I would like to have a function clause that matches any single UTF-8 character.
I can match on specific characters like this
def foo("a") do
"It's an a"
end
But I cannot determine if it possible to do the same for any single UTF8 character.
My current solution is to split the string to a char list and pattern match on that, but I was curious if I could skip that step.

You can do this with:
def char?(<<c::utf8>>), do: true
def char?(_), do: false
Note that this only matches a binary with a single character, to match on the next character in a string, you can just do:
def char?(<<c::utf8, _rest::binary>>), do: true

From the Regex docs:
The modifiers available when creating a Regex are: ...
unicode (u) - enables Unicode specific patterns like \p and changes modifiers like \w, \W, \s and friends to also match on Unicode. It expects valid Unicode strings to be given on match
dotall (s) - causes dot to match newlines and also set newline to anycrlf; the new line setting can be overridden by setting (*CR) or (*LF) or (*CRLF) or (*ANY) according to :re documentation
So you might try:
~r/./us
From http://elixir-lang.org/crash-course.html
In Elixir, the word string means a UTF-8 binary and there is a String module that works on such data
So I think you should be good to go.

TL;DR:
for <<char <- "abc">> do
def foo(unquote(<<char>>)), do: "It's an #{unquote(<<char>>)}"
end
Take a look at https://github.com/elixir-lang/elixir/blob/3eb938a0ba7db5c6cc13d390e6242f66fdc9ef00/lib/elixir/unicode/unicode.ex#L48-L52 You can on compile time generate function for each character in binary ("abc" in my example). It's how Elixir unicode support works, check out whole module to better understand.

Related

Working with Ruby class: Capitalizing a string

I'm trying to get my head around how to work with Classes in Ruby and would really appreciate some insight on this area. Currently, I've got a rather simple task to convert a string with the start of each word capitalized. For example:
Not Jaden-Cased: "How can mirrors be real if our eyes aren't real"
Jaden-Cased: "How Can Mirrors Be Real If Our Eyes Aren't Real"
This is my code currently:
class String
def toJadenCase
split
capitalize
end
end
#=> usual case: split.map(&:capitalize).join(' ')
Output:
Expected: "The Moment That Truth Is Organized It Becomes A Lie.",
instead got: "The moment that truth is organized it becomes a lie."

I suggest you not pollute the core String class with the addition of an instance method. Instead, just add an argument to the method to hold the string. You can do that as follows, by downcasing the string then using gsub with a regular expression.
def to_jaden_case(str)
str.downcase.gsub(/(?<=\A| )[a-z]/) { |c| c.upcase }
end
to_jaden_case "The moMent That trUth is organized, it becomes a lie."
#=> "The Moment That Truth Is Organized, It Becomes A Lie."
Ruby's regex engine performs the following operations.
(?<=\A| ) : use a positive lookbehind to assert that the following match
is immediately preceded by the start of the string or a space
[a-z] : match a lowercase letter
(?<=\A| ) can be replaced with the negative lookbehind (?<![^ ]), which asserts that the match is not preceded by a character other than a space.
Notice that by using String#gsub with a regular expression (unlike the split-process-join dance), extra spaces are preserved.
When spaces are to be matched by a regular expression one often sees whitespaces (\s) matched instead. Here, for example, /(?<=\A|\s)[a-z]/ works fine, but sometimes matching whitespaces leads to problems, mainly because they also match newlines (\n) (as well as spaces, tabs and a few other characters). My advice is to match space characters if spaces are to be matched. If tabs are to be matched as well, use a character class ([ \t]).

Try:
def toJadenCase
self.split.map(&:capitalize).join(' ')
end

regexp match group with the exception of a member of the group

So, there are a number of regular expression which matches a particular group like the following:
/./ - Any character except a newline.
/./m - Any character (the m modifier enables multiline mode)
/\w/ - A word character ([a-zA-Z0-9_])
/\s/ - Any whitespace character
And in ruby:
/[[:punct:]]/ - Punctuation character
/[[:space:]]/ - Whitespace character ([:blank:], newline, carriage return, etc.)
/[[:upper:]]/ - Uppercase alphabetical
So, here is my question: how do I get a regexp to match a group like this, but exempt a character out?
Examples:
match all punctuations apart from the question mark
match all whitespace characters apart from the new line
match all words apart from "go"... etc
Thanks.

You can use character class subtraction.
Rexegg:
The syntax […&&[…]] allows you to use a logical AND on several character classes to ensure that a character is present in them all. Intersecting with a negated character, as in […&&[^…]] allows you to subtract that class from the original class.
Consider this code:
s = "./?!"
res = s.scan(/[[:punct:]&&[^!]]/)
puts res
Output is only ., / and ? since ! is excluded.
Restricting with a lookahead (as sawa has written just now) is also possible, but is not required when you have this subtraction supported. When you need to restrict some longer values (more than 1 character) a lookahead is required.
In many cases, a lookahead must be anchored to a word boundary to return correct results. As an example of using a lookahead to restrict punctuation (single character matching generic pattern):
/(?:(?!!)[[:punct:]])+/
This will match 1 or more punctuation symbols but a !.
The puts "./?!".scan(/(?:(?!!)[[:punct:]])+/) code will output ./? (see demo)
Use character class subtraction whenever you need to restrict with single characters, it is more efficient than using lookaheads.
So, the 3rd scenario regex must look like:
/\b(?!go\b)\w+\b/
^^
If you write /(?!\bgo\b)\b\w+\b/, the regex engine will check each position in the input string. If you use a \b at the beginning, only word boundary positions will be checked, and the pattern will yield better performance. Also note that the ^^ \b is very important since it makes the regex engine check for the whole word go. If you remove it, it will only restrict to the words that do not start with go.

Put what you want to exclude inside a negative lookahead in front of the match. For example,
To match all punctuations apart from the question mark,
/(?!\?)[[:punct:]]/
To match all words apart from "go",
/(?!\bgo\b)\b\w+\b/

This is a general approach that is sometimes useful:
a = []
".?!,:;-".scan(/[[:punct:]]/) { |s| a << s unless s == '?' }
a #=> [".", "!", ",", ":", ";", "-"]
The content of the block is limited only by your imagination.

How do I tune this regex to return the matches I want?

So I have a string that looks like this:
#jackie#test.com, #mike#test.com
What I want to do is before any email in this comma separated list, I want to remove the #. The issue I keep running into is that if I try to do a regular \A flag like so /[\A#]+/, it finds all the instances of # in that string...including the middle crucial #.
The same thing happens if I do /[\s#]+/. I can't figure out how to just look at the beginning of each string, where each string is a complete email address.
Edit 1
Note that all I need is the regex, I already have the rest of the stuff I need to do what I want. Specifically, I am achieving everything else like this:
str.gsub(/#/, '').split(',').map(&:strip)
Where str is my string.
All I am looking for is the regex portion for my gsub.

You may use the below negative lookbehind based regex.
str.gsub(/(?<!\S)#/, '').split(',').map(&:strip)
(?<!\S) Negative lookbehind asserts that the character or substring we are going to match would be preceeded by any but not of a non-space character. So this matches the # which exists at the start or the # which exists next to a space character.
Difference between my answer and hwnd's str.gsub(/\B#/, '') is, mine won't match the # which exists in :# but hwnd's answer does. \B matches between two word characters or two non-word characters.

Here is one solution
str = "#jackie#test.com, #mike#test.com"
p str.split(/,[ ]+/).map{ |i| i.gsub(/^#/, '')}
Output
["jackie#test.com", "mike#test.com"]

How to detect if string contains only latin symbols using Ruby 1.9?

I need to detect if some string contains symbols from a non latin alphabet. Numbers and special symbols like -, _, + are good. I need to know whether there is any non latin symbols. For example:
"123sdjjsf-4KSD".just_latin?
should return true.
"12333ыц4--sdf".just_latin?
should return false.

I think that this should work for you:
# encoding: UTF-8
class String
def just_latin?
!!self.match(/^[a-zA-Z0-9_\-+ ]*$/)
end
end
puts "123sdjjsf-4KSD".just_latin?
puts "12333ыц4--sdf".just_latin?
Note that *#ascii_only?* is very close to what you want as well.

The following regular expression will match a single letter character that is not Latin:
[\p{L}&&[^a-zA-Z]]
The && syntax intersects two character classes. The first one (\p{L}) matches any Unicode letter. The second one ^a-zA-Z matches any character that is not (^) a Latin one (a-z or A-Z). I.e. the whole character class matches any letter that is not a Latin one.
See it working on Rubular.
So if you use this regular expression inside just_latin? and return true if no match is found, it should work just like you want it to.
I tried with the Unicode property \p{Latin} for the second character class before, but that is not entirely reliable, since \p{Latin} includes for instance the Icelandic characters þ, æ, ð.

There you go, just match those characteres and you are done (a-z means characteres from a to z): ^[a-zA-Z_\-+]+$

How do I match a UTF-8 encoded hashtag with embedded punctuation characters?

I want to extract #hashtags from a string, also those that have special characters such as #1+1.
Currently I'm using:
#hashtags ||= string.scan(/#\w+/)
But it doesn't work with those special characters. Also, I want it to be UTF-8 compatible.
How do I do this?
EDIT:
If the last character is a special character it should be removed, such as #hashtag, #hashtag. #hashtag! #hashtag? etc...
Also, the hash sign at the beginning should be removed.

The Solution
You probably want something like:
'#hash+tag'.encode('UTF-8').scan /\b(?<=#)[^#[:punct:]]+\b/
=> ["hash+tag"]
Note that the zero-width assertion at the beginning is required to avoid capturing the pound sign as part of the match.
References
String#encode
Ruby's POSIX Character Classes

This should work:
#hashtags = str.scan(/#([[:graph:]]*[[:alnum:]])/).flatten
Or if you don't want your hashtag to start with a special character:
#hashtags = str.scan(/#((?:[[:alnum:]][[:graph:]]*)?[[:alnum:]])/).flatten

How about this:
#hashtags ||=string.match(/(#[[:alpha:]]+)|#[\d\+-]+\d+/).to_s[1..-1]
Takes cares of #alphabets or #2323+2323 #2323-2323 #2323+65656-67676
Also removes # at beginning
Or if you want it in array form:
#hashtags ||=string.scan(/#[[:alpha:]]+|#[\d\+-]+\d+/).collect{|x| x[1..-1]}
Wow, this took so long but I still don't understand why scan(/#[[:alpha:]]+|#[\d\+-]+\d+/) works but not scan(/(#[[:alpha:]]+)|#[\d\+-]+\d+/) in my computer. The difference being the () on the 2nd scan statement. This has no effect as it should be when I use with match method.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Pattern match on any single UTF-8 character - utf-8

You can do this with: def char?(<<c::utf8>>), do: true def char?(_), do: false Note that this only matches a binary with a single character, to match on the next character in a string, you can just do: def char?(<<c::utf8, _rest::binary>>), do: true

Related

Working with Ruby class: Capitalizing a string

regexp match group with the exception of a member of the group

How do I tune this regex to return the matches I want?

How to detect if string contains only latin symbols using Ruby 1.9?

How do I match a UTF-8 encoded hashtag with embedded punctuation characters?

Categories

Resources