How can I match Korean characters in a Ruby regular expression? - ruby

I have some basic validations for usernames using regular expressions, something like [\w-_]+, and I want to add support for Korean alphabet, while still keeping the validation the same.
I don't want to allow special characters, such as {}[]!##$%^&*() etc., I just want to replace the \w with something that matches a given alphabet in addition to [a-zA-Z0-9].
Which means username like 안녕 should be valid, but not 안녕[].
I need to do this in Ruby 1.9.

try this:
[가-힣]+
This matches every character from U+AC00 to U+D7A3, which is probably enough for your interest. (I don't think you'll need old hangul characters and stuff)

You can test for invalid characters like this:
#encoding: utf-8
def valid_name?(name)
!name.match(/[^a-zA-Z0-9\p{Hangul}]/)
end
ar = %w(안녕 name 안녕[].)
ar.each{|name| puts "#{name} is #{valid_name?(name) ? "valid" : "invalid"}."}
# 안녕 is valid.
# name is valid.
# 안녕[]. is invalid.

I think you can replace \w by [:word:]
/^[[:word:]\-_]+$/ should work

Matching for invalid characters is your best option, because there are way too many valid Korean characters - it's technically an alphabet but computerized as one-character-per-syllable, and additionally there are thousands of Chinese loan characters (Hanja) which should also be valid.

Related

Regex for capital letters not matching accented characters

I am new to ruby and I'm trying to work with regex.
I have a text which looks something like:
HEADING
Some text which is always non capitalized. Headings are always capitalized, followed by a space or nothing more.
YOU CAN HAVE MULTIPLE WORDS IN HEADING
I'm using this regular expression to choose all headings:
^[A-Z]{2,}\s?([A-Z]{2,}\s?)*$
However, it matches all headings which does not contain chars as Č, Š, Ž(slovenian characters).
So I'm guessing [A-Z] only matches ASCII characters? How could I get utf8?
You are right in that when you define the ASCII range A-Z, the match is made literally only for those characters. This is to do with the history of characters on computers, more and more characters have been added over time, and they are not always structured in an encoding in ways that are easy to use.
You could make a larger character class that matches the slovenian characters you need, by listing them.
But there is a shortcut. Someone else has already added necessary data to the Unicode data so that you can write shorter matches for "all uppercase characters": /[[:upper:]]/. See http://ruby-doc.org//core-2.1.4/Regexp.html for more.
Altering your regular expression with just this adjustment:
^[[:upper:]]{2,}\s?([[:upper:]]{2,}\s?)*$
You may need to adjust it further, for instance it would not match the heading "I AM A HEADING" due to the match insisting each word is at least two letters long.
Without seeing all your examples, I would probably simplify the group matching and just allow spaces anywhere:
^[[:upper:]\s]+$
You can use unicode upper case letter:
\p{Lu}
Your regex:
\b\p{Lu}{2,}(?:\s*\p{Lu}{2,})\b
RegEx Demo

How to hide output password with sharp in ruby?

I want to puts sharp's instead of password in ruby code
puts " found password: #{pass.tr('?','#')}"
I need as many sharp '#' characters output as characters in a password.
How to do it right?
The method .tr is intended to swap specific characters, you cannot do a wild-card match. Even if you extended it to cover many characters, there is a risk that you miss or forget a special character that is allowed in passwords on your system.
A simple variant of what you have is to use .gsub instead:
pass.gsub(/./,'#')
This uses regular expressions to find groups of characters to swap. The simple Regexp /./ matches any single character. The Ruby core documentation on regular expressions includes a brief introduction, in case you have not used them much before.

How do I match a UTF-8 encoded hashtag with embedded punctuation characters?

I want to extract #hashtags from a string, also those that have special characters such as #1+1.
Currently I'm using:
#hashtags ||= string.scan(/#\w+/)
But it doesn't work with those special characters. Also, I want it to be UTF-8 compatible.
How do I do this?
EDIT:
If the last character is a special character it should be removed, such as #hashtag, #hashtag. #hashtag! #hashtag? etc...
Also, the hash sign at the beginning should be removed.
The Solution
You probably want something like:
'#hash+tag'.encode('UTF-8').scan /\b(?<=#)[^#[:punct:]]+\b/
=> ["hash+tag"]
Note that the zero-width assertion at the beginning is required to avoid capturing the pound sign as part of the match.
References
String#encode
Ruby's POSIX Character Classes
This should work:
#hashtags = str.scan(/#([[:graph:]]*[[:alnum:]])/).flatten
Or if you don't want your hashtag to start with a special character:
#hashtags = str.scan(/#((?:[[:alnum:]][[:graph:]]*)?[[:alnum:]])/).flatten
How about this:
#hashtags ||=string.match(/(#[[:alpha:]]+)|#[\d\+-]+\d+/).to_s[1..-1]
Takes cares of #alphabets or #2323+2323 #2323-2323 #2323+65656-67676
Also removes # at beginning
Or if you want it in array form:
#hashtags ||=string.scan(/#[[:alpha:]]+|#[\d\+-]+\d+/).collect{|x| x[1..-1]}
Wow, this took so long but I still don't understand why scan(/#[[:alpha:]]+|#[\d\+-]+\d+/) works but not scan(/(#[[:alpha:]]+)|#[\d\+-]+\d+/) in my computer. The difference being the () on the 2nd scan statement. This has no effect as it should be when I use with match method.

Regex to validate strings having only characters (without special characters but with accented characters), blank spaces and numbers

I am using Ruby on Rails 3.0.9 and I would like to validate a string that can contain only characters (case insensitive characters), blank spaces and numbers.
More:
special characters are not allowed (eg: !"£$%&/()=?^) except - and _;
accented characters are allowed (eg: à, è, é, ò, ...);
The regex that I know from this question is ^[a-zA-Z\d\s]*$ but this do not validate special characters and accented characters.
So, how I should improve the regex?
I wrote the ^(?:[^\W_]|\s)*$ answer in the question you referred to (which actually would have been different if I'd known you wanted to allow _ and -). Not being a Ruby guy myself, I didn't realize that Ruby defaults to not using Unicode for regex matching.
Sorry for my lack of Ruby experience. What you want to do is use the u flag. That switches to Unicode (UTF-8), so accented characters are caught. Here's the pattern you want:
^[\w\s-]*$
And here it is in action at Rubular. This should do the trick, I think.
The u flag works on my original answer as well, though that one isn't meant to allow _ or - characters.
Something like ^[\w\s\-]*$ should validate characters, blank spaces, minus, and underscore.
Validation string only for not allowed characters. In this case |,<,>," and &.
^[^|<>\"&]*$

Remove all but some special characters

I am trying to come up with a regex to remove all special characters except some. For example, I have a string:
str = "subscripción gustaría♥"
I want the output to be "subscripción gustaría".
The way I tried to do is, match anything which is not an ascii character (00 - 7F) and not special character I want and replace it with blank.
str.gsub(/(=?[^\x00-\x7F])(=?^\xC3\xB3)(=?^\xC3\xA1)/,'')
This doesn't work. The last special character is not removed.
Can someone help? (This is ruby 1.8)
Update: I am trying to make the question a little more clear. The string is utf-8 encoded. And I am trying to whitelist the ascii characters plus ó and í and blacklist everything else.
Oniguruma has support for all the characters you care about without having to deal with codepoints. You can just add the unicode characters inside the character class you're whitelisting, followed by the 'u' option.
ruby-1.8.7-p248 > str = "subscripción gustaría♥"
=> "subscripci\303\263n gustar\303\255a\342\231\245"
ruby-1.8.7-p248 > puts str.gsub(/[^a-zA-Z\sáéíóúÁÉÍÓÚ]/u,'')
subscripción gustaría
=> nil
str.split('').find_all {|c| (0x00..0x7f).include? c.ord }.join('')
The question is a bit vague. There is not a word about encoding of the string. Also, you want to white-list characters or black list? Which ones?
But you get the idea, decide what you want, and then use proper ranges as colleagues here already proposed. Some examples:
if str = "subscripción gustaría♥" is utf-8
then you can blacklist all char above the range (excl. whitespaces):
str.gsub(/[^\x{0021}-\x{017E}\s]/,'')
if string is in ISO-8859-1 codepage you can try to match all quirky characters like the "heart" from the beginning of ASCII range:
str.gsub(/[\x01-\x1F]/,'')
The problem is here with regex, has nothing to do with Ruby. You probably will need to experiment more.
It is not completely clear which characters you want to keep and which you want to delete. The example string's character is some Unicode character that, in my browser, displays as a heart symbol. But it seems you are dealing with 8-bit ASCII characters (since you are using ruby 1.8 and your regular expressions point that way).
Nonetheless, you should be able to do it in one of two ways; either specify the characters you want to keep or, alternatively, specify the characters you want to delete. For example, the following specifies that all characters 0x00-0x7F and 0xC0-0xF6 should be kept (remove everything that is not in that group):
puts str.gsub(/[^\x00-\x7F\xC0-\xF6]/,'')
This next example specifies that characters 0xA1 and 0xC3 should be deleted.
puts str.gsub(/[\xA1\xC3]/,'')
I ended up doing this: str.gsub(/[^\x00-\x7FÁáÉéÍíÑñÓóÚúÜü]/,''). It doesn't work on my mac but works on linux.

Resources