sanitize input for ruby - ruby

Hallo I would like to sanitize input from Ruby, but simultaneously not have it mess up my strings containing foreign characters.
string1=string.downcase.gsub( /<(.|\n)*?>/, '' ).gsub(" ", "").gsub(",","").gsub("'","").gsub("_", "").gsub(";", "").gsub("-", "").gsub(":","").gsub(".", "").gsub("?", "").gsub("!", "").gsub("^","").gsub("%", "").gsub("$","")
The string needs to be stripped of spaces, apostrophes, and everything but letters (not sure about numbers), in addition to be sanitized. I am not sure if I forgot something, and probably something is redundant.
My code works OK as long as string does not contain harmless non-english characters such as accented letters, which I'd like it to deal with, but they break my code. My guess is that they are converted to %25 and all that stuff and afterwards they break. In fact it breaks even if I don't sanitize at all. How can I tell Ruby to handle non-english characters correctly? Thank you a lot.

like this;
" ' ; te st".gsub(/\W+/, "") # "test"

Related

Ruby on rails, how to remove white-space from japanese word?

I am trying to remove white-space from japanese word.
input "かいしゃ(会社)"
output "かいしゃ(会社)"
The space here is consumed by the parentheses. They are not your regular ASCII parentheses, they are of the "full width" flavor.
If you want to replace them with ASCII parentheses, you can do it like this:
compact_input = input.gsub("\uFF08", '(') # and a similar step for the closing parenthesis
Although this might make your string look weird in japanese (I don't know the language well enough, so can't say)

How do I match a UTF-8 encoded hashtag with embedded punctuation characters?

I want to extract #hashtags from a string, also those that have special characters such as #1+1.
Currently I'm using:
#hashtags ||= string.scan(/#\w+/)
But it doesn't work with those special characters. Also, I want it to be UTF-8 compatible.
How do I do this?
EDIT:
If the last character is a special character it should be removed, such as #hashtag, #hashtag. #hashtag! #hashtag? etc...
Also, the hash sign at the beginning should be removed.
The Solution
You probably want something like:
'#hash+tag'.encode('UTF-8').scan /\b(?<=#)[^#[:punct:]]+\b/
=> ["hash+tag"]
Note that the zero-width assertion at the beginning is required to avoid capturing the pound sign as part of the match.
References
String#encode
Ruby's POSIX Character Classes
This should work:
#hashtags = str.scan(/#([[:graph:]]*[[:alnum:]])/).flatten
Or if you don't want your hashtag to start with a special character:
#hashtags = str.scan(/#((?:[[:alnum:]][[:graph:]]*)?[[:alnum:]])/).flatten
How about this:
#hashtags ||=string.match(/(#[[:alpha:]]+)|#[\d\+-]+\d+/).to_s[1..-1]
Takes cares of #alphabets or #2323+2323 #2323-2323 #2323+65656-67676
Also removes # at beginning
Or if you want it in array form:
#hashtags ||=string.scan(/#[[:alpha:]]+|#[\d\+-]+\d+/).collect{|x| x[1..-1]}
Wow, this took so long but I still don't understand why scan(/#[[:alpha:]]+|#[\d\+-]+\d+/) works but not scan(/(#[[:alpha:]]+)|#[\d\+-]+\d+/) in my computer. The difference being the () on the 2nd scan statement. This has no effect as it should be when I use with match method.

How can I match Korean characters in a Ruby regular expression?

I have some basic validations for usernames using regular expressions, something like [\w-_]+, and I want to add support for Korean alphabet, while still keeping the validation the same.
I don't want to allow special characters, such as {}[]!##$%^&*() etc., I just want to replace the \w with something that matches a given alphabet in addition to [a-zA-Z0-9].
Which means username like 안녕 should be valid, but not 안녕[].
I need to do this in Ruby 1.9.
try this:
[가-힣]+
This matches every character from U+AC00 to U+D7A3, which is probably enough for your interest. (I don't think you'll need old hangul characters and stuff)
You can test for invalid characters like this:
#encoding: utf-8
def valid_name?(name)
!name.match(/[^a-zA-Z0-9\p{Hangul}]/)
end
ar = %w(안녕 name 안녕[].)
ar.each{|name| puts "#{name} is #{valid_name?(name) ? "valid" : "invalid"}."}
# 안녕 is valid.
# name is valid.
# 안녕[]. is invalid.
I think you can replace \w by [:word:]
/^[[:word:]\-_]+$/ should work
Matching for invalid characters is your best option, because there are way too many valid Korean characters - it's technically an alphabet but computerized as one-character-per-syllable, and additionally there are thousands of Chinese loan characters (Hanja) which should also be valid.

Regex to validate strings having only characters (without special characters but with accented characters), blank spaces and numbers

I am using Ruby on Rails 3.0.9 and I would like to validate a string that can contain only characters (case insensitive characters), blank spaces and numbers.
More:
special characters are not allowed (eg: !"£$%&/()=?^) except - and _;
accented characters are allowed (eg: à, è, é, ò, ...);
The regex that I know from this question is ^[a-zA-Z\d\s]*$ but this do not validate special characters and accented characters.
So, how I should improve the regex?
I wrote the ^(?:[^\W_]|\s)*$ answer in the question you referred to (which actually would have been different if I'd known you wanted to allow _ and -). Not being a Ruby guy myself, I didn't realize that Ruby defaults to not using Unicode for regex matching.
Sorry for my lack of Ruby experience. What you want to do is use the u flag. That switches to Unicode (UTF-8), so accented characters are caught. Here's the pattern you want:
^[\w\s-]*$
And here it is in action at Rubular. This should do the trick, I think.
The u flag works on my original answer as well, though that one isn't meant to allow _ or - characters.
Something like ^[\w\s\-]*$ should validate characters, blank spaces, minus, and underscore.
Validation string only for not allowed characters. In this case |,<,>," and &.
^[^|<>\"&]*$

Remove all but some special characters

I am trying to come up with a regex to remove all special characters except some. For example, I have a string:
str = "subscripción gustaría♥"
I want the output to be "subscripción gustaría".
The way I tried to do is, match anything which is not an ascii character (00 - 7F) and not special character I want and replace it with blank.
str.gsub(/(=?[^\x00-\x7F])(=?^\xC3\xB3)(=?^\xC3\xA1)/,'')
This doesn't work. The last special character is not removed.
Can someone help? (This is ruby 1.8)
Update: I am trying to make the question a little more clear. The string is utf-8 encoded. And I am trying to whitelist the ascii characters plus ó and í and blacklist everything else.
Oniguruma has support for all the characters you care about without having to deal with codepoints. You can just add the unicode characters inside the character class you're whitelisting, followed by the 'u' option.
ruby-1.8.7-p248 > str = "subscripción gustaría♥"
=> "subscripci\303\263n gustar\303\255a\342\231\245"
ruby-1.8.7-p248 > puts str.gsub(/[^a-zA-Z\sáéíóúÁÉÍÓÚ]/u,'')
subscripción gustaría
=> nil
str.split('').find_all {|c| (0x00..0x7f).include? c.ord }.join('')
The question is a bit vague. There is not a word about encoding of the string. Also, you want to white-list characters or black list? Which ones?
But you get the idea, decide what you want, and then use proper ranges as colleagues here already proposed. Some examples:
if str = "subscripción gustaría♥" is utf-8
then you can blacklist all char above the range (excl. whitespaces):
str.gsub(/[^\x{0021}-\x{017E}\s]/,'')
if string is in ISO-8859-1 codepage you can try to match all quirky characters like the "heart" from the beginning of ASCII range:
str.gsub(/[\x01-\x1F]/,'')
The problem is here with regex, has nothing to do with Ruby. You probably will need to experiment more.
It is not completely clear which characters you want to keep and which you want to delete. The example string's character is some Unicode character that, in my browser, displays as a heart symbol. But it seems you are dealing with 8-bit ASCII characters (since you are using ruby 1.8 and your regular expressions point that way).
Nonetheless, you should be able to do it in one of two ways; either specify the characters you want to keep or, alternatively, specify the characters you want to delete. For example, the following specifies that all characters 0x00-0x7F and 0xC0-0xF6 should be kept (remove everything that is not in that group):
puts str.gsub(/[^\x00-\x7F\xC0-\xF6]/,'')
This next example specifies that characters 0xA1 and 0xC3 should be deleted.
puts str.gsub(/[\xA1\xC3]/,'')
I ended up doing this: str.gsub(/[^\x00-\x7FÁáÉéÍíÑñÓóÚúÜü]/,''). It doesn't work on my mac but works on linux.

Resources