Remove all but some special characters

Remove all but some special characters - ruby

I am trying to come up with a regex to remove all special characters except some. For example, I have a string:
str = "subscripción gustaría♥"
I want the output to be "subscripción gustaría".
The way I tried to do is, match anything which is not an ascii character (00 - 7F) and not special character I want and replace it with blank.
str.gsub(/(=?[^\x00-\x7F])(=?^\xC3\xB3)(=?^\xC3\xA1)/,'')
This doesn't work. The last special character is not removed.
Can someone help? (This is ruby 1.8)
Update: I am trying to make the question a little more clear. The string is utf-8 encoded. And I am trying to whitelist the ascii characters plus ó and í and blacklist everything else.

Oniguruma has support for all the characters you care about without having to deal with codepoints. You can just add the unicode characters inside the character class you're whitelisting, followed by the 'u' option.
ruby-1.8.7-p248 > str = "subscripción gustaría♥"
=> "subscripci\303\263n gustar\303\255a\342\231\245"
ruby-1.8.7-p248 > puts str.gsub(/[^a-zA-Z\sáéíóúÁÉÍÓÚ]/u,'')
subscripción gustaría
=> nil

str.split('').find_all {|c| (0x00..0x7f).include? c.ord }.join('')

The question is a bit vague. There is not a word about encoding of the string. Also, you want to white-list characters or black list? Which ones?
But you get the idea, decide what you want, and then use proper ranges as colleagues here already proposed. Some examples:
if str = "subscripción gustaría♥" is utf-8
then you can blacklist all char above the range (excl. whitespaces):
str.gsub(/[^\x{0021}-\x{017E}\s]/,'')
if string is in ISO-8859-1 codepage you can try to match all quirky characters like the "heart" from the beginning of ASCII range:
str.gsub(/[\x01-\x1F]/,'')
The problem is here with regex, has nothing to do with Ruby. You probably will need to experiment more.

It is not completely clear which characters you want to keep and which you want to delete. The example string's character is some Unicode character that, in my browser, displays as a heart symbol. But it seems you are dealing with 8-bit ASCII characters (since you are using ruby 1.8 and your regular expressions point that way).
Nonetheless, you should be able to do it in one of two ways; either specify the characters you want to keep or, alternatively, specify the characters you want to delete. For example, the following specifies that all characters 0x00-0x7F and 0xC0-0xF6 should be kept (remove everything that is not in that group):
puts str.gsub(/[^\x00-\x7F\xC0-\xF6]/,'')
This next example specifies that characters 0xA1 and 0xC3 should be deleted.
puts str.gsub(/[\xA1\xC3]/,'')

I ended up doing this: str.gsub(/[^\x00-\x7FÁáÉéÍíÑñÓóÚúÜü]/,''). It doesn't work on my mac but works on linux.

Related

How do I match a UTF-8 encoded hashtag with embedded punctuation characters?

I want to extract #hashtags from a string, also those that have special characters such as #1+1.
Currently I'm using:
#hashtags ||= string.scan(/#\w+/)
But it doesn't work with those special characters. Also, I want it to be UTF-8 compatible.
How do I do this?
EDIT:
If the last character is a special character it should be removed, such as #hashtag, #hashtag. #hashtag! #hashtag? etc...
Also, the hash sign at the beginning should be removed.

The Solution
You probably want something like:
'#hash+tag'.encode('UTF-8').scan /\b(?<=#)[^#[:punct:]]+\b/
=> ["hash+tag"]
Note that the zero-width assertion at the beginning is required to avoid capturing the pound sign as part of the match.
References
String#encode
Ruby's POSIX Character Classes

This should work:
#hashtags = str.scan(/#([[:graph:]]*[[:alnum:]])/).flatten
Or if you don't want your hashtag to start with a special character:
#hashtags = str.scan(/#((?:[[:alnum:]][[:graph:]]*)?[[:alnum:]])/).flatten

How about this:
#hashtags ||=string.match(/(#[[:alpha:]]+)|#[\d\+-]+\d+/).to_s[1..-1]
Takes cares of #alphabets or #2323+2323 #2323-2323 #2323+65656-67676
Also removes # at beginning
Or if you want it in array form:
#hashtags ||=string.scan(/#[[:alpha:]]+|#[\d\+-]+\d+/).collect{|x| x[1..-1]}
Wow, this took so long but I still don't understand why scan(/#[[:alpha:]]+|#[\d\+-]+\d+/) works but not scan(/(#[[:alpha:]]+)|#[\d\+-]+\d+/) in my computer. The difference being the () on the 2nd scan statement. This has no effect as it should be when I use with match method.

How can I match Korean characters in a Ruby regular expression?

I have some basic validations for usernames using regular expressions, something like [\w-_]+, and I want to add support for Korean alphabet, while still keeping the validation the same.
I don't want to allow special characters, such as {}[]!##$%^&*() etc., I just want to replace the \w with something that matches a given alphabet in addition to [a-zA-Z0-9].
Which means username like 안녕 should be valid, but not 안녕[].
I need to do this in Ruby 1.9.

try this:
[가-힣]+
This matches every character from U+AC00 to U+D7A3, which is probably enough for your interest. (I don't think you'll need old hangul characters and stuff)

You can test for invalid characters like this:
#encoding: utf-8
def valid_name?(name)
!name.match(/[^a-zA-Z0-9\p{Hangul}]/)
end
ar = %w(안녕 name 안녕[].)
ar.each{|name| puts "#{name} is #{valid_name?(name) ? "valid" : "invalid"}."}
# 안녕 is valid.
# name is valid.
# 안녕[]. is invalid.

I think you can replace \w by [:word:]
/^[[:word:]\-_]+$/ should work

Matching for invalid characters is your best option, because there are way too many valid Korean characters - it's technically an alphabet but computerized as one-character-per-syllable, and additionally there are thousands of Chinese loan characters (Hanja) which should also be valid.

Regex to validate strings having only characters (without special characters but with accented characters), blank spaces and numbers

I am using Ruby on Rails 3.0.9 and I would like to validate a string that can contain only characters (case insensitive characters), blank spaces and numbers.
More:
special characters are not allowed (eg: !"£$%&/()=?^) except - and _;
accented characters are allowed (eg: à, è, é, ò, ...);
The regex that I know from this question is ^[a-zA-Z\d\s]*$ but this do not validate special characters and accented characters.
So, how I should improve the regex?

I wrote the ^(?:[^\W_]|\s)*$ answer in the question you referred to (which actually would have been different if I'd known you wanted to allow _ and -). Not being a Ruby guy myself, I didn't realize that Ruby defaults to not using Unicode for regex matching.
Sorry for my lack of Ruby experience. What you want to do is use the u flag. That switches to Unicode (UTF-8), so accented characters are caught. Here's the pattern you want:
^[\w\s-]*$
And here it is in action at Rubular. This should do the trick, I think.
The u flag works on my original answer as well, though that one isn't meant to allow _ or - characters.

Something like ^[\w\s\-]*$ should validate characters, blank spaces, minus, and underscore.

Validation string only for not allowed characters. In this case |,<,>," and &.
^[^|<>\"&]*$

How do I escape a Unicode string with Ruby?

I need to encode/convert a Unicode string to its escaped form, with backslashes. Anybody know how?

In Ruby 1.8.x, String#inspect may be what you are looking for, e.g.
>> multi_byte_str = "hello\330\271!"
=> "hello\330\271!"
>> multi_byte_str.inspect
=> "\"hello\\330\\271!\""
>> puts multi_byte_str.inspect
"hello\330\271!"
=> nil
In Ruby 1.9 if you want multi-byte characters to have their component bytes escaped, you might want to say something like:
>> multi_byte_str.bytes.to_a.map(&:chr).join.inspect
=> "\"hello\\xD8\\xB9!\""
In both Ruby 1.8 and 1.9 if you are instead interested in the (escaped) unicode code points, you could do this (though it escapes printable stuff too):
>> multi_byte_str.unpack('U*').map{ |i| "\\u" + i.to_s(16).rjust(4, '0') }.join
=> "\\u0068\\u0065\\u006c\\u006c\\u006f\\u0639\\u0021"

To use a unicode character in Ruby use the "\uXXXX" escape; where XXXX is the UTF-16 codepoint. see http://leejava.wordpress.com/2009/03/11/unicode-escape-in-ruby/

If you have Rails kicking around you can use the JSON encoder for this:
require 'active_support'
x = ActiveSupport::JSON.encode('µ')
# x is now "\u00b5"
The usual non-Rails JSON encoder doesn't "\u"-ify Unicode.

There are two components to your question as I understand it: Finding the numeric value of a character, and expressing such values as escape sequences in Ruby. Further, the former depends on what your starting point is.
Finding the value:
Method 1a: from Ruby with String#dump:
If you already have the character in a Ruby String object (or can easily get it into one), this may be as simple as displaying the string in the repl (depending on certain settings in your Ruby environment). If not, you can call the #dump method on it. For example, with a file called unicode.txt that contains some UTF-8 encoded data in it – say, the currency symbols €£¥$ (plus a trailing newline) – running the following code (executed either in irb or as a script):
s = File.read("unicode.txt", :encoding => "utf-8") # this may be enough, from irb
puts s.dump # this will definitely do it.
... should print out:
"\u20AC\u00A3\u00A5$\n"
Thus you can see that € is U+20AC, £ is U+00A3, and ¥ is U+00A5. ($ is not converted, since it's straight ASCII, though it's technically U+0024. The code below could be modified to give that information, if you actually need it. Or just add leading zeroes to the hex values from an ASCII table – or reference one that already does so.)
(Note: a previous answer suggested using #inspect instead of #dump. That sometimes works, but not always. For example, running ruby -E UTF-8 -e 'puts "\u{1F61E}".inspect' prints an unhappy face for me, rather than an escape sequence. Changing inspect to dump, though, gets me the escape sequence back.)
Method 1b: with Ruby using String#encode and rescue:
Now, if you're trying the above with a larger input file, the above may prove unwieldy – it may be hard to even find escape sequences in files with mostly ASCII text, or it may be hard to identify which sequences go with which characters. In such a case, one might replace the second line above with the following:
encodings = {} # hash to store mappings in
s.split("").each do |c| # loop through each "character"
begin
c.encode("ASCII") # try to encode it to ASCII
rescue Encoding::UndefinedConversionError # but if that fails
encodings[c] = $!.error_char.dump # capture a dump, mapped to the source character
end
end
# And then print out all the captured non-ASCII characters:
encodings.each do |char, dumped|
puts "#{char} encodes to #{dumped}."
end
With the same input as above, this would then print:
€ encodes to "\u20AC".
£ encodes to "\u00A3".
¥ encodes to "\u00A5".
Note that it's possible for this to be a bit misleading. If there are combining characters in the input, the output will print each component separately. For example, for input of 🙋🏾 ў ў, the output would be:
🙋 encodes to "\u{1F64B}".
🏾 encodes to "\u{1F3FE}".
ў encodes to "\u045E".
у encodes to "\u0443". ̆
encodes to "\u0306".
This is because 🙋🏾 is actually encoded as two code points: a base character (🙋 - U+1F64B), with a modifier (🏾, U+1F3FE; see also). Similarly with one of the letters: the first, ў, is a single pre-combined code point (U+045E), while the second, ў – though it looks the same – is formed by combining у (U+0443) with the modifier ̆ (U+0306 - which may or may not render properly, including on this page, since it's not meant to stand alone). So, depending on what you're doing, you may need to watch out for such things (which I leave as an exercise for the reader).
Method 2a: from web-based tools: specific characters:
Alternatively, if you have, say, an e-mail with a character in it, and you want to find the code point value to encode, if you simply do a web search for that character, you'll frequently find a variety of pages that give unicode details for the particular character. For example, if I do a google search for ✓, I get, among other things, a wiktionary entry, a wikipedia page, and a page on fileformat.info, which I find to be a useful site for getting details on specific unicode characters. And each of those pages lists the fact that that check mark is represented by unicode code point U+2713. (Incidentally, searching in that direction works well, too.)
Method 2b: from web-based tools: by name/concept:
Similarly, one can search for unicode symbols to match a particular concept. For example, I searched above for unicode check marks, and even on the Google snippet there was a listing of several code points with corresponding graphics, though I also find this list of several check mark symbols, and even a "list of useful symbols" which has a bunch of things, including various check marks.
This can similarly be done for accented characters, emoticons, etc. Just search for the word "unicode" along with whatever else you're looking for, and you'll tend to get results that include pages that list the code points. Which then brings us to putting that back into ruby:
Representing the value, once you have it:
The Ruby documentation for string literals describes two ways to represent unicode characters as escape sequences:
\unnnn Unicode character, where nnnn is exactly 4 hexadecimal digits ([0-9a-fA-F])
\u{nnnn ...} Unicode character(s), where each nnnn is 1-6 hexadecimal digits ([0-9a-fA-F])
So for code points with a 4-digit representation, e.g. U+2713 from above, you'd enter (within a string literal that's not in single quotes) this as \u2713. And for any unicode character (whether or not it fits in 4 digits), you can use braces ({ and }) around the full hex value for the code point, e.g. \u{1f60d} for 😍. This form can also be used to encode multiple code points in a single escape sequence, separating characters with whitespace. For example, \u{1F64B 1F3FE} would result in the base character 🙋 plus the modifier 🏾, thus ultimately yielding the abstract character 🙋🏾 (as seen above).
This works with shorter code points, too. For example, that currency character string from above (€£¥$) could be represented with \u{20AC A3 A5 24} – requiring only 2 digits for three of the characters.

You can directly use unicode characters if you just add #Encoding: UTF-8 to the top of your file. Then you can freely use ä, ǹ, ú and so on in your source code.

try this gem. It converts Unicode or non-ASCII punctuation and symbols to nearest ASCII punctuation and symbols
https://github.com/qwuen/punctuate
example usage:
"100٪".punctuate
=> "100%"
the gem uses the reference in https://lexsrv3.nlm.nih.gov/LexSysGroup/Projects/lvg/current/docs/designDoc/UDF/unicode/DefaultTables/symbolTable.html for the conversion.

How to remove all non - ASCII characters from a string in Ruby

I seems to be a very simple and much needed method. I need to remove all non ASCII characters from a string. e.g Â© etc. See the following example.
#coding: utf-8
s = " Hello this a mixed string Â© that I made."
puts s.encoding
puts s.encode
output:
UTF-8
Hello this a mixed str
ing ┬⌐ that I made.
When I feed this to Watir, it produces following error:incompatible character encodings: UTF-8 and ASCII-8BIT
So my problem is that I want to get rid of all non ASCII characters before using it. I will not know which encoding the source string "s" uses.
I have been searching and experimenting for quite some time now.
If I try to use
puts s.encode('ASCII-8BIT')
It gives the error:
: "\xC2\xA9" from UTF-8 to ASCII-8BIT (Encoding::UndefinedConversionError)

You can just literally translate what you asked into a Regexp. You wrote:
I want to get rid of all non ASCII characters
We can rephrase that a little bit:
I want to substitue all characters which don't thave the ASCII property with nothing
And that's a statement that can be directly expressed in a Regexp:
s.gsub!(/\P{ASCII}/, '')
As an alternative, you could also use String#delete!:
s.delete!("^\u{0000}-\u{007F}")

Strip out the characters using regex. This example is in C# but the regex should be the same:
How can you strip non-ASCII characters from a string? (in C#)
Translating it into ruby using gsub should not be difficult.

UTF-8 is a variable-length encoding. When a character occupies one byte, its value coincides with 7-bit ASCII. So why don't you just look for bytes with a '1' in the MSB, and then remove both them and their trailers? A byte beginning with '110' will be followed by one additional byte. A byte beginning with '1110' will be followed by two. And a byte beginning with '11110' will be followed by three, the maximum supported by UTF-8.
This is all just off the top of my head. I could be wrong.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio