Regex add special characters

Regex add special characters - ruby

I have this regex:
var characterReg = /^\s*[a-zA-Z0-9,\s]+\s*$/;
How do I include the letters: Å, Ø, Æ, å, ø, æ ?

Use the unicode values:
\u{1234}{2}
for perl, use:
\x{1234}{2}
will match the 1234 unicode character twice.
There is much more information on this here:
http://www.regular-expressions.info/unicode.html
In ruby, looks like unicode support is half-baked:
http://www.ruby-forum.com/topic/133538

Related

Dealing with special character in Nokogiri / Regex

I am getting the text from the body of an HTML doc as below. When I try to regex scan for the term "Exhibit 99", I get an no matched, i.e, an empty array. However, in the html, I do see "Exhibit 99", although inspect element shows it with &nbsp99. How can I get rid of these HTML characters and search for "Exhibit 99" as if it were a regular string?
url = "https://www.sec.gov/Archives/edgar/data/1467373/000146737316000912/fy16q3plc8-kbody.htm"
doc = Nokogiri::HTML(open(url))
body = doc.css("body").text
body.scan(/exhibit 99/i)

Unicode character space
You can use :
body.scan(/exhibit\p{Zs}99/i)
From the documentation about Unicode character’s General Category:
/\p{Z}/ - 'Separator'
/\p{Zs}/ - 'Separator: Space'
It matches a whitespace or a non-breaking space, but no tab or newline. The string should be encoded in UTF-8. See this related question for more information.
non-word character
A more permissive regex would be :
body.scan(/exhibit\W99/i)
This allows any character other than a letter, a digit or an underscore between exhibit and 99. It would match a whitespace, a nbsp, a tab, a dash, ...

Replace all characters other than english letters and numbers to underscore

I have a string, and I would like to replace all special characters with underscores.
In other words, I just want 26 english letters (lower and upper cases) and 0-9 and the "_" character.
Also note that there are the non-english characters and they need to be replaced with "_" as well.
What is the most elegant way to do this in Ruby?

It sounds like you want to replace all non-word characters with underscores. Therefore,
result = subject.gsub(/[^\w]/, '_')
But are you okay that this would also replace newlines and other whitespace characters?
If not, change it to
result = subject.gsub(/[^\w\s]/, '_')
Explain Regex
[^\w\s] # any character except: word characters (a-
# z, A-Z, 0-9, _), whitespace (\n, \r, \t,
# \f, and " ")
Note
As #CarySwoveland mentions, the [^\w] can also be written with the shorthand \W.

Regexp non alphanumerical but not German characters

I would like to remove all non alpha numerical characters from a string. Except space, - and some German characters.
Example
regexp = "mönchengladbach."
regexp.gsub(/[^0-9a-z \-]/i, '')
=> mnchengladbach
I need this:
=> mönchengladbach
It should also not replace other German characters such as:
ä ö ü ß
Thanks!
Edit:
It was just me not testing properly. The IRB did not accept special characters. This works for me:
regexp.gsub(/[^0-9a-z \-äüöß]/i, '')

To remove all that is not a letter or a space you can use this:
str.gsub(/[^\p{L}\s]+/, '')
I use here a negated character class, [^\p{L}\s] means all that is not a letter (in all language you want) or a white charater (space, tab, newlines)
\p{L} is an unicode character class for Letters.
You can easily add other characters you want to preserve like -:
str.gsub(/[^\p{L}\s-]+/, '')
example script:
# encoding: UTF-8
str = "mönchengladbach."
str = str.gsub(/[^\p{L}\s]+/, '#')
puts str

I think you want:
/[^[:alnum:] -]/
Note the //i is not necessary and no need to escape - when it's at the end of a []

How to match any quoted strings containing Cyrillic symbols

Need parse a lot of text files and replace any quoted strings containing cyrillic symbols. They are may contains new lines, non-alphabetic characters and special symbols (for example '$' or escaped quote).
Can anyone help with regex?
From comments:
for example php code
function hello($word) {
$word2 = "ха-ха!";
echo "Привет, $word $word2\n";
}
hello('Мир');
I need match "ха-ха!", "Привет, $word $word2\n" and 'Мир'

This should work:
str = 'The cat is under the "таблица"'
regex = /"\p{Cyrillic}+.*?\.?"/ui
str.match(regex){|s| do_stuff_with_each_matching s}
# or...
str.gsub!(regex){|s| method_that_translates_russian s}
Check it out on live at http://rubular.com/r/0Mwbfinjvp.
http://www.ruby-doc.org/core-1.9.3/Regexp.html

".*[^a-zA-Z\d]+.*" matches any quoted character sequence containing at least one non-alphanumeric character.
i.e. it matches "aa$bb" and "a1$b1"
It doesn't match "aabb" or a$b.
Hope that this is what you want (Add required escaping).

how to remove whitespace but not utf-8 character in ruby

I want to prevent users to write an empty comment (whitespaces, , etc.). so I apply the following:
var.gsub(/^\s+|\s+\z|\s* \s*/.'')
However, then a smart user find a hole by using \302 or \240 unicode characters so I filtered out these characters too.
Then I ran into problem as I introduced several languages support, then a word like Déjà vu becomes an error. because part of the à character contains \240. is there any way to remove the whitespaces but leave the latin characters untouched?

A way around this is to use iconv to discard the invalid unicode characters (such as \230 on its own) before using your regexp to remove the whitespaces:
require 'iconv'
var1 = "Déjà vu"
var2 = "\240"
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
valid1 = ic.iconv(var1) # => "D\303\251j\303\240 vu"
valid2 = ic.iconv(var2) # => ""

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Regex add special characters - ruby

I have this regex: var characterReg = /^\s[a-zA-Z0-9,\s]+\s$/; How do I include the letters: Å, Ø, Æ, å, ø, æ ?

Use the unicode values: \u{1234}{2} for perl, use: \x{1234}{2} will match the 1234 unicode character twice. There is much more information on this here: http://www.regular-expressions.info/unicode.html In ruby, looks like unicode support is half-baked: http://www.ruby-forum.com/topic/133538

Related

Dealing with special character in Nokogiri / Regex

Replace all characters other than english letters and numbers to underscore

Regexp non alphanumerical but not German characters

How to match any quoted strings containing Cyrillic symbols

how to remove whitespace but not utf-8 character in ruby

Categories

Resources

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Regex add special characters - ruby

I have this regex: var characterReg = /^\s*[a-zA-Z0-9,\s]+\s*$/; How do I include the letters: Å, Ø, Æ, å, ø, æ ?

Use the unicode values: \u{1234}{2} for perl, use: \x{1234}{2} will match the 1234 unicode character twice. There is much more information on this here: http://www.regular-expressions.info/unicode.html In ruby, looks like unicode support is half-baked: http://www.ruby-forum.com/topic/133538

Related

Dealing with special character in Nokogiri / Regex

Replace all characters other than english letters and numbers to underscore

Regexp non alphanumerical but not German characters

How to match any quoted strings containing Cyrillic symbols

how to remove whitespace but not utf-8 character in ruby

Categories

Resources

I have this regex: var characterReg = /^\s[a-zA-Z0-9,\s]+\s$/; How do I include the letters: Å, Ø, Æ, å, ø, æ ?