Regex to remove non letters - ruby

I'm trying to remove non-letters from a string. Would this do it:
c = o.replace(o.gsub!(/\W+/, ''))

Just gsub! is sufficient:
o.gsub!(/\W+/, '')
Note that gsub! modifies the original o object. Also, if the o does not contain any non-word characters, the result will be nil, so using the return value as the modified string is unreliable.
You probably want this instead:
c = o.gsub(/\W+/, '')

Remove anything that is not a letter:
> " sd 190i.2912390123.aaabbcd".gsub(/[^a-zA-Z]/, '')
"sdiaaabbcd"
EDIT: as ikegami points out, this doesn't take into account accented characters, umlauts, and other similar characters. The solution to this problem will depend on what exactly you are referring to as "not a letter". Also, what your input will be.

Keep in mind that ruby considers the underscore _ to be a word character. So if you want to keep underscores as well, this should do it
string.gsub!(/\W+/, '')
Otherwise, you need to do this:
string.gsub!(/[^a-zA-Z]/, '')

That will work most of the cases, except when o initially does not contain any non-letter, in which case gsub! will return nil.
If you just want a replaced string, it can be simpler:
c = o.gsub(/\W+/, '')

Using \W or \w to select or delete only characters won't work. \w means A-Z, a-z, 0-9, and "_":
irb(main):002:0> characters = (' ' .. "\x7e").to_a.join('')
=> " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
irb(main):003:0> characters.gsub(/\W+/, '')
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"
So, stripping using \W preserves digits and underscores.
If you want to match characters use /[A-Za-z]+/, or the POSIX character class [:alpha:], i.e. /[[:alpha:]]+/, or /\p{ALPHA}/.
The final format is the Unicode property for 'A'..'Z' + 'a'..'z' in ASCII, and gets extended when dealing with Unicode, so if you have multibyte characters you should probably use that.

use Regexp#union to create a big matching object
allowed = Regexp.union(/[a-zA-Z0-9]/, " ", "-", ":", ")", "(", ".")
cleanstring = dirty_string.chars.select {|c| c =~ allowed}.join("")

I don't see what that o.replace is in there for if you have a string:
string = 't = 4 6 ^'
And you do:
string.gsub!(/\W+/, '')
You get:
t46
If you want to get rid of the number characters too, you can do:
string.gsub!(/\W+|\d+/, '')
And you get:
t

Related

How to replace every 4th character of a string using .gsub in Ruby?

Beginner here, obviously. I need to add a sum and a string together and from the product, I have to replace every 4th character with underscore, the end product should look something like this: 160_bws_np8_1a
I think .gsub is the way, but I can find a way to format the first part in .gsub where I have to specify every 4th character.
total = (1..num).sum
final_output = "#{total.to_s}" + "06bwsmnp851a"
return final_output.gsub(//, "_")
This would work:
s = '12345678901234'
s.gsub(/(...)./, '\1_')
#=> "123_567_901_34"
The regex matches 3 characters (...) that are captured (parentheses) followed by another character (.). Each match is replaced by the first capture (\1) and a literal underscore (_).
s = "12345678901234"
Here are two ways to do that. Both return
"123_567_901_34"
Match every four-character substring and replace the match with the first three characters of the match followed by an underscore
s.gsub(/.{4}/) { |s| s[0,3] << '_' }
Chain the enumerator s.gsub(/./) to Enumerator#with_index and replace every fourth character with an underscore
s.gsub(/./).with_index { |c,i| i%4 == 3 ? '_' : c }
See the form of String#gsub that takes a single argument and no block.

Replace all characters other than english letters and numbers to underscore

I have a string, and I would like to replace all special characters with underscores.
In other words, I just want 26 english letters (lower and upper cases) and 0-9 and the "_" character.
Also note that there are the non-english characters and they need to be replaced with "_" as well.
What is the most elegant way to do this in Ruby?
It sounds like you want to replace all non-word characters with underscores. Therefore,
result = subject.gsub(/[^\w]/, '_')
But are you okay that this would also replace newlines and other whitespace characters?
If not, change it to
result = subject.gsub(/[^\w\s]/, '_')
Explain Regex
[^\w\s] # any character except: word characters (a-
# z, A-Z, 0-9, _), whitespace (\n, \r, \t,
# \f, and " ")
Note
As #CarySwoveland mentions, the [^\w] can also be written with the shorthand \W.

What is the opposite of Regexp.escape?

What is the opposite of Regexp.escape ?
> Regexp.escape('A & B')
=> "A\\ &\\ B"
> # do something, to get the next result: (something like Regexp.unescape(A\\ &\\ B))
=> "A & B"
How can I get the original value?
replaces = Hash.new { |hash,key| key } # simple trick to return key if there is no value in hash
replaces['t'] = "\t"
replaces['n'] = "\n"
replaces['r'] = "\r"
replaces['f'] = "\f"
replaces['v'] = "\v"
rx = Regexp.escape('A & B')
str = rx.gsub(/\\(.)/){ replaces[$1] }
Also make sure to #puts output in irb, because #inspect escapes characters by default.
Basically escaping/quoting looks for meta-characters, and prepends \ character (which has to be escaped for string interpretation in source code). But if we find any control character from list: \t, \n, \r, \f, \v, then quoting outputs \ character followed by this special character translated to ascii.
UPDATE:
My solution had problems with special characters (\n, \t ans so on), I updated it after investigating source code for rb_reg_quote method.
UPDATE 2:
replaces is hash, which converts escaped characters (thats why it is used in block attached to gsub) to unescaped ones. It is indexed by character without escape character (second character in sequence) and searches for unescaped value. The only defined values are control-characters, but there is also default_proc attached (block attached to Hash.new), which returns key if there is no value found in hash. So it works like this:
for "n" it returns "\n", the same for all other escaped control characters, because it is value associated with key
for "(" it returns "(", because there is no value associated with "(" key, hash calls #default_proc, which returns key itself
The only characters escaped by Regexp.escape are meta characters and control characters, so we don't have to worry about alphanumerics.
Take a look at http://ruby-doc.org/core-2.0.0/Hash.html#method-i-default_proc for documentation on #defoult_proc
You can perhaps use something like this?
def unescape(s)
eval %Q{"#{s}"}
end
puts unescape('A\\ &\\ B')
Credits to this question.
codepad demo
If you are okay with a regex solution, you can use this:
res = s.gsub(/\\(?!\\)|(\\)\\/, "\\1")
codepad demo
Try this
>> r = Regexp.escape("A & B (and * c [ e] + )")
# => "A\\ &\\ B\\ \\(and\\ \\*\\ c\\ \\[\\ e\\]\\ \\+\\ \\)"
>> r.gsub("\\(","(").gsub("\\)",")").gsub("\\[","[").gsub("\\]","]").gsub("\\{","{").gsub("\\}","}").gsub("\\.",".").gsub("\\?","?").gsub("\\+","+").gsub("\\*","*").gsub("\\ "," ")
# => "A & B (and * c [ e] + )"
Basically, these (, ), [, ], {, }, ., ?, +, * are the meta characters in regex. And also \ which is used as an escape character.
The chain of gsub() calls replace the escaped patterns with corresponding actual value.
I am sure there is a way to DRY this up.
Update: DRY version as suggested by user2503775
>> r.gsub("\\","")
Update:
following are the special characters in regex
[,],{,},(,),|,-,*,.,\\,?,+,^,$,<space>,#,\t,\f,\v,\n,\r
using a regex replace using \\(?=([\\\*\+\?\|\{\[\(\)\^\$\.\#\ ]))\
should give you the string unescaped, you would only have to replace \r\n sequences with there CrLf counterparts.
"There\ is\ a\ \?\ after\ the\ \(white\)\ car\.\ \r\n\ it\ should\ be\ http://car\.com\?\r\n"
is unescaped to :
"There is a ? after the (white) car. \r\n it should be http://car.com?\r\n"
and removing the \r\n gives you :
There is a ? after the (white) car.
it should be http://car.com?

Regexp non alphanumerical but not German characters

I would like to remove all non alpha numerical characters from a string. Except space, - and some German characters.
Example
regexp = "mönchengladbach."
regexp.gsub(/[^0-9a-z \-]/i, '')
=> mnchengladbach
I need this:
=> mönchengladbach
It should also not replace other German characters such as:
ä ö ü ß
Thanks!
Edit:
It was just me not testing properly. The IRB did not accept special characters. This works for me:
regexp.gsub(/[^0-9a-z \-äüöß]/i, '')
To remove all that is not a letter or a space you can use this:
str.gsub(/[^\p{L}\s]+/, '')
I use here a negated character class, [^\p{L}\s] means all that is not a letter (in all language you want) or a white charater (space, tab, newlines)
\p{L} is an unicode character class for Letters.
You can easily add other characters you want to preserve like -:
str.gsub(/[^\p{L}\s-]+/, '')
example script:
# encoding: UTF-8
str = "mönchengladbach."
str = str.gsub(/[^\p{L}\s]+/, '#')
puts str
I think you want:
/[^[:alnum:] -]/
Note the //i is not necessary and no need to escape - when it's at the end of a []

how to remove leading and trailing non-alphabetic characters in ruby

I want to remove any leading and trailing non-alphabetic character in my string.
for eg. ":----- pt-br:-" , i want "pt-br"
Thanks
result = subject.gsub(/\A[\d_\W]+|[\d_\W]+\Z/, '')
will remove non-letters from the start and end of the string.
\A and \Z anchor the regex at the start/end of the string (^/$ would also match after/before a newline which is probably not what you want - but that might not matter in this case);
[\d_\W]+ matches one or more digits, the underscore or anything else that is not an alphanumeric character, leaving only letters.
| is the alternation operator.
In ruby 1.9.1 :
":----- pt-br:-".partition( /[a-zA-Z](...)[a-zA-Z]/ )[1]
partition searches the pattern in the string and returns the part before it, the match, and the part after it.
result = subject.gsub(/^[^a-zA-Z]+/, '').gsub(/[^a-zA-Z]+$/, '')

Resources