Replacing all but alphabetic characters with spaces in python, in any language - python-2.x

The code
phrase = "".join([c if c.isalpha() else " " for c in phrase])
substitute all non-alphabetic character with spaces. It works very well with strings made up with occidental language characters.
But giving it the value:
phrase = u'इसका स्वामित्व और नियंत्रण किया। इसके'
the result is u'इसक स व म त व और न य त रण क य इसक ', while it shouldn't change, since the string is only made of alphabetic characters and spaces.
I think the reason is that some character is a surrogate pair.
Is it a bug with python's isalpha() method?
Or, if not, how can I deal properly with characters represented by surrogate pairs?

Related

Replace all characters other than english letters and numbers to underscore

I have a string, and I would like to replace all special characters with underscores.
In other words, I just want 26 english letters (lower and upper cases) and 0-9 and the "_" character.
Also note that there are the non-english characters and they need to be replaced with "_" as well.
What is the most elegant way to do this in Ruby?
It sounds like you want to replace all non-word characters with underscores. Therefore,
result = subject.gsub(/[^\w]/, '_')
But are you okay that this would also replace newlines and other whitespace characters?
If not, change it to
result = subject.gsub(/[^\w\s]/, '_')
Explain Regex
[^\w\s] # any character except: word characters (a-
# z, A-Z, 0-9, _), whitespace (\n, \r, \t,
# \f, and " ")
Note
As #CarySwoveland mentions, the [^\w] can also be written with the shorthand \W.

How to handle Combining Diacritical Marks with UnicodeUtils?

I am trying to insert spaces into a string of IPA characters, e.g. to turn ɔ̃wɔ̃tɨ into ɔ̃ w ɔ̃ t ɨ. Using split/join was my first thought:
s = ɔ̃w̃ɔtɨ
s.split('').join(' ') #=> ̃ ɔ w ̃ ɔ p t ɨ
As I discovered by examining the results, letters with diacritics are in fact encoded as two characters. After some research I found the UnicodeUtils module, and used the each_grapheme method:
UnicodeUtils.each_grapheme(s) {|g| g + ' '} #=> ɔ ̃w ̃ɔ p t ɨ
This worked fine, except for the inverted breve mark. The code changes ̑a into ̑ a. I tried normalization (UnicodeUtils.nfc, UnicodeUtils.nfd), but to no avail. I don't know why the each_grapheme method has a problem with this particular diacritic mark, but I noticed that in gedit, the breve is also treated as a separate character, as opposed to tildes, accents etc. So my question is as follows: is there a straightforward method of normalization, i.e. turning the combination of Latin Small Letter A and Combining Inverted Breve into Latin Small Letter A With Inverted Breve?
I understand your question concerns Ruby but I suppose the problem is about the same as with Python. A simple solution is to test the combining diacritical marks explicitly :
import unicodedata
liste=[]
s = u"ɔ̃w̃ɔtɨ"
comb=False
prec=u""
for char in s:
if unicodedata.combining(char):
liste.append(prec+char)
prec=""
else:
liste.append(prec)
prec=char
liste.append(prec)
print " ".join(liste)
>>>> ɔ̃ w̃ ɔ t ɨ

regex any non-digit with exception

I've got strings like these:
+996999966966AA
-996999966966AA
I am using this code:
"+996999966966AA".gsub!(/\D/, "")
to get rid of any character except digits, but the sign + also being stripped. How can my code retain the +?
Use:
[^+\d]
to match anything that isn't + or a digit.
You can also use \W, "non-word character" which matches any character that is not a word character (alphanumeric & underscore)).
(\W\d+)\w+

Ruby regex remove ^C character from string

There is a file that has control B and control C commands separating fields of text. It looks like:
"TEST\003KEY\002TEST\003KEY"
I tried to create a regex that will match this and remove it. I am not sure why this regex is not working:
"TEST\003KEY\002TEST\003KEY".gsub(/\00[23]/, ',')
Try the following:
"TEST\003KEY\002TEST\003KEY".gsub(/\002|\003/, ',')
Here it is demonstrated in irb on my machine:
$ irb
1.9.3p448 :007 > "TEST\003KEY\002TEST\003KEY".gsub(/\002|\003/, ',')
=> "TEST,KEY,TEST,KEY"
The syntax \002|\003 means "match the character literal \002 or the character literal \003". The expression given in the original question \00[23] is not valid: this is the character literal \00 (a null character) followed by the character class [23]: i.e. it matches two-character sequences.
You can also use the [[:cntrl:]] character class to match all control characters:
$ irb
1.9.3p448 :007 > "TEST\003KEY\002TEST\003KEY\005TEST".gsub(/[[:cntrl:]]/, ',')
=> "TEST,KEY,TEST,KEY,TEST"
Here's the deal. First and foremost, computers cannot store characters--they can only store numbers. So when a computer stores a string it converts every character to a number. The numbers for all the basic characters are given by an ascii chart(you can search google for one).
When you tell a computer to print a string, it retrieves the numbers saved for the string and outputs them as characters (using an ascii chart to convert the numbers to characters).
Double quoted strings can contain what are called escape sequences. The most common escape sequence is "\n":
puts "hello\nworld"
--output:--
hello
world
A double quoted string converts the escape sequence "\n" to the ascii code 10:
puts "\n".ord #=>10 (ord() will show you the ascii code for a character)
A double quoted string can also contain escape sequences of the form \ddd, e.g. \002. Escape sequences like that are called octal escape sequences, which means 002 is the octal representation of an ascii code.
In an octal number, the right most digit is the 1's column, and the next digit to the left is the 8's column and the next digit to the left is the 64's column. For instance, this octal number:
\123
is equivalent to 3*1 + 2*8 + 1*64 = 83. It so happens that an "S" has the ascii code 83:
puts "\123" #=>S
Because you also can use octal escape sequences in a double quoted string, that means that instead of using the escape sequence "\n" you could use the octal escape "\012" (2*1 + 1*8 + 0*64 = 10). A double quoted string converts the octal escape sequence "\012" to the ascii code 10, which is the same thing that a double quoted string does to "\n". Here is an example:
puts "hello" + "\012" + "world"
--output:--
hello
world
The final thing to note about octal escape sequences is that you can optionally leave off any leading 0's:
puts "hello" + "\12" + "world"
--output:--
hello
world
Okay, now examine your string:
"TEST\003KEY\002TEST\003KEY"
You can see that it contains three octal escape sequences. A double quoted string converts the octal escape sequence \003 to the ascii code: 3*1 + 0*8 + 0*64 = 3. If you check an ascii chart, the ascii code 3 represents a character called "end of text". A double quoted string converts the octal escape sequence \002 to the ascii code: 2*1 + 0*8 + 0*64 = 2, which represents a character called 'start of text'. I'm not sure where you are getting the "control B" and "control C" names from (maybe those are the key strokes on your keyboard that are mapped to those characters?).
Next, a regex acts like a double quoted string, so
/<in here>/
you can use the same escape sequences as in a double quoted string, and the regex will convert the escape sequences to ascii codes.
Now, in light of all the above, examine your regex:
/\00[23]/
As Richard Cook pointed out, your regex gets interpreted as the octal escape sequence \00 followed by the character class [23]. The octal escape sequence \00 gets converted to the ascii code: 0*1 + 0*8 = 0. And if you look at an ascii chart, the number 0 represents a character called 'null'. So your regex is looking for a null character, followed by either a "2" or a "3", which means your regex is looking for a two character string. But a two character string will never match the octal escape sequence "\003" (or "\002"), which represents only one character.
The main thing to take away from all this is that when you see a string that contains an octal escape sequence:
"hello\012world"
...that string does not contain the characters \, 0, 1, and 2. A double quoted string converts that sequence of characters into one ascii code, which represents ONE character. You can prove that very easily:
puts "hello".length #=>5
puts "hello\012".length #=>6
There are also many other types of escape sequences that can appear in double quoted strings. You would think they would be listed in the String class docs, but they are not.
s = "TEST\003KEY\002TEST\003KEY"
s.split(/[[:cntrl:]]/) * ","
# => "TEST,KEY,TEST,KEY"

Regex to remove non letters

I'm trying to remove non-letters from a string. Would this do it:
c = o.replace(o.gsub!(/\W+/, ''))
Just gsub! is sufficient:
o.gsub!(/\W+/, '')
Note that gsub! modifies the original o object. Also, if the o does not contain any non-word characters, the result will be nil, so using the return value as the modified string is unreliable.
You probably want this instead:
c = o.gsub(/\W+/, '')
Remove anything that is not a letter:
> " sd 190i.2912390123.aaabbcd".gsub(/[^a-zA-Z]/, '')
"sdiaaabbcd"
EDIT: as ikegami points out, this doesn't take into account accented characters, umlauts, and other similar characters. The solution to this problem will depend on what exactly you are referring to as "not a letter". Also, what your input will be.
Keep in mind that ruby considers the underscore _ to be a word character. So if you want to keep underscores as well, this should do it
string.gsub!(/\W+/, '')
Otherwise, you need to do this:
string.gsub!(/[^a-zA-Z]/, '')
That will work most of the cases, except when o initially does not contain any non-letter, in which case gsub! will return nil.
If you just want a replaced string, it can be simpler:
c = o.gsub(/\W+/, '')
Using \W or \w to select or delete only characters won't work. \w means A-Z, a-z, 0-9, and "_":
irb(main):002:0> characters = (' ' .. "\x7e").to_a.join('')
=> " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
irb(main):003:0> characters.gsub(/\W+/, '')
=> "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz"
So, stripping using \W preserves digits and underscores.
If you want to match characters use /[A-Za-z]+/, or the POSIX character class [:alpha:], i.e. /[[:alpha:]]+/, or /\p{ALPHA}/.
The final format is the Unicode property for 'A'..'Z' + 'a'..'z' in ASCII, and gets extended when dealing with Unicode, so if you have multibyte characters you should probably use that.
use Regexp#union to create a big matching object
allowed = Regexp.union(/[a-zA-Z0-9]/, " ", "-", ":", ")", "(", ".")
cleanstring = dirty_string.chars.select {|c| c =~ allowed}.join("")
I don't see what that o.replace is in there for if you have a string:
string = 't = 4 6 ^'
And you do:
string.gsub!(/\W+/, '')
You get:
t46
If you want to get rid of the number characters too, you can do:
string.gsub!(/\W+|\d+/, '')
And you get:
t

Resources