Ruby regex remove ^C character from string - ruby

There is a file that has control B and control C commands separating fields of text. It looks like:
"TEST\003KEY\002TEST\003KEY"
I tried to create a regex that will match this and remove it. I am not sure why this regex is not working:
"TEST\003KEY\002TEST\003KEY".gsub(/\00[23]/, ',')

Try the following:
"TEST\003KEY\002TEST\003KEY".gsub(/\002|\003/, ',')
Here it is demonstrated in irb on my machine:
$ irb
1.9.3p448 :007 > "TEST\003KEY\002TEST\003KEY".gsub(/\002|\003/, ',')
=> "TEST,KEY,TEST,KEY"
The syntax \002|\003 means "match the character literal \002 or the character literal \003". The expression given in the original question \00[23] is not valid: this is the character literal \00 (a null character) followed by the character class [23]: i.e. it matches two-character sequences.
You can also use the [[:cntrl:]] character class to match all control characters:
$ irb
1.9.3p448 :007 > "TEST\003KEY\002TEST\003KEY\005TEST".gsub(/[[:cntrl:]]/, ',')
=> "TEST,KEY,TEST,KEY,TEST"

Here's the deal. First and foremost, computers cannot store characters--they can only store numbers. So when a computer stores a string it converts every character to a number. The numbers for all the basic characters are given by an ascii chart(you can search google for one).
When you tell a computer to print a string, it retrieves the numbers saved for the string and outputs them as characters (using an ascii chart to convert the numbers to characters).
Double quoted strings can contain what are called escape sequences. The most common escape sequence is "\n":
puts "hello\nworld"
--output:--
hello
world
A double quoted string converts the escape sequence "\n" to the ascii code 10:
puts "\n".ord #=>10 (ord() will show you the ascii code for a character)
A double quoted string can also contain escape sequences of the form \ddd, e.g. \002. Escape sequences like that are called octal escape sequences, which means 002 is the octal representation of an ascii code.
In an octal number, the right most digit is the 1's column, and the next digit to the left is the 8's column and the next digit to the left is the 64's column. For instance, this octal number:
\123
is equivalent to 3*1 + 2*8 + 1*64 = 83. It so happens that an "S" has the ascii code 83:
puts "\123" #=>S
Because you also can use octal escape sequences in a double quoted string, that means that instead of using the escape sequence "\n" you could use the octal escape "\012" (2*1 + 1*8 + 0*64 = 10). A double quoted string converts the octal escape sequence "\012" to the ascii code 10, which is the same thing that a double quoted string does to "\n". Here is an example:
puts "hello" + "\012" + "world"
--output:--
hello
world
The final thing to note about octal escape sequences is that you can optionally leave off any leading 0's:
puts "hello" + "\12" + "world"
--output:--
hello
world
Okay, now examine your string:
"TEST\003KEY\002TEST\003KEY"
You can see that it contains three octal escape sequences. A double quoted string converts the octal escape sequence \003 to the ascii code: 3*1 + 0*8 + 0*64 = 3. If you check an ascii chart, the ascii code 3 represents a character called "end of text". A double quoted string converts the octal escape sequence \002 to the ascii code: 2*1 + 0*8 + 0*64 = 2, which represents a character called 'start of text'. I'm not sure where you are getting the "control B" and "control C" names from (maybe those are the key strokes on your keyboard that are mapped to those characters?).
Next, a regex acts like a double quoted string, so
/<in here>/
you can use the same escape sequences as in a double quoted string, and the regex will convert the escape sequences to ascii codes.
Now, in light of all the above, examine your regex:
/\00[23]/
As Richard Cook pointed out, your regex gets interpreted as the octal escape sequence \00 followed by the character class [23]. The octal escape sequence \00 gets converted to the ascii code: 0*1 + 0*8 = 0. And if you look at an ascii chart, the number 0 represents a character called 'null'. So your regex is looking for a null character, followed by either a "2" or a "3", which means your regex is looking for a two character string. But a two character string will never match the octal escape sequence "\003" (or "\002"), which represents only one character.
The main thing to take away from all this is that when you see a string that contains an octal escape sequence:
"hello\012world"
...that string does not contain the characters \, 0, 1, and 2. A double quoted string converts that sequence of characters into one ascii code, which represents ONE character. You can prove that very easily:
puts "hello".length #=>5
puts "hello\012".length #=>6
There are also many other types of escape sequences that can appear in double quoted strings. You would think they would be listed in the String class docs, but they are not.

s = "TEST\003KEY\002TEST\003KEY"
s.split(/[[:cntrl:]]/) * ","
# => "TEST,KEY,TEST,KEY"

Related

Go rune literal for high positioned emojis

How do we use an emoji with a rune literal that is beyond I think
code point U+265F?
a1 := '\u2665'
this works
a2 := '\u1F3A8'
this gives error invalid character literal, more that one character.
Is there a way to represent higher positioned emojis as rune literals?
https://unicode.org/emoji/charts/full-emoji-list.html
You may use the \U sequence followed by 8 hex digits which is the hexadecimal representation of the Unicode codepoint. This is detailed in Spec: Rune literals:
There are four ways to represent the integer value as a numeric constant: \x followed by exactly two hexadecimal digits; \u followed by exactly four hexadecimal digits; \U followed by exactly eight hexadecimal digits, and a plain backslash \ followed by exactly three octal digits. In each case the value of the literal is the value represented by the digits in the corresponding base.
For example:
a1 := '\u2665'
fmt.Printf("%c\n", a1)
a2 := '\U0001F3A8'
fmt.Printf("%c\n", a2)
Which outputs (try it on the Go Playground):
♥
🎨
Note (response to #torek):
I believe the Go authors chose to require exactly 4 and 8 hex digits because this allows to use the exact same form, the exact same rune literals inside interpreted string literals. E.g. if you want a string that contains 2 runes, one having code point 0x0001F3A8 and another rune being 4, it could look like this:
s := "\U0001F3A84"
If the spec would not require exactly 8 hex digits, it would be ambiguous whether the last '4' is part of the code point or is an individual rune of the string, so you would have to break the string to a concatenation like "\U1F3A8" + "4".
Spec: String literals:
Interpreted string literals are character sequences between double quotes, as in "bar". Within the quotes, any character may appear except newline and unescaped double quote. The text between the quotes forms the value of the literal, with backslash escapes interpreted as they are in rune literals (except that \' is illegal and \" is legal), with the same restrictions. The three-digit octal (\nnn) and two-digit hexadecimal (\xnn) escapes represent individual bytes of the resulting string; all other escapes represent the (possibly multi-byte) UTF-8 encoding of individual characters. Thus inside a string literal \377 and \xFF represent a single byte of value 0xFF=255, while ÿ, \u00FF, \U000000FF and \xc3\xbf represent the two bytes 0xc3 0xbf of the UTF-8 encoding of character U+00FF.

Understanding string escape sequences

I am new to go, so lot of confusion regarding bytes concept.
While going through some go code, I came across some thing like
[]byte("\xd2\xfd\x88g\xd5\r-\xfe")
was it in hexa decimal or bytes format?
what are some chars in above like g,r-,e signifies?
And how to print it in log?
[]byte("\xd2\xfd\x88g\xd5\r-\xfe") is an interpreted string literal converted to type []byte, a byte slice. Here it is separated into byte values:
[\xd2, \xfd, \x88, g, \xd5, \r, -, \xfe]
or, expressed as hexadecimal bytes,
[d2, fd, 88, 67, d5, 0d, 2d, fe]
One way to log the value,
package main
import "log"
func main() {
b := []byte("\xd2\xfd\x88g\xd5\r-\xfe")
log.Printf("%q\n", b)
}
Playground: https://play.golang.org/p/BIh_EuvoxU-
Output:
2009/11/10 23:00:00 "\xd2\xfd\x88g\xd5\r-\xfe"
The Go Programming Language Specification
String literals
A string literal represents a string constant obtained from
concatenating a sequence of characters. There are two forms: raw
string literals and interpreted string literals.
Raw string literals are character sequences between back quotes, as in
foo. Within the quotes, any character may appear except back quote.
The value of a raw string literal is the string composed of the
uninterpreted (implicitly UTF-8-encoded) characters between the
quotes; in particular, backslashes have no special meaning and the
string may contain newlines. Carriage return characters ('\r') inside
raw string literals are discarded from the raw string value.
Interpreted string literals are character sequences between double
quotes, as in "bar". Within the quotes, any character may appear
except newline and unescaped double quote. The text between the quotes
forms the value of the literal, with backslash escapes interpreted as
they are in rune literals (except that \' is illegal and \" is legal),
with the same restrictions. The three-digit octal (\nnn) and two-digit
hexadecimal (\xnn) escapes represent individual bytes of the resulting
string; all other escapes represent the (possibly multi-byte) UTF-8
encoding of individual characters. Thus inside a string literal \377
and \xFF represent a single byte of value 0xFF=255, while ÿ, \u00FF,
\U000000FF and \xc3\xbf represent the two bytes 0xc3 0xbf of the UTF-8
encoding of character U+00FF.
After a backslash, certain single-character escapes represent special
values:
\a U+0007 alert or bell
\b U+0008 backspace
\f U+000C form feed
\n U+000A line feed or newline
\r U+000D carriage return
\t U+0009 horizontal tab
\v U+000b vertical tab
\\ U+005c backslash
\' U+0027 single quote (valid escape only within rune literals)
\" U+0022 double quote (valid escape only within string literals)

Replacing all but alphabetic characters with spaces in python, in any language

The code
phrase = "".join([c if c.isalpha() else " " for c in phrase])
substitute all non-alphabetic character with spaces. It works very well with strings made up with occidental language characters.
But giving it the value:
phrase = u'इसका स्वामित्व और नियंत्रण किया। इसके'
the result is u'इसक स व म त व और न य त रण क य इसक ', while it shouldn't change, since the string is only made of alphabetic characters and spaces.
I think the reason is that some character is a surrogate pair.
Is it a bug with python's isalpha() method?
Or, if not, how can I deal properly with characters represented by surrogate pairs?

Replace all characters other than english letters and numbers to underscore

I have a string, and I would like to replace all special characters with underscores.
In other words, I just want 26 english letters (lower and upper cases) and 0-9 and the "_" character.
Also note that there are the non-english characters and they need to be replaced with "_" as well.
What is the most elegant way to do this in Ruby?
It sounds like you want to replace all non-word characters with underscores. Therefore,
result = subject.gsub(/[^\w]/, '_')
But are you okay that this would also replace newlines and other whitespace characters?
If not, change it to
result = subject.gsub(/[^\w\s]/, '_')
Explain Regex
[^\w\s] # any character except: word characters (a-
# z, A-Z, 0-9, _), whitespace (\n, \r, \t,
# \f, and " ")
Note
As #CarySwoveland mentions, the [^\w] can also be written with the shorthand \W.

What is the opposite of Regexp.escape?

What is the opposite of Regexp.escape ?
> Regexp.escape('A & B')
=> "A\\ &\\ B"
> # do something, to get the next result: (something like Regexp.unescape(A\\ &\\ B))
=> "A & B"
How can I get the original value?
replaces = Hash.new { |hash,key| key } # simple trick to return key if there is no value in hash
replaces['t'] = "\t"
replaces['n'] = "\n"
replaces['r'] = "\r"
replaces['f'] = "\f"
replaces['v'] = "\v"
rx = Regexp.escape('A & B')
str = rx.gsub(/\\(.)/){ replaces[$1] }
Also make sure to #puts output in irb, because #inspect escapes characters by default.
Basically escaping/quoting looks for meta-characters, and prepends \ character (which has to be escaped for string interpretation in source code). But if we find any control character from list: \t, \n, \r, \f, \v, then quoting outputs \ character followed by this special character translated to ascii.
UPDATE:
My solution had problems with special characters (\n, \t ans so on), I updated it after investigating source code for rb_reg_quote method.
UPDATE 2:
replaces is hash, which converts escaped characters (thats why it is used in block attached to gsub) to unescaped ones. It is indexed by character without escape character (second character in sequence) and searches for unescaped value. The only defined values are control-characters, but there is also default_proc attached (block attached to Hash.new), which returns key if there is no value found in hash. So it works like this:
for "n" it returns "\n", the same for all other escaped control characters, because it is value associated with key
for "(" it returns "(", because there is no value associated with "(" key, hash calls #default_proc, which returns key itself
The only characters escaped by Regexp.escape are meta characters and control characters, so we don't have to worry about alphanumerics.
Take a look at http://ruby-doc.org/core-2.0.0/Hash.html#method-i-default_proc for documentation on #defoult_proc
You can perhaps use something like this?
def unescape(s)
eval %Q{"#{s}"}
end
puts unescape('A\\ &\\ B')
Credits to this question.
codepad demo
If you are okay with a regex solution, you can use this:
res = s.gsub(/\\(?!\\)|(\\)\\/, "\\1")
codepad demo
Try this
>> r = Regexp.escape("A & B (and * c [ e] + )")
# => "A\\ &\\ B\\ \\(and\\ \\*\\ c\\ \\[\\ e\\]\\ \\+\\ \\)"
>> r.gsub("\\(","(").gsub("\\)",")").gsub("\\[","[").gsub("\\]","]").gsub("\\{","{").gsub("\\}","}").gsub("\\.",".").gsub("\\?","?").gsub("\\+","+").gsub("\\*","*").gsub("\\ "," ")
# => "A & B (and * c [ e] + )"
Basically, these (, ), [, ], {, }, ., ?, +, * are the meta characters in regex. And also \ which is used as an escape character.
The chain of gsub() calls replace the escaped patterns with corresponding actual value.
I am sure there is a way to DRY this up.
Update: DRY version as suggested by user2503775
>> r.gsub("\\","")
Update:
following are the special characters in regex
[,],{,},(,),|,-,*,.,\\,?,+,^,$,<space>,#,\t,\f,\v,\n,\r
using a regex replace using \\(?=([\\\*\+\?\|\{\[\(\)\^\$\.\#\ ]))\
should give you the string unescaped, you would only have to replace \r\n sequences with there CrLf counterparts.
"There\ is\ a\ \?\ after\ the\ \(white\)\ car\.\ \r\n\ it\ should\ be\ http://car\.com\?\r\n"
is unescaped to :
"There is a ? after the (white) car. \r\n it should be http://car.com?\r\n"
and removing the \r\n gives you :
There is a ? after the (white) car.
it should be http://car.com?

Resources