ruby 1.9 - what is easiest inverse of `string.codepoints.to_a`? - ruby

In ruby 1.9.3, I can get the codepoints of a string:
> "foo\u00f6".codepoints.to_a
=> [102, 111, 111, 246]
Is there a built-in method to go the other direction, ie from integer array to string?
I'm aware of:
# not acceptable; only works with UTF-8
[102, 111, 111, 246].pack("U*")
# works, but not very elegant
[102, 111, 111, 246].inject('') {|s, cp| s << cp }
# concise, but I need to unshift that pesky empty string to "prime" the inject call
['', 102, 111, 111, 246].inject(:<<)
UPDATE (response to Niklas' answer)
Interesting discussion.
pack("U*") always returns a UTF-8 string, while the inject version returns a string in the file's source encoding.
#!/usr/bin/env ruby
# encoding: iso-8859-1
p [102, 111, 111, 246].inject('', :<<).encoding
p [102, 111, 111, 246].pack("U*").encoding
# this raises an Encoding::CompatibilityError
[102, 111, 111, 246].pack("U*") =~ /\xf6/
For me, the inject call returns an ISO-8859-1 string, while pack returns a UTF-8. To prevent the error, I could use pack("U*").encode(__ENCODING__) but that makes me do extra work.
UPDATE 2
Apparently the String#<< doesn't always append correctly depending on the string's encoding. So it looks like pack is still the best option.
[225].inject(''.encode('utf-16be'), :<<) # fails miserably
[225].pack("U*").encode('utf-16be') # works

The most obvious adaption of your own attempt would be
[102, 111, 111, 246].inject('', :<<)
This is however not a good solution, as it only works if the initial empty string literal has an encoding that is capable of holding the entire Unicode character range. The following fails:
#!/usr/bin/env ruby
# encoding: iso-8859-1
p "\u{1234}".codepoints.to_a.inject('', :<<)
So I'd actually recommend
codepoints.pack("U*")
I don't know what you mean by "only works with UTF-8". It creates a Ruby string with UTF-8 encoding, but UTF-8 can hold the whole Unicode character range, so what's the problem? Observe:
irb(main):010:0> s = [0x33333, 0x1ffff].pack("U*")
=> "\u{33333}\u{1FFFF}"
irb(main):011:0> s.encoding
=> #<Encoding:UTF-8>
irb(main):012:0> [0x33333, 0x1ffff].pack("U*") == [0x33333, 0x1ffff].inject('', :<<)
=> true

Depending on the values in your array and the value of Encoding.default_internal, you might try:
[102, 111, 111, 246].map(&:chr).inject(:+)
You have to be careful of the encoding. Note the following:
irb(main):001:0> 0.chr.encoding
=> #<Encoding:US-ASCII>
irb(main):002:0> 127.chr.encoding
=> #<Encoding:US-ASCII>
irb(main):003:0> 128.chr.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):004:0> 255.chr.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):005:0> 256.chr.encoding
RangeError: 256 out of char range
from (irb):5:in `chr'
from (irb):5
from C:/Ruby200/bin/irb:12:in `<main>'
irb(main):006:0>
By default, 256.chr fails because it likes to return either US-ASCII or ASCII-8BIT, depending on whether the codepoint is in 0..127 or 128..256.
This should cover your point for 8-bit values. If you have values larger than 255 (presumably Unicode codepoints), then you can do the following:
irb(main):006:0> Encoding.default_internal = "utf-8"
=> "utf-8"
irb(main):007:0> 256.chr.encoding
=> #<Encoding:UTF-8>
irb(main):008:0> 256.chr.codepoints
=> [256]
irb(main):009:0>
With Encoding.default_internal set to "utf-8", Unicode values > 255 should work fine (but see below):
irb(main):009:0> 65535.chr.encoding
=> #<Encoding:UTF-8>
irb(main):010:0> 65535.chr.codepoints
=> [65535]
irb(main):011:0> 65536.chr.codepoints
=> [65536]
irb(main):012:0> 65535.chr.bytes
=> [239, 191, 191]
irb(main):013:0> 65536.chr.bytes
=> [240, 144, 128, 128]
irb(main):014:0>
Now it gets interesting -- ASCII-8BIT and UTF-8 don't seem to mix:
irb(main):014:0> (0..127).to_a.map(&:chr).inject(:+).encoding
=> #<Encoding:US-ASCII>
irb(main):015:0> (0..128).to_a.map(&:chr).inject(:+).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):016:0> (0..255).to_a.map(&:chr).inject(:+).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):017:0> ((0..127).to_a + (256..1000000).to_a).map(&:chr).inject(:+).encoding
RangeError: invalid codepoint 0xD800 in UTF-8
from (irb):17:in `chr'
from (irb):17:in `map'
from (irb):17
from C:/Ruby200/bin/irb:12:in `<main>'
irb(main):018:0> ((0..127).to_a + (256..0xD7FF).to_a).map(&:chr).inject(:+).encoding
=> #<Encoding:UTF-8>
irb(main):019:0> (0..256).to_a.map(&:chr).inject(:+).encoding
Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
from (irb):19:in `+'
from (irb):19:in `each'
from (irb):19:in `inject'
from (irb):19
from C:/Ruby200/bin/irb:12:in `<main>'
irb(main):020:0>
ASCII-8BIT and UTF-8 can be concatenated, as long as the ASCII-8BIT codepoints are all in 0..127:
irb(main):020:0> 256.chr.encoding
=> #<Encoding:UTF-8>
irb(main):021:0> (0.chr.force_encoding("ASCII-8BIT") + 256.chr).encoding
=> #<Encoding:UTF-8>
irb(main):022:0> 255.chr.encoding
=> #<Encoding:ASCII-8BIT>
irb(main):023:0> (255.chr + 256.chr).encoding
Encoding::CompatibilityError: incompatible character encodings: ASCII-8BIT and UTF-8
from (irb):23
from C:/Ruby200/bin/irb:12:in `<main>'
irb(main):024:0>
This brings us to an ultimate solution to your question:
irb(main):024:0> (0..0xD7FF).to_a.map {|c| c.chr("utf-8")}.inject(:+).encoding
=> #<Encoding:UTF-8>
irb(main):025:0>
So I think the most general answer is, assuming you want UTF-8, is:
[102, 111, 111, 246].map {|c| c.chr("utf-8")}.inject(:+)
Assuming you know your values are in 0..255, then this is easier:
[102, 111, 111, 246].map(&:chr).inject(:+)
giving you:
irb(main):027:0> [102, 111, 111, 246].map {|c| c.chr("utf-8")}.inject(:+)
=> "fooö"
irb(main):028:0> [102, 111, 111, 246].map(&:chr).inject(:+)
=> "foo\xF6"
irb(main):029:0> [102, 111, 111, 246].map {|c| c.chr("utf-8")}.inject(:+).encoding
=> #<Encoding:UTF-8>
irb(main):030:0> [102, 111, 111, 246].map(&:chr).inject(:+).encoding
=> #<Encoding:ASCII-8BIT>
irb(main):031:0>
I hope this helps (albeit a bit late, perhaps) -- I found this looking for an answer to the same question, so I researched it myself.

Related

Remove "\n\" from string

Is there any way to remove "\r\" from string?
Sofar i manage to remove only "\r" with mystring.gsub(/\r/,"")
How do I remove all 3 characters \r\ ?
In your string, do you have the literal characters "\" and "r", or do you have the escape sequence "\r"?
If you have the string foo\r\fbar, then your string is 8 characters long:
"foo\r\fbar".split(//).map(&:ord)
=> [102, 111, 111, 13, 12, 98, 97, 114]
What you want to remove are the \r and \f characters, or character codes 13 and 12. You can't remove just the leading slash in the \f, because \f is just one character. If this is your case, you can use:
"foo\r\fbar".gsub(/\r\f/, "")
=> "foobar"
However, if you have the literal sequence foo\\r\\fbar:
"foo\\r\\fbar".split(//).map(&:ord)
=> [102, 111, 111, 92, 114, 92, 102, 98, 97, 114]
Then you can remove the \r and the slash before the "f":
"foo\\r\\fbar".gsub(/\\r\\/, "")
=> "foofbar"
If you have the sequence foo\r\\fbar:
"foo\r\\fbar".split(//).map(&:ord)
=> [102, 111, 111, 13, 92, 102, 98, 97, 114]
Then you can likewise remove the \r and the slash before the "f":
"foo\r\\fbar".gsub(/\r\\/, "")
=> "foofbar"
Use this:
mystring.gsub(/\r\\/,"")
mystring.gsub(/\\r\\/,"")
Ivaylo was so close :P
I tested it on http://rubular.com/
As you can tell, it's difficult for us to figure out what characters you actually need to remove, because you are contradicting yourself. In the title you say "\n\" and in the question you say "\r\". Either way, here's what I'd do in order to find out exactly what I need to search for.
Starting with the string in question:
mystring = "\n\\"
I'd use the bytes method to have Ruby show me what I should use:
mystring = "\n\\" # => "\n\\"
mystring.bytes.map{ |b| '%02x' % b } # => ["0a", "5c"]
mystring.tr("\x0a\x5c", '') # => ""
mystring.gsub(/\x0a\x5c/, '') # => ""
mystring = "\r\\" # => "\r\\"
mystring.bytes.map{ |b| '%02x' % b } # => ["0d", "5c"]
mystring.tr("\x0d\x5c", '') # => ""
mystring.gsub(/\x0d\x5c/, '') # => ""
Dealing with escaped characters can be a pain in any programming language, but if I look at the exact bytes that make up the character I can't go wrong, as long as I'm dealing with ASCII. If it's another character set, I'll want to use the chars method, and adjust my pattern appropriately:
mystring = "\n\\"
mystring.chars.to_a # => ["\n", "\\"]
mystring.gsub(/\n\\/, '') # => ""
mystring.tr("\n\\", '') # => ""
mystring = "\r\\"
mystring.chars.to_a # => ["\r", "\\"]
mystring.tr("\r\\", '') # => ""
mystring.gsub(/\r\\/, '') # => ""

Ruby 1.9.3 Dir.glob not returning NFC UTF-8 strings, returns NFD instead

When reading file names from Ruby 1.9.3, I'm seeing some odd results. For example with the following test ruby script, running in a folder containing a file with the name 'Testé.txt'
#!encoding:UTF-8
def inspect_string s
puts "Source encoding: #{"".encoding}"
puts "External encoding: #{Encoding.default_external}"
puts "Name: #{s.inspect}"
puts "Encoding: #{s.encoding}"
puts "Chars: #{s.chars.to_a.inspect}"
puts "Codepoints: #{s.codepoints.to_a.inspect}"
puts "Bytes: #{s.bytes.to_a.inspect}"
end
def transform_string s
puts "Testing string #{s}"
puts s.gsub(/é/u,'TEST')
end
Dir.glob("./*.txt").each do |f|
puts RUBY_VERSION + RUBY_PLATFORM
puts "Inline string works as expected"
s = "./Testé.txt"
inspect_string s
puts transform_string s
puts "File name from Dir.glob does not"
inspect_string f
puts transform_string f
end
On Mac OS X Lion, I see the following results:
1.9.3x86_64-darwin11.4.0
Inline string works as expected
Source encoding: UTF-8
External encoding: UTF-8
Name: "./Testé.txt"
Encoding: UTF-8
Chars: [".", "/", "T", "e", "s", "t", "é", ".", "t", "x", "t"]
Codepoints: [46, 47, 84, 101, 115, 116, 233, 46, 116, 120, 116]
Bytes: [46, 47, 84, 101, 115, 116, 195, 169, 46, 116, 120, 116]
Testing string ./Testé.txt
./TestTEST.txt
File name from Dir.glob does not
Source encoding: UTF-8
External encoding: UTF-8
Name: "./Testé.txt"
Encoding: UTF-8
Chars: [".", "/", "T", "e", "s", "t", "e", "́", ".", "t", "x", "t"]
Codepoints: [46, 47, 84, 101, 115, 116, 101, 769, 46, 116, 120, 116]
Bytes: [46, 47, 84, 101, 115, 116, 101, 204, 129, 46, 116, 120, 116]
Testing string ./Testé.txt
./Testé.txt
The expected last line is
./TestTEST.txt
the encodings returned indicate that this is a normal UTF-8 string and yet any regexp transformations involving unicode are not being applied properly.
An update to this: Ruby 2.2.0 has gained String#unicode_normalize.
f.unicode_normalize!
would convert the NFD-decomposed string returned from OSX' HFS+ filesystem into a NFC-composed string. You can specify :nfd, :nfkc, or :nfkd if you require alternative normalizations.
Posted in case this is useful for anyone else running into this:
Ruby 1.9 and 2.0 will use composed UTF-8 strings if you use UTF-8 encoding, but will not modify strings received from the OS. Mac OS X uses decomposed strings (two bytes for many common accents like é in UTF-8, which are combined for display). So file system methods will often return unexpected string formats, which are strictly UTF-8, but a decomposed form.
In order to work around this, you need to decompose them by converting from the 'UTF8-MAC' encoding to UTF-8:
f.encode!('UTF-8','UTF8-MAC')
Before using them, otherwise you may end up running checks against a decomposed string with a native ruby string which is composed.
This behaviour affects all file system calls like glob for both files and folders where a file name contains unicode characters.
Apple docs:
http://developer.apple.com/library/mac/#qa/qa1235/_index.html

Ruby String.encode still gives "invalid byte sequence in UTF-8"

In IRB, I'm trying the following:
1.9.3p194 :001 > foo = "\xBF".encode("utf-8", :invalid => :replace, :undef => :replace)
=> "\xBF"
1.9.3p194 :002 > foo.match /foo/
ArgumentError: invalid byte sequence in UTF-8
from (irb):2:in `match'
Any ideas what's going wrong?
I'd guess that "\xBF" already thinks it is encoded in UTF-8 so when you call encode, it thinks you're trying to encode a UTF-8 string in UTF-8 and does nothing:
>> s = "\xBF"
=> "\xBF"
>> s.encoding
=> #<Encoding:UTF-8>
\xBF isn't valid UTF-8 so this is, of course, nonsense. But if you use the three argument form of encode:
encode(dst_encoding, src_encoding [, options] ) → str
[...] The second form returns a copy of str transcoded from src_encoding to dst_encoding.
You can force the issue by telling encode to ignore what the string thinks its encoding is and treat it as binary data:
>> foo = s.encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "�"
Where s is the "\xBF" that thinks it is UTF-8 from above.
You could also use force_encoding on s to force it to be binary and then use the two-argument encode:
>> s.encoding
=> #<Encoding:UTF-8>
>> s.force_encoding('binary')
=> "\xBF"
>> s.encoding
=> #<Encoding:ASCII-8BIT>
>> foo = s.encode('utf-8', :invalid => :replace, :undef => :replace)
=> "�"
If you're only working with ascii characters you can use
>> "Hello \xBF World!".encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "Hello � World!"
But what happens if we use the same approach with valid UTF8 characters that are invalid in ascii
>> "¡Hace \xBF mucho frío!".encode('utf-8', 'binary', :invalid => :replace, :undef => :replace)
=> "��Hace � mucho fr��o!"
Uh oh! We want frío to remain with the accent. Here's an option that keeps the valid UTF8 characters
>> "¡Hace \xBF mucho frío!".chars.select{|i| i.valid_encoding?}.join
=> "¡Hace mucho frío!"
Also in Ruby 2.1 there is a new method called scrub that solves this problem
>> "¡Hace \xBF mucho frío!".scrub
=> "¡Hace � mucho frío!"
>> "¡Hace \xBF mucho frío!".scrub('')
=> "¡Hace mucho frío!"
This is fixed if you read the source text file in using an explicit code page:
File.open( 'thefile.txt', 'r:iso8859-1' )

ruby 1.8.7 why .to_yaml converts some Strings to non-readable bytes

Parsing some webpages with nokogiri, i've got some issues while cleaning some Strings and saving them with YAML. To reproduce the problem look at this IRB session that reproduces the same problem:
irb(main):001:0> require 'yaml'
=> true
irb(main):002:0> "1,000 €".to_yaml
=> "--- !binary |\nMSwwMDAg4oKs\n\n"
irb(main):003:0> "1,0000 €".to_yaml
=> "--- \"1,0000 \\xE2\\x82\\xAC\"\n"
irb(main):004:0> "1,00 €".to_yaml
=> "--- !binary |\nMSwwMCDigqw=\n\n"
irb(main):005:0> "1 €".to_yaml
=> "--- !binary |\nMSDigqw=\n\n"
irb(main):006:0> "23 €".to_yaml
=> "--- !binary |\nMjMg4oKs\n\n"
irb(main):007:0> "12000 €".to_yaml
=> "--- !binary |\nMTIwMDAg4oKs\n\n"
irb(main):008:0> "1200000 €".to_yaml
=> "--- \"1200000 \\xE2\\x82\\xAC\"\n"
irb(main):009:0> "120000 €".to_yaml
=> "--- \"120000 \\xE2\\x82\\xAC\"\n"
irb(main):010:0> "12000 €".to_yaml
=> "--- !binary |\nMTIwMDAg4oKs\n\n"
To sum up, sometimes .to_yaml outputs are readable while other times the output is unreadable. The most intriguing aspect is that the strings are very similar.
How can I avoid those !binary ... outputs?
Whether YAML prefers to dump a string as text or binary is a matter of ratio between ASCII and non ASCII characters.
If you want to avoid !binary as much as possible, you should use the ya2yaml gem. It tries hard to dump strings as ASCII + escaped UTF-8.

Using Netbeans, why does Ruby debug not display multibytes string properly?

The env are: netbeans(v=6.9.1), ruby-debug-base (v=0.10.4), ruby-debug-ide (0.4.16) ,ruby(v=1.8.7)
During the process of debuging a Ruby script, the debuger can not display multibytes properly and always displays "Binary Data" for multibytes string in variable window view:
require 'rubygems'
require 'active_support'
str = "调试程序"
str = str.mb_chars
puts "length: #{str.length}"
BTW, I tried 0.4.16, 0.4.11 for ruby-debug-ide, but they have the same output.
Can someone tell me how to make it to display the multibyte string properly in the debug variable window view?
Part of the problem is that Ruby 1.8.7 had the beginning of multi-byte support. You probably need to define your $KCODE value for your source. See The $KCODE Variable and jcode Library
Ruby 1.9.2 has much better support for it, so give it a try if that's an option.
This is from messing around with 1.9.2 and irb:
Greg:~ greg$ irb -f
irb(main):001:0> RUBY_VERSION
=> "1.9.2"
irb(main):002:0> str = "调试程序"
=> "调试程序"
irb(main):003:0> str
=> "调试程序"
irb(main):004:0> str.each_char.to_a
=> ["调", "试", "程", "序"]
irb(main):005:0> str.each_byte.to_a
=> [232, 176, 131, 232, 175, 149, 231, 168, 139, 229, 186, 143]
irb(main):006:0> str.valid_encoding?
=> true
irb(main):007:0> str.codepoints
=> #<Enumerator: "调试程序":codepoints>
irb(main):008:0> str.each_codepoint.to_a
=> [35843, 35797, 31243, 24207]
irb(main):009:0> str.each_codepoint.to_a.map { |i| i.to_s(16) }
=> ["8c03", "8bd5", "7a0b", "5e8f"]
irb(main):010:0> str.encoding
=> #<Encoding:UTF-8>
irb(main):011:0>
And, if I run the following in Textmate while 1.9.2 is set as my default:
# encoding: UTF-8
puts RUBY_VERSION
str = "调试程序"
puts str
which outputs:
# >> 1.9.2
# >> 调试程序
Ruby Debug19 gets mad with the same code so I need to look into what its problem is.

Resources