I was trying to understand how the ARGF#putc works. I was actually doing test with multibyte character sets
Here is the sample:
$stdout.putc 63 #<~~~ A
#?=> 63
$stdout.putc 191
#?=> 191
$stdout.putc 181
#?=> 181
$stdout.putc 166
#?=> 166
Now my question is,except line A - why does every statement printing ??
My Ruby version is:
D:\Rubyscript\My ruby learning days>ruby -v
ruby 2.0.0p0 (2013-02-24) [i386-mingw32]
It depends on the default encoding (or code page on windows) for your console. You can run chcp in cmd.exe to check.
ASCII characters contains characters or control characters from \x00 to \x7F. The multibyte character sets uses ISO-8859-1 encoding which contains characters in ASCII and \x80-\xFF. Well, inferred from your post, the default code page of your console isn't compatible with ISO-8859-1, so the console don't know how to represent those characters from \x80-\xFF.
You need to do some encoding conversion before printing it to your console.
putc 191.chr.force_encoding('ISO-8859-1').encode('UTF-8')
# UTF-8 is the default encoding used in my Linux environment
# you need to replace it with your console's default encoding
Related
When I cat a file in bash I get the following:
$ cat /tmp/file
microsoft
When I view the same file in vim I get the following:
^#m^#i^#c^#r^#o^#s^#o^#f^#t^#
How can I identify and remove these "non-printable" characters. What does '^#' mean in vim??
(Just a piece of background information: the file was created by base 64 decoding and cutting from the pssh header of an mpd file for Microsoft Playready)
What you see is Vim's visual representation of unprintable characters. It is explained at :help 'isprint':
Non-printable characters are displayed with two characters:
0 - 31 "^#" - "^_"
32 - 126 always single characters
127 "^?"
128 - 159 "~#" - "~_"
160 - 254 "| " - "|~"
255 "~?"
Therefore, ^# stands for a null byte = 0x00. These (and other non-printable characters) can come from various sources, but in your case it's an ...
encoding issue
If you clearly observe your output in Vim, every second byte is a null byte; in between are the expected characters. This is a clear indication that the file uses a multibyte encoding (utf-16, big endian, no byte order mark to be precise), and Vim did not properly detect that, and instead opened the file as latin1 or so (whereas things worked out properly in the terminal).
To fix this, you can either explicitly specify the encoding:
:edit ++enc=utf-16 /tmp/file
Or tweak the 'fileencodings' option, so that Vim can automatically detect this. However, be aware that ambiguities (as in your case) make this prone to fail:
For an empty file or a file with only ASCII characters most encodings
will work and the first entry of 'fileencodings' will be used (except
"ucs-bom", which requires the BOM to be present).
That's why a byte order mark (BOM) is recommended for 16-bit encodings; but that assumes that you have control over the output encoding.
^# is Vim's representation of a null byte. The ^ indicates a non-printable control character, with the following ASCII character indicating
which control character it is.
^# == 0 (NUL)
^A == 1
^B == 2
...
^H == 8
^K == 11
...
^Z == 26
^[ == 27
^\ == 28
^] == 29
^^ == 30
^_ == 31
^? == 127
9 and 10 aren't escaped because they are Tab and Line Feed respectively.
32 to 126 are printable ASCII characters (starting with Space).
Matz wrote in his book that in order to use UTF-8, you must add a coding comment on the first line of your script. He gives us an example:
# -*- coding: utf-8 -*- # Specify Unicode UTF-8 characters
# This is a string literal containing a multibyte multiplication character
s = "2x2=4"
# The string contains 6 bytes which encode 5 characters
s.length # => 5: Characters: '2' 'x' '2' '=' '4'
s.bytesize # => 6: Bytes (hex): 32 c3 97 32 3d 34
When he invokes bytesize, it returns 6 since the multiplication symbol × is outside the ascii set, and must be represented by unicode with the two bytes.
I tried the exercise and without specifying the coding comment, it recognized the multiplication symbol as two bytes:
'×'.encoding
=> #<Encoding:UTF-8>
'×'.bytes.to_a.map {|dec| dec.to_s(16) }
=> ["c3", "97"]
So it appears utf-8 is the default encoding. Is this a recent addition to Ruby 2? His examples were from Ruby 1.9.
Yes. The fact that UTF-8 is the default encoding is only since Ruby 2.
If you are aware that his examples were from Ruby 1.9, then check the newly added features to the newer versions of Ruby. It is not that much.
I have a ruby program that I'm trying to upgrade form ruby 1.8 to ruby 2.0.0-p247.
This works just fine in 1.8.7:
begin
ARGF.each do |line|
# a collection of pecluliarlities, appended as they appear in data
line.gsub!("\x92", "'")
line.gsub!("\x96", "-")
puts line
end
rescue => e
$stderr << "exception on line #{$.}:\n"
$stderr << "#{e.message}:\n"
$stderr << #line
end
But under ruby 2.0, this results in this an exxeption when encountering the 96 or 92 encoded into a data file that otherwise contains what appears to be ASCII:
invalid byte sequence in UTF-8
I have tried all manner of things: double backslashes, using a regex object instead of the string, force_encoding(), etc. and am stumped.
Can anybody fill in the missing puzzle piece for me?
Thanks.
=============== additions: 2013-09-25 ============
Changing \x92 to \u2019 did not fix the problem.
The program does not error until it actually hits a 92 or 96 in the input file, so I'm confused as to how the character pattern in the string is the problem when there are hundreds of thousands of lines of input data that are matched against the patterns without incident.
It's not the regex that's throwing the exception, it's the Ruby compiler. \x92 and \x96 are how you would represent ’ and – in the windows-1252 encoding, but Ruby expects the string to be UTF-8 encoded. You need to get out of the habit of putting raw byte values like \x92 in your string literals. Non-ASCII characters should be specified by Unicode escape sequences (in this case, \u2019 and \u2013).
It's a Unicode world now, stop thinking of text in terms of bytes and think in terms of characters instead.
I am trying compile this Ruby code with option --1.9:
\# encoding: utf-8
module Modd
def cpd
#"_¦+?" mySQL
"ñ,B˜"
end
end
I used the GVim editor and compiled then got the following error:
SyntaxError: f3.rb:6: invalid multibyte char (UTF-8)
After that I used Notepad++ and changed to Encode as UTF-8 and compiled with this option:
jruby --1.9 f3.rb
then I get:
SyntaxError: f3.rb:1: \273Invalid char `\273' ('╗') in expression
I have seen this happen when the BOM gets messed up during a charset conversion (the BOM in octal is 357 273 277). If you open the file with a hexadecimal editor (:%!xxd on vi), you will more than likely see characters at the beginning of the file, before the first #.
If you recreate that file directly in utf-8, or get rid of these spurious characters, this should solve your problem.
I have a String that has non ascii characters encoded as "\\'fc" (without quotes), where fc is hex 252 which corresponds to the german ü umlaut.
I managed to find all occurences and can replace them. But I have not been able to convert the fc to an ü.
"fc".hex.chr
gives me another representation...but if I do
puts "fc".hex.chr
I get nothing back...
Thanks in advance
PS: I'm working on ruby 1.9 and have
# coding: utf-8
at the top of the file.
fc is not the correct UTF-8 codepoint for that character; that's iso-8859-1 or windows-1252. The UTF-8 encoding for ü is the two-byte sequence, c3bc. Further, FC is not a valid UTF-8 sequence.
Since UTF-8 is assumed in Ruby 1.9, you should be able to get the literal u-umlaut with: "\xc3\xbc"
Have you tried
puts "fc".hex.chr(Encoding::UTF_8)
Ruby docs:
int.chr
Encoding
UPDATE:
Jason True is right. fc is invalid UTF-8. I have no idea why my example works!