puts "C3A9".lines.to_a.pack('H*').encoding
results in
ASCII-8BIT
but I prefer this text in UTF-8. But
"C3A9".lines.to_a.pack('H*').encode("UTF-8")
results in
`encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
why? How can I convert it to UTF-8?
You're going about this the wrong way. If you have URI encoded data like this:
%C5%BBaba
Then you should use URI.unescape to decode it:
1.9.2-head :004 > URI.unescape('%C5%BBaba')
=> "Żaba"
If that doesn't work then force the encoding to UTF-8:
1.9.2-head :004 > URI.unescape('%C5%BBaba').force_encoding('utf-8')
=> "Żaba"
ASCII-8bit is a pretend encoding native to Ruby. It has an alias to BINARY, and it is just that. ASCII-8bit is not a character encoding, but rather a way of saying that a string is binary data and not to be processed like text. Because pack/unpack functions are designed to operate on binary data, you should never assume that is returned is printable under any encoding unless the ENTIRE pack string is made up of character derivatives. If you clarify what the overall goal is, maybe we could give you a better solution.
If you isolate a hex UTF-8 code into a variable, say code which is a string of the hexadecimal format minus percent sign:
utf_char=[code.to_i(16)].pack("U")
Combine these with the rest of the string, you can make your string.
Related
I have a ruby program that parses a large block of text with a number of regular expressions. The problem I'm having is that anytime the text contains 'special characters' (for example Kuutõbine or Noël) the program throws an Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) How do I force the proper encoding?
Your Regex is being "compiled" as ASCII-8BIT.
Just add the encoding declaration at the top of the file where the Regex is declared:
encoding: utf-8
And you're done. Now, when Ruby is parsing your code, it will assume every literal you use (Regex, String, etc) is specified in UTF-8 encoding.
How to keep all characters converting from UTF-8 to CP1252 on ruby 2.2
this code:
file = 'd:/1 descrição.txt'
puts file.encode('cp1252')
Give this error:
`encode': U+0327 to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252 (Encoding::UndefinedConversionError)
My application need to be cp1252, but I can't find any way to keep all the characters.
I can't replace this characters, because later I will use this info to read the file from file system.
puts file.encode('cp1252', undef: :replace, replace: '')
> d:/1 descricao.txt
ps: It is a ruby script not a ruby on rails application
UTF-8 covers the entire range of unicode, but CP1252 only includes a subset of them. Obviously this means that there are characters that can be encoded in UTF-8 but not in CP1252. This is the problem you are facing.
In your example it looks like the string only contains characters that should work in CP1252, but clearly it doesn’t.
The character in the error message, U+0327 is a combining character, and is not representable in CP1252. It combines with the preceding c to produce ç. ç can also be represented as a single character (U+00E7), which is representable in CP1252.
One option might be normalisation, which will convert the string into a form that is representable in CP1252.
file = 'd:/1 descrição.txt'.unicode_normalize(:nfc)
puts file.encode('cp1252')
(It appears that Stack Overflow is normalizing the string when displaying your question, which is probably why copying the code from the question and running it doesn’t produce any errors.)
This will avoid the error, but note that it is not necessarily possible to reverse the process to get the original string unless the original is in a known normalized form already.
Please follow the code:
__ENCODING__
# => #<Encoding:UTF-8>
Encoding.default_internal
# => #<Encoding:UTF-8>
Encoding.default_external
# => #<Encoding:UTF-8>
Case 1: HAML throws Encoding::UndefinedConversionError
string = "j\xC3\xBCrgen".force_encoding('ASCII-8BIT')
string.encoding
# => #<Encoding:ASCII-8BIT>
Haml::Engine.new("#{string}").render
## => Encoding::UndefinedConversionError: "\xC3" from ASCII-8BIT to UTF-8
ERB.new("<%= string %>").result(binding)
# => "jürgen"
# => Resulting encoding is #<Encoding:UTF-8>
Erubis::Eruby.new("<%= string %>").result(binding)
# => "j\xC3\xBCrgen"
# => resulting encoding is #<Encoding:ASCII-8BIT>
Case 2: HAML doesn't throw error
string = "Ratatouille".force_encoding('ASCII-8BIT')
string.encoding
# => #<Encoding:ASCII-8BIT>
Haml::Engine.new("#{string}").render
## => "Ratatouille\n"
## => resulting encoding is #<Encoding:UTF-8>
ERB.new("<%= string %>").result(binding)
# => "Ratatouille"
# => resulting encoding is #<Encoding:UTF-8>
Erubis::Eruby.new("<%= string %>").result(binding)
# => "Ratatouille"
# => result encoding is #<Encoding:US-ASCII>
Question :
Why is HAML failing in case 1 and succeeding in case 2
Why I'm asking I'm facing the similar problem when a rendering in HAML which blow up page because of Encoding::CompatibilityError
The only way right now I think I know how to avoid error this is do a force_encoding of my string to UTF8 using .force_encoding('UTF-8') which sort of avoid this issue but I have to do this in every page where I want to use the given string i.e "j\xC3\xBCrgen" (which I found kind of lame to do considering their many pages)
Any clue ??
Haml is trying to encode the result string to your Encoding.default_internal setting. In the first example the string ("j\xC3\xBCrgen") contains non ASCII bytes (i.e. bytes with the high bit set), whilst the string in the second example ("Ratatouille") doesn’t. Ruby can encode the second string (since UTF-8 is a superset of ASCII), but can’t encode the first and raises an error.
One way to work round this is to explicitly pass the string encoding as an option to Haml::Encoding:
Haml::Engine.new("#{string}", :encoding => Encoding::ASCII_8BIT).render
this will give you a result string that is also ASCII-8BIT.
In this case the string in question is UTF-8 though, so a better solution might be to look at where the string is coming from in your app and ensure it has the right encoding.
I don’t know enough about ERB and Erubis to say what’s happening, it looks like ERB is incorrectly assuming it is UTF-8 (it has now way to know those bytes should actually be treated as UTF-8) and Erubis is doing the more sensible thing of leaving the encoding as binary – either because it isn’t doing any encoding at all, or it is treating binary encoded input specially.
From the PickAxe book:
Ruby supports a virtual encoding called ASCII-8BIT . Despite the ASCII in the name, this is
really intended to be used on data streams that contain binary data (which is why it has an
alias of BINARY }). However, you can also use this as an encoding for source files. If you do,
Ruby interprets all characters with codes below 128 as regular ASCII and all other characters
as valid constituents of variable names. This is basically a neat hack, because it allows you
to compile a file written in an encoding you don’t know—the characters with the high-order
bit set will be assumed to be printable.
String#force_encoding tells Ruby which encoding to use in order to interpret some binary data. It does not change/convert the actual bytes (that would be String#encode), just changes the encoding associated with these bytes.
Why would you try to associate a BINARY encoding to a string containing UTF-8 characters anyway?
Regarding your question about why the second case succeeds the answer is simply that your second string ("Ratatouille") contains only 7-bit ASCII characters.
I'm writing a crawler which uses Hpricot. It downloads a list of strings from some webpage, then I try to write it to the file. Something is wrong with the encoding:
"\xC3" from ASCII-8BIT to UTF-8
I have items which are rendered on a webpage and printed this way:
Développement
the str.encoding returns UTF-8, so force_encoding('UTF-8') doesn't help. How may I convert this to readable UTF-8?
Your string seems to have been encoded the wrong way round:
"Développement".encode("iso-8859-1").force_encoding("utf-8")
#=> "Développement"
Seems your string thinks it is UTF-8, but in reality, it is something else, probably ISO-8859-1.
Define (force) the correct encoding first, then convert it to UTF-8.
In your example:
puts "Développement".encode('iso-8859-1').encode('utf-8')
An alternative is:
puts "\xC3".force_encoding('iso-8859-1').encode('utf-8') #-> Ã
If the à makes no sense, then try another encoding.
"ruby 1.9: invalid byte sequence in UTF-8" described another good approach with less code:
file_contents.encode!('UTF-16', 'UTF-8')
I have this hash:
a={"topic_id"=>60693, "urlkey"=>"innovacion", "name"=>"Innovaci\xF3n"}
and I am trying to save it to MongoDB using Mongoid, when I get this error:
BSON::InvalidStringEncoding: String not valid UTF-8
I am then trying to gsub it:
a["name"].gsub(/\xF3/,"o")
and I get: SyntaxError: (pry):12: too short escaped multibyte character: /\xF3/
I have added a magic comment at the beginning of my model file:# encoding: UTF-8
Hexidecimal 0xF3 by itself is not valid UTF-8. Values greater than 0x7F are all multi-byte characters. What makes you think it should be UTF-8?
You can read up on the allowable sequences here: http://en.wikipedia.org/wiki/UTF-8#Description
If you need to force the ruby string to assume an encoding that allows arbitrary byte sequences, you can force it to binary:
str.force_encoding("BINARY")
With a binary encoding, #gsub and other string operations that rely on valid encodings will work on a byte-by-byte basis, instead of a character-by-character basis.