I have a ruby program that parses a large block of text with a number of regular expressions. The problem I'm having is that anytime the text contains 'special characters' (for example Kuutõbine or Noël) the program throws an Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) How do I force the proper encoding?
Your Regex is being "compiled" as ASCII-8BIT.
Just add the encoding declaration at the top of the file where the Regex is declared:
encoding: utf-8
And you're done. Now, when Ruby is parsing your code, it will assume every literal you use (Regex, String, etc) is specified in UTF-8 encoding.
Related
How to keep all characters converting from UTF-8 to CP1252 on ruby 2.2
this code:
file = 'd:/1 descrição.txt'
puts file.encode('cp1252')
Give this error:
`encode': U+0327 to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252 (Encoding::UndefinedConversionError)
My application need to be cp1252, but I can't find any way to keep all the characters.
I can't replace this characters, because later I will use this info to read the file from file system.
puts file.encode('cp1252', undef: :replace, replace: '')
> d:/1 descricao.txt
ps: It is a ruby script not a ruby on rails application
UTF-8 covers the entire range of unicode, but CP1252 only includes a subset of them. Obviously this means that there are characters that can be encoded in UTF-8 but not in CP1252. This is the problem you are facing.
In your example it looks like the string only contains characters that should work in CP1252, but clearly it doesn’t.
The character in the error message, U+0327 is a combining character, and is not representable in CP1252. It combines with the preceding c to produce ç. ç can also be represented as a single character (U+00E7), which is representable in CP1252.
One option might be normalisation, which will convert the string into a form that is representable in CP1252.
file = 'd:/1 descrição.txt'.unicode_normalize(:nfc)
puts file.encode('cp1252')
(It appears that Stack Overflow is normalizing the string when displaying your question, which is probably why copying the code from the question and running it doesn’t produce any errors.)
This will avoid the error, but note that it is not necessarily possible to reverse the process to get the original string unless the original is in a known normalized form already.
When I using nokogiri to parser htmls, the Chinese characters are transfer to escaped sequences like
"å·
峰延æ¶"
How could I decode the escaped characters like "å·
峰延æ¶" back to normal characters?
It looks like your HTML page is encoded as UTF-8 but you are parsing as ISO-8859-1. You need to ensure you specify the correct encoding when parsing. If you are parsing from a string Nokogiri should use the same encoding as the string. If you are parsing from an IO object you can specify the encoding as the third argument to the parse method:
Nokogiri::HTML::Document.parse(io_object, nil, 'UTF-8')
What should the normal characters be though? This looks like their string representations.
Otherwise you have CGI.unescapeHTML() and CGI.escapeHTML() available in standard ruby (stdlib).
I have this hash:
a={"topic_id"=>60693, "urlkey"=>"innovacion", "name"=>"Innovaci\xF3n"}
and I am trying to save it to MongoDB using Mongoid, when I get this error:
BSON::InvalidStringEncoding: String not valid UTF-8
I am then trying to gsub it:
a["name"].gsub(/\xF3/,"o")
and I get: SyntaxError: (pry):12: too short escaped multibyte character: /\xF3/
I have added a magic comment at the beginning of my model file:# encoding: UTF-8
Hexidecimal 0xF3 by itself is not valid UTF-8. Values greater than 0x7F are all multi-byte characters. What makes you think it should be UTF-8?
You can read up on the allowable sequences here: http://en.wikipedia.org/wiki/UTF-8#Description
If you need to force the ruby string to assume an encoding that allows arbitrary byte sequences, you can force it to binary:
str.force_encoding("BINARY")
With a binary encoding, #gsub and other string operations that rely on valid encodings will work on a byte-by-byte basis, instead of a character-by-character basis.
puts "C3A9".lines.to_a.pack('H*').encoding
results in
ASCII-8BIT
but I prefer this text in UTF-8. But
"C3A9".lines.to_a.pack('H*').encode("UTF-8")
results in
`encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
why? How can I convert it to UTF-8?
You're going about this the wrong way. If you have URI encoded data like this:
%C5%BBaba
Then you should use URI.unescape to decode it:
1.9.2-head :004 > URI.unescape('%C5%BBaba')
=> "Żaba"
If that doesn't work then force the encoding to UTF-8:
1.9.2-head :004 > URI.unescape('%C5%BBaba').force_encoding('utf-8')
=> "Żaba"
ASCII-8bit is a pretend encoding native to Ruby. It has an alias to BINARY, and it is just that. ASCII-8bit is not a character encoding, but rather a way of saying that a string is binary data and not to be processed like text. Because pack/unpack functions are designed to operate on binary data, you should never assume that is returned is printable under any encoding unless the ENTIRE pack string is made up of character derivatives. If you clarify what the overall goal is, maybe we could give you a better solution.
If you isolate a hex UTF-8 code into a variable, say code which is a string of the hexadecimal format minus percent sign:
utf_char=[code.to_i(16)].pack("U")
Combine these with the rest of the string, you can make your string.
I seems to be a very simple and much needed method. I need to remove all non ASCII characters from a string. e.g © etc. See the following example.
#coding: utf-8
s = " Hello this a mixed string © that I made."
puts s.encoding
puts s.encode
output:
UTF-8
Hello this a mixed str
ing © that I made.
When I feed this to Watir, it produces following error:incompatible character encodings: UTF-8 and ASCII-8BIT
So my problem is that I want to get rid of all non ASCII characters before using it. I will not know which encoding the source string "s" uses.
I have been searching and experimenting for quite some time now.
If I try to use
puts s.encode('ASCII-8BIT')
It gives the error:
: "\xC2\xA9" from UTF-8 to ASCII-8BIT (Encoding::UndefinedConversionError)
You can just literally translate what you asked into a Regexp. You wrote:
I want to get rid of all non ASCII characters
We can rephrase that a little bit:
I want to substitue all characters which don't thave the ASCII property with nothing
And that's a statement that can be directly expressed in a Regexp:
s.gsub!(/\P{ASCII}/, '')
As an alternative, you could also use String#delete!:
s.delete!("^\u{0000}-\u{007F}")
Strip out the characters using regex. This example is in C# but the regex should be the same:
How can you strip non-ASCII characters from a string? (in C#)
Translating it into ruby using gsub should not be difficult.
UTF-8 is a variable-length encoding. When a character occupies one byte, its value coincides with 7-bit ASCII. So why don't you just look for bytes with a '1' in the MSB, and then remove both them and their trailers? A byte beginning with '110' will be followed by one additional byte. A byte beginning with '1110' will be followed by two. And a byte beginning with '11110' will be followed by three, the maximum supported by UTF-8.
This is all just off the top of my head. I could be wrong.