Why \xF3 is not recognized as UTF-8 - ruby

I have this hash:
a={"topic_id"=>60693, "urlkey"=>"innovacion", "name"=>"Innovaci\xF3n"}
and I am trying to save it to MongoDB using Mongoid, when I get this error:
BSON::InvalidStringEncoding: String not valid UTF-8
I am then trying to gsub it:
a["name"].gsub(/\xF3/,"o")
and I get: SyntaxError: (pry):12: too short escaped multibyte character: /\xF3/
I have added a magic comment at the beginning of my model file:# encoding: UTF-8

Hexidecimal 0xF3 by itself is not valid UTF-8. Values greater than 0x7F are all multi-byte characters. What makes you think it should be UTF-8?
You can read up on the allowable sequences here: http://en.wikipedia.org/wiki/UTF-8#Description
If you need to force the ruby string to assume an encoding that allows arbitrary byte sequences, you can force it to binary:
str.force_encoding("BINARY")
With a binary encoding, #gsub and other string operations that rely on valid encodings will work on a byte-by-byte basis, instead of a character-by-character basis.

Related

How to decode/encode ASCII_8BIT in the Scala?

I have an external system written in Ruby which sending a data over the wire encoded with ASCII_8BIT. How should I decode and encode them in the Scala?
I couldn't find a library for decoding and encoding ASCII_8BIT string in scala.
As I understand, correctly, the ASCII_8BIT is something similar to Base64. However, there is more than one Base64 encoding. Which type of encoding should I use to be sure that cover all corner cases?
What is ASCII-8BIT?
ASCII-8BIT is Ruby's binary encoding (the name "BINARY" is accepted as an alias for "ASCII-8BIT" when specifying the name of an encoding). It is used both for binary data and for text whose real encoding you don't know.
Any sequence of bytes is a valid string in the ASCII-8BIT encoding, but unlike other 8bit-encodings, only the bytes in the ASCII range are considered printable characters (and of course only those that are printable in ASCII). The bytes in the 128-255 range are considered special characters that don't have a representation in other encodings. So trying to convert an ASCII-8BIT string to any other encoding will fail (or replace the non-ASCII characters with question marks depending on the options you give to encode) unless it only contains ASCII characters.
What's its equivalent in the Scala/JVM world?
There is no strict equivalent. If you're dealing with binary data, you should be using binary streams that don't have an encoding and aren't treated as containing text.
If you're dealing with text, you'll either need to know (or somehow figure out) its encoding or just arbitrarily pick an 8-bit ASCII-superset encoding. That way non-ASCII characters may come out as the wrong character (if the text was actually encoded with a different encoding), but you won't get any errors because any byte is a valid character. You can then replace the non-ASCII characters with question marks if you want.
What does this have to do with Base64?
Nothing. Base64 is a way to represent binary data as ASCII text. It is not itself a character encoding. Knowing that a string has the character encoding ASCII or ASCII-8BIT or any other encoding, doesn't tell you whether it contains Base64 data or not.
But do note that a Base64 string will consist entirely of ASCII characters (and not just any ASCII characters, but only letters, numbers, +, / and =). So if your string contains any non-ASCII character or any character except the aforementioned, it's not Base64.
Therefore any Base64 string can be represented as ASCII. So if you have an ASCII-8BIT string containing Base64 data in Ruby, you should be able to convert it to ASCII without any problems. If you can't, it's not Base64.

How to address Compatibility Error with ruby

I have a ruby program that parses a large block of text with a number of regular expressions. The problem I'm having is that anytime the text contains 'special characters' (for example Kuutõbine or Noël) the program throws an Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) How do I force the proper encoding?
Your Regex is being "compiled" as ASCII-8BIT.
Just add the encoding declaration at the top of the file where the Regex is declared:
encoding: utf-8
And you're done. Now, when Ruby is parsing your code, it will assume every literal you use (Regex, String, etc) is specified in UTF-8 encoding.

Convert UTF-8 to CP1252 ruby 2.2

How to keep all characters converting from UTF-8 to CP1252 on ruby 2.2
this code:
file = 'd:/1 descrição.txt'
puts file.encode('cp1252')
Give this error:
`encode': U+0327 to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252 (Encoding::UndefinedConversionError)
My application need to be cp1252, but I can't find any way to keep all the characters.
I can't replace this characters, because later I will use this info to read the file from file system.
puts file.encode('cp1252', undef: :replace, replace: '')
> d:/1 descricao.txt
ps: It is a ruby script not a ruby on rails application
UTF-8 covers the entire range of unicode, but CP1252 only includes a subset of them. Obviously this means that there are characters that can be encoded in UTF-8 but not in CP1252. This is the problem you are facing.
In your example it looks like the string only contains characters that should work in CP1252, but clearly it doesn’t.
The character in the error message, U+0327 is a combining character, and is not representable in CP1252. It combines with the preceding c to produce ç. ç can also be represented as a single character (U+00E7), which is representable in CP1252.
One option might be normalisation, which will convert the string into a form that is representable in CP1252.
file = 'd:/1 descrição.txt'.unicode_normalize(:nfc)
puts file.encode('cp1252')
(It appears that Stack Overflow is normalizing the string when displaying your question, which is probably why copying the code from the question and running it doesn’t produce any errors.)
This will avoid the error, but note that it is not necessarily possible to reverse the process to get the original string unless the original is in a known normalized form already.

Are functions like strip_tags() and trim() UTF-8 aware?

I'm wondering whether functions like strip_tags() and trim() UTF-8 aware?
I found this on the web, but I'm not sure about it:
strip_tags(): multi-byte UTF-8 characters contain no byte sequences that resemble less-than or greater-than symbols.
trim(): multi-byte UTF-8 characters contain no byte sequences that resemble white space.
If it's true, using those functions with an UTF-8 string could lead to a corrupted/illegible string.
Thank you.
I think the descriptions you quoted mean just the opposite. Because utf8 multi-byte characters DON'T contains whitespace, or lt/gt, or any other byte < 0x80, you can safely use those functions on utf8 strings. That's the beauty of utf8!

How to remove all non - ASCII characters from a string in Ruby

I seems to be a very simple and much needed method. I need to remove all non ASCII characters from a string. e.g © etc. See the following example.
#coding: utf-8
s = " Hello this a mixed string © that I made."
puts s.encoding
puts s.encode
output:
UTF-8
Hello this a mixed str
ing © that I made.
When I feed this to Watir, it produces following error:incompatible character encodings: UTF-8 and ASCII-8BIT
So my problem is that I want to get rid of all non ASCII characters before using it. I will not know which encoding the source string "s" uses.
I have been searching and experimenting for quite some time now.
If I try to use
puts s.encode('ASCII-8BIT')
It gives the error:
: "\xC2\xA9" from UTF-8 to ASCII-8BIT (Encoding::UndefinedConversionError)
You can just literally translate what you asked into a Regexp. You wrote:
I want to get rid of all non ASCII characters
We can rephrase that a little bit:
I want to substitue all characters which don't thave the ASCII property with nothing
And that's a statement that can be directly expressed in a Regexp:
s.gsub!(/\P{ASCII}/, '')
As an alternative, you could also use String#delete!:
s.delete!("^\u{0000}-\u{007F}")
Strip out the characters using regex. This example is in C# but the regex should be the same:
How can you strip non-ASCII characters from a string? (in C#)
Translating it into ruby using gsub should not be difficult.
UTF-8 is a variable-length encoding. When a character occupies one byte, its value coincides with 7-bit ASCII. So why don't you just look for bytes with a '1' in the MSB, and then remove both them and their trailers? A byte beginning with '110' will be followed by one additional byte. A byte beginning with '1110' will be followed by two. And a byte beginning with '11110' will be followed by three, the maximum supported by UTF-8.
This is all just off the top of my head. I could be wrong.

Resources