Convert UTF-8 to CP1252 ruby 2.2 - ruby

How to keep all characters converting from UTF-8 to CP1252 on ruby 2.2
this code:
file = 'd:/1 descrição.txt'
puts file.encode('cp1252')
Give this error:
`encode': U+0327 to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252 (Encoding::UndefinedConversionError)
My application need to be cp1252, but I can't find any way to keep all the characters.
I can't replace this characters, because later I will use this info to read the file from file system.
puts file.encode('cp1252', undef: :replace, replace: '')
> d:/1 descricao.txt
ps: It is a ruby script not a ruby on rails application

UTF-8 covers the entire range of unicode, but CP1252 only includes a subset of them. Obviously this means that there are characters that can be encoded in UTF-8 but not in CP1252. This is the problem you are facing.
In your example it looks like the string only contains characters that should work in CP1252, but clearly it doesn’t.
The character in the error message, U+0327 is a combining character, and is not representable in CP1252. It combines with the preceding c to produce ç. ç can also be represented as a single character (U+00E7), which is representable in CP1252.
One option might be normalisation, which will convert the string into a form that is representable in CP1252.
file = 'd:/1 descrição.txt'.unicode_normalize(:nfc)
puts file.encode('cp1252')
(It appears that Stack Overflow is normalizing the string when displaying your question, which is probably why copying the code from the question and running it doesn’t produce any errors.)
This will avoid the error, but note that it is not necessarily possible to reverse the process to get the original string unless the original is in a known normalized form already.

Related

Why does Ruby Integer method 'chr' use ASCII-8bit, not UTF-8 by default?

According to it's source code https://www.rubydoc.info/stdlib/core/Integer:chr, this method uses ASCII encoding if no arguments provided, and really, it gives different results when called with and without arguments:
irb(main):002:0* 255.chr
=> "\xFF"
irb(main):003:0' 255.chr 'utf-8'
=> "ÿ"
Why does this happen? Isn't Ruby supposed to use UTF-8 everywhere by default? At least all strings seem to be encoded with UTF-8:
irb(main):005:0> "".encoding
=> #<Encoding:UTF-8>
Why does this happen?
For characters from U+0000 to U+007F (127), the vast majority of single-octet and variable-length character encodings agree on the encoding. In particular, they all agree on being strict supersets of ASCII.
In other words: for characters up to and including U+007F, ASCII, the entire ISO8859 family, the entire DOS codepage family, the entire Windows family, as well as UTF-8 are actually identical. So, for characters between U+0000 and U+007F, ASCII is the logical choice:
0.chr.encoding
#=> #<Encoding:US-ASCII>
127.chr.encoding
#=> #<Encoding:US-ASCII>
However, for anything above 127, more or less no two character encodings agree. In fact, the overwhelming majority of characters above 127 don't even exist in the overwhelming majority of characters sets, thus don't have an encoding in the vast majority of character encodings.
In other words: it is practically impossible to find a single default encoding for characters above 127.
Therefore, the encoding that is chosen by Ruby is Encoding::BINARY, which is basically a pseudo-encoding that means "this isn't actually text, this is unstructured unknown binary data". (For hysterical raisins, this encoding is also aliased to ASCII-8BIT, which I find absolutely horrible, because ASCII is 7 bit, period, and anything using the 8th bit is by definition not ASCII.)
128.chr.encoding
#=> #<Encoding:ASCII-8BIT>
255.chr.encoding
#=> #<Encoding:ASCII-8BIT>
Note also that Integer#chr is limited to a single octet, i.e. to a range from 0 to 255, so multi-octet or variable-length encodings are not really required here.
Isn't Ruby supposed to use UTF-8 everywhere by default?
Which encoding are you talking about? Ruby has about a half dozen of them.
For the vast majority of encodings, your statement is incorrect.
the locale encoding is the default encoding of the environment
the filesystem encoding is the encoding that is used for file paths: the value is determined by the file system
the external encoding of an IO object is the encoding that text that this read is assumed to be in and text that is written is transcoded to: the default is the locale encoding
the internal encoding of an IO object is the encoding that Strings that are written to the IO object must be in and that Strings that are read from the IO object are transcoded into: the default is the default internal encoding, whose default value, in turn, is nil, meaning no transcoding occurs
the script encoding is the encoding that a Ruby script is read, and also String literals in the script will inherit this encoding: it is set with a magic comment at the beginning of the script, and the default is UTF-8
So, as you can see, there are many different encodings, and many different defaults, and only one of them is UTF-8. And none of those encodings are actually relevant to your question, because 128.chr is neither a String literal nor an IO object. It is a String object that is created by the Integer#chr method using whatever encoding it sees fit.

How to address Compatibility Error with ruby

I have a ruby program that parses a large block of text with a number of regular expressions. The problem I'm having is that anytime the text contains 'special characters' (for example Kuutõbine or Noël) the program throws an Encoding::CompatibilityError: incompatible encoding regexp match (UTF-8 regexp with ASCII-8BIT string) How do I force the proper encoding?
Your Regex is being "compiled" as ASCII-8BIT.
Just add the encoding declaration at the top of the file where the Regex is declared:
encoding: utf-8
And you're done. Now, when Ruby is parsing your code, it will assume every literal you use (Regex, String, etc) is specified in UTF-8 encoding.

Convert a unicode string to characters in Ruby?

I have the following string:
l\u0092issue
My question is how to convert it to utf8 characters ?
I have tried that
1.9.3p484 :024 > "l\u0092issue".encode('utf-8')
=> "l\u0092issue"
You seem to have got your encodings into a bit of a mix up. If you haven’t already, you should first read Joel Spolsky’s article The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) which provides a good introduction into this type of thing. There is a good set of articles on how Ruby handles character encodings at http://graysoftinc.com/character-encodings/understanding-m17n-multilingualization. You could also have a look at the Ruby docs for String and Encoding.
In this specific case, the string l\u0092issue means that the second character is the character with the unicode codepoint 0x92. This codepoint is PRIVATE USE TWO (see the chart), which basically means this position isn’t used.
However, looking at the Windows CP-1252 encoding, position 0x92 is occupied by the character ’, so if this is the missing character the the string would be l’issue, whick looks a lot more likely even though I don’t speak French.
What I suspect has happened is your program has received the string l’issue encoded in CP-1252, but has assumed it was encoded in ISO-8859-1 (ISO-8859-1 and CP-1252 are quite closely related) and re-encoded it to UTF-8 leaving you with the string you now have.
The real fix for you is to be careful about the encodings of any strings that enter (and leave) your program, and how you manage them.
To transform your string to l’issue, you can encode it back to ISO-8859-1, then use force_encoding to tell Ruby the real encoding of CP-1252, and then you can re-encode to UTF-8:
2.1.0 :001 > s = "l\u0092issue"
=> "l\u0092issue"
2.1.0 :002 > s = s.encode('iso-8859-1')
=> "l\x92issue"
2.1.0 :003 > s.force_encoding('cp1252')
=> "l\x92issue"
2.1.0 :004 > s.encode('utf-8')
=> "l’issue"
This is only really a demonstration of what is going on though. The real solution is to make sure you’re handling encodings correctly.
That is encoded as UTF-8 (unless you changed the original string encoding). Ruby is just showing you the escape sequences when you inspect the string (which is why IRB does there). \u0092 is the escape sequence for this character.
Try puts "l\u0092issue" to see the rendered character, if your terminal font supports it.

Convert unicode codepoint to string character in Ruby

I have these values from a unicode database but I'm not sure how to translate them into the human readable form. What are these even called?
Here they are:
U+2B71F
U+2A52D
U+2A68F
U+2A690
U+2B72F
U+2B4F7
U+2B72B
How can I convert these to there readable symbols?
How about:
# Using pack
puts ["2B71F".hex].pack("U")
# Using chr
puts (0x2B71F).chr(Encoding::UTF_8)
In Ruby 1.9+ you can also do:
puts "\u{2B71F}"
I.e. the \u{} escape sequence can be used to decode Unicode codepoints.
The unicode symbols like U+2B71F are referred to as a codepoint.
The unicode system defines a unique codepoint for each character in a multitude of world languages, scientific symbols, currencies etc. This character set is steadily growing.
For example, U+221E is infinity.
The codepoints are hexadecimal numbers. There is always exactly one number defined per character.
There are many ways to arrange this in memory. This is known as an encoding of which the common ones are UTF-8 and UTF-16. The conversion to and fro is well defined.
Here you are most probably looking for converting the unicode codepoint to UTF-8 characters.
codepoint = "U+2B71F"
You need to extract the hex part coming after U+ and get only 2B71F. This will be the first group capture. See this.
codepoint.to_s =~ /U\+([0-9a-fA-F]{4,5}|10[0-9a-fA-F]{4})$/
And you're UTF-8 character will be:
utf_8_character = [$1.hex].pack("U")
References:
Convert Unicode codepoints to UTF-8 characters with Module#const_missing.
Tim Bray on the goodness of unicode.
Joel Spolsky - The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).
Dissecting the Unicode regular expression

How to remove all non - ASCII characters from a string in Ruby

I seems to be a very simple and much needed method. I need to remove all non ASCII characters from a string. e.g © etc. See the following example.
#coding: utf-8
s = " Hello this a mixed string © that I made."
puts s.encoding
puts s.encode
output:
UTF-8
Hello this a mixed str
ing © that I made.
When I feed this to Watir, it produces following error:incompatible character encodings: UTF-8 and ASCII-8BIT
So my problem is that I want to get rid of all non ASCII characters before using it. I will not know which encoding the source string "s" uses.
I have been searching and experimenting for quite some time now.
If I try to use
puts s.encode('ASCII-8BIT')
It gives the error:
: "\xC2\xA9" from UTF-8 to ASCII-8BIT (Encoding::UndefinedConversionError)
You can just literally translate what you asked into a Regexp. You wrote:
I want to get rid of all non ASCII characters
We can rephrase that a little bit:
I want to substitue all characters which don't thave the ASCII property with nothing
And that's a statement that can be directly expressed in a Regexp:
s.gsub!(/\P{ASCII}/, '')
As an alternative, you could also use String#delete!:
s.delete!("^\u{0000}-\u{007F}")
Strip out the characters using regex. This example is in C# but the regex should be the same:
How can you strip non-ASCII characters from a string? (in C#)
Translating it into ruby using gsub should not be difficult.
UTF-8 is a variable-length encoding. When a character occupies one byte, its value coincides with 7-bit ASCII. So why don't you just look for bytes with a '1' in the MSB, and then remove both them and their trailers? A byte beginning with '110' will be followed by one additional byte. A byte beginning with '1110' will be followed by two. And a byte beginning with '11110' will be followed by three, the maximum supported by UTF-8.
This is all just off the top of my head. I could be wrong.

Resources