What a quick web search will confirm that US ASCII is a subset of UTF-8, but what I've not yet found is how to convert &foo; and { to their corresponding native UTF-8 characters.
I know that at least 7-bit US ASCII is unchanged in UTF-8, but I haven't seen yet a program to filter through and convert &foo; to how it would naturally be expressed in UTF-8.
You can use html_entity_decode(s, "UTF-8") in PHP or html.unescape(s) in Python.
https://www.php.net/manual/en/function.html-entity-decode.php
https://docs.python.org/3/library/html.html#html.unescape
How to keep all characters converting from UTF-8 to CP1252 on ruby 2.2
this code:
file = 'd:/1 descrição.txt'
puts file.encode('cp1252')
Give this error:
`encode': U+0327 to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252 (Encoding::UndefinedConversionError)
My application need to be cp1252, but I can't find any way to keep all the characters.
I can't replace this characters, because later I will use this info to read the file from file system.
puts file.encode('cp1252', undef: :replace, replace: '')
> d:/1 descricao.txt
ps: It is a ruby script not a ruby on rails application
UTF-8 covers the entire range of unicode, but CP1252 only includes a subset of them. Obviously this means that there are characters that can be encoded in UTF-8 but not in CP1252. This is the problem you are facing.
In your example it looks like the string only contains characters that should work in CP1252, but clearly it doesn’t.
The character in the error message, U+0327 is a combining character, and is not representable in CP1252. It combines with the preceding c to produce ç. ç can also be represented as a single character (U+00E7), which is representable in CP1252.
One option might be normalisation, which will convert the string into a form that is representable in CP1252.
file = 'd:/1 descrição.txt'.unicode_normalize(:nfc)
puts file.encode('cp1252')
(It appears that Stack Overflow is normalizing the string when displaying your question, which is probably why copying the code from the question and running it doesn’t produce any errors.)
This will avoid the error, but note that it is not necessarily possible to reverse the process to get the original string unless the original is in a known normalized form already.
I have the following string encoded with ISO-8859-15 stored inside of a file:
DEBUG_RECEIVED: ????
The correct UTF-8 string though is:
DEBUG_RECEIVED: 测试手机
Does it make sense trying to convert those wrong ???? characters again into 测试手机 (therefore from ISO-8859-15 to UTF-8 again), or is it just impossible due to the fact that ISO-8859-15 is not sued for chinese characters and as it uses 8 bit per character, the 16 bit needed for chinese characters are simply lost?
When I try the following:
echo "DEBUG_RECEIVED: ????" | iconv -f iso-8859-15 -t utf-8
I still get DEBUG_RECEIVED: ???? as output.
I am a bit confused about this, please, if you can clarify this detail, it would be great.
Thanks for the attention.
Yes, whatever generated the 8859-15 string had to discard the information necessary to represent Chinese characters.
Lost info is lost – your Chinese characters seem to have been replaced by ?, and there is nothing that can get them back.
I'm trying to do a little challenge where you have to decode a 'Alien Message' located here
What I'm trying to do is force the encoding into ACSII in an attempt to decode the message here's what I have so far:
def gather_info
file = './lib/SETI_message.txt'
gather = File.read(file)
packed = [gather].pack('b*')
encoding_forced = packed.encode(Encoding::ASCII)
File.open('packed.txt', 'a+'){ |s| s.puts(encoding_forced) }
end
However I'm getting the following error:
main.rb:5:in `encode': "\xFF" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to US-ASCII (Encoding::UndefinedConversionError)
from main.rb:5:in `gather_info'
from main.rb:9:in `<main>'
I have no idea what this error means can anyone explain to me what I'm doing wrong, and how to go about fixing the encoding?
UPDATE:
I've discovered that the character encoding is IMB437 for the message using the following:
file = './lib/packed.txt'
gather = File.read(file)
puts gather.encoding
The problem with trying to encode the unpacked string to ASCII is that while the unpacked string is 8 bits (256 possible characters), ASCII covers only 7 bits (128 characters). So there is no way ruby can know how to encode (and possibly display) "characters" having their byte value above 127 and that's why you get the conversion error.
Anyway, converting the binary numbers to letters based on the ASCII table seems not the best approach for this type of task (unless the aliens used the ASCII table too :) ). I guess you need to work with the data as with numbers only.
How can I load a YAML file regardlessly of its encoding?
My YAML file can be encoded in UTF-8 or ANSI (that's what Notepad++ calls it - I guess it's Windows-1252):
:key1:
:key2: "ä"
utf8.yml is encoded in UTF-8, ansi.yml is encoded in ANSI. I load the files as follows:
# encoding: utf-8
Encoding.default_internal = "utf-8"
utf8_load = YAML::load(File.open('utf8.yml'))
utf8_load_file = YAML::load_file('utf8.yml')
ansi_load = YAML::load(File.open('ansi.yml'))
ansi_load_file = YAML::load_file('ansi.yml')
It seems like Ruby doesn't recognize the encoding correctly:
utf8_load [:key1][:key2].encoding #=> "UTF-8"
utf8_load_file [:key1][:key2].encoding #=> "UTF-8"
ansi_load [:key1][:key2].encoding #=> "UTF-8"
ansi_load_file [:key1][:key2].encoding #=> "UTF-8"
because the bytes aren't the same:
utf8_load [:key1][:key2].bytes #=> [195, 164]
utf8_load_file [:key1][:key2].bytes #=> [195, 164]
ansi_load [:key1][:key2].bytes #=> [239, 191, 189]
ansi_load_file [:key1][:key2].bytes #=> [239, 191, 189]
If I miss Encoding.default_internal = "utf-8", the bytes are also different:
utf8_load [:key1][:key2].bytes #=> [195, 131, 194, 164]
utf8_load_file [:key1][:key2].bytes #=> [195, 164]
ansi_load [:key1][:key2].bytes #=> [195, 164]
ansi_load_file [:key1][:key2].bytes #=> [239, 191, 189]
What happens actually when I don't set the default_internal to utf-8?
Which encodings do the strings in both examples have?
How can I load a file even if I don't know its encoding?
I believe officially YAML only supports UTF-8 (and maybe UTF-16). There have historically been all sorts of encoding confusions in YAML libraries. I think you are going to run into trouble trying to have YAML in something other than a Unicode encoding.
What happens actually when I don't set the default_internal to utf-8?
Encoding.default_internal controls the encoding your input will be converted to when it is read in, at least by some operations that respect Encoding.default_internal, not everything does. Rails seems to set it to UTF-8. So if you don't set the Encoding.default_internal to UTF-8, it might be UTF-8 already anyway.
If Encoding.default_internal is nil, then those operations that respect it, and try to convert any input to Encoding.default_internal upon reading it in won't do that, they'll leave any input in the encoding it was believed to originate in, not try to convert it.
If you set it to something else, like say "WINDOWS-1252" Ruby would automatically convert your stuff to WINDOWS-1252 when it read it in with File.open, which would possibly confuse YAML::load when you pass the string that's now encoded and tagged as WINDOWS-1252 to it. Generally there's no good reason to do this, so leave Encoding.default_internal alone.
Note: The Ruby docs say:
"You should not set ::default_internal in Ruby code as strings created before changing the value may have a different encoding from strings created after the change. Instead you should use ruby -E to invoke Ruby with the correct default_internal."
See also: http://ruby-doc.org/core-1.9.3/Encoding.html#method-c-default_internal
Which encodings do the strings in both examples have?
I don't really know. One would have to have to look at the bytes and try to figure out if they are legal bytes for various plausible encodings, and beyond being legal, if they mean something likely to be intended.
For example take: "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ". That's a perfectly legal UTF-8 string, but as humans we know it's probably not intended, and is probably garbage, quite likely from the result of an encoding misinterpretation. But a computer has no way to know that, it's perfectly legal UTF-8, and, hey, maybe someone actually did mean to write "ÉGÉìÉRÅ[ÉfÉBÉìÉOÇÕìÔǵÇ≠ǻǢ", after all, I just did, when writing this post!
So you can try to interpret the bytes according to various encodings and see if any of them make sense.
You're really just guessing at this point. Which means...
How can I load a file even if I don't know it's encoding?
Generally, you can not. You need to know and keep track of encodings. There's no real way to know what the bytes mean without knowing their encoding.
If you have some legacy data for which you've lost this, you've got to try to figure it out. Manually, or with some code that tries to guess likely encodings based on heuristics. Here's one Ruby gem Charlock Holmes that tries to guess, using the ICU library heuristics (this particular gem only works on MRI).
What Ruby says in response to string.encoding is just the encoding the string is tagged with. The string can be tagged with the wrong encoding, the bytes in the string don't actually mean what is intended in the encoding it's tagged with... in which case you'll get garbage.
Ruby will do the right things with your string instead of creating garbage only if the string's encoding tag is correct. The string's encoding tag is determined by Encoding.default_external for most input operations by default (Encoding.default_external usually starts out as UTF-8, or ASCII-8BIT which really means the null encoding, binary data, not tagged with an encoding), or by passing an argument to File.open: File.open("something", "r:UTF-8" or, means the same thing, File.open("something", "r", :encoding => "UTF-8"). The actual bytes are determined by whatever is in the file. It's up to you to tell Ruby the correct encoding to interpret those bytes as text meaning what they were intended to mean.
There were a couple posts recently to reddit /r/ruby that try to explain how to troubleshoot and workaround encoding issues that you may find helpful:
http://www.justinweiss.com/articles/how-to-get-from-theyre-to-theyre/
http://www.justinweiss.com/articles/3-steps-to-fix-encoding-problems-in-ruby/
Also, this is my favorite article on understanding encoding generally: http://kunststube.net/encoding/
For YAML files in particular, if I were you, I'd just make sure they are all in UTF-8. Life will be much easier and you won't have to worry about it. If you have some legacy ones that have become corrupted, it's going to be a pain to fix them, but that's what you've got to do, unless you can just rewrite them from scratch. Try to fix them to be in valid and correct UTF-8, and from here on out keep all your YAML in UTF-8.
The YAML specification says in "5.1. Character Set":
To ensure readability, YAML streams use only the printable subset of the Unicode character set. The allowed character range explicitly excludes the C0 control block #x0-#x1F (except for TAB #x9, LF #xA, and CR #xD which are allowed), DEL #x7F, the C1 control block #x80-#x9F (except for NEL #x85 which is allowed), the surrogate block #xD800-#xDFFF, #xFFFE, and #xFFFF.
This means that Windows-1252 or ISO-8859-1 encoding are acceptable as long as the characters being output are within the defined range. Windows users tend to use the "the C1 control block #x80-#x9F" range for diacritical and accented characters, so if those are present in a YAML file the file is not going to meet the spec and the YAML generator didn't do its job correctly. And that explains why "ä" isn't acceptable.
On output, a YAML processor must only produce acceptable characters. Any excluded characters must be presented using escape sequences. In addition, any allowed characters known to be non-printable should also be escaped. This isn’t mandatory since a full implementation would require extensive character property tables.
These days, by default, Ruby uses UTF-8, however YAML isn't limited to that. The spec goes on to say in "5.2. Character Encodings":
On input, a YAML processor must support the UTF-8 and UTF-16 character encodings. For JSON compatibility, the UTF-32 encodings must also be supported.
If a character stream begins with a byte order mark, the character encoding will be taken to be as as indicated by the byte order mark. Otherwise, the stream must begin with an ASCII character. This allows the encoding to be deduced by the pattern of null (#x00) characters.
So, UTF-8, 16 and 32 are supported, but Ruby will assume UTF-8. If the BOM is present you'll see it when you view the file in an editor. I haven't tried loading a UTF-16 or 32 file to see what Ruby's YAML does, so that's left as an experiment.