Error while forcing encoding in text file - ruby

I'm trying to do a little challenge where you have to decode a 'Alien Message' located here
What I'm trying to do is force the encoding into ACSII in an attempt to decode the message here's what I have so far:
def gather_info
file = './lib/SETI_message.txt'
gather = File.read(file)
packed = [gather].pack('b*')
encoding_forced = packed.encode(Encoding::ASCII)
File.open('packed.txt', 'a+'){ |s| s.puts(encoding_forced) }
end
However I'm getting the following error:
main.rb:5:in `encode': "\xFF" to UTF-8 in conversion from ASCII-8BIT to UTF-8 to US-ASCII (Encoding::UndefinedConversionError)
from main.rb:5:in `gather_info'
from main.rb:9:in `<main>'
I have no idea what this error means can anyone explain to me what I'm doing wrong, and how to go about fixing the encoding?
UPDATE:
I've discovered that the character encoding is IMB437 for the message using the following:
file = './lib/packed.txt'
gather = File.read(file)
puts gather.encoding

The problem with trying to encode the unpacked string to ASCII is that while the unpacked string is 8 bits (256 possible characters), ASCII covers only 7 bits (128 characters). So there is no way ruby can know how to encode (and possibly display) "characters" having their byte value above 127 and that's why you get the conversion error.
Anyway, converting the binary numbers to letters based on the ASCII table seems not the best approach for this type of task (unless the aliens used the ASCII table too :) ). I guess you need to work with the data as with numbers only.

Related

Convert UTF-8 to CP1252 ruby 2.2

How to keep all characters converting from UTF-8 to CP1252 on ruby 2.2
this code:
file = 'd:/1 descrição.txt'
puts file.encode('cp1252')
Give this error:
`encode': U+0327 to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252 (Encoding::UndefinedConversionError)
My application need to be cp1252, but I can't find any way to keep all the characters.
I can't replace this characters, because later I will use this info to read the file from file system.
puts file.encode('cp1252', undef: :replace, replace: '')
> d:/1 descricao.txt
ps: It is a ruby script not a ruby on rails application
UTF-8 covers the entire range of unicode, but CP1252 only includes a subset of them. Obviously this means that there are characters that can be encoded in UTF-8 but not in CP1252. This is the problem you are facing.
In your example it looks like the string only contains characters that should work in CP1252, but clearly it doesn’t.
The character in the error message, U+0327 is a combining character, and is not representable in CP1252. It combines with the preceding c to produce ç. ç can also be represented as a single character (U+00E7), which is representable in CP1252.
One option might be normalisation, which will convert the string into a form that is representable in CP1252.
file = 'd:/1 descrição.txt'.unicode_normalize(:nfc)
puts file.encode('cp1252')
(It appears that Stack Overflow is normalizing the string when displaying your question, which is probably why copying the code from the question and running it doesn’t produce any errors.)
This will avoid the error, but note that it is not necessarily possible to reverse the process to get the original string unless the original is in a known normalized form already.

Ruby, pack encoding (ASCII-8BIT that cannot be converted to UTF-8)

puts "C3A9".lines.to_a.pack('H*').encoding
results in
ASCII-8BIT
but I prefer this text in UTF-8. But
"C3A9".lines.to_a.pack('H*').encode("UTF-8")
results in
`encode': "\xC3" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
why? How can I convert it to UTF-8?
You're going about this the wrong way. If you have URI encoded data like this:
%C5%BBaba
Then you should use URI.unescape to decode it:
1.9.2-head :004 > URI.unescape('%C5%BBaba')
=> "Żaba"
If that doesn't work then force the encoding to UTF-8:
1.9.2-head :004 > URI.unescape('%C5%BBaba').force_encoding('utf-8')
=> "Żaba"
ASCII-8bit is a pretend encoding native to Ruby. It has an alias to BINARY, and it is just that. ASCII-8bit is not a character encoding, but rather a way of saying that a string is binary data and not to be processed like text. Because pack/unpack functions are designed to operate on binary data, you should never assume that is returned is printable under any encoding unless the ENTIRE pack string is made up of character derivatives. If you clarify what the overall goal is, maybe we could give you a better solution.
If you isolate a hex UTF-8 code into a variable, say code which is a string of the hexadecimal format minus percent sign:
utf_char=[code.to_i(16)].pack("U")
Combine these with the rest of the string, you can make your string.

Converting integers to UTF-8 (Korean)

I'm running Ruby 1.9.2 and trying to fix some broken UTF-8 text input where the text is literally "\\354\\203\\201\\355\\221\\234\\353\\252\\205" and change it into its correct Korean "상표명"
However after searching for a while and trying a few methods I still get out gibberish.
It's confusing as the escaped characters example on line 3 works fine
# encoding: utf-8
puts "상표명" # Target string
# Output: "상표명"
puts "\354\203\201\355\221\234\353\252\205" # Works with escaped characters like this
# Output: "상표명"
# Real input is a string
input = "\\354\\203\\201\\355\\221\\234\\353\\252\\205"
# After some manipulation got it into an array of numbers
puts [354, 203,201,355,221,234,353,252,205].pack('U*').force_encoding('UTF-8')
# Output: ŢËÉţÝêšüÍ (gibberish)
I'm sure this must have been answered somewhere but I haven't managed to find it.
This is what you want to do to get your UTF-8 Korean text:
s = "\\354\\203\\201\\355\\221\\234\\353\\252\\205"
k = s.scan(/\d+/).map { |n| n.to_i(8) }.pack("C*").force_encoding('utf-8')
# "상표명"
And this is how it works:
The input string is nice and regular so we can use scan to pull out the individual number.
Then a map with to_i(8) to convert the octal values (as noted by Henning Makholm) to integers.
Now we need to convert our list of integers to bytes so we pack('C*') to get a byte string. This string will have the BINARY encoding (AKA ASCII-8BIT).
We happen to know that the bytes really do represent UTF-8 so we can force the issue with force_encoding('utf-8').
The main thing that you were missing was your pack format; 'U' means "UTF-8 character" and would expect an array of Unicode codepoints each represented by a single integer, 'C' expects an array of bytes and that's what we had.
The \354 and so forth are octal escapes, not decimal, so you cannot just write them as 354 to get the integer values of the bytes.

display iso-8859-1 encoded data gives strange characters

I have a ISO-8859-1 encoded csv-file that I try to open and parse with ruby:
require 'csv'
filename = File.expand_path('~/myfile.csv')
file = File.open(filename, "r:ISO-8859-1")
CSV.parse(file.read, col_sep: "\t") do |row|
puts row
end
If I leave out the encoding from the call to File.open, I get an error
ArgumentError: invalid byte sequence in UTF-8
My problem is that the call to puts row displays strange characters instead of the norwegian characters æ,ø,å:
BOKF�RINGSDATO
I get the same if I open the file in textmate, forcing it to use UTF-8 encoding.
By assigning the file content to a string, I can check the encoding used for the string. As expected, it shows ISO-8859-1.
So when I puts each row, why does it output the string as UTF-8?
Is it something to do with the csv-library?
I use ruby 1.9.2.
Found myself an answer by trying different things from the documentation:
require 'csv'
filename = File.expand_path('~/myfile.csv')
File.open(filename, "r:ISO-8859-1") do |file|
CSV.parse(file.read.encode("UTF-8"), col_sep: "\t") do |row|
# ↳ returns a copy transcoded to UTF-8.
puts row
end
end
As you can see, all I have done, is to encode the string to an UTF-8 string before the CSV-parser gets it.
Edit:
Trying this solution on macruby-head, I get the following error message from encode( ):
Encoding::InvalidByteSequenceError: "\xD8" on UTF-8
Even though I specify encoding when opening the file, macruby use UTF-8.
This seems to be an known macruby limitation: Encoding is always UTF-8
Maybe you could use Iconv to convert the file contents to UTF-8 before parsing?
ISO-8859-1 and Win-1252 are reaallly close in their character sets. Could some app have processed the file and converted it? Or could it have been received from a machine that was defaulting to Win-1252, which is Window's standard setting?
Software that senses the code-set can get the encoding wrong if there are no characters in the 0x80 to 0x9F byte-range so you might try setting file = File.open(filename, "r:ISO-8859-1") to file = File.open(filename, "r:Windows-1252"). (I think "Windows-1252" is the right encoding name.)
I used to write spiders, and HTML is notorious for being mis-labeled or for having encoded binary characters from one character set embedded in another. I used some bad language many times over these problems several years ago, before most languages had implemented UTF-8 and Unicode so I understand the frustration.
ISO/IEC_8859-1,
Windows-1252

How to remove all non - ASCII characters from a string in Ruby

I seems to be a very simple and much needed method. I need to remove all non ASCII characters from a string. e.g © etc. See the following example.
#coding: utf-8
s = " Hello this a mixed string © that I made."
puts s.encoding
puts s.encode
output:
UTF-8
Hello this a mixed str
ing © that I made.
When I feed this to Watir, it produces following error:incompatible character encodings: UTF-8 and ASCII-8BIT
So my problem is that I want to get rid of all non ASCII characters before using it. I will not know which encoding the source string "s" uses.
I have been searching and experimenting for quite some time now.
If I try to use
puts s.encode('ASCII-8BIT')
It gives the error:
: "\xC2\xA9" from UTF-8 to ASCII-8BIT (Encoding::UndefinedConversionError)
You can just literally translate what you asked into a Regexp. You wrote:
I want to get rid of all non ASCII characters
We can rephrase that a little bit:
I want to substitue all characters which don't thave the ASCII property with nothing
And that's a statement that can be directly expressed in a Regexp:
s.gsub!(/\P{ASCII}/, '')
As an alternative, you could also use String#delete!:
s.delete!("^\u{0000}-\u{007F}")
Strip out the characters using regex. This example is in C# but the regex should be the same:
How can you strip non-ASCII characters from a string? (in C#)
Translating it into ruby using gsub should not be difficult.
UTF-8 is a variable-length encoding. When a character occupies one byte, its value coincides with 7-bit ASCII. So why don't you just look for bytes with a '1' in the MSB, and then remove both them and their trailers? A byte beginning with '110' will be followed by one additional byte. A byte beginning with '1110' will be followed by two. And a byte beginning with '11110' will be followed by three, the maximum supported by UTF-8.
This is all just off the top of my head. I could be wrong.

Resources