hex representation of german umlaut to string - ruby

I have a String that has non ascii characters encoded as "\\'fc" (without quotes), where fc is hex 252 which corresponds to the german ü umlaut.
I managed to find all occurences and can replace them. But I have not been able to convert the fc to an ü.
"fc".hex.chr
gives me another representation...but if I do
puts "fc".hex.chr
I get nothing back...
Thanks in advance
PS: I'm working on ruby 1.9 and have
# coding: utf-8
at the top of the file.

fc is not the correct UTF-8 codepoint for that character; that's iso-8859-1 or windows-1252. The UTF-8 encoding for ü is the two-byte sequence, c3bc. Further, FC is not a valid UTF-8 sequence.
Since UTF-8 is assumed in Ruby 1.9, you should be able to get the literal u-umlaut with: "\xc3\xbc"

Have you tried
puts "fc".hex.chr(Encoding::UTF_8)
Ruby docs:
int.chr
Encoding
UPDATE:
Jason True is right. fc is invalid UTF-8. I have no idea why my example works!

Related

Text file encoding properly as utf-8 in Scite editor but fails to encode to uft-8 in ruby

I have a text file which if viewed in the Scite editor with the encoding set to utf-8, displays all text correctly, including capital letters with an accute accent (i.e. Á).
However, if I write a ruby script and use mystring.encode("utf-8") it will give me this error on capital letters that carry an acute accent (i.e. Á):
encode': "\x81" to UTF-8 in conversion from Windows-1252 to UTF-8 (Encoding::UndefinedConversionError)
Is this expected behaviour? How can I encode the whole text to utf-8 using ruby, knowing that otherwise it does get successfully encoded in the Scite editor?
Code:
ine_file = File.open("../../_data/ine_spain_demographics.csv", 'r')
ine_towns_population_hash = Hash.new
ine_file.each do|line|
values = line.split(";")
town_name = values[3]
population = values[4]
begin
ine_towns_population_hash[town_name.encode("utf-8")] = population
rescue
puts "problematic string: " + town_name
end
end
You say that ine_file.external_encoding says Windows-1252 so the file is being opened as a Windows-1252 encoded file. Then you say town_name.encode("utf-8") in an attempt to encoded a string as UTF-8 and Ruby complains. But the file is actually UTF-8; reading UTF-8 bytes as Windows-1252 and then trying to recode those bytes as UTF-8 isn't going to work.
You need to open the file in UTF-8 mode:
File.open("../../_data/ine_spain_demographics.csv", 'r:UTF-8')
and stop trying to change the encoding of town_name, just use town_name as-is.
It seems like it's misinterpreting the encoding of ine_spain_demographics.csv.
Looking at the doc's for encode and open you have two options:
Use replace in encode to tell Ruby what character to use town_name.encode("utf-8", replace: '').
Identify the correct file encoding and specify it: File.open("../../_data/ine_spain_demographics.csv", 'r:ISO-8859-1')

Text file encoding

I'm having a problem trying to read in windows a CSV file generated in MAC.
My question is how can I convert the encoding to UTF-8 or even ISO-8859-1.
I've already tried iconv with no success.
Inside "vim" I can understand that in this file linebreaks are marked with ^M and the accent ã is marked with <8b>, Ç = <82> and so on.
Any ideas?
To convert from encoding a to encoding b, you need to know what encoding a is.
Given that
ã is marked with <8b>, Ç = <82>
encoding a is very likely Mac OS Roman.
So call iconv with macintosh* as from argument, and utf-8 as to argument.
*try macroman, x-mac-roman etc if macintosh is not found

Confusion with ARGF#set_encoding

ARGF.set_encoding says:
If single argument is specified, strings read from ARGF are tagged with the encoding specified.
If two encoding names separated by a colon are given, e.g. "ascii:utf-8", the read string is converted from the first encoding (external encoding) to the second encoding (internal encoding), then tagged with the second encoding.
So I tried the below:
p RUBY_VERSION
p ARGF.external_encoding
ARGF.set_encoding('ascii')
p ARGF.readlines($/)
output:
D:\Rubyscript\My ruby learning days>ruby true.rb a.txt
"2.0.0"
#<Encoding:IBM437>
["Hi! How are you?\n", "I am doing good,thanks."]
p RUBY_VERSION
p ARGF.external_encoding
ARGF.set_encoding(ARGF.external_encoding,'ascii')
p ARGF.readlines($/)
output:
D:\Rubyscript\My ruby learning days>ruby true.rb a.txt
"2.0.0"
#<Encoding:IBM437>
["Hi! How are you?\n", "I am doing good,thanks."]
No encoding change is found. So please advice me the correct approach.
Encoding IBM437 and ASCII (and UTF-8) has the same byte sequence for ASCII characters. So you won't see the difference from String#inspect. However, you can check the String#encoding value for the input strings.
p RUBY_VERSION
p ARGF.external_encoding
ARGF.set_encoding(ARGF.external_encoding,'ascii')
p ARGF.readlines($/).map{|s| s.encoding}
In Ruby (1.9 and higher version), String is a byte sequence tagged with some encoding. You can get the encoding from String#encoding.
So the Chinese word "中" can be represented different ways:
e4 b8 ad # tagged with encoding UTF-8
d6 d0 # tagged with encoding GBK
2d 4e # tagged with encoding UTF-16le
I will always write my script in UTF-8, that is, the internal encoding for my script is UTF-8. Some times I want to process text file (e.g. named "a.txt" and has content "中") encoded with GBK. Then I can set the external encoding and the internal encoding for the IO object and Ruby will do the conversion for me.
ARGF.set_encoding('GBK', 'UTF-8')
str = ARGF.readline
puts str.encoding
# run $ script.rb a.txt
Ruby reads "\xd6\xd0" from "a.txt" and since I have specified the external encoding as GBK, it tags the data with encoding GBK. And I have specified the internal encoding as UTF-8 so Ruby do a conversion from GBK byte sequence to UTF-8, which results in "\xe4\xb8\xad" with tag UTF-8. And this string has the same encoding as other strings in my script, so I can use it with ease.
This is useful because a lot of String methods fail when the two String operands has different, incompatible encoding. For example:
# encoding: utf-8
a = "中" # tagged with UTF-8
b = "中".encode('gbk') # tagged with GBK
puts a + b
#=> Encoding::CompatibilityError: incompatible character encodings: UTF-8 and GBK

`gsub': incompatible character encodings: UTF-8 and IBM437

I try to use search, google but with no luck.
OS: Windows XP
Ruby version 1.9.3po
Error:
`gsub': incompatible character encodings: UTF-8 and IBM437
Code:
require 'rubygems'
require 'hpricot'
require 'net/http'
source = Net::HTTP.get('host', '/' + ARGV[0] + '.asp')
doc = Hpricot(source)
doc.search("p.MsoNormal/a").each do |a|
puts a.to_plain_text
end
Program output few strings but when text is ”NOŻYCE” I am getting error above.
Could somebody help?
You could try converting your HTML to UTF-8 since it appears the original is in vintage-retro DOS format:
source.encode!('UTF-8')
That should flip it from 8-bit ASCII to UTF-8 as expected by the Hpricot parser.
The inner encoding of the source variable is UTF-8 but that is not what you want.
As tadman wrote, you must first tell Ruby that the actual characters in the string are in the IBM437 encoding. Then you can convert that string to your favourite encoding, but only if such a conversion is possible.
source.force_encoding('IBM437').encode('UTF-8')
In your case, you cannot convert your string to ISO-8859-2 because not all IBM437 characters can be converted to that charset. Sticking to UTF-8 is probably your best option.
Anyway, are you sure that that file is actually transmitted in IBM437? Maybe it is stored as such in the HTTP server but it is sent over-the-wire with another encoding. Or it may not even be exactly in IBM437, it may be CP852, also called MS-DOC Latin 2 (different from ISO Latin 2).

Determine if a text file without BOM is UTF8 or ASCII

Long story short:
+ I'm using ffmpeg to check the artist name of a MP3 file.
+ If the artist has asian characters in its name the output is UTF8.
+ If it just has ASCII characters the output is ASCII.
The output does not use any BOM indication at the beginning.
The problem is if the artist has for example a "ä" in the name it is ASCII, just not US-ASCII so "ä" is not valid UTF8 and is skipped.
How can I tell whether or not the output text file from ffmpeg is UTF8 or not? The application does not have any switches and I just think it's plain dumb not to always go with UTF8. :/
Something like this would be perfect:
http://linux.die.net/man/1/isutf8
If anyone knows of a Windows version?
Thanks a lot in before hand guys!
This program/source might help you:
Detect Encoding for In- and Outgoing
Detect the encoding of a text without BOM (Byte Order Mask) and choose the best Encoding ...
You say, "ä" is not valid UTF-8 ... This is not correct...
It seems you don't have a clear understanding of what UTF-8 is. UTF-8 is a system of how to encode Unicode Codepoints. The issue of validity is notin the character itself, it is a question of how has it been encoded...
There are many systems which can encode Unicode Codepoints; UTF-8 is one and UTF16 is another... "ä" is quite legal in the UTF-8 system.. Actually all characters are valid, so long as that character has a Unicode Codepoint.
However, ASCII has only 128 valid values, which equate identically to the first 128 characters in the Unicode Codepoint system. Unicode itself is nothing more that a big look-up table. What does the work is teh encoding system; eg. UTF-8.
Because the 128 ASCII characters are identical to the first 128 Unicode characters, and because UTF-8 can represent these 128 values is a single byte, just as ASCII does, this means that the data in an ASCII file is identical to a file with the same date but which you call a UTF-8 file. Simply put: ASCII is a subset of UTF-8... they are indistinguishable for data in the ASCII range (ie, 128 characters).
You can check a file for 7-bit ASCII compliance..
# If nothing is output to stdout, the file is 7-bit ASCII compliant
# Output lines containing ERROR chars -- to stdout
perl -l -ne '/^[\x00-\x7F]*$/ or print' "$1"
Here is a similar check for UTF-8 compliance..
perl -l -ne '/
^( ([\x00-\x7F]) # 1-byte pattern
|([\xC2-\xDF][\x80-\xBF]) # 2-byte pattern
|((([\xE0][\xA0-\xBF])|([\xED][\x80-\x9F])|([\xE1-\xEC\xEE-\xEF][\x80-\xBF]))([\x80-\xBF])) # 3-byte pattern
|((([\xF0][\x90-\xBF])|([\xF1-\xF3][\x80-\xBF])|([\xF4][\x80-\x8F]))([\x80-\xBF]{2})) # 4-byte pattern
)*$ /x or print' "$1"

Resources