Convert non-ASCII chars from ASCII-8BIT to UTF-8 - ruby

I'm pulling text from remote sites and trying to load it into a Ruby 1.9/Rails 3 app that uses utf-8 by default.
Here is an example of some offending text:
Cancer Res; 71(3); 1-11. ©2011 AACR.\n
That Copyright code expanded looks like this:
Cancer Res; 71(3); 1-11. \xC2\xA92011 AACR.\n
Ruby tells me that string is encoded as ASCII-8BIT and feeding into my Rails app gets me this:
incompatible character encodings: ASCII-8BIT and UTF-8
I can strip the copyright code out using this regex
str.gsub(/[\x00-\x7F]/n,'?')
to produce this
Cancer Res; 71(3); 1-11. ??2011 AACR.\n
But how can I get a copyright symbol (and various other symbols such as greek letters) converted into the same symbols in UTF-8? Surely it is possible...
I see references to using force_encoding but this does not work:
str.force_encoding('utf-8').encode
I realize there are many other people with similar issues but I've yet to see a solution that works.

This works for me:
#encoding: ASCII-8BIT
str = "\xC2\xA92011 AACR"
p str, str.encoding
#=> "\xC2\xA92011 AACR"
#=> #<Encoding:ASCII-8BIT>
str.force_encoding('UTF-8')
p str, str.encoding
#=> "©2011 AACR"
#=> #<Encoding:UTF-8>

There are two possibilities:
The input data is already UTF-8, but Ruby just doesn't know it. That seems to be your case, as "\xC2\xA9" is valid UTF-8 for the copyright symbol. In which case you just need to tell Ruby that the data is already UTF-8 using force_encoding.
For example "\xC2\xA9".force_encoding('ASCII-8BIT') would recreate the relevant bit of your input data. And "\xC2\xA9".force_encoding('ASCII-8BIT').force_encoding('UTF-8') would demonstrate that you can tell Ruby that it is really UTF-8 and get the desired result.
The input data is in some other encoding and you need Ruby to transcode it to UTF-8. In that case you'd have to tell Ruby what the current encoding is (ASCII-8BIT is ruby-speak for binary, it isn't a real encoding), then tell Ruby to transcode it.
For example, say your input data was ISO-8859-1. In that encoding the copyright symbol is just "\xA9". This would generate such a bit of data: "\xA9".force_encoding('ISO-8859-1') And this would demonstrate that you can get Ruby to transcode that to UTF-8: "\xA9".force_encoding('ISO-8859-1').encode('UTF-8')

I used to do this for a script that scraped Greek Windows-encoded pages, using open-uri, iconv and Hpricot:
doc = open(DATA_URL)
doc.rewind
data = Hpricot(Iconv.conv('utf-8', "WINDOWS-1253", doc.readlines.join("\n")))
I believe that was Ruby 1.8.7, not sure how things are with ruby 1.9

I've been having issues with character encoding, and the other answers have been helpful, but didn't work for every case. Here's the solution I came up with that forces encoding when possible and transcodes using '?'s when not possible. Here's the solution:
def encode str
encoded = str.force_encoding('UTF-8')
unless encoded.valid_encoding?
encoded = str.encode("utf-8", invalid: :replace, undef: :replace, replace: '?')
end
encoded
end
force_encoding works most of the time, but I've encountered some strings where that fails. Strings like this will have invalid characters replaced:
str = "don't panic: \xD3"
str.valid_encoding?
false
str = str.encode("utf-8", invalid: :replace, undef: :replace, replace: '?')
"don't panic: ?"
str.valid_encoding?
true
Update: I have had some issues in production with the above code. I recommend that you set up unit tests with known problem text to make sure that this code works for you like you need it to. Once I come up with version 2 I'll update this answer.

Related

Convert UTF-8 to CP1252 ruby 2.2

How to keep all characters converting from UTF-8 to CP1252 on ruby 2.2
this code:
file = 'd:/1 descrição.txt'
puts file.encode('cp1252')
Give this error:
`encode': U+0327 to WINDOWS-1252 in conversion from UTF-8 to WINDOWS-1252 (Encoding::UndefinedConversionError)
My application need to be cp1252, but I can't find any way to keep all the characters.
I can't replace this characters, because later I will use this info to read the file from file system.
puts file.encode('cp1252', undef: :replace, replace: '')
> d:/1 descricao.txt
ps: It is a ruby script not a ruby on rails application
UTF-8 covers the entire range of unicode, but CP1252 only includes a subset of them. Obviously this means that there are characters that can be encoded in UTF-8 but not in CP1252. This is the problem you are facing.
In your example it looks like the string only contains characters that should work in CP1252, but clearly it doesn’t.
The character in the error message, U+0327 is a combining character, and is not representable in CP1252. It combines with the preceding c to produce ç. ç can also be represented as a single character (U+00E7), which is representable in CP1252.
One option might be normalisation, which will convert the string into a form that is representable in CP1252.
file = 'd:/1 descrição.txt'.unicode_normalize(:nfc)
puts file.encode('cp1252')
(It appears that Stack Overflow is normalizing the string when displaying your question, which is probably why copying the code from the question and running it doesn’t produce any errors.)
This will avoid the error, but note that it is not necessarily possible to reverse the process to get the original string unless the original is in a known normalized form already.

Ruby URI.extract returns empty array or ArgumentError: invalid byte sequence in UTF-8

I'm trying to get a list of files from url like this:
require 'uri'
require 'open-uri'
url = 'http://www.wmprof.com/media/niti/download'
html = open(url).read
puts URI.extract(html).select{ |link| link[/(PL)/]}
This code returns ArgumentError: invalid byte sequence in UTF-8 in line with URI.extract (even though html.encoding returns utf-8)
I've found some solutions to encoding problems, but when I'm changing the code to
html.encode('UTF-8', invalid: :replace, undef: :replace, replace: '?')
URI.extract returns empty string, even when I'm not calling the select method on it. Any suggestions?
The character encoding of the website might be ISO-8859-1 or a related one. We can't tell for sure since there are only two occurrences of the same non-US-ASCII-character and it doesn't really matter anyway.
html.each_char.reject(&:ascii_only?) # => ["\xDC", "\xDC"]
Finding the actual encoding is done by guessing. The age of HTML 3.2 or the used language/s might be a clue. And in this case especially the content of the PDF file is helpful (it contains SPRÜH-EX and the file has the name TI_DE_SPR%dcH_EX.pdf). Then we only need to find the encoding for which "\xDC" and "Ü" are equal. Either by knowing it or writing some Ruby:
Encoding.list.select { |e| "Ü" == "\xDC".encode!(Encoding::UTF_8, e) rescue next }.map(&:name)
Of course, letting a program do the guessing is an option too. There is the libguess library. The web browser can do it too. However you you need to download the file though unless the server might tell the browser it's UTF-8 even if it isn't (like in this case). Any decent text editor will also try to detect the file encoding: e.g. ST3 thinks it's Windows 1252 which is a superset of ISO-8859-1 (like UTF-8 is of US-ASCII).
Possible solutions are manually setting the string encoding to ISO-8859-1:
html.force_encoding(Encoding::ISO_8859_1)
Or (preferably) transcoding the string from ISO-8859-1 to UTF-8:
html.encode!(Encoding::UTF_8, Encoding::ISO_8859_1)
To answer the other question: URI.extract isn't the method you're looking for. Apparently it's obsolete and more importantly, it doesn't extract relative URI.
A simple alternative is using a regular expression with String#scan. It works with this site but it might not with other ones. You have to use a HTML parser for the best reliability (there might be also a gem). Here's an example that should do what you want:
html.scan(/href="(.*?PL.*?)"/).flatten # => ["SI_PL_ACTIV_bicompact.pdf", ...]

Ruby character transliteration

What's the current best way to transliterate characters to 7-bit ASCII in Ruby? Most of questions I've seen on SO are 3 or 4 years old and the solutions don't fully work.
I want a method that will work for a wide range of Latin alphabets and, for example, convert
Your résumé’s a non–encyclopædia
to
Your resume's a non-encyclopaedia
but I cannot find a way that does that, particularly for folding 8-bit ASCII to 7-bit ASCII.
s = "Your r\u00e9sum\u00e9\u2019s a non\u2013encyclop\u00e6dia"
puts Iconv.iconv('ascii//ignore//translit', 'utf-8', s)
# => Your r'esum'e's a non-encyclopaedia
puts s.encode('ascii//ignore//translit', 'utf-8')
# => Encoding::ConverterNotFoundError: code converter not found (UTF-8 to ascii//ignore//translit)
puts s.encode('ascii', 'utf-8')
# Encoding::UndefinedConversionError: U+00E9 from UTF-8 to US-ASCII
puts s.encode('ascii', 'utf-8', invalid: :replace, undef: :replace)
# Your r?sum??s a non?encyclop?dia
puts I18n.transliterate(s)
# Your resume?s a non?encyclopaedia
Since Iconv is deprecated I'd rather not use that if I don't have to, but I'd do it if that is the only thing that works. Obviously I could put in custom 8-bit ASCII to 7-bit ASCII translations, but I'd prefer to use a supported solution that has been thoroughly tested.
The translation is handled fine by International Components for Unicode with its Latin-ASCII translation, but that is only available for Java and C.
UPDATE
What I ended up doing was writing my own character translation routines to take care of punctuation and whitespace, after which I could use I18n.transliterate to do the rest. I'd still prefer finding and using a well-maintained library function to handle the stuff I18n does not.
If you're willing to add a somewhat heavy dependency (unless your already on Rails), ActiveSupport has support (pun not intended) for this:
ActiveSupport::Multibyte::Chars.new("Your r\u00e9sum\u00e9\u2019s not an encyclop\u00e6dia").mb_chars.normalize(:kd).chars.to_a.delete_if {|c| !c.ascii_only?}.join('')
This works for all of the letters. It doesn't handle the apostrophe right yet though.
I guess the removeaccents script is just right what your want.
Maybe UnicodeUtils gem can be useful, but only to remove the accents (not to convert things like æ AFAIK).

How to convert a string to UTF8 in Ruby

I'm writing a crawler which uses Hpricot. It downloads a list of strings from some webpage, then I try to write it to the file. Something is wrong with the encoding:
"\xC3" from ASCII-8BIT to UTF-8
I have items which are rendered on a webpage and printed this way:
Développement
the str.encoding returns UTF-8, so force_encoding('UTF-8') doesn't help. How may I convert this to readable UTF-8?
Your string seems to have been encoded the wrong way round:
"Développement".encode("iso-8859-1").force_encoding("utf-8")
#=> "Développement"
Seems your string thinks it is UTF-8, but in reality, it is something else, probably ISO-8859-1.
Define (force) the correct encoding first, then convert it to UTF-8.
In your example:
puts "Développement".encode('iso-8859-1').encode('utf-8')
An alternative is:
puts "\xC3".force_encoding('iso-8859-1').encode('utf-8') #-> Ã
If the à makes no sense, then try another encoding.
"ruby 1.9: invalid byte sequence in UTF-8" described another good approach with less code:
file_contents.encode!('UTF-16', 'UTF-8')

display iso-8859-1 encoded data gives strange characters

I have a ISO-8859-1 encoded csv-file that I try to open and parse with ruby:
require 'csv'
filename = File.expand_path('~/myfile.csv')
file = File.open(filename, "r:ISO-8859-1")
CSV.parse(file.read, col_sep: "\t") do |row|
puts row
end
If I leave out the encoding from the call to File.open, I get an error
ArgumentError: invalid byte sequence in UTF-8
My problem is that the call to puts row displays strange characters instead of the norwegian characters æ,ø,å:
BOKF�RINGSDATO
I get the same if I open the file in textmate, forcing it to use UTF-8 encoding.
By assigning the file content to a string, I can check the encoding used for the string. As expected, it shows ISO-8859-1.
So when I puts each row, why does it output the string as UTF-8?
Is it something to do with the csv-library?
I use ruby 1.9.2.
Found myself an answer by trying different things from the documentation:
require 'csv'
filename = File.expand_path('~/myfile.csv')
File.open(filename, "r:ISO-8859-1") do |file|
CSV.parse(file.read.encode("UTF-8"), col_sep: "\t") do |row|
# ↳ returns a copy transcoded to UTF-8.
puts row
end
end
As you can see, all I have done, is to encode the string to an UTF-8 string before the CSV-parser gets it.
Edit:
Trying this solution on macruby-head, I get the following error message from encode( ):
Encoding::InvalidByteSequenceError: "\xD8" on UTF-8
Even though I specify encoding when opening the file, macruby use UTF-8.
This seems to be an known macruby limitation: Encoding is always UTF-8
Maybe you could use Iconv to convert the file contents to UTF-8 before parsing?
ISO-8859-1 and Win-1252 are reaallly close in their character sets. Could some app have processed the file and converted it? Or could it have been received from a machine that was defaulting to Win-1252, which is Window's standard setting?
Software that senses the code-set can get the encoding wrong if there are no characters in the 0x80 to 0x9F byte-range so you might try setting file = File.open(filename, "r:ISO-8859-1") to file = File.open(filename, "r:Windows-1252"). (I think "Windows-1252" is the right encoding name.)
I used to write spiders, and HTML is notorious for being mis-labeled or for having encoded binary characters from one character set embedded in another. I used some bad language many times over these problems several years ago, before most languages had implemented UTF-8 and Unicode so I understand the frustration.
ISO/IEC_8859-1,
Windows-1252

Resources