Determining encoding for a file in Ruby - ruby

I have come up with a method to determine encoding (or at least a guess at it) for a file that I pass in:
def encoding_type(file_path)
File.read(file_path).encoding.name
end
The problem with this is that I have a file that is 15GB, so that means the entire file is being read into memory.
Is there anyway to accomplish what I am doing in this method without needing to read the entire file into memory?

The file -mime command will return the mime type and encoding of the file:
file -mime myfile
myfile: text/plain; charset=iso-8859-1
def detect_charset(file_path)
`file --mime #{file_path}`.strip.split('charset=').last
rescue => e
Rails.logger.warn "Unable to determine charset of #{file_path}"
Rails.logger.warn "Error: #{e.message}"
end

The method you suggest in your question will not do what you think. It will simply set the file to the Encoding.default_internal encoding, possibly after transcoding it from Encoding.default_external. These are both usually UTF-8. The encoding is going to always be Encoding.default_internal after you run that code, it is not guessing or determining the encoding from the actual file.
If you have a file and you really don't know what encoding it is, you indeed will have to guess. There's no way to be 100% sure you've gotten it right as the author intended (and some files are corrupt and mixed encoding or not legal in any encoding).
There are libraries with heuristics meant to try and guess (they won't be right all the time).
Here's one, which I've never actually used myself, but the likelyist prospect I found in 10 minutes of googling: https://github.com/oleander/rchardet There might be other ruby gems for this. You could also use ruby system() to call a linux command line utility that tries to do this as well, someone above mentions the Linux file command.
If you don't want to load the entire file in to test it, you can certainly just load part of it in. Probably the chardet library will work more reliably the more it's got, but, sure, just read the first X bytes of the file in and then ask chardet to guess it's encoding.
require 'chardet19'
first1000bytes = File.read(file, 1000)
cd = CharDet.detect(first1000bytes)
cd.encoding
cd.confidence
You can also always check to see if any string in ruby is valid for the encoding it's set at:
str.valid_encoding?
So you could simply go through a variety of encodings and see if it's valid:
orig_encoding = str.encoding
str.force_encoding("ISO-8859-1").valid_encoding?
str.force_encoding("UTF-8").valid_encoding?
str.force_enocding(orig_encoding) # put it back to what it was
But it's certainly possible for a file to be valid in more than one encoding, or to be valid in a given encoding but read as nonsense by humans in that encoding.
If you have your best guess encoding, but it's still not valid_encoding? for that encoding, it may just have a few bad bytes in it. You can remove them with String.scrub in ruby 2.1, or with this pure-ruby backport of String.scrub in other ruby versions.
Hope this helps give you some idea of what you're dealing with and what your options are.

Related

Nokogiri outputs different strings on different systems

I am reading a local .html file using the following line:
myDoc = File.open("Ina.html") { |f| Nokogiri::HTML(f) }
I get a Node using xpath and then I simply print it
divNode = myDoc.at_xpath('//div[#id="mw-content-text"]/p[1]')
puts divNode
Fragment of Output on one system: Using ruby 2.3
<p><b>Ina:</b> Ñe’êpehê , ñe’ẽtéva rire (aha´aína)</p>
Fragment of Output on another system: Using ruby 2.1
<p><b>Ina:</b> Ñe’êpehê , ñe’ẽtéva rire (aha´aína)</p>
Any thoughts on what is going on with the encoding? All the suggestions of forcing the encoding and/or specifying the encoding have not been successful.
Well, I fixed the problem but I still don't fully understand why this way would not work.
So, the solution was to simply read the whole .html file and then instantiate the nokogiri object by parsing the the string of the file.
file = File.open(outputFolder + "/" + htmlName,"rb")
content = file.read
doc = Nokogiri::HTML.parse(content,nil, "UTF-8")
To me,this was equivalent to either one of the statements I tried:
myDoc = File.open("Ina.html") { |f| Nokogiri::HTML(f) }
myDoc = File.open("Ina.html", nil, "UTF-8") { |f| Nokogiri::HTML(f) }
nokogiri does weird stuff sometimes. I couldn't explain what nokogiri is "supposed" to do here -- both versions are 'correct' in representing the same thing in an HTML document. Is it exactly the same version of nokogiri? If so, it could be a different version of libxml, which nokogiri uses under the hood, and in some cases will use an existing system install. Or the ruby 2.1 vs 2.3 difference could matter, although that seems unlikely.
Basically, if you want exactly the same behavior, you need to use exactly the same version of everything -- ruby, nokogiri, libxml.
The first is just the straight unicode bytes, the second has non-ascii characters replaced by html character entities. Both should be rendered the same in a browser. If you want one of those behaviors and not the other (personally I think I'd rather have the unicode), that's kind of a different question, but there's probably a way to force nokogiri to do it. But I don't know it.
If you use Nokogiri::XML instead of Nokogiri::HTML, I'd wager it won't replace non-ascii with html character entities, but you also, if I recall right, won't get some "forgiving of not quite legal syntax" behavior the HTML parser uses.
Wait, now looking closer, I'm thinking maybe the second one doesn't represent the same thing, it's html character entities, but I'm not sure they're really the right ones. Could encoding have gotten messed up? Depending on how you are reading the data in, and the OS, and what the LANG env variable is set to if it's a unix machine, it could be messing up the encoding.
Also, are you positive that the Ina.html file you are opening is really truly identical on both systems? Could it have become corrupted or transformed differently in the download process? Copy the file from one machine to the other to make sure the two files are really identical.

Two machines returning a different encoding result for the same file

I am running the below code on two different machines with the same file, but each one is returning a different character encoding type:
def encoding_type
File.read(file_path).encoding.name
end
Does that make any sense?
I expect that the two machines are using different default encodings. You can verify this by inspecting the return value from Encoding.default_external - it should match the two different encodings you get from your File.read( file_path ). If you were assuming the given file somehow declared its encoding in a way that Ruby detected, you were most likely wrong - it is possible for Ruby to determine correct String encoding in some scenarios, but reading a file from disk is not one of them. In fact many encodings are not technically distinguishable from the file alone, and although it is possible to have a good guess that is not something you should expect from any language's basic file reading library.
The documentation on Encoding.default_external explains where it applies. It includes file reads where you have not specified the file encoding.
One possible root cause is different locale settings on each machine.
The best fix will vary depending on what your code needs to do. One simple fix, if you want your application to use consistent encoding everywhere and ignore machine settings, is to just set the value:
Encoding.default_external = 'UTF-8'
Another option, if the problem is specific to this file, and you want to use machine settings elsewhere in your app is to open the file with a specific encoding:
File.read(file_path, :encoding => 'UTF-8')
You could also alter the locale setting on the two machines, if that makes sense for other uses the two machines have.

Ruby 1.9 iso-8859-8-i encoding

I'm trying to create a piece of code that will download a page from the internet and do some manipulation on it. The page is encoded in iso-8859-1.
I can't find a way to handle this file. I need to search through the file in Hebrew and return the changed file to the user.
I tried to use string.encode, but I still get the wrong encoding.
when printing the response encoding, I get: "encoding":{} like its undefined, and this is an example of what it returns:
\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd-\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd \ufffd\ufffd\ufffd\ufffd
It should be Hebrew letters.
When I try with final.body.encode('iso-8859-8-i'), I get the error code converter not found (ASCII-8BIT to iso-8859-8-i).
When you have input where Ruby or OS has incorrectly assign encoding, then conversions will not work. That's because Ruby will start with the wrong assumption and try to maintain the wrong characters when converting.
However, if you know from some other source what the correct encoding is, you can use force_encoding method to tell Ruby how to interpret the bytes it has loaded into a String. Note this alters the object in place.
E.g.
contents = final.body
contents.force_encoding( 'ISO-8859-8' )
puts contents
At this point (provided it works), you now can make conversions (to e.g. UTF-8), because Ruby has been correctly told what characters it is dealing with.
I could not find 'ISO-8859-8-I' on my version of Ruby. I am not sure yet how close 'ISO-8859-8' is to what you need (some Googling suggests that it may be OK for you, if the ...-I encoding is not available).

mime body guess charset (and convert to UTF-8)

I'm trying to parse incoming e-mails and want to store the body as a UTF-8 encoded string in a database, however I've quickly noticed that not all e-mails send charset information in the Content-Type header. After trying some manual quick fixes with String.force_encoding and String.encode I decided to ask the friendly people of SO.
To be honest I was secretly hoping for String.encoding to automagically return the encoding used in the string, however it always appears ASCII-8BIT after I sent a test e-mail to it. I started having this problem when I was implementing quoted-printable as an option which seemed to work if I had also gotten some ;charset=blabla info.
input = input.gsub(/\r\n/, "\n").unpack("M*").first
if( charset )
return input.force_encoding(charset).encode("utf-8")
end
# This is obviously wrong as the string is not always ISO-8859-1 encoded:
return input.force_encoding("ISO-8859-1").encode("utf-8")
I've been experimenting with several "solutions" i found on the internet, however most seemed to relate to file reading/writing, and experimented with a few gems for detecting encoding (however none really seemed to do the trick or were incredibly outdated). It should be possible, and it feels as if the answer is staring me right in the face, hopefully someone here will be able to shine some light on my situation and tell me what I've been doing completely wrong.
using ruby 1.9.3
You may use https://github.com/janx/chardet to detect the origin encoding of you email text.
Example Here:
irb(main):001:0> require 'rubygems'
=> true
irb(main):002:0> require 'UniversalDetector'
=> false
irb(main):003:0> p UniversalDetector::chardet('hello')
{"encoding"=>"ascii", "confidence"=>1.0}
=> nil
Have you tried https://github.com/fac/cmess ?
== DESCRIPTION
CMess bundles several tools under its hood that aim at dealing with
various problems occurring in the context of character sets and
encodings. Currently, there are:
guess_encoding:: Simple helper to identify the encoding of a given
string.
Includes the ability to automatically detect the encoding
of an input.
[...]

Ruby 1.9 - Invalid multibyte character (utf-8)

I have a ruby file with only these two lines:
# encoding: utf-8
puts "—"
When I run it with ruby test_enc.rb it fails with:
test_enc.rb:2: invalid multibyte char (UTF-8)
test_enc.rb:2: unterminated string meets end of file
I don't know how to properly specify the character code of — (emdash), but vim tells me it is 151, Hex 97, Octal 227. It fails the same way with other characters like ã as well, so I doubt it is related specifically to that character.
I am running on Windows XP and the version of ruby I'm using is:
ruby 1.9.1p430 (2010-08-16 revision 28998) [i386-mingw32]
I feel like there is something very obvious I am missing here. Any ideas?
EDIT: Learned a valuable lesson about assumptions today - specifically assuming your editor IS using UTF-8 without actually checking it. Oops!
Thanks for the quick and accurate replies all!
EDIT AGAIN: The 'setting up vim properly for utf-8' grew too big and wasn't really relevant to this question, so it is now a separate question.
Given that Ruby is explicitly calling your attention to UTF-8, I strongly suspect that you haven't actually written out a UTF-8 file to start with. Make sure that Vim (or whatever text editor you're using to create the file) is really set to write out UTF-8.
Note that in UTF-8, any non-ASCII character will be represented by multiple bytes, not a single byte as you've described from the Vim diagnostics. I'd recommend using a binary file editor (or dump, or whatever) to really show what's in the text file though. Something that doesn't already have some preconceived notion of the encoding - something that isn't even trying to think of it as a text file.
Notepad lets you write out a file in UTF-8, so you might want to try that just to see what happens. (I don't have Ruby installed myself, otherwise I'd try it for you.)
Your file is in latin1. Ruby is right.
emdash would be encoded on two bytes not one in UTF-8.

Resources