Reading a file with ISO-8859 encoding - ruby

According to Mac OSX, I have a file with ISO-8859 encoding:
$ file filename.txt
filename.txt: ISO-8859 text, with CRLF line terminators
I try to read it with that encoding:
> filename = "/Users/myuser/Downloads/filename.txt"
> content = File.read(filename, encoding: "ISO-8859")
> content.encoding
=> #<Encoding:UTF-8>
It doesn't work. And consequently:
> content.split("\n")
ArgumentError: invalid byte sequence in UTF-8
Why doesn't it read the file as ISO-8859?

With your code, Ruby emits the following warning when reading the file:
warning: Unsupported encoding ISO-8859 ignored
This is because there is not only one ISO 8859 encoding but actually quite a bunch of variants. You need to specify the correct one explicitly, e.g
content = File.read(filename, encoding: "ISO-8859-1")
# or equivalently
content = File.read(filename, encoding: Encoding::ISO_8859_1)
When dealing with text files produced in Windows machines (which is hinted by the CRLF line endings), you might want to use Encoding:::Windows_1252 (resp. "Windows-1252") instead. This is a superset of ISO 8859-1 and used to be the default encoding used by many Windows programs and the system itself.

Try to use Encoding::ISO_8859_1 instead.

Related

MBCS to UTF-8: How to encode in Python

I am trying to create a duplicate file finder for Windows. My program works well in Linux. But it writes NUL characters to the log file in Windows. This is due to the MBCS default file system encoding of Windows, while the file system encoding in Linux is UTF-8. How can I convert MBCS to UTF-8 to avoid this error?
Tell Python to use UTF-8 on the log file. In Python 3 you do this by:
open(..., encoding='utf-8')
If you want to convert an MBCS string to UTF-8 you can switch string encodings:
filename.encode('mbcs').decode('utf-8')
Use filename.encode(sys.getdefaultencoding())... to make the code work on Linux, as well.
Just change the encode to 'latin-1' (encoding='latin-1')
Using pure Python:
open(..., encoding = 'latin-1')
Using Pandas:
pd.read_csv(..., encoding='latin-1')

Ruby incompatible character encodings

I am currently trying to write a script that iterates over an input file and checks data on a website. If it finds the new data, it prints out to the terminal that it passes, if it doesn't it tells me it fails. And vice versa for deleted data. It was working fine until the input file I was given contains the "™" character. Then when ruby gets to that line, it is spitting out an error:
PDAPWeb.rb:73:in `include?': incompatible character encodings: UTF-8 and IBM437
(Encoding::CompatibilityError)
The offending line is a simple check to see if the text exists on the page.
if browser.text.include? (program_name)
Where the program_name variable is a parsed piece of information from the input file. In this instance, the program_name contains the 'TM' character mentioned before.
After some research I found that adding the line # encoding: utf-8 to the beginning of my script could help, but so far has not proven useful.
I added this to my program_name variable to see if it would help(and it allowed my script to run without errors), but now it is not properly finding the TM character when it should be.
program_name = record[2].gsub("\n", '').force_encoding("utf-8").encode("IBM437", replace: nil)
This seemed to convert the TM character to this: Γäó
I thought maybe i had IBM437 and utf-8 parts reversed, so I tried the opposite
program_name = record[2].gsub("\n", '').force_encoding("IBM437").encode("utf-8", replace: nil)
and am now receiving this error when attempting to run the script
PDAPWeb.rb:48:in `encode': U+2122 from UTF-8 to IBM437 (Encoding::UndefinedConve
rsionError)
I am using ruby 1.9.3p392 (2013-02-22) and I'm not sure if I should upgrade as this is the standard version installed in my company.
Is my encoding incorrect and causing it to convert the TM character with errors?
Here’s what it looks like is going on. Your input file contains a ™ character, and it is in UTF-8 encoding. However when you read it, since you don’t specify the encoding, Ruby assumes it is in your system’s default encoding of IBM437 (you must be on Windows).
This is basically the same as this:
>> input = "™"
=> "™"
>> input.encoding
=> #<Encoding:UTF-8>
>> input.force_encoding 'ibm437'
=> "\xE2\x84\xA2"
Note that force_encoding doesn’t change the actual string, just the label associated with it. This is the same outcome as in your case, only you arrive here via a different route (by reading the file).
The web page also has a ™ symbol, and is also encoded as UTF-8, but in this case Ruby has the encoding correct (Watir probably uses the headers from the page):
>> web_page = '™'
=> "™"
>> web_page.encoding
=> #<Encoding:UTF-8>
Now when you try to compare these two strings you get the compatibility error, because they have different encodings:
>> web_page.include? input
Encoding::CompatibilityError: incompatible character encodings: UTF-8 and IBM437
from (irb):11:in `include?'
from (irb):11
from /Users/matt/.rvm/rubies/ruby-2.2.1/bin/irb:11:in `<main>'
If either of the two strings only contained ASCII characters (i.e. code points less that 128) then this comparison would have worked. Both UTF-8 and IBM437 are both supersets of ASCII, and are only incompatible if they both contain characters outside of the ASCII range. This is why you only started seeing this behaviour when the input file had a ™.
The fix is to inform Ruby what the actual encoding of the input file is. You can do this with the already loaded string:
>> input.force_encoding 'utf-8'
=> "™"
You can also do this when reading the file, e.g. (there are a few ways of reading files, they all should allow you to explicitly specify the encoding):
input = File.read("input_file.txt", :encoding => "utf-8")
# now input will be in the correct encoding
Note in both of these the string isn’t being changed, it still contains the same bytes, but Ruby now knows its correct encoding.
Now the comparison should work okay:
>> web_page.include? input
=> true
There is no need to encode the string. Here’s what happens if you do. First if you correct the encoding to UTF-8 then encode to IBM437:
>> input.force_encoding("utf-8").encode("IBM437", replace: nil)
Encoding::UndefinedConversionError: U+2122 from UTF-8 to IBM437
from (irb):16:in `encode'
from (irb):16
from /Users/matt/.rvm/rubies/ruby-2.2.1/bin/irb:11:in `<main>'
IBM437 doesn’t include the ™ character, so you can’t encode a string containing it to this encoding without losing data. By default Ruby raises an exception when this happens. You can force the encoding by using the :undef option, but the symbol is lost:
>> input.force_encoding("utf-8").encode("IBM437", :undef => :replace)
=> "?"
If you go the other way, first using force_encoding to IBM437 then encoding to UTF-8 you get the string Γäó:
>> input.force_encoding("IBM437").encode("utf-8", replace: nil)
=> "Γäó"
The string is already in IBM437 encoding as far as Ruby is concerned, so force_encoding doesn’t do anything. The UTF-8 representation of ™ is the three bytes 0xe2 0x84 0xa2, and when interpreted as IBM437 these bytes correspond to the three characters seen here which are then converted into their UTF-8 representations.
(These two outcomes are the other way round from what you describe in the question, hence my comment above. I’m assuming that this is just a copy-and-paste error.)

Unexpected encoding error using JSON.parse

I've got a rather large JSON file on my Windows machine and it contains stuff like \xE9. When I JSON.parse it, it works fine.
However, when I push the code to my server running CentOS, I always get this: "\xE9" on US-ASCII (Encoding::InvalidByteSequenceError)
Here is the output of file on both machines
Windows:
λ file data.json
data.json: UTF-8 Unicode English text, with very long lines, with no line terminators
CentOS:
$ file data.json
data.json: UTF-8 Unicode English text, with very long lines, with no line terminators
Here is the error I get when trying to parse it:
$ ruby -rjson -e 'JSON.parse(File.read("data.json"))'
/usr/local/rvm/rubies/ruby-2.0.0-p353/lib/ruby/2.0.0/json/common.rb:155:in `encode': "\xC3" on US-ASCII (Encoding::InvalidByteSequenceError)
What could be causing this problem? I've tried using iconv to change the file into every possible encoding I can, but nothing seems to work.
"\xE9" is é in ISO-8859-1 (and various other ISO-8859-X encodings and Windows-1250 and ...) and is certainly not UTF-8.
You can get File.read to fix up the encoding for you by using the encoding options:
File.read('data.json',
:external_encoding => 'iso-8859-1',
:internal_encoding => 'utf-8'
)
That will give you a UTF-8 encoded string that you can hand to JSON.parse.
Or you could let JSON.parse deal with the encoding by using just :external_encoding to make sure the string comes of the disk with the right encoding flag:
JSON.parse(
File.read('data.json',
:external_encoding => 'iso-8859-1',
)
)
You should have a close look at data.json to figure out why file(1) thinks it is UTF-8. The file might incorrectly have a BOM when it is not UTF-8 or someone might be mixing UTF-8 and Latin-1 encoded strings in one file.

Encode a movie with Unicode filename in Windows using Popen

I want to encode a movie through IO.popen by ruby(1.9.3) in windows 7.
If the file name contains only ascii strings, encoding proceed normally.
But with unicode filename the script returns "No such file or directory" error.
Like following code.
#-*- encoding: utf-8 -*-
command = "ffmpeg -i ü.rm"
IO.popen(command){|pipe|
pipe.each{|line|
p line
}
}
I couldn't find whether the problem causes by ffmpeg or ruby.
How can fix this problem?
Windows doesn't use UTF-8 encoding. Ruby send the byte sequence of the Unicode filename to the file system directly, and of course the file system won't recognize UTF-8 sequences. It seems newer version of Ruby has fixed this issue. (I'm not sure. I'm using 1.9.2p290 and it's still there.)
You need to convert the UTF-8 filename to the encoding your Windows uses.
# coding: utf-8
code_page = "cp#{`chcp`.chomp[/\d+$/]}" # detect code page automatically.
command = "ffmpeg -i ü.rm".encode(code_page)
IO.popen(command) do |pipe|
pipe.each do |line|
p line
end
end
Another way is to save your script with the same encoding Windows uses. And don't forget to update the encoding declaration. For example, I'm using Simplified Chinese Windows and it uses GBK(CP936) as default encoding:
# coding: GBK
# save this file in GBK
command = "ffmpeg -i ü.rm"
IO.popen(command) do |pipe|
pipe.each do |line|
p line
end
end
BTW, by convention, it is suggested to use do...end for multi-line code blocks rather than {...}, unless in special cases.
UPDATE:
The underlying filesystem NTFS uses UTF-16 for file name encoding. So 가 is a valid filename character. However, GBK isn't able to encode 가, and so as to CP932 in your Japanese Windows. So you cannot send that specific filename to your cmd.exe and it isn't likely you can process that file with IO.popen. For CP932 compatible filenames, the encoding approach provided above works fine. For those filenames not compatible with CP932, it might be better to modify your filenames to a compatible one.

Why do I get an "Invalid Byte Sequence in UTF-8" error reading a text file?

I'm writing a Ruby script to process a large text file, and keep getting an odd encoding error.
Here's the situation:
input_data = File.new(in_path, 'r').read
p input_data.encoding.name # UTF-8
break_char = "\r".encode("UTF-8")
p break_char # "\r"
p break_char.encoding.name # "UTF-8"
input_data.split(",".encode("UTF-8"))
p Encoding.compatible?(input_data, break_char) # # Encoding:UTF-8>
This produces the error :in 'split': invalid byte sequence in UTF-8 (ArgumentError)
I read http://blog.grayproductions.net/articles/ruby_19s_string and looked at other solutions to apparently the same problem, but still can't work out why it's happening when I believe I am controlling the encodings.
I'm on OSX working with ruby 1.9.2
Obviously your input file is not UTF-8 (or at least, not entirely). If you don't care about non-ascii characters, you can simply assume your file is ascii-8bit encoded. BTW, your separator (break_char) is not causing problems as comma is encoded the same way in UTF-8 as in ASCII.
fname = 'test.in'
# create example file and fill it with invalid UTF-8 sequence
File.open(fname, 'w') do |f|
f.write "\xc3\x28"
end
# then try to read and parse it
s = File.open(fname) do |f| # file opened as UTF-8
#s = File.open(fname, 'r:ascii-8bit') do |f| # file opened as ascii-8bit
f.read
end
p s.split ','
I fail to get an error here on Linux even when the input file is not UTF-8. (I'm using Ruby 1.9.2, as well.)
Logically, either this problem is linked with OS-X, or it's something to do with your input data. Does it happen regardless of the data in the input file?
(I realise that this is not a proper answer, but I lack the rep to add a comment. And since no-one has responded yet, I thought it better than nothing...)
You read the file using the default encoding your system provides. So ruby tags the string as utf8, which doesn't mean it's really utf8-data. Try file <input file> to guess what kind of encoding is in there, then tell ruby it's that one (unclean: force_encoding(<encoding>), clean: tell the File object what encoding it is, I don't know how to do that) and then use encode!("utf8") to convert it to utf8.
Here are 2 common situations and how to deal with them:
Situation 1
You have an UTF-8 input-file with possibly a few invalid bytes
Remove the invalid bytes:
test = "Partly valid\xE4 UTF-8 encoding: äöüß"
File.open( 'input_file', 'w' ) {|f| f.write(test)}
str = File.read( 'input_file' )
str.scrub('')
=> "Partly valid UTF-8 encoding: äöüß"
Situation 2
You have an input-file that could be in either UTF-8 or ISO-8859-1 encoding
Check which encoding it is and convert to UTF-8 (if necessary):
test = "String in ISO-8859-1 encoding: \xE4\xF6\xFC\xDF"
File.open( 'input_file', 'w' ) {|f| f.write(test)}
str = File.read( 'input_file' )
unless str.valid_encoding?
str.encode!( 'UTF-8', 'ISO-8859-1', invalid: :replace )
end #unless
=> "String in ISO-8859-1 encoding: äöüß"
Notes
The above code snippets assume that Ruby encodes all your strings in UTF-8 by default. Even though, this is almost always the case, you can make sure of this by starting your scripts with # encoding: UTF-8.
If invalid, it is programmatically possible to detect most multi-byte encodings like UTF-8 (in Ruby, see: #valid_encoding?). However, it is NOT possible (or at least extremely hard) to programmatically detect invalidity of single-byte-encodings like ISO-8859-1. Thus the above code snippet does not work the other way around, i.e. detecting if a String is valid ISO-8859-1 encoding.
Even though UTF-8 has become increasingly popular as the default encoding in computer-systems, ISO-8859-1 and other Latin1 flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary from ISO-8859-1. Examples: CP1252 (a.k.a. Windows-1252), ISO-8859-15
[ruby] [encoding] [utf8] [file-encoding] [character-encoding]
Please try this one:-
input_data = File.open("path/your_file.pdf", "rb") {|io| io.read}
Thanks

Resources