How can I use Net::Http to download a file with UTF-8 characters in it? - ruby

I have an application where users can upload text-based files (xml, csv, txt) that are persisted to S3. Some of these files are pretty big. There are a variety of operations that need to be performed on the data in these files, so rather than read them from S3 and have it time out occasionally I download the files locally, then turn the operations loose on them.
Here's the code I use to download the file from S3. Upload is the name of the AR model I use to store this information. This method is an instance method on the Upload model:
def download
basename = File.basename(self.text_file_name.path)
filename = Rails.root.join(basename)
host = MyFitment::Utility.get_host_without_www(self.text_file_name.url)
Net::HTTP.start(host) do |http|
f = open(filename)
begin
http.request_get(self.text_file_name.url) do |resp|
resp.read_body do |segment|
f.write(segment) # Fails when non-ASCII 8-bit characters are included.
end
end
ensure
f.close()
end
end
filename
end
So you see that line above where the load fails. This code somehow thinks all files that are downloaded are encoded in ASCII 8-bit. How can I:
1) Check the encoding of a remote file like that
2) Download it and write it successfully.
Here's the error that is happening with a particular file right now:
Encoding::UndefinedConversionError: "\x95" from ASCII-8BIT to UTF-8
from /Users/me/code/myapp/app/models/upload.rb:47:in `write'
Thank you for any help you can offer!

How can I: 1) Check the encoding of a remote file like that.
You can check the Content-Type header of the response, which, if present, may look something like this:
Content-Type: text/plain; charset=utf-8
As you can see, the encoding is specified there. If there's no Content-Type header, or if the charset is not specified, or if the charset is specified incorrectly, then you can't know the encoding of the text. There are gems that can try to guess the encoding(with increasing accuracy), e.g. rchardet, charlock_holmes, but for complete accuracy, you have to know the encoding before reading the text.
This code somehow thinks all files that are downloaded are encoded in
ASCII 8-bit.
In ruby, ASCII-8BIT is equivalent to binary, which means the Net::HTTP library just gives you a string containing a series of single bytes, and it's up to you to decide how to interpret those bytes.
If you want to interpret those bytes as UTF-8, then you do that with String#force_encoding():
text = text.force_encoding("UTF-8")
You might want to do that if, for instance, you want to do some regex matching on the string, and you want to match full characters(which might be multi-byte) rather than just single bytes.
Encoding::UndefinedConversionError: "\x95" from ASCII-8BIT to UTF-8
Using String#encode('UTF-8') to convert ASCII-8BIT to UTF-8 doesn't work for bytes whose ascii codes are greater than 127:
(0..255).each do |ascii_code|
str = ascii_code.chr("ASCII-8BIT")
#puts str.encoding #=>ASCII-8BIT
begin
str.encode("UTF-8")
rescue Encoding::UndefinedConversionError
puts "Can't encode char with ascii code #{ascii_code} to UTF-8."
end
end
--output:--
Can't encode char with ascii code 128 to UTF-8.
Can't encode char with ascii code 129 to UTF-8.
Can't encode char with ascii code 130 to UTF-8.
...
...
Can't encode char with ascii code 253 to UTF-8.
Can't encode char with ascii code 254 to UTF-8.
Can't encode char with ascii code 255 to UTF-8.
Ruby just reads one byte at a time from the ASCII-8BIT string and tries to convert the character in the byte to UTF-8. So, while 128 may be a legal byte in UTF-8 when part of a multi-byte character sequence, 128 is not a legal UTF-8 character as a single byte.
As for writing the strings to a file, instead of this:
f = open(filename)
...if you want to output UTF-8 to the file, you would write:
f = open(filename, "w:UTF-8")
By default, ruby uses whatever the value of Encoding.default_external is to encode output to a file. The default_external encoding is pulled from your system's environment, or you can set it explicitly.

Related

Gettext failed to extract non-ASCII characters

In my source files I have string containing non-ASCII characters like
sCursorFormat = TRANSLATE("Frequency (Hz): %s\nDegree (°): %s");
But when I extract them they vanish like
msgid ""
"Frequency (Hz): %s\n"
"Degree (): %s"
msgstr ""
I have specified the encoding when extracting as
xgettext --from-code=UTF-8
I'm running under MS Windows and the source files are C++ (not that it should matter).
The encoding of your source file is probably not UTF-8, but ANSI, which stands for whatever the encoding for non-Unicode applications is (probably code page 1252). If you would open the file in some hex editor you would see byte 0x80 standing for degree symbol. This byte is not a valid UTF-8 character. In UTF-8 encoding degree symbol is represented with two bytes 0xC2 0xB0. This is why the byte vanishes when using --from-code=UTF-8.
The solution for your problem is to use --from-code=windows-1252. OR, better yet, to save all source files as UTF-8, and then use --from-code=UTF-8.

Text file encoding properly as utf-8 in Scite editor but fails to encode to uft-8 in ruby

I have a text file which if viewed in the Scite editor with the encoding set to utf-8, displays all text correctly, including capital letters with an accute accent (i.e. Á).
However, if I write a ruby script and use mystring.encode("utf-8") it will give me this error on capital letters that carry an acute accent (i.e. Á):
encode': "\x81" to UTF-8 in conversion from Windows-1252 to UTF-8 (Encoding::UndefinedConversionError)
Is this expected behaviour? How can I encode the whole text to utf-8 using ruby, knowing that otherwise it does get successfully encoded in the Scite editor?
Code:
ine_file = File.open("../../_data/ine_spain_demographics.csv", 'r')
ine_towns_population_hash = Hash.new
ine_file.each do|line|
values = line.split(";")
town_name = values[3]
population = values[4]
begin
ine_towns_population_hash[town_name.encode("utf-8")] = population
rescue
puts "problematic string: " + town_name
end
end
You say that ine_file.external_encoding says Windows-1252 so the file is being opened as a Windows-1252 encoded file. Then you say town_name.encode("utf-8") in an attempt to encoded a string as UTF-8 and Ruby complains. But the file is actually UTF-8; reading UTF-8 bytes as Windows-1252 and then trying to recode those bytes as UTF-8 isn't going to work.
You need to open the file in UTF-8 mode:
File.open("../../_data/ine_spain_demographics.csv", 'r:UTF-8')
and stop trying to change the encoding of town_name, just use town_name as-is.
It seems like it's misinterpreting the encoding of ine_spain_demographics.csv.
Looking at the doc's for encode and open you have two options:
Use replace in encode to tell Ruby what character to use town_name.encode("utf-8", replace: '').
Identify the correct file encoding and specify it: File.open("../../_data/ine_spain_demographics.csv", 'r:ISO-8859-1')

Why do I get ArgumentError - invalid byte sequence in UTF-8?

While trying to print Duplicaci¾n out of a CSV file, I get the following error:
ArgumentError - invalid byte sequence in UTF-8
I'm using Ruby 1.9.3-p362 and opening the file using:
CSV.foreach(fpath, headers: true) do |row|
How can I skip an invalid character without using iconv or str.encode(undef: :replace, invalid: :replace, replace: '')?
I tried answers from the following questions, but nothing worked:
ruby 1.9: invalid byte sequence in UTF-8
Ruby Invalid Byte Sequence in UTF-8
ruby 1.9: invalid byte sequence in UTF-8
This is from the CSV.open documentation:
You must provide a mode with an embedded Encoding designator unless your data is in Encoding::default_external(). CSV will check the Encoding of the underlying IO object (set by the mode you pass) to determine how to parse the data. You may provide a second Encoding to have the data transcoded as it is read just as you can with a normal call to IO::open(). For example, "rb:UTF-32BE:UTF-8" would read UTF-32BE data from the file but transcode it to UTF-8 before CSV parses it.
That applies to any method in CSV that opens a file.
Also start reading in the documentation at the part beginning with:
CSV and Character Encodings (M17n or Multilingualization)
Ruby is expecting UTF-8 but is seeing characters that don't fit. I'd suspect WIN-1252 or ISO-8859-1 or a variant.

Confusion with ARGF#set_encoding

ARGF.set_encoding says:
If single argument is specified, strings read from ARGF are tagged with the encoding specified.
If two encoding names separated by a colon are given, e.g. "ascii:utf-8", the read string is converted from the first encoding (external encoding) to the second encoding (internal encoding), then tagged with the second encoding.
So I tried the below:
p RUBY_VERSION
p ARGF.external_encoding
ARGF.set_encoding('ascii')
p ARGF.readlines($/)
output:
D:\Rubyscript\My ruby learning days>ruby true.rb a.txt
"2.0.0"
#<Encoding:IBM437>
["Hi! How are you?\n", "I am doing good,thanks."]
p RUBY_VERSION
p ARGF.external_encoding
ARGF.set_encoding(ARGF.external_encoding,'ascii')
p ARGF.readlines($/)
output:
D:\Rubyscript\My ruby learning days>ruby true.rb a.txt
"2.0.0"
#<Encoding:IBM437>
["Hi! How are you?\n", "I am doing good,thanks."]
No encoding change is found. So please advice me the correct approach.
Encoding IBM437 and ASCII (and UTF-8) has the same byte sequence for ASCII characters. So you won't see the difference from String#inspect. However, you can check the String#encoding value for the input strings.
p RUBY_VERSION
p ARGF.external_encoding
ARGF.set_encoding(ARGF.external_encoding,'ascii')
p ARGF.readlines($/).map{|s| s.encoding}
In Ruby (1.9 and higher version), String is a byte sequence tagged with some encoding. You can get the encoding from String#encoding.
So the Chinese word "中" can be represented different ways:
e4 b8 ad # tagged with encoding UTF-8
d6 d0 # tagged with encoding GBK
2d 4e # tagged with encoding UTF-16le
I will always write my script in UTF-8, that is, the internal encoding for my script is UTF-8. Some times I want to process text file (e.g. named "a.txt" and has content "中") encoded with GBK. Then I can set the external encoding and the internal encoding for the IO object and Ruby will do the conversion for me.
ARGF.set_encoding('GBK', 'UTF-8')
str = ARGF.readline
puts str.encoding
# run $ script.rb a.txt
Ruby reads "\xd6\xd0" from "a.txt" and since I have specified the external encoding as GBK, it tags the data with encoding GBK. And I have specified the internal encoding as UTF-8 so Ruby do a conversion from GBK byte sequence to UTF-8, which results in "\xe4\xb8\xad" with tag UTF-8. And this string has the same encoding as other strings in my script, so I can use it with ease.
This is useful because a lot of String methods fail when the two String operands has different, incompatible encoding. For example:
# encoding: utf-8
a = "中" # tagged with UTF-8
b = "中".encode('gbk') # tagged with GBK
puts a + b
#=> Encoding::CompatibilityError: incompatible character encodings: UTF-8 and GBK

Why do I get an "Invalid Byte Sequence in UTF-8" error reading a text file?

I'm writing a Ruby script to process a large text file, and keep getting an odd encoding error.
Here's the situation:
input_data = File.new(in_path, 'r').read
p input_data.encoding.name # UTF-8
break_char = "\r".encode("UTF-8")
p break_char # "\r"
p break_char.encoding.name # "UTF-8"
input_data.split(",".encode("UTF-8"))
p Encoding.compatible?(input_data, break_char) # # Encoding:UTF-8>
This produces the error :in 'split': invalid byte sequence in UTF-8 (ArgumentError)
I read http://blog.grayproductions.net/articles/ruby_19s_string and looked at other solutions to apparently the same problem, but still can't work out why it's happening when I believe I am controlling the encodings.
I'm on OSX working with ruby 1.9.2
Obviously your input file is not UTF-8 (or at least, not entirely). If you don't care about non-ascii characters, you can simply assume your file is ascii-8bit encoded. BTW, your separator (break_char) is not causing problems as comma is encoded the same way in UTF-8 as in ASCII.
fname = 'test.in'
# create example file and fill it with invalid UTF-8 sequence
File.open(fname, 'w') do |f|
f.write "\xc3\x28"
end
# then try to read and parse it
s = File.open(fname) do |f| # file opened as UTF-8
#s = File.open(fname, 'r:ascii-8bit') do |f| # file opened as ascii-8bit
f.read
end
p s.split ','
I fail to get an error here on Linux even when the input file is not UTF-8. (I'm using Ruby 1.9.2, as well.)
Logically, either this problem is linked with OS-X, or it's something to do with your input data. Does it happen regardless of the data in the input file?
(I realise that this is not a proper answer, but I lack the rep to add a comment. And since no-one has responded yet, I thought it better than nothing...)
You read the file using the default encoding your system provides. So ruby tags the string as utf8, which doesn't mean it's really utf8-data. Try file <input file> to guess what kind of encoding is in there, then tell ruby it's that one (unclean: force_encoding(<encoding>), clean: tell the File object what encoding it is, I don't know how to do that) and then use encode!("utf8") to convert it to utf8.
Here are 2 common situations and how to deal with them:
Situation 1
You have an UTF-8 input-file with possibly a few invalid bytes
Remove the invalid bytes:
test = "Partly valid\xE4 UTF-8 encoding: äöüß"
File.open( 'input_file', 'w' ) {|f| f.write(test)}
str = File.read( 'input_file' )
str.scrub('')
=> "Partly valid UTF-8 encoding: äöüß"
Situation 2
You have an input-file that could be in either UTF-8 or ISO-8859-1 encoding
Check which encoding it is and convert to UTF-8 (if necessary):
test = "String in ISO-8859-1 encoding: \xE4\xF6\xFC\xDF"
File.open( 'input_file', 'w' ) {|f| f.write(test)}
str = File.read( 'input_file' )
unless str.valid_encoding?
str.encode!( 'UTF-8', 'ISO-8859-1', invalid: :replace )
end #unless
=> "String in ISO-8859-1 encoding: äöüß"
Notes
The above code snippets assume that Ruby encodes all your strings in UTF-8 by default. Even though, this is almost always the case, you can make sure of this by starting your scripts with # encoding: UTF-8.
If invalid, it is programmatically possible to detect most multi-byte encodings like UTF-8 (in Ruby, see: #valid_encoding?). However, it is NOT possible (or at least extremely hard) to programmatically detect invalidity of single-byte-encodings like ISO-8859-1. Thus the above code snippet does not work the other way around, i.e. detecting if a String is valid ISO-8859-1 encoding.
Even though UTF-8 has become increasingly popular as the default encoding in computer-systems, ISO-8859-1 and other Latin1 flavors are still very popular in the Western countries, especially in North America. Be aware that there a several single-byte encodings out there that are very similar, but slightly vary from ISO-8859-1. Examples: CP1252 (a.k.a. Windows-1252), ISO-8859-15
[ruby] [encoding] [utf8] [file-encoding] [character-encoding]
Please try this one:-
input_data = File.open("path/your_file.pdf", "rb") {|io| io.read}
Thanks

Resources