File encoding issue when downloading file from AWS S3 - ruby

I have a CSV file in AWS S3 that I'm trying to open in a local temp file. This is the code:
s3 = Aws::S3::Resource.new
bucket = s3.bucket({bucket name})
obj = bucket.object({object key})
temp = Tempfile.new('temp.csv')
obj.get(response_target: temp)
It pulls the file from AWS and loads it in a new temp file called 'temp.csv'. For some files, the obj.get(..) line throws the following error:
WARN: Encoding::UndefinedConversionError: "\xEF" from ASCII-8BIT to UTF-8
WARN: /Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `write'
/Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `block in delegating_block'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/http/response.rb:62:in `signal_data'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/net_http/handler.rb:83:in `block (3 levels) in transmit'
...
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/client.rb:2666:in `get_object'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/object.rb:657:in `get'
Stacktrace shows the error initially gets thrown by the .get from the AWS SDK for Ruby.
Things I've tried:
When uploading the file (object) to AWS S3, you can specify content_encoding, so I tried setting that to UTF-8:
obj.upload_file({file path}, content_encoding: 'utf-8')
Also when you call .get you can set response_content_encoding:
obj.get(response_target: temp, response_content_encoding: 'utf-8')
Neither of those work, they result in the same error as above. I would really expect that to do the trick. In the AWS S3 dashboard I can see that the content encoding is indeed set correctly via the code but it doesn't appear to make a difference.
It does work when I do the following, in the first code snippet above:
temp = Tempfile.new('temp.csv', encoding: 'ascii-8bit')
But I'd prefer to upload and/or download the file from AWS S3 with the proper encoding. Can someone explain why specifying the encoding on the tempfile works? Or how to make it work through the AWS S3 upload/download?
Important to note: The problematic character in the error message appears to just be a random symbol added at the beginning of this auto-generated file I'm working with. I'm not worried about reading the character correctly, it gets ignored when I parse the file anyways.

I don't have a full answer to all your question, but I think I have a generalized solution, and that is to always put the temp file into binary mode. That way the AWS gem will simply dump the data from the bucket into the file, without any further re/encoding:
Step 1 (put the Tempfile into binmode):
temp = Tempfile.new('temp.csv')
temp.binmode
You will however have a problem, and that is the fact that there is a 3-byte BOM header in your UTF-8 file now.
I don't know where this BOM came from. Was it there when the file was uploaded? If so, it might be a good idea to strip the 3 byte BOM before uploading.
However, if you set up your system as below, it will not matter, because Ruby supports transparent reading of UTF-8 with or without BOM, and will return the string correctly regardless of if the BOM header is in the file or not:
Step 2 (process the file using bom|utf-8):
File.read(temp.path, encoding: "bom|utf-8")
# or...
CSV.read(temp.path, encoding: "bom|utf-8")
This should cover all your bases I think. Whether you receive files encoded as BOM + UTF-8 or plain UTF-8, you will process them correctly this way, without any extra header characters appearing in the final string, and without errors when saving them with AWS.
Another option (from OP)
Use obj.get.body instead, which will bypass the whole issue with response_target and Tempfile.
Useful references:
Is there a way to remove the BOM from a UTF-8 encoded file?
How to avoid tripping over UTF-8 BOM when reading files
What's the difference between UTF-8 and UTF-8 without BOM?
How to write BOM marker to a file in Ruby

I fixed this encoding issue by using File.open(tmp, 'wb') additionally. Here is how it looks like:
s3_object = Aws::S3::Resource.new.bucket("bucket-name").object("resource-key")
Tempfile.new.tap do |file|
s3_object.get(response_target: File.open(file, "wb"))
end

The Ruby SDK docs have an example of downloading an S3 item to the filesystem in https://docs.aws.amazon.com/sdk-for-ruby/v3/developer-guide/s3-example-get-bucket-item.html. I just ran it and it works fine.

Related

Decoding gb-2312 file in colab

I am trying to open a file in Colab that uses gb-2312 encoding. Here is the code I successfully ran in my IDE to read and decode:
file = open(r'file.txt')
opened = file.read()
decoded = opened.encode('latin1').decode('gb2312')
print(decoded)
When I run this code in colab, I get the following error:
'utf-8' codec can't decode byte 0xc6 in position 67: invalid continuation byte
But I can't decode without using read() or list() first, or else I get the following error:
'_io.TextIOWrapper' object has no attribute 'encode'
This seems like a catch-22. Is this a bug with Colab or is there some better way to approach the problem?
The default when opening a file is rt (read, text mode) and uses an OS-specific default encoding returned by locale.getpreferredencoding(False). Use the encoding parameter to override the default (which appears to be utf-8):
with open('file.txt', encoding='gb2312') as file:
data = file.read()

ASCII incompatible encoding with normal run, not in debug mode

I'm really confused on this one, and maybe it's a bug in Ruby 2.6.2. I have files that were written as UTF-8 with BOM, so I'm using the following:
filelist = Dir.entries(#input_dirname).join(' ')
filelist = filelist.split(' ').grep(/xml/)
filelist.each do |indfile|
filecontents_tmp = File.read("#{#input_dirname}/#{indfile}", :encoding =>'bom|utf-8')
puts filecontents_tmp
end
If I put a debug breakpoint at the puts line, my file is read in properly. If I just run the simple script, I get the following error:
in `read': ASCII incompatible encoding needs binmode (ArgumentError)
I'm confused as to why this would work in debug, but not when run normally. Ideas?
Have you tried printing the default encoding when you run the file as opposed to when you debug the file? There are 3 ways to set / change the encoding in Ruby (that I'm aware of), so I wonder if it's different between running the file and debugging. You should be able to tell by printing the default encoding: puts Encoding.default_external.
As for actually fixing the issue, I ran into a similar problem and found this answer which said to add bin mode as an option to the File.open call and it worked for me.

Loading Ruby scripts in SketchUp: LoadError: (eval):0:in `load': no such file to load

I have been trying to manually load Ruby scripts into SketchUp manually, using load. I always get an error back saying the file is non existent even though it is there in the directory.
Here is a sample of my code:
load "H:Document\sclf_color_by_z_1.6.1_1.rbz"
and the error messages:
Error: LoadError: (eval):0:in `load': no such file to load -- H:Document clf_color_by_z_1.6.1_1.rbz>
(eval)
(eval):0
Three issues here:
H:Document\sclf_color_by_z_1.6.1_1.rbz is not a valid path. After the Drive specifier H: you you should have a separator: \ - like so: H:\Document\sclf_color_by_z_1.6.1_1.rbz
Beware escape characters in strings when you program. \ is such a character.
To correct your string you'd have to have something like this:
"H:\\Document\\sclf_color_by_z_1.6.1_1.rbz"
https://en.wikibooks.org/wiki/Ruby_Programming/Strings#Escape_sequences
However, note that the convention for Ruby is to use forward slashes - even on Windows: "H:/Document/clf_color_by_z_1.6.1_1.rbz"
You are trying to load an RBZ file here. This is not the same as an RB file. An RBZ is a packaged SketchUp extension (actually a ZIP file). To programmatically install an RBZ you must use Sketchup.install_from_archive("H:/Document/clf_color_by_z_1.6.1_1.rbz")
http://www.sketchup.com/intl/en/developer/docs/ourdoc/sketchup#install_from_archive
Note that Sketchup.install_from_archive is nothing like load - it permanently installs the extension to SketchUp where as load would be just for that session.
Whenever you have a filepath that you think should be on disk - as the system whether it can find it: File.exist?("H:\Document\sclf_color_by_z_1.6.1_1.rbz") If that return false you know you need to carefully check your path again checking for syntax errors and typos.
You should use File.join() method. In your case:
You can't use load for a .rbz file but you can use Sketchup.install_from_archive() as thomthom said
So in your case your can simply do:
file = File.join( 'H:', 'Document' , 'sclf_color_by_z_1.6.1_1.rbz' )
Sketchup.install_from_archive file

Reading a JSON file in Ruby prepending unknown characters

I have a simple JSON file like this:
{
"env": "Development",
"app_host": "https://localhost:3455",
"server_host": "localhost",
"server_port": "3455"
}
When I read this file using the below code, the output contains some unknown characters in the beginning.
contents = IO.read('config.json')
puts contents
output:
{
"env": "Development",
"app_host": "https://localhost:3455",
"server_host": "localhost",
"server_port": "3455"
}
Can someone let me know how to fix this?
These characters are the bytes of a UTF-8 byte order mark (BOM), being displayed as code page 437 characters.
From your comment, it seems Visual Studio is inserting a BOM into the files. When you then read the file in and try to display it in your console it is displaying as ∩╗┐, since your console’s encoding is set to CP437, and the three bytes that make up the BOM in UTF-8 (0xEF,0xBB,0xBF) correspond to those characters in that encoding.
You should probably look into changing the encoding your console is using, as well as seeing if you can configure VS not to add the BOM (I’m not on Windows so I don’t know how you would do either of those).
From the Ruby side, you could specify the encoding in your call to IO.read like this:
IO.read('config.json', :encoding => 'bom|utf-8')
This will strip the BOM when reading the file.

Rails parse upload file "\xDE" from ASCII-8BIT to UTF-8

I try parse upload *.txt file and get some import DB information. But before save it I try get tring in utf-8 format. When I do that I get error:
"\xDE" from ASCII-8BIT to UTF-8
First file characters
Import data \xDE\xE4\xE5
Before parse code
# encoding: utf-8
require "iconv"
class HandlerController < ApplicationController
def add_report
utf8_format = "UTF-8"
file_data = params[:import_file].tempfile.read.encode(utf8_format)
end
end
P.S. Also I try do that with iconv but it didn't help
You need to start from a known encoding with valid content (and compatible characters for input and output) before you will be able to successfully convert a string.
ASCII-8BIT doesn't assign Unicode-compatible characters to values 128..255 - it cannot be converted to Unicode.
The chances are that the input - as you say it is text - is in some other encoding to start with. You could start by assuming ISO-8859-1 ("Latin-1") which is quite a common encoding, although you may have some other clue, or know what characters to expect in the file, in which case you should try others.
I suggest you try something like this:
file_data = params[:import_file].tempfile.read.force_encoding('ISO-8859-1')
utf8_file_data = file_data.encode(utf8_format)
This probably will not give you an error, but if my guess at 'ISO-8859-1' is wrong, it will give you gibberish unfortunately.

Resources