Reading a JSON file in Ruby prepending unknown characters - ruby

I have a simple JSON file like this:
{
"env": "Development",
"app_host": "https://localhost:3455",
"server_host": "localhost",
"server_port": "3455"
}
When I read this file using the below code, the output contains some unknown characters in the beginning.
contents = IO.read('config.json')
puts contents
output:
{
"env": "Development",
"app_host": "https://localhost:3455",
"server_host": "localhost",
"server_port": "3455"
}
Can someone let me know how to fix this?

These characters are the bytes of a UTF-8 byte order mark (BOM), being displayed as code page 437 characters.
From your comment, it seems Visual Studio is inserting a BOM into the files. When you then read the file in and try to display it in your console it is displaying as ∩╗┐, since your console’s encoding is set to CP437, and the three bytes that make up the BOM in UTF-8 (0xEF,0xBB,0xBF) correspond to those characters in that encoding.
You should probably look into changing the encoding your console is using, as well as seeing if you can configure VS not to add the BOM (I’m not on Windows so I don’t know how you would do either of those).
From the Ruby side, you could specify the encoding in your call to IO.read like this:
IO.read('config.json', :encoding => 'bom|utf-8')
This will strip the BOM when reading the file.

Related

File encoding issue when downloading file from AWS S3

I have a CSV file in AWS S3 that I'm trying to open in a local temp file. This is the code:
s3 = Aws::S3::Resource.new
bucket = s3.bucket({bucket name})
obj = bucket.object({object key})
temp = Tempfile.new('temp.csv')
obj.get(response_target: temp)
It pulls the file from AWS and loads it in a new temp file called 'temp.csv'. For some files, the obj.get(..) line throws the following error:
WARN: Encoding::UndefinedConversionError: "\xEF" from ASCII-8BIT to UTF-8
WARN: /Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `write'
/Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `block in delegating_block'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/http/response.rb:62:in `signal_data'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/net_http/handler.rb:83:in `block (3 levels) in transmit'
...
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/client.rb:2666:in `get_object'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/object.rb:657:in `get'
Stacktrace shows the error initially gets thrown by the .get from the AWS SDK for Ruby.
Things I've tried:
When uploading the file (object) to AWS S3, you can specify content_encoding, so I tried setting that to UTF-8:
obj.upload_file({file path}, content_encoding: 'utf-8')
Also when you call .get you can set response_content_encoding:
obj.get(response_target: temp, response_content_encoding: 'utf-8')
Neither of those work, they result in the same error as above. I would really expect that to do the trick. In the AWS S3 dashboard I can see that the content encoding is indeed set correctly via the code but it doesn't appear to make a difference.
It does work when I do the following, in the first code snippet above:
temp = Tempfile.new('temp.csv', encoding: 'ascii-8bit')
But I'd prefer to upload and/or download the file from AWS S3 with the proper encoding. Can someone explain why specifying the encoding on the tempfile works? Or how to make it work through the AWS S3 upload/download?
Important to note: The problematic character in the error message appears to just be a random symbol added at the beginning of this auto-generated file I'm working with. I'm not worried about reading the character correctly, it gets ignored when I parse the file anyways.
I don't have a full answer to all your question, but I think I have a generalized solution, and that is to always put the temp file into binary mode. That way the AWS gem will simply dump the data from the bucket into the file, without any further re/encoding:
Step 1 (put the Tempfile into binmode):
temp = Tempfile.new('temp.csv')
temp.binmode
You will however have a problem, and that is the fact that there is a 3-byte BOM header in your UTF-8 file now.
I don't know where this BOM came from. Was it there when the file was uploaded? If so, it might be a good idea to strip the 3 byte BOM before uploading.
However, if you set up your system as below, it will not matter, because Ruby supports transparent reading of UTF-8 with or without BOM, and will return the string correctly regardless of if the BOM header is in the file or not:
Step 2 (process the file using bom|utf-8):
File.read(temp.path, encoding: "bom|utf-8")
# or...
CSV.read(temp.path, encoding: "bom|utf-8")
This should cover all your bases I think. Whether you receive files encoded as BOM + UTF-8 or plain UTF-8, you will process them correctly this way, without any extra header characters appearing in the final string, and without errors when saving them with AWS.
Another option (from OP)
Use obj.get.body instead, which will bypass the whole issue with response_target and Tempfile.
Useful references:
Is there a way to remove the BOM from a UTF-8 encoded file?
How to avoid tripping over UTF-8 BOM when reading files
What's the difference between UTF-8 and UTF-8 without BOM?
How to write BOM marker to a file in Ruby
I fixed this encoding issue by using File.open(tmp, 'wb') additionally. Here is how it looks like:
s3_object = Aws::S3::Resource.new.bucket("bucket-name").object("resource-key")
Tempfile.new.tap do |file|
s3_object.get(response_target: File.open(file, "wb"))
end
The Ruby SDK docs have an example of downloading an S3 item to the filesystem in https://docs.aws.amazon.com/sdk-for-ruby/v3/developer-guide/s3-example-get-bucket-item.html. I just ran it and it works fine.

Rails parse upload file "\xDE" from ASCII-8BIT to UTF-8

I try parse upload *.txt file and get some import DB information. But before save it I try get tring in utf-8 format. When I do that I get error:
"\xDE" from ASCII-8BIT to UTF-8
First file characters
Import data \xDE\xE4\xE5
Before parse code
# encoding: utf-8
require "iconv"
class HandlerController < ApplicationController
def add_report
utf8_format = "UTF-8"
file_data = params[:import_file].tempfile.read.encode(utf8_format)
end
end
P.S. Also I try do that with iconv but it didn't help
You need to start from a known encoding with valid content (and compatible characters for input and output) before you will be able to successfully convert a string.
ASCII-8BIT doesn't assign Unicode-compatible characters to values 128..255 - it cannot be converted to Unicode.
The chances are that the input - as you say it is text - is in some other encoding to start with. You could start by assuming ISO-8859-1 ("Latin-1") which is quite a common encoding, although you may have some other clue, or know what characters to expect in the file, in which case you should try others.
I suggest you try something like this:
file_data = params[:import_file].tempfile.read.force_encoding('ISO-8859-1')
utf8_file_data = file_data.encode(utf8_format)
This probably will not give you an error, but if my guess at 'ISO-8859-1' is wrong, it will give you gibberish unfortunately.

File.exist? not working when directory name has special characters

File.exist? in not working with directory name having special characters. for something like given below
path = "/home/cis/Desktop/'El%20POP%20que%20llevas%20dentro%20Vol.%202'/*.mp3"
it works fine but if it has letters like ñ its returns false.
Plz help with this.
Try the following:
Make sure you're running 1.9.2 or greater and put # encoding: UTF-8 at the top of your file (which must be in UTF-8 and your editor must support it).
If you're running MRI(i.e. not JRuby or other implementation) you can add environment variable RUBYOPT=-Ku instead of # encoding: UTF-8 to the top of each file.

Automatically open a file as binary with Ruby

I'm using Ruby 1.9 to open several files and copy them into an archive. Now there are some binary files, but some are not. Since Ruby 1.9 does not open binary files automatically as binaries, is there a way to open them automatically anyway? (So ".class" would be binary, ".txt" not)
Actually, the previous answer by Alex D is incomplete. While it's true that there is no "text" mode in Unix file systems, Ruby does make a difference between opening files in binary and non-binary mode:
s = File.open('/tmp/test.jpg', 'r') { |io| io.read }
s.encoding
=> #<Encoding:UTF-8>
is different from (note the "rb")
s = File.open('/tmp/test.jpg', 'rb') { |io| io.read }
s.encoding
=> #<Encoding:ASCII-8BIT>
The latter, as the docs say, set the external encoding to ASCII-8BIT which tells Ruby to not attempt to interpret the result at UTF-8. You can achieve the same thing by setting the encoding explicitly with s.force_encoding('ASCII-8BIT'). This is key if you want to read binary into a string and move them around (e.g. saving them to a database, etc.).
Since Ruby 1.9.1 there is a separate method for binary reading (IO.binread) and since 1.9.3 there is one for writing (IO.binwrite) as well:
For reading:
content = IO.binread(file)
For writing:
IO.binwrite(file, content)
Since IO is the parent class of File, you could also do the following which is probably more expressive:
content = File.binread(file)
File.binwrite(file, content)
On Unix-like platforms, there is no difference between opening files in "binary" and "text" modes. On Windows, "text" mode converts line breaks to DOS style, and "binary" mode does not.
Unless you need linebreak conversion on Windows platforms, just open all the files in "binary" mode. There is no harm in reading a text file in "binary" mode.
If you really want to distinguish, you will have to match File.extname(filename) against a list of known extensions like ".txt" and ".class".

Does Ruby auto-detect a file's codepage?

If a save a text file with the following character б U+0431, but save it as an ANSI code page file.
Ruby returns ord = 63. Saving the file with UTF-8 as the codepage returns ord = 208, 177
Should I be specifically telling Ruby to handle the input encoded with a certain code page? If so, how do you do this?
Is that in ruby source code or in a file which is read with File.open? If it's in the ruby source code, you can (in ruby 1.9) add this to the top of the file:
# encoding: utf-8
Or you could specify most other encodings (like iso-8859-1).
If you are reading a file with File.open, you could do something like this:
File.open("file.txt", "r:utf-8") {|f| ... }
As with the encoding comment, you can pass in different types of encodings here too.

Resources