L O S T :( Dumped some data with Ruby YAML, can't read it back - ruby

so I saved to disk some objects using the following code (this is Ruby 1.9.2 on Windows BTW):
open('1.txt', "wb") { |file|
file.write(YAML::dump( results))
}
Now I'm trying to get back that data, but get 'invalid byte sequence in UTF-8 (ArgumentError)'. I've tryed everything I could think of to save the data in different format, but no luck. For example
open('1.txt', 'rb'){|f| a1 = YAML::load(f.read)}
a1.each do |a|
JSON.generate(a)
end
results in:
C:/m/ruby-1.9.2-p0-i386-mingw32/lib/ruby/1.9.1/json/common.rb:212:in `match':
invalid byte sequence
in UTF-8 (ArgumentError)
from C:/m/ruby-1.9.2-p0-i386-mingw32/lib/ruby/1.9.1/json/common.rb:212:in `generate'
from C:/m/ruby-1.9.2-p0-i386-mingw32/lib/ruby/1.9.1/json/common.rb:212:in `generate'
from merge3.rb:31:in `block in <main>'
from merge3.rb:29:in `each'
from merge3.rb:29:in `<main>'
What can I do?
EDIT: from the file:
---
- !ruby/object:Product
name: HSF
- !ruby/object:Product
name: "almer\xA2n"
The 1st product works OK, but the 2nd gives the exception.

This is probably your encoding being wrong. You could try this:
Encoding.default_external = 'BINARY'
This should read in the file raw, not interpreted as UTF-8. You are presumably using some kind of ISO-8859-1 accent.

You need to read the file back in using the same encoding you used to write it out, obviously. Since you don't specify an encoding in either case, you will basically end up with an environment-dependent encoding outside of your control, which is why it is never a good idea to not specify an encoding.
The snippet you posted is clearly not valid UTF-8, so the fact that you get an exception is perfectly appropriate.

I'm not sure if this is what you're after, but currently your YAML file looks like:
---
- !ruby/object:Product
name: HSF
- !ruby/object:Product
name: "almer\xA2n"
If you remove the !ruby/object:Product from the array lines you'll get an array of hashes:
---
- name: HSF
- name: "almer\xA2n"
results in:
YAML::load_file('test.yaml') #=> [{"name"=>"HSF"}, {"name"=>"almer\xA2n"}]
If I print the second element's value when my terminal is set to Windows character sets I see the cent sign. So, if you're trying to regain access to the data all you have to do is a bit of manipulation of the data file.

Related

File encoding issue when downloading file from AWS S3

I have a CSV file in AWS S3 that I'm trying to open in a local temp file. This is the code:
s3 = Aws::S3::Resource.new
bucket = s3.bucket({bucket name})
obj = bucket.object({object key})
temp = Tempfile.new('temp.csv')
obj.get(response_target: temp)
It pulls the file from AWS and loads it in a new temp file called 'temp.csv'. For some files, the obj.get(..) line throws the following error:
WARN: Encoding::UndefinedConversionError: "\xEF" from ASCII-8BIT to UTF-8
WARN: /Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `write'
/Users/.rbenv/versions/2.5.0/lib/ruby/2.5.0/delegate.rb:349:in `block in delegating_block'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/http/response.rb:62:in `signal_data'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-core-3.21.2/lib/seahorse/client/net_http/handler.rb:83:in `block (3 levels) in transmit'
...
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/client.rb:2666:in `get_object'
/Users/.rbenv/versions/2.5.0/lib/ruby/gems/2.5.0/gems/aws-sdk-s3-1.13.0/lib/aws-sdk-s3/object.rb:657:in `get'
Stacktrace shows the error initially gets thrown by the .get from the AWS SDK for Ruby.
Things I've tried:
When uploading the file (object) to AWS S3, you can specify content_encoding, so I tried setting that to UTF-8:
obj.upload_file({file path}, content_encoding: 'utf-8')
Also when you call .get you can set response_content_encoding:
obj.get(response_target: temp, response_content_encoding: 'utf-8')
Neither of those work, they result in the same error as above. I would really expect that to do the trick. In the AWS S3 dashboard I can see that the content encoding is indeed set correctly via the code but it doesn't appear to make a difference.
It does work when I do the following, in the first code snippet above:
temp = Tempfile.new('temp.csv', encoding: 'ascii-8bit')
But I'd prefer to upload and/or download the file from AWS S3 with the proper encoding. Can someone explain why specifying the encoding on the tempfile works? Or how to make it work through the AWS S3 upload/download?
Important to note: The problematic character in the error message appears to just be a random symbol added at the beginning of this auto-generated file I'm working with. I'm not worried about reading the character correctly, it gets ignored when I parse the file anyways.
I don't have a full answer to all your question, but I think I have a generalized solution, and that is to always put the temp file into binary mode. That way the AWS gem will simply dump the data from the bucket into the file, without any further re/encoding:
Step 1 (put the Tempfile into binmode):
temp = Tempfile.new('temp.csv')
temp.binmode
You will however have a problem, and that is the fact that there is a 3-byte BOM header in your UTF-8 file now.
I don't know where this BOM came from. Was it there when the file was uploaded? If so, it might be a good idea to strip the 3 byte BOM before uploading.
However, if you set up your system as below, it will not matter, because Ruby supports transparent reading of UTF-8 with or without BOM, and will return the string correctly regardless of if the BOM header is in the file or not:
Step 2 (process the file using bom|utf-8):
File.read(temp.path, encoding: "bom|utf-8")
# or...
CSV.read(temp.path, encoding: "bom|utf-8")
This should cover all your bases I think. Whether you receive files encoded as BOM + UTF-8 or plain UTF-8, you will process them correctly this way, without any extra header characters appearing in the final string, and without errors when saving them with AWS.
Another option (from OP)
Use obj.get.body instead, which will bypass the whole issue with response_target and Tempfile.
Useful references:
Is there a way to remove the BOM from a UTF-8 encoded file?
How to avoid tripping over UTF-8 BOM when reading files
What's the difference between UTF-8 and UTF-8 without BOM?
How to write BOM marker to a file in Ruby
I fixed this encoding issue by using File.open(tmp, 'wb') additionally. Here is how it looks like:
s3_object = Aws::S3::Resource.new.bucket("bucket-name").object("resource-key")
Tempfile.new.tap do |file|
s3_object.get(response_target: File.open(file, "wb"))
end
The Ruby SDK docs have an example of downloading an S3 item to the filesystem in https://docs.aws.amazon.com/sdk-for-ruby/v3/developer-guide/s3-example-get-bucket-item.html. I just ran it and it works fine.

Why is Ruby failing to convert CP-1252 to UTF-8?

I have a CSV files saved from Excel which is CP-1252/Windows-1252. I tried the following, but it still comes out corrupted. Why?
csv_text = File.read(arg[:file], encoding: 'cp1252').encode('utf-8')
# csv_text = File.read(arg[:file], encoding: 'cp1252')
csv = CSV.parse csv_text, :headers => true
csv.each do |row|
# create model
p model
The result
>rake import:csv["../file.csv"] | grep Brien
... name: "Oâ?TBrien ...
However it works in the console
> "O\x92Brien".force_encoding("cp1252").encode("utf-8")
=> "O'Brien"
I can open the CSV file in Notepad++, Encoding > Character Sets > Western European > Windows-1252, see the correct characters, then Encoding > Convert to UTF-8. However, there are many files an I want Ruby to handle this.
Similar: How to change the encoding during CSV parsing in Rails. But this doesn't explain why this is failing.
Ruby 2.4, Reference: https://ruby-doc.org/core-2.4.3/IO.html#method-c-read
Wow, it was caused by the shitty grep in DevKit.
>rake import:csv["../file.csv"]
... name: "O'Brien ...
>where grep
C:\DevKit2\bin\grep.exe
I also did not need the .encode('utf-8').
Let that be a lesson kids. Never take anything for granted. Trust no one!

ruby open pathname with specified encoding

I am trying to open files telling Ruby 1.9.3 to treat them as UTF-8 encoding.
require 'pathname'
Pathname.glob("/Users/Wes/Desktop/uf2/*.ics").each { |f|
puts f.read(["encoding:UTF-8"])
}
The class documentation goes through several levels of indirection, so I am not sure I am specifying the encoding properly. When I try it, however, I get this error message
ICS_scanner_strucdoc.rb:4:in read': can't convert Array into Integer (TypeError)
from ICS_scanner_strucdoc.rb:4:inread'
from ICS_scanner_strucdoc.rb:4:in block in <main>'
from ICS_scanner_strucdoc.rb:3:ineach'
from ICS_scanner_strucdoc.rb:3:in `'
This error message leads me to believe that read is trying to interpret the open_args as the optional leading argument, which would be the length of the read.
If I put the optional parameters in, as in puts f.read(100000, 0, ["encoding:UTF-8"]) I get an error message that says there are too many arguments.
What is the appropriate way to specify only the encoding? Would it be correct to say that this is an inconsistency between the documentation and the behavior of the class?
Mac OS 10.8
rvm current reports "ruby-1.9.3-p484"
I'm not sure you want to specify encoding for path name or for file itself.
If it is latter, this maybe what you want.
Pathname.glob("/Users/Wes/Desktop/uf2/*.ics").each { |f|
puts File.open(f,"r:UTF-8")
}
With Pathname.read you can write like this.
Pathname.glob("/Users/Wes/Desktop/uf2/*.ics").each do |f|
path = Pathname(f)
puts path.read
end

Rails parse upload file "\xDE" from ASCII-8BIT to UTF-8

I try parse upload *.txt file and get some import DB information. But before save it I try get tring in utf-8 format. When I do that I get error:
"\xDE" from ASCII-8BIT to UTF-8
First file characters
Import data \xDE\xE4\xE5
Before parse code
# encoding: utf-8
require "iconv"
class HandlerController < ApplicationController
def add_report
utf8_format = "UTF-8"
file_data = params[:import_file].tempfile.read.encode(utf8_format)
end
end
P.S. Also I try do that with iconv but it didn't help
You need to start from a known encoding with valid content (and compatible characters for input and output) before you will be able to successfully convert a string.
ASCII-8BIT doesn't assign Unicode-compatible characters to values 128..255 - it cannot be converted to Unicode.
The chances are that the input - as you say it is text - is in some other encoding to start with. You could start by assuming ISO-8859-1 ("Latin-1") which is quite a common encoding, although you may have some other clue, or know what characters to expect in the file, in which case you should try others.
I suggest you try something like this:
file_data = params[:import_file].tempfile.read.force_encoding('ISO-8859-1')
utf8_file_data = file_data.encode(utf8_format)
This probably will not give you an error, but if my guess at 'ISO-8859-1' is wrong, it will give you gibberish unfortunately.

How to avoid undefined method error for Nilclass

I use the dbf gem to read data out of an df file. I wrote some code:
# encoding: UTF-8
require 'dbf'
widgets = DBF::Table.new("patient.dbf")
widgets.each do |record|
puts record.vorname
end
Basically the code works but after ruby writes about 400 record.vorname to the console i get this error:
...
Gisela
G?nter
mycode.rb:5:in `block in <main>': undefined method `vorname' for nil:NilClass (NoM
ethodError)
from C:/RailsInstaller/Ruby1.9.3/lib/ruby/gems/1.9.1/gems/dbf-2.0.6/lib/
dbf/table.rb:101:in `block in each'
......
My question is how can i avoid this error? Therefore it would be intresting why ( how you can see in the error) the record.vorname's with ä,ö,ü are displayed like ?,?,? for eg:
Günter is transformed to G?nter
Thanks
For some reason, your DBF driver returns nil records. You can pretend that this problem doesn't exist by skipping those.
widgets.each do |record|
puts record.vorname if record
end
About your question about the wrong chars, according to the dfb documentation:
Encodings (Code Pages)
dBase supports encoding non-english characters in different formats.
Unfortunately, the format used is not always set, so you may have to
specify it manually. For example, you have a DBF file from Russia and
you are getting bad data. Try using the 'Russion OEM' encoding:
table = DBF::Table.new('dbf/books.dbf', nil, 'cp866')
See doc/supported_encodings.csv for a full list of supported
encodings.
So make sure you use the right encoding to read from the DB.
To avoid the NoMethodError for nil:Nil Class you can probably try this:
require 'dbf'
widgets = DBF::Table.new("patient.dbf")
widgets.each do |record|
puts record.vorname unless record.blank?
end

Resources