Using aws-sdk to download files from s3. Encoding not right - ruby

I am trying to use aws-sdk to load s3 files to local disk, and question why my pdf file (which just has a text saying SAMPLE PDF) turns out with an apparently empty content.
I guess it has something to do with the encoding...but how can i fix it?
Here is my code :
require 'aws-sdk'
bucket_name = "****"
access_key_id = "***"
secret_access_key = "**"
s3=AWS::S3.new(
access_key_id: access_key_id,
secret_access_key: secret_access_key)
b = s3.buckets[bucket_name]
filen = File.basename("Sample.pdf")
path = "original/90/#{filen}"
o = b.objects[path]
require 'tempfile'
ext= File.extname(filen)
file = File.open("test.pdf","w", encoding: "ascii-8bit")
# streaming download from S3 to a file on disk
begin
file.write(o.read) do |chunk|
file.write(chunk)
end
end
file.close
If i take out the encoding: "ascii-8bit", i just get an error message Encoding::UndefinedConversionError: "\xC3" from ASCII-8BIT to UTF-8

After some research and a tip from a cousin of mine, i finally got this to work.
Instead of using the aws solution to load the file from amazon and write it to disk (which was generating a strange pdf file : apparently equal to the original, but with blank content, and Adobe Reader "fixing" it when opening)
i instead am now using open-uri, with SSL ignore.
Here is the final code which made my day :
require 'open-uri'
open('test.pdf', 'wb') do |file|
file << open('https://s3.amazon.com/mybucket/Sample.pdf',:ssl_verify_mode => OpenSSL::SSL::VERIFY_NONE).read
end

Related

Ruby ZIP file encoding for sending to Sidekiq / Redis

I build a ZIP file using the following code:
def compress_batch(directory_path)
zip_file_path = File.join( File.expand_path("..", directory_path), SecureRandom.hex(10))
Zip::File.open(zip_file_path, Zip::File::CREATE) do |zip_file|
(Dir.entries(directory_path) - %w(. ..)).each do |file_name|
zip_file.add file_name, File.join(directory_path, file_name)
end
end
result = File.open(zip_file_path, 'rb').read
File.unlink(zip_file_path)
result
end
I store that ZIP file in memory:
#result = Payoff::DataFeed::Compress::ZipCompress.new.compress_batch(source_path)
I put it into a hash:
options = {
data: #result
}
Then I submit it to my SideKiq worker using perform_async:
DeliveryWorker.perform_async(options)
and get the following error:
[DEBUG] Starting store to: { "destination" => "sftp", "path" => "INBOUND/20191009.zip" }
Encoding::UndefinedConversionError: "\xBA" from ASCII-8BIT to UTF-8
from ruby/2.3.0/gems/activesupport-4.2.10/lib/active_support/core_ext/object/json.rb:34:in `encode'
However, if I use .new.perform instead of .perform_async, bypassing SideKiq, it works fine!
DeliveryWorker.new.perform(options)
My best guess is that there is something wrong with my encoding such that when the job goes to SideKiq / Redis, it blows up. How should I have encoded it? Do I need to change the creation of my ZIP file? Maybe I can convert the encoding upon submission to SideKiq?
Sidekiq serializes arguments as JSON. You are trying to stuff binary data into JSON, which only supports UTF-8 strings. You will need to Base64 encode the data if you wish to pass it through Redis.
require 'base64'
encoded = Base64.encode64(filedata)

`scan': invalid byte sequence in UTF-8 (ArgumentError)

I'm trying to read a .txt file in ruby and split the text line-by-line.
Here is my code:
def file_read(filename)
File.open(filename, 'r').read
end
puts f = file_read('alice_in_wonderland.txt')
This works perfectly. But when I add the method line_cutter like this:
def file_read(filename)
File.open(filename, 'r').read
end
def line_cutter(file)
file.scan(/\w/)
end
puts f = line_cutter(file_read('alice_in_wonderland.txt'))
I get an error:
`scan': invalid byte sequence in UTF-8 (ArgumentError)
I found this online for untrusted website and tried to use it for my own code but it's not working. How can I remove this error?
Link to the file: File
The linked text file contains the following line:
Character set encoding: ISO-8859-1
If converting it isn't desired or possible then you have to tell Ruby that this file is ISO-8859-1 encoded. Otherwise the default external encoding is used (UTF-8 in your case). A possible way to do that is:
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1')
s.encoding # => #<Encoding:ISO-8859-1>
Or even like this if you prefer your string UTF-8 encoded (see utf8everywhere.org):
s = File.read('alice_in_wonderland.txt', encoding: 'ISO-8859-1:UTF-8')
s.encoding # => #<Encoding:UTF-8>
It seems to work if you read the file directly from the page, maybe there's something funny about the local copy you have. Try this:
require 'net/http'
uri = 'http://www.ccs.neu.edu/home/vip/teach/Algorithms/7_hash_RBtree_simpleDS/hw_hash_RBtree/alice_in_wonderland.txt'
scanned = Net::HTTP.get_response(URI.parse(uri)).body.scan(/\w/)

how to save StringIO (pdf) data into file

I want to save pdf file which is located in external remote server with ruby. The pdf file is coming in StringIO. I tried saving the data with File.write but it is not working. I received the below error .
ArgumentError: string contains null byte
How to save now ?
require 'stringio'
sio = StringIO.new("he\x00llo")
File.open('data.txt', 'w') do |f|
f.puts(sio.read)
end
$ cat data.txt
hello
Response to comment:
Okay, try this:
require 'stringio'
sio = StringIO.new("\c2\xb5")
sio.set_encoding('ASCII-8BIT') #Apparently, this is what you have.
File.open('data.txt', 'w:utf-8') do |f|
f.puts(sio.read)
end
--output:--
1.rb:7:in `write': "\xB5" from ASCII-8BIT to UTF-8 (Encoding::UndefinedConversionError)
To get rid of that error, you can set the encoding of the StringIO to UTF-8:
require 'stringio'
sio = StringIO.new("\c2\xb5")
sio.set_encoding('ASCII-8BIT') #Apparently, this is what you have.
sio.set_encoding('UTF-8') #Change the encoding to what it should be.
File.open('data.txt', 'w:UTF-8') do |f|
f.puts(sio.read)
end
Or, you can use the File.open modes:
require 'stringio'
sio = StringIO.new("\c2\xb5")
sio.set_encoding('ASCII-8BIT') #Apparently, this is what you have.
File.open('data.txt', 'w:UTF-8:ASCII-8BIT') do |f|
f.puts(sio.read)
end
But, that assumes the data is encoded in UTF-8. If you actually have binary data, i.e. data that isn't encoded because it represents a .jpg file for instance, then that won't work.

getting BadDigest error while trying to upload compressed file to s3 on ruby 1.9.3

as stated, i am trying to upload a file to s3
require 'digest/md5'
require 'base64'
require 'aws-sdk'
def digest f
f.rewind
Digest::MD5.new.tab do |dig|
f.each_chunk{|ch| dig << ch}
end.base64digest
ensure
f.rewind
end
file = File.new(compress file) #file zipped with zip/zip
total = file.size
digest = digest(file)
s3 = AWS::S3::new(:access_key_id => #access_key_id, :secret_access_key
=> #secret_access_key)
bucket = s3.buckets['mybucket']
bucket.objects["myfile"].write :content_md5 => digest, :content_length
=> total do |buf,len|
buf.write(file.read len)
end
but i constantly get AWS::S3::Errors::BadDigest exception
if i try to upload the file without passing :content_md5, everything goes well, archive downloads and opens correctly.
also as i just found out this fails on ruby 1.9.3 but works well on 1.9.2
fixed by changing digest func to
def digest f
Digest::MD5.file(f.path).base64digest
end
i think the issue was in the fact that the file passed to it was open

Can't read files from amazon s3 bucket using aws_s3 (ruby gem) in correct encoding?

I have problem when creating a file in encoding 'utf-8' and reading it from amazon-s3 bucket.
I create a file.
file = File.open('new_file', 'w', :encoding => 'utf-8')
string = "Some ££££ sings"
file.write(string)
file.close
When read from local everything is ok.
open('new_file').read
=> "Some ££££ sings"
Now I upload the file to amazon s3 using aws_s3.
AWS::S3::S3Object.store('new_file', open('new_file'), 'my_bucket')
=> #<AWS::S3::S3Object::Response:0x2214462560 200 OK>
When I read from amazon s3
AWS::S3::S3Object.find('new_file', 'my_bucket').value
=> "Some \xC2\xA3\xC2\xA3\xC2\xA3\xC2\xA3 sings"
open(AWS::S3::S3Object.find('new_file','my_bucket').url).read
=> "Some \xC2\xA3\xC2\xA3\xC2\xA3\xC2\xA3 sings"
I've been trying many things a still can't find solution.
Many Thanks for all the help
M
I found solution on different forum.
They way to do it is to make sure we are passing/uploading the text file in 'utf-8' in the first place. This it self will not solve the problem but will allow you with certainty force on stream back string encoding.
open(AWS::S3::S3Object.find('new_file','my_bucket').url).read.force_encoding('utf-8')
I think there is a better solution. Put the file you are writing to in binmode.
file = File.open("test.txt", "wb")
# or use File#binmode
file = File.open("test.txt")
file.binmode
# binmode also works with Tempfile
file = Tempfile.new
file.binmode
# then proceed to downloading
s3 = AWS::S3.new
s3.buckets["foo"]["test.txt"].read do |chunk|
file.write(chunk)
end

Resources