Reading in gzipped data from S3 in Ruby - ruby

My company has data messages (json) stored in gzipped files on Amazon S3. I want to use Ruby to iterate through the files and do some analytics. I started to use the 'aws/s3' gem, and get get each file as an object:
#<AWS::S3::S3Object:0x4xxx4760 '/my.company.archive/data/msg/20131030093336.json.gz'>
But once I have this object, I do not know how to unzip it or even access the data inside of it.

You can see the documentation for S3Object here: http://amazon.rubyforge.org/doc/classes/AWS/S3/S3Object.html.
You can fetch the content by calling your_object.value; see if you can get that far. Then it should be a question of unpacking the gzip blob. Zlib should be able to handle that.
I'm not sure if .value returns you a big string of binary data or an IO object. If it's a string, you can wrap it in a StringIO object to pass it to Zlib::GzipReader.new, e.g.
json_data = Zlib::GzipReader.new(StringIO.new(your_object.value)).read
S3Object has a stream method, which I would hope behaves like a IO object (I can't test that here, sorry). If so, you could do this:
json_data = Zlib::GzipReader.new(your_object.stream).read
Once you have the unzipped json content, you can just call JSON.parse on it, e.g.
JSON.parse Zlib::GzipReader.new(StringIO.new(your_object.value)).read

For me the below set of steps worked:
Step to read and write the csv.gz from S3 client to local file
Open the local csv.gz file using gzipreader and read csv from it
file_path = "/tmp/gz/x.csv.gz"
File.open(file_path, mode="wb") do |f|
s3_client.get_object(bucket: bucket, key: key) do |gzfiledata|
f.write gzfiledata
end
end
data = []
Zlib::GzipReader.open(file_path) do |gz_reader|
csv_reader = ::FastestCSV.new(gz_reader)
csv_reader.each do |csv|
data << csv
end
end

The S3Object documentation is updated and the stream method is no longer available: https://docs.aws.amazon.com/AWSRubySDK/latest/AWS/S3/S3Object.html
So, the best way to read data from an S3 object would be this:
json_data = Zlib::GzipReader.new(StringIO.new(your_object.read)).read

Related

Upload CSV payload to Google Storage using Ruby

I am trying to upload string payload to Google Storage directly using Ruby. But, it seems there's no direct way to do this without creating a temporary file in the disk.
I am using the CSV library to generate a string payload.
Current method suggests to store the string payload in a temporary file and then use code something like below to upload the file to google storage:
require "google/cloud/storage"
storage = Google::Cloud::Storage.new
bucket = storage.bucket bucket_name
file = bucket.create_file local_file_path, file_name
Is there a way to avoid creating a temporary file to upload?
I found the documentation where it states that we can use any File-like object such as StringIO to upload string payload directly.
Here's my code:
require "google/cloud/storage"
storage = Google::Cloud::Storage.new
bucket = storage.bucket "my-todo-app"
bucket.create_file StringIO.new("Hello world!"), "hello-world.txt"
Go to Creating a File section of this link for example
Documentation Link:
https://googleapis.dev/ruby/google-cloud-storage/latest/Google/Cloud/Storage/Bucket.html#create_file-instance_method

axlsx serialize spreadsheet to string

For testing purposes, I'd like to serialize an axlsx spreadsheet to a string. The axlsx documentation indicates it is possible to "Output to file or StringIO". But I haven't found documentation or a code sample that explains how to output to a StringIO. How is it done?
From the code:
# Serialize to a stream
s = package.to_stream()
File.open('example_streamed.xlsx', 'w') { |f| f.write(s.read) }
In the end, an [xlsx] file is zip archive containing multiple xml files and other assets. You can use Package#to_stream to generate an IO stream for streaming purposes, but viewing that archive as a string is probably not what you are looking to do.
If you are just looking to investigate the xml for a specific Worksheet, you can use Worksheet#to_xml_string which will return a String object with all the goodies in there. (That is how worksheet validation works, we parse that XML and validate it against the schema for the object)
Hope this help!

How to deserialize BSON::Binary back into ruby hash?

I'm using Anemone to store crawled pages into MongoDB. It mostly works, except for accessing the page headers when I retrieve a page from MongoDB.
When I call collection.find_one("http://stackoverflow.com") I'll get the correct object from the data store, but I can't acecss the headers.
Anemone stores the headers as a hash, so theoretically, after retreiving the document, I should be able to do something like
document["headers"]["content-type"]
but that doesn't work because document["headers"] is a BSON::Binary.
puts document["headers"]
displays a mixture of text and binary characters.
How can I create a usable ruby hash object from the binary data that comes back from MongoDB?
EDIT: I haven't solved the original problem, but was able to modify Anemone so that I can have it load the data for me, which seems to work:
class NewMongo < Anemone::Storage::MongoDB
def initialize(mongo_db, collection_name)
#db = mongo_db
#collection = #db[collection_name]
#Do not delete the collection! I need it!
##collection.remove
#collection.create_index 'url'
end
end
And then later on...
repo = NewMongo.new(db, "pages")
repo.each db |url, page|
puts page.content_type
end
If the data was stored in a Binary format by the Anemone storage backend there isn't much you can do unless you know the format or there is a deserializer they provide. It sounds like that would be a bad choice for storing the header as the hash would be a more natural form for it.

Updating content-type after file upload on Amazon S3 with Amazon-SDK Ruby gem

I'm running a script that updates a metadata field on some of my S3 objects after they have already been uploaded to the S3 bucket. On initialization, I am setting the content-type by checking the file name.
def save_to_amazon(file, s3_object, file_name, meta_path)
puts "uploaded #{file} to Amazon S3"
content_type = set_content_type(file_name)
s3_object.write(file.get_input_stream.read, :metadata => { :folders => meta_path}, :content_type => content_type)
end
At this point, the S3 content-type works fine for these objects. The problem arises when I update the metadata later on. I run something like this:
s3_object.metadata['folders'] = "some string"
At this point, I get an empty string returned when I run s3_objects.content_type after updating the metadata.
s3_object.content_type = is not available.
As far as I can tell from reading the Rdoc there isn't a way to assign content-type after uploading the S3 file. I have tried using the metadata method like
s3.object.metadata['content_type'] = "some string"
s3.object.metadata['content-type'] = "some string"
Both of these appear to assign a new custom metadata attribute instead of updating the object's mime type.
Is there a way to set this, or do I need to completely re-upload the file again?
To elaborate on tkotisis reponse, here is what I did to update the content-type using copy_to. You can use s3object.head[:metadata] to pull out the existing metadata to copy it over as referenced here.
amazon_bucket.objects.each do |ob|
metadata = ob.head[:metadata]
content_type = "foo/bar"
ob.copy_to(ob.key, :metadata => metadata, :content_type => content_type)
end
EDIT
amazon_bucket.objects.each do |ob|
metadata = ob.metadata
content_type = "foo/bar"
ob.copy_to(ob.key, :metadata{:foo => metadata[:foo]}, :content_type => content_type)
end
Your example code only modifies your in-memory object.
To modify the metadata of the actual S3 object, issue a copy request with destination key the one of your current object.
EDIT
According to the documentation
Using the copy operation, you can rename objects by copying them and
deleting the original ones.
When copying an object, you might decide to update some of the
metadata values. For example, if your source object is configured to
use standard storage, you might choose to use reduced redundancy
storage for the object copy. You might also decide to alter some of
the user-defined metadata values present on the source object. Note
that if you choose to update any of the object's user configurable
metadata (system or user-defined) during the copy, then you must
explicitly specify all the user configurable metadata, even if you are
only changing only one of the metadata values, present on the source
object in your request.
I haven't tried it, but using the Ruby SDK this is probably achieved through the
- (S3Object) copy_to(target, options = {})
method.
I'm using a gem "aws-sdk", "~> 2" (2.2.3)
Assume that you have a current file without set content-type (Content-type will be set as a "binary/octet-stream" by default)
How to check a content-type file?
If you use the RestClient as follows:
object mean Aws::S3::Object
bucket = Aws::S3::Bucket.new(bucket_name)
object = bucket.object(key)
RestClient.head(object.presigned_url(:head)) do |resp|
puts resp.headers
puts resp.headers[:content_type]
end
How to change a content-type file?
In my case, I wanna change a content-type to 'image/jpeg' which current object is 'binary/octet-stream' so you can
object.copy_from(
object,
content_type: 'image/jpeg',
metadata_directive: 'REPLACE'
)
Make sure you set the ACL to :public read, otherwise your files will be unavailable after copying.
This did the trick for me:
bucket.objects.with_prefix('my_assets').each do |obj|
metadata = obj.head[:metadata]
content_type = "application/pdf"
obj.copy_to(obj.key, :metadata => metadata, :content_type => content_type)
obj.acl = :public_read
end
Although not Ruby I found this project which automatically guessing the mime type based on the extension and resets is via the same copy method that the other answers refers to. It's not terribly quick since it has to copy the blob. If you needed to make it happen faster you could probably divide up the work and copy in parallel via something like IronWorker. I did a similar thing for resetting permissions.

How to get the real file from S3 using CarrierWave

I have an application that reads the content of a file and indexes it. I was storing them in the disk itself, but now I'm using Amazon S3, so the following method doesn't work anymore.
It was something like this:
def perform(docId)
#document = Document.find(docId)
if #document.file?
#You should't create a new version
#document.versionless do |doc|
#document.file_content = Cloudoc::Extractor.new.extract(#document.file.file)
#document.save
end
end
end
#document.file returns the FileUploader, and doc.file.file returns the CarrierWave::Storage::Fog::File class.
How can I get the real file?
Calling #document.file.read will get you the contents of the file from S3 in Carrierwave.

Resources