How to retrieve CSV headers only from S3 [duplicate] - ruby

Below is the code I'm using to parse the CSV from within the app, but I want to parse a file located in a Amazon S3 bucket. It needs to work when pushed to Heroku as well.
namespace :csvimport do
desc "Import CSV Data to Inventory."
task :wiwt => :environment do
require 'csv'
csv_file_path = Rails.root.join('public', 'wiwt.csv.txt')
CSV.foreach(csv_file_path) do |row|
p = Wiwt.create!({
:user_id => row[0],
:date_worn => row[1],
:inventory_id => row[2],
})
end
end
end

There are cases with S3, when permissions on S3 Object disallow public access. In-built Ruby functions do assume a path is publicly accessible and don't account for AWS S3 specificity.
s3 = Aws::S3::Resource.new
bucket = s3.bucket("bucket_name_here")
str = bucket.object("file_path_here").get.body.string
content = CSV.parse(str, col_sep: "\t", headers: true).map(&:to_h)
Per-line explanation using AWS SDK:
Line 1. Initialize
Line 2. Choose a bucket.
Line 3. Choose an object and get it as a String.
Line 4. Effectively CSV.parse('the string'), but I also added a options and map over it just in case it helps you.

You can do it like this
CSV.new(open(path_to_s3)).each do |row|
...
end

This worked for me
open(s3_file_path) do |file|
CSV.foreach(file, {headers: true, header_converters: :symbol}) do |row|
Model.create(row.to_hash)
end
end

You can get the csv file from S3 like this:
require 'csv'
require 'net/http'
CSV.parse(Net::HTTP.get(s3_file_url), headers: true).each do |row|
# code for processing row here
end

Related

Working with large CSV files in Ruby

I want to parse two CSV files of the MaxMind GeoIP2 database, do some joining based on a column and merge the result into one output file.
I used standard CSV ruby library, it is very slow. I think it tries to load all the file in memory.
block_file = File.read(block_path)
block_csv = CSV.parse(block_file, :headers => true)
location_file = File.read(location_path)
location_csv = CSV.parse(location_file, :headers => true)
CSV.open(output_path, "wb",
:write_headers=> true,
:headers => ["geoname_id","Y","Z"] ) do |csv|
block_csv.each do |block_row|
puts "#{block_row['geoname_id']}"
location_csv.each do |location_row|
if (block_row['geoname_id'] === location_row['geoname_id'])
puts " match :"
csv << [block_row['geoname_id'],block_row['Y'],block_row['Z']]
break location_row
end
end
end
Is there another ruby library that support processing in chuncks ?
block_csv is 800MB and location_csv is 100MB.
Just use CSV.open(block_path, 'r', :headers => true).each do |line| instead of File.read and CSV.parse. It will parse the file line by line.
In your current version, you explicitly tell it to read all the file with File.read and then to parse the whole file as a string with CSV.parse. So it does exactly what you have told.

Download files from URL's in array naming them by items in another array

I have a CSV with two columns, I am pushing each column's data into an array. Column 2 contains URL's of images that I would like to download. How do I name the file it's corresponding value from column 1?
require "open-uri"
require "csv"
members = []
photos = []
CSV.foreach('members.csv', :headers => true) do |csv_obj|
members << csv_obj[0]
photos << csv_obj[1]
end
photos.each {
|x| File.open({value from members array}, 'wb') do |fo|
fo.write open(x).read
end
}
Try this:
require "open-uri"
require "csv"
members = []
photos = []
CSV.foreach('members.csv', :headers => true) do |csv_obj|
members << csv_obj[0]
photos << csv_obj[1]
end
photos.each_with_index do |photo, index|
File.open(members[index], 'wb') do |fo|
fo.write open(photo) { |file| file.read }
end
end
Notes:
Try to submit a snippet of the CSV file too, it will help testing the code.
The code assumes that the members array will contain file names with extension.
The reason for using the block with open while downloading file is so that to ensure closing of file stream.
I suggest to use long descriptive variable names; it silently documents your intent and makes code very readable.
wb argument in File.open method is to ensure writing the file in binary mode.

Removing whitespaces in a CSV file

I have a string with extra whitespace:
First,Last,Email ,Mobile Phone ,Company,Title ,Street,City,State,Zip,Country, Birthday,Gender ,Contact Type
I want to parse this line and remove the whitespaces.
My code looks like:
namespace :db do
task :populate_contacts_csv => :environment do
require 'csv'
csv_text = File.read('file_upload_example.csv')
csv = CSV.parse(csv_text, :headers => true)
csv.each do |row|
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
end
end
#prices = CSV.parse(IO.read('prices.csv'), :headers=>true,
:header_converters=> lambda {|f| f.strip},
:converters=> lambda {|f| f ? f.strip : nil})
The nil test is added to the row but not header converters assuming that the headers are never nil, while the data might be, and nil doesn't have a strip method. I'm really surprised that, AFAIK, :strip is not a pre-defined converter!
You can strip your hash first:
csv.each do |unstriped_row|
row = {}
unstriped_row.each{|k, v| row[k.strip] = v.strip}
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
Edited to strip hash keys too
CSV supports "converters" for the headers and fields, which let you get inside the data before it's passed to your each loop.
Writing a sample CSV file:
csv = "First,Last,Email ,Mobile Phone ,Company,Title ,Street,City,State,Zip,Country, Birthday,Gender ,Contact Type
first,last,email ,mobile phone ,company,title ,street,city,state,zip,country, birthday,gender ,contact type
"
File.write('file_upload_example.csv', csv)
Here's how I'd do it:
require 'csv'
csv = CSV.open('file_upload_example.csv', :headers => true)
[:convert, :header_convert].each { |c| csv.send(c) { |f| f.strip } }
csv.each do |row|
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
Which outputs:
First Name: 'first'
Last Name: 'last'
Email: 'email'
The converters simply strip leading and trailing whitespace from each header and each field as they're read from the file.
Also, as a programming design choice, don't read your file into memory using:
csv_text = File.read('file_upload_example.csv')
Then parse it:
csv = CSV.parse(csv_text, :headers => true)
Then loop over it:
csv.each do |row|
Ruby's IO system supports "enumerating" over a file, line by line. Once my code does CSV.open the file is readable and the each reads each line. The entire file doesn't need to be in memory at once, which isn't scalable (though on new machines it's becoming a lot more reasonable), and, if you test, you'll find that reading a file using each is extremely fast, probably equally fast as reading it, parsing it then iterating over the parsed file.

Can I delete columns from CSV using Ruby?

Looking at the documentation for the CSV library of Ruby, I'm pretty sure this is possible and easy.
I simply need to delete the first three columns of a CSV file using Ruby but I haven't had any success getting it run.
csv_table = CSV.read(file_path_in, :headers => true)
csv_table.delete("header_name")
csv_table.to_csv # => The new CSV in string format
Check the CSV::Table documentation: http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV/Table.html
csv_table = CSV.read("../path/to/file.csv", :headers => true)
keep = ["x", "y", "z"]
new_csv_table = csv_table.by_col!.delete_if do |column_name,column_values|
!keep.include? column_name
end
new_csv_table.to_csv
What about:
require 'csv'
File.open("resfile.csv","w+") do |f|
CSV.foreach("file.csv") do |row|
f.puts(row[3..-1].join(","))
end
end
I have built on a few of the questions (really liked what #fguillen did with CSV::Table) here but just made it a bit simpler to drop it into an existing project, target a file and make a quick change.
Have added byebug cause ... yes. Then also retained the headers from the original file (assuming they exist for anyone wanting to use this snippet).
The file is overwritten each time in case you want to test/tinker.
require 'csv'
require 'byebug'
in_file = './db/data/inbox/change__to_file_name.csv'
out_file = in_file + ".out"
target_col = "change_to_column_name"
csv_table = CSV.read(in_file, headers: true)
csv_table.delete(target_col)
CSV.open(out_file, 'w+', force_quotes: true) do |csv|
csv << csv_table.headers
csv_table.each_with_index do |row|
csv << row
end
end

How to do the equivalent of 's3cmd ls s3://some_bucket/foo/bar' in Ruby?

How do I do the equivalent of 's3cmd ls s3://some_bucket/foo/bar' in Ruby?
I found the Amazon S3 gem for Ruby and also the Right AWS S3 library, but somehow it's not immediately obvious how to do a simple 'ls' like command on an S3 'folder' like location.
Using the aws gem this should do the trick:
s3 = Aws::S3.new(YOUR_ID, YOUR_SECTRET_KEY)
bucket = s3.bucket('some_bucket')
bucket.keys('prefix' => 'foo/bar')
I found a similar question here: Listing directories at a given level in Amazon S3
Based on that I created a method that behaves as much as possible as 's3cmd ls <path>':
require 'right_aws'
module RightAws
class S3
class Bucket
def list(prefix, delimiter = '/')
list = []
#s3.interface.incrementally_list_bucket(#name, {'prefix' => prefix, 'delimiter' => delimiter}) do |item|
if item[:contents].empty?
list << item[:common_prefixes]
else
list << item[:contents].map{|n| n[:key]}
end
end
list.flatten
end
end
end
end
s3 = RightAws::S3.new(ID, SECRET_KEY)
bucket = s3.bucket('some_bucket')
puts bucket.list('foo/bar/').inspect
In case some looks for the answer to this question for the aws-sdk version 2, you can very easily do this this way:
creds = Aws::SharedCredentials.new(profile_name: 'my_credentials')
s3_client = Aws::S3::Client.new(region: 'us-east-1',
credentials: creds)
response = s3_client.list_objects(bucket: "mybucket",
delimiter: "/")
Now, if you do
response.common_prefixes
It will give you the "Folders" of that particular subdirectory, and if you do
response.contents
It will have the files of that particular directory
The official Ruby AWS SDK now supports this: http://docs.aws.amazon.com/AWSRubySDK/latest/AWS/S3/Tree.html
You can also add the following convenience method:
class AWS::S3::Bucket
def ls(path)
as_tree(:prefix => path).children.select(&:branch?).map(&:prefix)
end
end
Then use it like this:
mybucket.ls 'foo/bar' # => ["/foo/bar/dir1/", "/foo/bar/dir2/"]
a quick and simple method to list files in a bucket folder using the ruby aws-sdk:
require 'aws-sdk'
s3 = AWS::S3.new
your_bucket = s3.buckets['bucket_o_files']
your_bucket.objects.with_prefix('lots/of/files/in/2014/09/03/').each do |file|
puts file.key
end
Notice the '/' at the end of the key, it is important.
I like the Idea of opening the Bucket class and adding a 'ls' method.
I would have done it like this...
class AWS::S3::Bucket
def ls(path)
objects.with_prefix("#{path}").as_tree.children.select(&:leaf?).collect(&:member).collect(&:key)
end
end
s3 = AWS::S3.new
your_bucket = s3.buckets['bucket_o_files']
your_bucket.ls('lots/of/files/in/2014/09/03/')

Resources