How do you upload a zip file and unzip to s3? - ruby

I am working on an application where I have to upload a zip file. The zip file is basically a static website so it has many files and a couple subdirectories. I have been playing with the rubyzip gem for a while now and can not figure out how to simply extract the files from it. Any pointers on where I can read up on some examples? I am sure someone has ran in to this problem before. the documentation for rubyzip is not very good so I am hoping someone can give me some pointers.

Here you go, one super magical multithreaded zip-to-S3 uploader which I haven't tested at all - go nuts! Looks like I'm three years too late though.
class S3ZipUploader
require 'thread'
require 'thwait'
require 'find'
attr_reader *%i{ bucket s3 zip failed_uploads }
def initialize(zipfilepath, mys3creds)
# next 4 lines are important
#s3 = AWS::S3.new(access_key_id: mys3creds[Rails.env]['aws_access_key'],
secret_access_key: mys3creds[Rails.env]['aws_secret_access_key'],
region: 'us-west-2')
#bucket = #s3.buckets[ mys3creds[Rails.env]['bucket'] ]
#failed_uploads = []
#zip = Zip::File.open(zipfilepath)
end
def upload_zip_contents
rootpath = "mypath/"
desired_threads = 10
total_entries = zip.entries.count
slice_size = (total_entries / desired_threats).ceil
threads = []
zip.entries.each_slice(slice_size) do |e_arr|
threads << Thread.new do |et|
e_arr.each do |e|
result = upload_to_s3(rootpath + e.name, e.get_input_stream.read)
if !result
#failed_uploads << {name: e.name, entry: e, error: err}
end
end
end
end
ThreadsWait.all_waits(*threads)
end
def upload_file_to_s3(filedata,path, rewrite_basepath)
retries = 0
success = false
while !success && retries < 3
success = begin
obj = bucket.objects[path]
obj.write(Pathname.new(outputhtml))
obj.acl = :public_read
success = true
rescue
retries += 1
success = false
end
end
return success
end
end
uploader = S3ZipUploader.new("/path/to/myzip.zip", MYS3CREDS)
uploader.upload_zip_contents

Related

RubyZip docx issues with write_buffer instead of open

I'm adapting the RubyZip recursive zipping example (found here) to work with write_buffer instead of open and am coming across a host of issues. I'm doing this because the zip archive I'm producing has word documents in it and I'm getting errors on opening those word documents. Therefore, I'm trying the work-around that RubyZip suggests, which is using write_buffer instead of open (example found here).
The problem is, I'm getting errors because I'm using an absolute path, but I'm not sure how to get around that. I'm getting the error "#//', name must not start with />"
Second, I'm not sure what to do to mitigate the issue with word documents. When I used my original code, which worked and created an actual zip file, any word document in that zip file had the following error upon opening: "Word found unreadable content in Do you want to recover the contents of this document? If you trust the source of this document, click Yes." The unreadable content error is the reason why I went down the road of attempting to use write_buffer.
Any help would be appreciated.
Here is the code that I'm currently using:
require 'zip'
require 'zip/zipfilesystem'
module AdvisoryBoard
class ZipService
def initialize(input_dir, output_file)
#input_dir = input_dir
#output_file = output_file
end
# Zip the input directory.
def write
entries = Dir.entries(#input_dir) - %w[. ..]
path = ""
buffer = Zip::ZipOutputStream.write_buffer do |zipfile|
entries.each do |e|
zipfile_path = path == '' ? e : File.join(path, e)
disk_file_path = File.join(#input_dir, zipfile_path)
#file = nil
#data = nil
if !File.directory?(disk_file_path)
#file = File.open(disk_file_path, "r+b")
#data = #file.read
unless [#output_file, #input_dir].include?(e)
zipfile.put_next_entry(e)
zipfile.write #data
end
#file.close
end
end
zipfile.put_next_entry(#output_file)
zipfile.put_next_entry(#input_dir)
end
File.open(#output_file, "wb") { |f| f.write(buffer.string) }
end
end
end
I was able to get word documents to open without any warnings or corruption! Here's what I ended up doing:
require 'nokogiri'
require 'zip'
require 'zip/zipfilesystem'
class ZipService
# Initialize with the directory to zip and the location of the output archive.
def initialize(input_dir, output_file)
#input_dir = input_dir
#output_file = output_file
end
# Zip the input directory.
def write
entries = Dir.entries(#input_dir) - %w[. ..]
::Zip::File.open(#output_file, ::Zip::File::CREATE) do |zipfile|
write_entries entries, '', zipfile
end
end
private
# A helper method to make the recursion work.
def write_entries(entries, path, zipfile)
entries.each do |e|
zipfile_path = path == '' ? e : File.join(path, e)
disk_file_path = File.join(#input_dir, zipfile_path)
if File.directory? disk_file_path
recursively_deflate_directory(disk_file_path, zipfile, zipfile_path)
else
put_into_archive(disk_file_path, zipfile, zipfile_path, e)
end
end
end
def recursively_deflate_directory(disk_file_path, zipfile, zipfile_path)
zipfile.mkdir zipfile_path
subdir = Dir.entries(disk_file_path) - %w[. ..]
write_entries subdir, zipfile_path, zipfile
end
def put_into_archive(disk_file_path, zipfile, zipfile_path, entry)
if File.extname(zipfile_path) == ".docx"
Zip::File.open(disk_file_path) do |zip|
doc = zip.read("word/document.xml")
xml = Nokogiri::XML.parse(doc)
zip.get_output_stream("word/document.xml") {|f| f.write(xml.to_s)}
end
zipfile.add(zipfile_path, disk_file_path)
else
zipfile.add(zipfile_path, disk_file_path)
end
end
end

Web scraping with Kimurai gem

I am doing some web scraping with the Kimurai Ruby gem. I have this script that works great:
require 'kimurai'
class SimpleSpider < Kimurai::Base
#name = "simple_spider"
#engine = :selenium_chrome
#start_urls = ["https://apply.workable.com/taxjar/"]
def parse(response, url:, data: {})
# Update response to current response after interaction with a browser
count = 0
# browser.click_button "Show more"
doc = browser.current_response
returned_jobs = doc.css('.careers-jobs-list-styles__jobsList--3_v12')
returned_jobs.css('li').each do |char_element|
# puts char_element
title = char_element.css('a')[0]['aria-label']
link = "https://apply.workable.com" + char_element.css('a')[0]['href']
#click on job link and get description
browser.visit(link)
job_page = browser.current_response
description = job_page.xpath('/html/body/div[1]/div/div[1]/div[2]/div[2]/div[2]').text
puts '*******'
puts title
puts link
puts description
puts count += 1
end
puts "There are #{count} jobs total"
end
end
SimpleSpider.crawl!
However, I'm wanting this all to return an array of objects...or jobs in this case. I'd like to create a jobs array in the parse method and do something like jobs << [title, link, description, company] inside the returned_jobs loop and have that get returned when I call SimpleSpider.crawl! but that doesn't work.
Any help appreciated.
You can slightly modify your code like this:
class SimpleSpider < Kimurai::Base
#name = "simple_spider"
#engine = :selenium_chrome
#start_urls = ["https://apply.workable.com/taxjar/"]
def parse(response, url:, data: {})
# Update response to current response after interaction with a browser
count = 0
# browser.click_button "Show more"
doc = browser.current_response
returned_jobs = doc.css('.careers-jobs-list-styles__jobsList--3_v12')
jobs = []
returned_jobs.css('li').each do |char_element|
# puts char_element
title = char_element.css('a')[0]['aria-label']
link = "https://apply.workable.com" + char_element.css('a')[0]['href']
#click on job link and get description
browser.visit(link)
job_page = browser.current_response
description = job_page.xpath('/html/body/div[1]/div/div[1]/div[2]/div[2]/div[2]').text
jobs << [title, link, description]
end
puts "There are #{jobs.count} jobs total"
puts jobs
end
end
I am not sure about the company as I don't see that variable in your code. However, you can see the idea to call an array above and work on that.
Here is part of output running in terminal:
I also have a blog post here about how to use Kimurai framework from Ruby on Rails application.
Turns out there is a parse method that allows a value to be returned. Here is working example:
require 'open-uri'
require 'nokogiri'
require 'kimurai'
class TaxJar < Kimurai::Base
#name = "tax_jar"
#engine = :selenium_chrome
#start_urls = ["https://apply.workable.com/taxjar/"]
def parse(response, url:, data: {})
jobs = Array.new
doc = browser.current_response
returned_jobs = doc.css('.careers-jobs-list-styles__jobsList--3_v12')
returned_jobs.css('li').each do |char_element|
title = char_element.css('a')[0]['aria-label']
link = "https://apply.workable.com" + char_element.css('a')[0]['href']
#click on job link and get description
browser.visit(link)
job_page = browser.current_response
description = job_page.xpath('/html/body/div[1]/div/div[1]/div[2]/div[2]/div[2]').text
company = 'TaxJar'
puts "title is: #{title}, link is: #{link}, \n description is: #{description}"
jobs << [title, link, description, company]
end
return jobs
end
end
jobs = TaxJar.parse!(:parse, url: "https://apply.workable.com/taxjar/")
puts jobs.inspect
If you are scraping JS websites, this gem seems pretty robust compared with others (waitr/selenium) I have tried.

Download files asynchronously

I was trying to make a script that downloads all images or videos from a thread in my favourite imageboard: 2ch.hk
I was successful until I wanted to download these files asynchronously (for example, to improve performance)
Here is the code http://ideone.com/k2l4Hm
file = http.get(source).body
require 'net/http'
multithreading = false
Net::HTTP.start("2ch.hk", :use_ssl => true) do |http|
thread = http.get("/b/res/133467978.html").body
sources = []
thread.scan(/<a class="desktop" target="_blank" href=".+">.+<\/a>/).each do |a|
source = "/b#{/<a class="desktop" target="_blank" href="\.\.(.+)">.+<\/a>/.match(a).to_a[1]}"
sources << source
end
i = 0
start = Time.now
if multithreading
threads = []
sources.each do |source|
threads << Thread.new(i) do |j|
file = http.get(source).body #breaks everything
# type = /.+\.(.+)/.match(source)[1]
# open("#{j}.#{type}","wb") { |new_file|
# new_file.write(file)
# }
end
i += 1
end
threads.each do |thr|
thr.join
end
# until downloade=sources.size
#
# end
else
sources.each do |source|
file = http.get(source).body
type = /.+\.(.+)/.match(source)[1]
open("#{i}.#{type}","wb") { |new_file|
new_file.write(file)
}
i += 1
print "#{(((i).to_f / sources.size) * 100).round(2)}% "
end
puts
end
puts "Done. #{i} files were downloaded. It took #{Time.now - start} seconds"
end
I suppose that this line crashes everything.
file = http.get(source).body
Or maybe that's the problem.
threads.each do |thr|
thr.join
end
Error messages are always different, from Bad File Descriptor and IO errors to "You may have encountered a bug in the Ruby interpreter or extension libraries."
If you want to try and run my code, please substitute a link to thread in 4th line with a new thread (from 2ch.hk/b), because the one in my code may be deleted by the time you run my code
Version of ruby: 2.3.1, OS Xubuntu 16.10
You'll probably have much better performance using a ruby http lib that supports parallel requests:
https://github.com/typhoeus/typhoeus
e.g.
hydra = Typhoeus::Hydra.new
10.times.map{ hydra.queue(Typhoeus::Request.new("www.example.com", followlocation: true)) }
hydra.run
The problem with my code is that I can't make multiple requests on a Net::HTTP instance at the same time.
The solution is to open an HTTP connection for each thread.

Ruby: Decorator pattern slows simple program by a lot

I recently wrote a program to return a bunch of stocks from the stock market that are unhealthy. The basic algorithm is this:
Look up all the quotes of every stock in an exchange (either NYSE or NASDAQ)
Find the ones that are less than 5 dollars from step 1
Find the ones from step 2 that are down 3 days and have large volume (expensive because I have to make a request for each stock, which is like ~700 currently for nasdaq).
Scan the news for the ones returned by step 3.
I had this all in one file:
Original implementation (https://github.com/EdmundMai/minion/blob/aa14bc3234a4953e7273ec502276c6f0073b459d/lib/minion.rb):
require 'bundler/setup'
require "minion/version"
require "yahoo-finance"
require "business_time"
require 'nokogiri'
require 'open-uri'
module Minion
class << self
def query(exchange)
client = YahooFinance::Client.new
all_companies = CSV.read("#{exchange}.csv")
small_caps = []
ticker_symbols = all_companies.map { |row| row[0] }
ticker_symbols.each_slice(200) do |batch|
data = client.quotes(batch, [:symbol, :last_trade_price, :average_daily_volume])
small_caps << data.select { |stock| stock.last_trade_price.to_f < 5.0 }
end
attractive = []
small_caps.flatten!.each_with_index do |small_cap, index|
begin
data = client.historical_quotes(small_cap.symbol, { start_date: 2.business_days.ago, end_date: Time.now })
closing_prices = data.map(&:close).map(&:to_f)
volumes = data.map(&:volume).map(&:to_i)
negative_3_days_in_a_row = closing_prices == closing_prices.sort
larger_than_average_volume = volumes.reduce(:+) / volumes.count > small_cap.average_daily_volume.to_i
if negative_3_days_in_a_row && larger_than_average_volume
attractive << small_cap.symbol
puts "Qualified: #{small_cap.symbol}, finished with #{index} out of #{small_caps.count}"
else
puts "Not qualified: #{small_cap.symbol}, finished with #{index} out of #{small_caps.count}"
end
rescue => e
puts e.inspect
end
end
final_results = []
attractive.each do |symbol|
rss_feed = Nokogiri::HTML(open("http://feeds.finance.yahoo.com/rss/2.0/headline?s=#{symbol}&region=US&lang=en-US"))
html_body = rss_feed.css('body')[0].text
diluting = false
['warrant', 'cashless exercise'].each do |keyword|
diluting = true if html_body.match(/#{keyword}/i)
end
final_results << symbol if diluting
end
final_results
end
end
end
This was really fast and would finish processing like ~700 stocks in a minute or less.
Then, I tried refactoring and splitting up the algorithm into different classes and files without changing the algorithm at all. I decided on using the decorator pattern since it seems to fit. However when I run the program now, it makes each request really slowly (15+ min). I know this because my puts statements get printed out really slowly.
New and slower implementation (https://github.com/EdmundMai/minion/blob/master/lib/minion.rb)
require 'bundler/setup'
require "minion/version"
require "yahoo-finance"
require "minion/dilution_finder"
require "minion/negative_finder"
require "minion/small_cap_finder"
require "minion/market_fetcher"
module Minion
class << self
def query(exchange)
all_companies = CSV.read("#{exchange}.csv")
all_tickers = all_companies.map { |row| row[0] }
short_finder = DilutionFinder.new(NegativeFinder.new(SmallCapFinder.new(MarketFetcher.new(all_tickers))))
short_finder.results
end
end
end
The part it's lagging at according to my puts:
require "yahoo-finance"
require "business_time"
require_relative "stock_finder"
class NegativeFinder < StockFinder
def results
client = YahooFinance::Client.new
results = []
finder.results.each_with_index do |stock, index|
begin
data = client.historical_quotes(stock.symbol, { start_date: 2.business_days.ago, end_date: Time.now })
closing_prices = data.map(&:close).map(&:to_f)
volumes = data.map(&:volume).map(&:to_i)
negative_3_days_in_a_row = closing_prices == closing_prices.sort
larger_than_average_volume = volumes.reduce(:+) / volumes.count > stock.average_daily_volume.to_i
if negative_3_days_in_a_row && larger_than_average_volume
results << stock
puts "Qualified: #{stock.symbol}, finished with #{index} out of #{finder.results.count}"
else
puts "Not qualified: #{stock.symbol}, finished with #{index} out of #{finder.results.count}"
end
rescue => e
puts e.inspect
end
end
results
end
end
It's lagging on step 3 (making one request for each stock). Not sure what's going on so any advice would be appreciated. If you want to clone the program and run it, just comment in the last line in lib/minion.rb and type ruby lib/minion.rb
After debugging it I figured it out. It was because I was calling finder.results (results being the decorated method) inside of the loop as shown below:
require 'bundler/setup'
require "minion/version"
require "yahoo-finance"
require "minion/dilution_finder"
require "minion/negative_finder"
require "minion/small_cap_finder"
require "minion/market_fetcher"
module Minion
class << self
def query(exchange)
all_companies = CSV.read("#{exchange}.csv")
all_tickers = all_companies.map { |row| row[0] }
short_finder = DilutionFinder.new(NegativeFinder.new(SmallCapFinder.new(MarketFetcher.new(all_tickers))))
short_finder.results
end
end
end
The part it's lagging at according to my puts:
require "yahoo-finance"
require "business_time"
require_relative "stock_finder"
class NegativeFinder < StockFinder
def results
client = YahooFinance::Client.new
results = []
finder.results.each_with_index do |stock, index|
begin
data = client.historical_quotes(stock.symbol, { start_date: 2.business_days.ago, end_date: Time.now })
closing_prices = data.map(&:close).map(&:to_f)
volumes = data.map(&:volume).map(&:to_i)
negative_3_days_in_a_row = closing_prices == closing_prices.sort
larger_than_average_volume = volumes.reduce(:+) / volumes.count > stock.average_daily_volume.to_i
if negative_3_days_in_a_row && larger_than_average_volume
results << stock
// HERE!!!!!!!!!!!!!!!!!!!!!!!!!
puts "Qualified: #{stock.symbol}, finished with #{index} out of #{finder.results.count}" <------------------------------------
else
// AND HERE!!!!!!!!!!!!!!!!!!!!!!!!!
puts "Not qualified: #{stock.symbol}, finished with #{index} out of #{finder.results.count}" <-----------------------------------------------------------
end
rescue => e
puts e.inspect
end
end
results
end
end
This caused a cascade of requests every time I iterated through the loop in NegativeFinder. Removing that call fixed it. Lesson: When using the decorator pattern, either only call the decorated method once, especially when you're doing something expensive in each call. Either that or hold the returned variable in an instance variable so you don't have to calculate it each time.
Also as a side note, I've decided not to go with the decorator pattern because I don't think it applies well here. Something like SmallCapFinder.new(SmallCapFinder.new(MarketFetcher.new(all_tickers))) doesn't add functionality at all (the primary function of using the decorator pattern), so chaining decorators doesn't do anything. Therefore, I'm just going to make them methods instead of adding unnecessary complexity.
There are some thing missing in the code you gave us (Base class StockFinder, MarketFetcher). But I think you are now instantate more than one YahooFinance::Client. Input/Output to other systems is very often the cause for speed problems.
I suggest that you first encapsulate the finance client and access to financial data. This makes it easier when you want to switch your financial data provider, or add another one. Instead of the decorator pattern, I would just use plain old methods for finding small caps, finding negative, etc.

Calling multiple methods on a CSV object

I have constructed an Event Manager class that performs parsing actions on a CSV file, and produces html letters using erb. It is part of a jumpstart labs tutorial
The program works fine, but I am unable to call multiple methods on an object without the earlier methods interfering with the later methods. As a result, I have opted to create multiple objects to call instance methods on, which seems like a clunky inelegant solution. Is there a better way to do this, where I can create a single new object and call methods on it?
Like so:
eventmg = EventManager.new("event_attendees.csv")
eventmg.print_valid_phone_numbers
eventmg_2 = EventManager.new("event_attendees.csv")
eventmg_2.print_zipcodes
eventmg_3 = EventManager.new("event_attendees.csv")
eventmg_3.time_targeter
eventmg_4 = EventManager.new("event_attendees.csv")
eventmg_4.day_of_week
eventmg_5 = EventManager.new("event_attendees.csv")
eventmg_5.create_thank_you_letters
The complete code is as follows
require 'csv'
require 'sunlight/congress'
require 'erb'
class EventManager
INVALID_PHONE_NUMBER = "0000000000"
Sunlight::Congress.api_key = "e179a6973728c4dd3fb1204283aaccb5"
def initialize(file_name, list_selections = [])
puts "EventManager Initialized."
#file = CSV.open(file_name, {:headers => true,
:header_converters => :symbol} )
#list_selections = list_selections
end
def clean_zipcode(zipcode)
zipcode.to_s.rjust(5,"0")[0..4]
end
def print_zipcodes
puts "Valid Participant Zipcodes"
#file.each do |line|
zipcode = clean_zipcode(line[:zipcode])
puts zipcode
end
end
def clean_phone(phone_number)
converted = phone_number.scan(/\d/).join('').split('')
if converted.count == 10
phone_number
elsif phone_number.to_s.length < 10
INVALID_PHONE_NUMBER
elsif phone_number.to_s.length == 11 && converted[0] == 1
phone_number.shift
phone_number.join('')
elsif phone_number.to_s.length == 11 && converted[0] != 1
INVALID_PHONE_NUMBER
else
phone_number.to_s.length > 11
INVALID_PHONE_NUMBER
end
end
def print_valid_phone_numbers
puts "Valid Participant Phone Numbers"
#file.each do |line|
clean_number = clean_phone(line[:homephone])
puts clean_number
end
end
def time_targeter
busy_times = Array.new(24) {0}
#file.each do |line|
registration = line[:regdate]
prepped_time = DateTime.strptime(registration, "%m/%d/%Y %H:%M")
prepped_time = prepped_time.hour.to_i
# inserts filtered hour into the array 'list_selections'
#list_selections << prepped_time
end
# tallies number of registrations for each hour
i = 0
while i < #list_selections.count
busy_times[#list_selections[i]] += 1
i+=1
end
# delivers a result showing the hour and the number of registrations
puts "Number of Registered Participants by Hour:"
busy_times.each_with_index {|counter, hours| puts "#{hours}\t#{counter}"}
end
def day_of_week
busy_day = Array.new(7) {0}
d_of_w = ["Monday:", "Tuesday:", "Wednesday:", "Thursday:", "Friday:", "Saturday:", "Sunday:"]
#file.each do |line|
registration = line[:regdate]
# you have to reformat date because of parser format
prepped_date = Date.strptime(registration, "%m/%d/%y")
prepped_date = prepped_date.wday
# adds filtered day of week into array 'list selections'
#list_selections << prepped_date
end
i = 0
while i < #list_selections.count
# i is minus one since days of week begin at '1' and arrays begin at '0'
busy_day[#list_selections[i-1]] += 1
i+=1
end
#busy_day.each_with_index {|counter, day| puts "#{day}\t#{counter}"}
prepared = d_of_w.zip(busy_day)
puts "Number of Registered Participants by Day of Week"
prepared.each{|date| puts date.join(" ")}
end
def legislators_by_zipcode(zipcode)
Sunlight::Congress::Legislator.by_zipcode(zipcode)
end
def save_thank_you_letters(id,form_letter)
Dir.mkdir("output") unless Dir.exists?("output")
filename = "output/thanks_#{id}.html"
File.open(filename,'w') do |file|
file.puts form_letter
end
end
def create_thank_you_letters
puts "Thank You Letters Available in Output Folder"
template_letter = File.read "form_letter.erb"
erb_template = ERB.new template_letter
#file.each do |line|
id = line[0]
name = line[:first_name]
zipcode = clean_zipcode(line[:zipcode])
legislators = legislators_by_zipcode(zipcode)
form_letter = erb_template.result(binding)
save_thank_you_letters(id,form_letter)
end
end
end
The reason you're experiencing this problem is because when you apply each to the result of CSV.open you're moving the file pointer each time. When you get to the end of the file with one of your methods, there is nothing for anyone else to read.
An alternative is to read the contents of the file into an instance variable at initialization with readlines. You'll get an array of arrays which you can operate on with each just as easily.
"Is there a better way to do this, where I can create a single new object and call methods on it?"
Probably. If your methods are interfering with one another, it means you're changing state within the manager, instead of working on local variables.
Sometimes, it's the right thing to do (e.g. Array#<<); sometimes not (e.g. Fixnum#+)... Seeing your method names, it probably isn't.
Nail the offenders down and adjust the code accordingly. (I only scanned your code, but those Array#<< calls on an instance variable, in particular, look fishy.)

Resources