Since Shopify's default products importer (via CSV) is really slow, I'm using multithreading to add ~24000 products to a Shopify store using the API. The API has a call limit of 2 per second. With 4 threads the calls are within the limit.
But after a while all threads stop working except one. I don't get any error messages, the code keeps running but doesn't print any product information. I can't seem to figure out what's going wrong.
Here's the code I'm using:
require 'shopify_api'
require 'open-uri'
require 'json'
require 'base64'
begin_time = Time.now
my_threads = []
shop_url = "https://<API_KEY>:<PASSWORD>#<SHOPNAME>.myshopify.com/admin"
ShopifyAPI::Base.site = shop_url
raw_product_data = JSON.parse(open('<REDACTED>') {|f| f.read }.force_encoding('UTF-8'))
# Split raw product data
one, two, three, four = raw_product_data.each_slice( (raw_product_data.size/4.0).round ).to_a
def category_to_tag(input)
<REDACTED>
end
def bazookah(array, number)
array.each do |item|
single_product_begin_time = Time.now
# Store item data in variables
vendor = item['brand'].nil? ? 'Overige' : item['brand']
title = item['description']
item_size = item['salesUnitSize']
body = "#{vendor} - #{title} - #{item_size}"
type = item['category'].nil? ? 'Overige' : item['category']
tags = category_to_tag(item['category']) unless item['category'].nil?
variant_sku = item['itemId']
variant_price = item['basePrice']['price']
if !item['images'].nil? && !item['images'][2].nil?
image_src = item['images'][2]['url']
end
image_time_begin = Time.now
image = Base64.encode64(open(image_src) { |io| io.read }) unless image_src.nil?
image_time_end = Time.now
total_image_time = image_time_end - image_time_begin
# Create new product
new_product = ShopifyAPI::Product.new
new_product.title = title
new_product.body_html = body
new_product.product_type = type
new_product.vendor = vendor
new_product.tags = item['category'].nil? ? 'Overige' : tags
new_product.variants = [ <REDACTED> ]
new_product.images = [ <REDACTED> ]
new_product.save
creation_time = Time.now - single_product_begin_time
puts "#{number}: #{variant_sku} - P: #{creation_time.round(2)} - I: #{image_src.nil? ? 'No image' : total_image_time.round(3)}"
end
end
puts '====================================================================================='
puts "#{raw_product_data.size} products loaded. Starting import at #{begin_time}..."
puts '-------------------------------------------------------------------------------------'
my_threads << Thread.new { bazookah(one, 'one') }
my_threads << Thread.new { bazookah(two, 'two') }
my_threads << Thread.new { bazookah(three, 'three') }
my_threads << Thread.new { bazookah(four, 'four') }
my_threads.each { |thr| thr.join }
puts '-------------------------------------------------------------------------------------'
puts "Done. It took #{Time.now - begin_time} minutes."
puts '====================================================================================='
What could I try to solve this?
It most likely has something to do with this:
http://docs.shopify.com/api/introduction/api-call-limit
I'd suspect that you are being rate limited by Shopify. You are trying to add 24000 records, via the API, from a single IP address. Most people don't like that kind of thing.
Related
This code gives me these errors in my ruby console:
1) warning: else without rescue is useless
2) syntax error, unexpected end-of-input, expecting keyword_end
Why am I getting both of these errors at the same time?
require 'nokogiri'
require 'httparty'
require 'byebug'
require 'awesome_print'
require 'watir'
def input #takes user input and grabs the url for that particular search
puts "1) Enter the job title that you want to search for \n"
j_input = gets.chomp
job = j_input.split(/ /).join("+")
puts "================================= \n"
puts "1/2)Do you want to input city-and-state(1) or zipcode(2)? \n"
choice = gets.chomp
if choice == "1"
puts "2) Enter the city that you want to search for \n"
city_input = gets.chomp
city = city_input.split(/ /).join("+")
puts "================================= \n"
puts "3) Enter the state that you want to search for \n"
state_input = gets.chomp
state = "+" + state_input
puts target_url = "https://www.indeed.com/resumes/?q=#{job}&l=#{city}%2C#{state}&cb=jt"
elsif choice == "2"
puts "Enter the zipcode that you want to search for \n"
zipcode = gets.chomp
puts target_url = "https://www.indeed.com/resumes?q=#{job}&l=#{zipcode}&cb=jt"
else
puts "error"
end
unparsed_page = HTTParty.get(target_url)
parsed_page = Nokogiri::HTML(unparsed_page)
resume_listing = parsed_page.css('div.sre-entry')
per_page = resume_listing.count
resumes = Array.new
counter = 0
result_count = parsed_page.css('div#result_count').text.split(' ')[0].to_f
page_count = (result_count.to_f / per_page.to_f ).ceil
current_count = 0
byebug
if counter <= 0
unparsed_page = HTTParty.get(target_url)
parsed_page = Nokogiri::HTML(unparsed_page)
resume_listing = parsed_page.css('div.sre-entry')
per_page = resume_listing.count
pagination_resume_listing.each do |resume_listing|
#resume_info = {
# title:
# link:
# skills:
# education:
#}
#resumes << resume_info
puts "Added #{resume_info[:title]}"
else
while current_count <= page_count * per_page
pagination_url = "https://www.indeed.com/resumes?q=#{job}&l=#{zipcode}&co=US&cb=jt&start=#{current_count}"
unparsed_pagination_page = HTTParty.get(pagination_url)
pagination_parsed_page = Nokogiri::HTML(unparsed_pagination_page)
pagination_resume_listing = pagination_parsed_page.css('div.sre-entry')
pagination_resume_listing.each do |resume_listing|
#resume_info = {
# title:
# link:
# skills:
# education:
#}
#resumes << resume_info
puts "Added #{resume_info[:title]}"
current_count += 50
end
end
end
end
end
It won't allow me to fix the else without rescue issue without telling me that it expects an extra end at the end of my code. Of course when I put the end there it does nothing and says that it wants another end
I would say that your code is horribly formatted, but it would first have to be formatted at all to be even that much. Once you format it, the answer is quite obvious, you have a mis-placed end.
puts "Added #{resume_info[:title]}"
# Should be and end here for the "do" block above
else
Here is what it should be:
require 'nokogiri'
require 'httparty'
require 'byebug'
require 'awesome_print'
require 'watir'
def input #takes user input and grabs the url for that particular search
puts "1) Enter the job title that you want to search for \n"
j_input = gets.chomp
job = j_input.split(/ /).join("+")
puts "================================= \n"
puts "1/2)Do you want to input city-and-state(1) or zipcode(2)? \n"
choice = gets.chomp
if choice == "1"
puts "2) Enter the city that you want to search for \n"
city_input = gets.chomp
city = city_input.split(/ /).join("+")
puts "================================= \n"
puts "3) Enter the state that you want to search for \n"
state_input = gets.chomp
state = "+" + state_input
puts target_url = "https://www.indeed.com/resumes/?q=#{job}&l=#{city}%2C#{state}&cb=jt"
elsif choice == "2"
puts "Enter the zipcode that you want to search for \n"
zipcode = gets.chomp
puts target_url = "https://www.indeed.com/resumes?q=#{job}&l=#{zipcode}&cb=jt"
else
puts "error"
end
unparsed_page = HTTParty.get(target_url)
parsed_page = Nokogiri::HTML(unparsed_page)
resume_listing = parsed_page.css('div.sre-entry')
per_page = resume_listing.count
resumes = Array.new
counter = 0
result_count = parsed_page.css('div#result_count').text.split(' ')[0].to_f
page_count = (result_count.to_f / per_page.to_f ).ceil
current_count = 0
byebug
if counter <= 0
unparsed_page = HTTParty.get(target_url)
parsed_page = Nokogiri::HTML(unparsed_page)
resume_listing = parsed_page.css('div.sre-entry')
per_page = resume_listing.count
pagination_resume_listing.each do |resume_listing|
#resume_info = {
# title:
# link:
# skills:
# education:
#}
#resumes << resume_info
puts "Added #{resume_info[:title]}"
end
else
while current_count <= page_count * per_page
pagination_url = "https://www.indeed.com/resumes?q=#{job}&l=#{zipcode}&co=US&cb=jt&start=#{current_count}"
unparsed_pagination_page = HTTParty.get(pagination_url)
pagination_parsed_page = Nokogiri::HTML(unparsed_pagination_page)
pagination_resume_listing = pagination_parsed_page.css('div.sre-entry')
pagination_resume_listing.each do |resume_listing|
#resume_info = {
# title:
# link:
# skills:
# education:
#}
#resumes << resume_info
puts "Added #{resume_info[:title]}"
current_count += 50
end
end
end
end
Lesson here is to ALWAYS format your code, for everyone's sake, most of all your own. There is no excuse to not be formatted, and not doing so leads to trivial problems like this that are difficult to find.
NOTE
I did not test this or run it, simply formatted, which made the mis-matched end obvious.
I have a Rails 3.2.21 app where I'm building time clock functionality. I'm currently writing a to_csv method that should do the following:
Create a header row with column names
Iterate through a block of input (records) and display the employee username, clock_in, clock_out, station, and comment objects, then finally on the last line of the block display the total hours.
In between each user I want to display a sum of their total hours. As you can see in the to_csv method I'm able to get this to work "hackish" by shoveling an array of csv << [TimeFormatter.format_time(ce.user.clock_events.sum(&:total_hours))] into the CSV. The end result is it does give me the proper total hours for each employee's clock_events, but it repeats it after every entry because I'm obviously iterating over a block.
I'd like to figure out a way to abstract this outside of the block and figure out how to shovel in another array that calculates total_hours for all clock events by user without duplicate entries.
Below is my model, so if something is not clear, please let me know. Also if my question is confusing or doesn't make sense let me know and I'll be happy to clarify.
class ClockEvent < ActiveRecord::Base
attr_accessible :clock_in, :clock_out, :user_id, :station_id, :comment
belongs_to :user
belongs_to :station
scope :incomplete, -> { where(clock_out: nil) }
scope :complete, -> { where("clock_out IS NOT NULL") }
scope :current_week, -> {where("clock_in BETWEEN ? AND ?", Time.zone.now.beginning_of_week - 1.day, Time.zone.now.end_of_week - 1.day)}
scope :search_between, lambda { |start_date, end_date| where("clock_in BETWEEN ? AND ?", start_date.beginning_of_day, end_date.end_of_day)}
scope :search_by_start_date, lambda { |start_date| where('clock_in BETWEEN ? AND ?', start_date.beginning_of_day, start_date.end_of_day) }
scope :search_by_end_date, lambda { |end_date| where('clock_in BETWEEN ? AND ?', end_date.beginning_of_day, end_date.end_of_day) }
def punch_in(station_id)
self.clock_in = Time.zone.now
self.station_id = station_id
end
def punch_out
self.clock_out = Time.zone.now
end
def completed?
clock_in.present? && clock_out.present?
end
def total_hours
self.clock_out.to_i - self.clock_in.to_i
end
def formatted_clock_in
clock_in.try(:strftime, "%m/%d/%y-%H:%M")
end
def formatted_clock_out
clock_out.try(:strftime, "%m/%d/%y-%H:%M")
end
def self.search(search)
search ||= { type: "all" }
results = scoped
# If searching with BOTH a start and end date
if search[:start_date].present? && search[:end_date].present?
results = results.search_between(Date.parse(search[:start_date]), Date.parse(search[:end_date]))
# If search with any other date parameters (including none)
else
results = results.search_by_start_date(Date.parse(search[:start_date])) if search[:start_date].present?
results = results.search_by_end_date(Date.parse(search[:end_date])) if search[:end_date].present?
end
results
end
def self.to_csv(records = [], options = {})
CSV.generate(options) do |csv|
csv << ["Employee", "Clock-In", "Clock-Out", "Station", "Comment", "Total Shift Hours"]
records.each do |ce|
csv << [ce.user.try(:username), ce.formatted_clock_in, ce.formatted_clock_out, ce.station.try(:station_name), ce.comment, TimeFormatter.format_time(ce.total_hours)]
csv << [TimeFormatter.format_time(ce.user.clock_events.sum(&:total_hours))]
end
csv << [TimeFormatter.format_time(records.sum(&:total_hours))]
end
end
end
With some help from a friend and researching APIdocs I was able to refactor the method as so:
def self.to_csv(records = [], options = {})
CSV.generate(options) do |csv|
csv << ["Employee", "Clock-In", "Clock-Out", "Station", "Comment", "Total Shift Hours"]
# records.group_by{ |r| r.user }.each do |user, records|
records.each do |ce|
csv << [ce.user.try(:username), ce.formatted_clock_in, ce.formatted_clock_out, ce.station.try(:station_name), ce.comment, TimeFormatter.format_time(ce.total_hours)]
#csv << [TimeFormatter.format_time(records.select{ |r| r.user == ce.user }.sum(&:total_hours))]
end
records.map(&:user).uniq.each do |user|
csv << ["Total Hours for: #{user.username}"]
csv << [TimeFormatter.format_time(records.select{ |r| r.user == user}.sum(&:total_hours))]
end
csv << ["Total Payroll Hours"]
csv << [TimeFormatter.format_time(records.sum(&:total_hours))]
end
end
I have two CSV files with some common headers and others that only appear in one or in the other, for example:
# csv_1.csv
H1,H2,H3
V11,V22,V33
V14,V25,V35
# csv_2.csv
H1,H4
V1a,V4b
V1c,V4d
I would like to merge both and obtain a new CSV file that combines all the information for the previous CSV files. Injecting new columns when needed, and feeding the new cells with null values.
Result example:
H1,H2,H3,H4
V11,V22,V33,
V14,V25,V35,
V1a,,,V4b
V1c,,,V4d
Challenge accepted :)
#!/usr/bin/env ruby
require "csv"
module MergeCsv
class << self
def run(csv_paths)
csv_files = csv_paths.map { |p| CSV.read(p, headers: true) }
merge(csv_files)
end
private
def merge(csv_files)
headers = csv_files.flat_map(&:headers).uniq.sort
hash_array = csv_files.flat_map(&method(:csv_to_hash_array))
CSV.generate do |merged_csv|
merged_csv << headers
hash_array.each do |row|
merged_csv << row.values_at(*headers)
end
end
end
# Probably not the most performant way, but easy
def csv_to_hash_array(csv)
csv.to_a[1..-1].map { |row| csv.headers.zip(row).to_h }
end
end
end
if(ARGV.length == 0)
puts "Use: ruby merge_csv.rb <file_path_csv_1> <file_path_csv_2>"
exit 1
end
puts MergeCsv.run(ARGV)
I have the answer, I just wanted to help people that is looking for the same solution
require "csv"
module MergeCsv
def self.run(csv_1_path, csv_2_path)
merge(File.read(csv_1_path), File.read(csv_2_path))
end
def self.merge(csv_1, csv_2)
csv_1_table = CSV.parse(csv_1, :headers => true)
csv_2_table = CSV.parse(csv_2, :headers => true)
return csv_2_table.to_csv if csv_1_table.headers.empty?
return csv_1_table.to_csv if csv_2_table.headers.empty?
headers_in_1_not_in_2 = csv_1_table.headers - csv_2_table.headers
headers_in_1_not_in_2.each do |header_in_1_not_in_2|
csv_2_table[header_in_1_not_in_2] = nil
end
headers_in_2_not_in_1 = csv_2_table.headers - csv_1_table.headers
headers_in_2_not_in_1.each do |header_in_2_not_in_1|
csv_1_table[header_in_2_not_in_1] = nil
end
csv_2_table.each do |csv_2_row|
csv_1_table << csv_1_table.headers.map { |csv_1_header| csv_2_row[csv_1_header] }
end
csv_1_table.to_csv
end
end
if(ARGV.length != 2)
puts "Use: ruby merge_csv.rb <file_path_csv_1> <file_path_csv_2>"
exit 1
end
puts MergeCsv.run(ARGV[0], ARGV[1])
And execute it from the console this way:
$ ruby merge_csv.rb csv_1.csv csv_2.csv
Any other, maybe cleaner, solution is welcome.
Simplied first answer:
How to use it:
listPart_A = CSV.read(csv_path_A, headers:true)
listPart_B = CSV.read(csv_path_B, headers:true)
listPart_C = CSV.read(csv_path_C, headers:true)
list = merge(listPart_A,listPart_B,listPart_C)
Function:
def merge(*csvs)
headers = csvs.map {|csv| csv.headers }.flatten.compact.uniq.sort
csvs.flat_map(&method(:csv_to_hash_array))
end
def csv_to_hash_array(csv)
csv.to_a[1..-1].map do |row|
Hash[csv.headers.zip(row)]
end
end
I had to do something very similar
to merge n CSV files that the might share some of the columns but some may not
if you want to keep a structure and do it easily,
I think the best way is to convert to hash and then re-convert to CSV file
my solution:
#!/usr/bin/env ruby
require "csv"
def join_multiple_csv(csv_path_array)
return nil if csv_path_array.nil? or csv_path_array.empty?
f = CSV.parse(File.read(csv_path_array[0]), :headers => true)
f_h = {}
f.headers.each {|header| f_h[header] = f[header]}
n_rows = f.size
csv_path_array.shift(1)
csv_path_array.each do |csv_file|
curr_csv = CSV.parse(File.read(csv_file), :headers => true)
curr_h = {}
curr_csv.headers.each {|header| curr_h[header] = curr_csv[header]}
new_headers = curr_csv.headers - f_h.keys
exist_headers = curr_csv.headers - new_headers
new_headers.each { |new_header|
f_h[new_header] = Array.new(n_rows) + curr_csv[new_header]
}
exist_headers.each {|exist_header|
f_h[exist_header] = f_h[exist_header] + curr_csv[exist_header]
}
n_rows = n_rows + curr_csv.size
end
csv_string = CSV.generate do |csv|
csv << f_h.keys
(0..n_rows-1).each do |i|
row = []
f_h.each_key do |header|
row << f_h[header][i]
end
csv << row
end
end
return csv_string
end
if(ARGV.length < 2)
puts "Use: ruby merge_csv.rb <file_path_csv_1> <file_path_csv_2> .. <file_path_csv_n>"
exit 1
end
csv_str = join_multiple_csv(ARGV)
f = File.open("results.csv", "w")
f.write(csv_str)
puts "CSV merge is done"
I'm trying to run this code in Red Hat Linux, and it won't launch a browser. The only way I can get it to work is if i ALSO launch a browser OUTSIDE of the thread, which makes no sense to me. Here is what I mean:
require 'watir-webdriver'
$alphabet = ["A", "B", "C"]
$alphabet.each do |z|
puts "pshaw"
Thread.new{
Thread.current["testPuts"] = "ohai " + z.to_s
Thread.current["myBrowser"] = Watir::Browser.new :ff
puts Thread.current["testPuts"] }
$browser = Watir::Browser.new :ff
end
the output is:
pshaw
(launches browser)
ohai A
(launches browser)
pshaw
(launches browser)
ohai B
(launches browser)
pshaw
(launches browser)
ohai C
(launches browser)
However, if I remove the browser launch that is outside of the thread, as so:
require 'watir-webdriver'
$alphabet = ["A", "B", "C"]
$alphabet.each do |z|
puts "pshaw"
Thread.new{
Thread.current["testPuts"] = "ohai " + z.to_s
Thread.current["myBrowser"] = Watir::Browser.new :ff
puts Thread.current["testPuts"] }
end
The output is:
pshaw
pshaw
pshaw
What is going on here? How do I fix this so that I can launch a browser inside a thread?
EDIT TO ADD:
The solution Justin Ko provided worked on the psedocode above, but it's not helping with my actual code:
require 'watir-webdriver'
require_relative 'Credentials'
require_relative 'ReportGenerator'
require_relative 'installPageLayouts'
require_relative 'PackageHandler'
Dir[(Dir.pwd.to_s + "/bmx*")].each {|file| require_relative file } #this includes all the files in the directory with names starting with bmx
module Runner
def self.runTestCases(orgType, *caseNumbers)
$testCaseArray = Array.new
caseNumbers.each do |thisCaseNum|
$testCaseArray << thisCaseNum
end
$allTestCaseResults = Array.new
$alphabet = ["A", "B", "C"]
#count = 0
#multiOrg = 0
#peOrg = 0
#eeOrg = 0
#threads = Array.new
$testCaseArray.each do |thisCase|
$alphabet[#count] = Thread.new {
puts "working one"
Thread.current["tBrowser"] = Watir::Browser.new :ff
puts "working two"
if ((thisCase.declareOrg().downcase == "multicurrency") || (thisCase.declareOrg().downcase == "mc"))
currentOrg = $multicurrencyOrgArray[#multiOrg]
#multiOrg += 1
elsif ((thisCase.declareOrg().downcase == "enterprise") || (thisCase.declareOrg().downcase == "ee"))
currentOrg = $eeOrgArray[#eeOrg]
#eeOrg += 1
else #default to single currency PE
currentOrg = $peOrgArray[#peOrg]
#peOrg += 1
end
setupOrg(currentOrg, thisCase.testCaseID, currentOrg.layoutDirectory)
runningTest = thisCase.actualTest()
if runningTest.crashed != "crashed" #changed this to read the attr_reader isntead of the deleted caseStatus method from TestCase.rb
cleanupOrg(thisCase.testCaseID, currentOrg.layoutDirectory)
end
#threads << Thread.current
}
#count += 1
end
#threads.each do |thisThread|
thisThread.join
end
writeReport($allTestCaseResults)
end
def self.setupOrg(thisOrg, caseID, layoutPath)
begin
thisOrg.logIn
pkg = PackageHandler.new
basicInstalled = "false"
counter = 0
until ((basicInstalled == "true") || (counter == 5))
pkg.basicInstaller()
if Thread.current["tBrowser"].text.include? "You have attempted to access a page"
thisOrg.logIn
else
basicInstalled = "true"
end
counter +=1
end
if !((caseID.include? "bmxb") || (caseID.include? "BMXB"))
moduleInstalled = "false"
counter2 = 0
until ((moduleInstalled == "true") || (counter == 5))
pkg.packageInstaller(caseID)
if Thread.current["tBrowser"].text.include? "You have attempted to access a page"
thisOrg.logIn
else
moduleInstalled = "true"
end
counter2 +=1
end
end
installPageLayouts(layoutPath)
rescue
$allTestCaseResults << TestCaseResult.new(caseID, caseID, 1, "SETUP FAILED!" + "<p>#{$!}</p><p>#{$#}</p>").hashEmUp
writeReport($allTestCaseResults)
end
end
def self.cleanupOrg(caseID, layoutPath)
begin
uninstallPageLayouts(layoutPath)
pkg = PackageHandler.new
pkg.packageUninstaller(caseID)
Thread.current["tBrowser"].close
rescue
$allTestCaseResults << TestCaseResult.new(caseID, caseID, 1, "CLEANUP FAILED!" + "<p>#{$!}</p><p>#{$#}</p>").hashEmUp
writeReport($allTestCaseResults)
end
end
end
The output it's generating is:
working one
working one
working one
It's not opening a browser or doing any of the subsequent code.
It looks like the code is having the problem mentioned in the Thread class documentation:
If we don't call thr.join before the main thread terminates, then all
other threads including thr will be killed.
Basically your main thread is finishing pretty instantaneously. However, the threads, which create browsers, take a lot longer than that. As result the threads get terminated before the browser opens.
By adding a long sleep at the end, you can see that your browsers can be opened by your code:
require 'watir-webdriver'
$chunkythread = ["A", "B", "C"]
$chunkythread.each do |z|
puts "pshaw"
Thread.new{
Thread.current["testwords"] = "ohai " + z.to_s
Thread.current["myBrowser"] = Watir::Browser.new :ff
puts Thread.current["testwords"] }
end
sleep(300)
However, for more reliability, you should join all the threads at the end:
require 'watir-webdriver'
threads = []
$chunkythread = ["A", "B", "C"]
$chunkythread.each do |z|
puts "pshaw"
threads << Thread.new{
Thread.current["testwords"] = "ohai " + z.to_s
Thread.current["myBrowser"] = Watir::Browser.new :ff
puts Thread.current["testwords"] }
end
threads.each { |thr| thr.join }
For the actual code example, putting #threads << Thread.current will not work. The join will be evaluating like #threads is empty. You could try doing the following:
$testCaseArray.each do |thisCase|
#threads << Thread.new {
puts "working one"
Thread.current["tBrowser"] = Watir::Browser.new :ff
# Do your other thread stuff
}
$alphabet[#count] = #threads.last
#count += 1
end
#threads.each do |thisThread|
thisThread.join
end
Note that I am not sure why you want to store the threads in $alphabet. I put in the $alphabet[#count] = #threads.last, but could be removed if not in use.
I uninstalled Watir 5.0.0 and installed Watir 4.0.2, and now it works fine.
I have a program using the spreadsheet gem to create a CSV file; I have not been able to find the way to configure the functionality that I need.
This is what I would like the gem to do: The model number and additional_image field should be "in sync", that is, each additional image written to the spreadsheet doc should be a new line and should not be wrapped.
Here are some snippets of the desired output in contrast with the current. These fields are defined by XPath objects that are screen scraped using another gem. The program won't know for sure how many objects it will encounter in the additional image field but due to business logic the number of objects in the additional image field should mirror the number of model number objects that are written to the spreadsheet.
model
168868837a
168868837a
168868837a
168868837a
168868837a
168868837a
additional_image
1688688371.jpg
1688688372.jpg
1688688373.jpg
1688688374.jpg
1688688375.jpg
1688688376.jpg
This is the current code:
require "capybara/dsl"
require "spreadsheet"
require "fileutils"
require "open-uri"
LOCAL_DIR = 'data-hold/images'
FileUtils.makedirs(LOCAL_DIR) unless File.exists?LOCAL_DIR
Capybara.run_server = false
Capybara.default_driver = :selenium
Capybara.default_selector = :xpath
Spreadsheet.client_encoding = 'UTF-8'
class Tomtop
include Capybara::DSL
def initialize
#excel = Spreadsheet::Workbook.new
#work_list = #excel.create_worksheet
#row = 0
end
def go
visit_main_link
end
def retryable(options = {}, &block)
opts = { :tries => 1, :on => Exception }.merge(options)
retry_exception, retries = opts[:on], opts[:tries]
begin
return yield
rescue retry_exception
retry if (retries -= 1) > 0
end
yield
end
def visit_main_link
retryable(:tries => 1, :on => OpenURI::HTTPError) do
visit "http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position"
results = all("//h5/a[contains(#onclick, 'analyticsLog')]")
item = []
results.each do |a|
item << a[:href]
end
item.each do |link|
visit link
save_item
end
#excel.write "inventory.csv"
end
end
def save_item
data = all("//*[#id='content-wrapper']/div[2]/div/div")
data.each do |info|
#work_list[#row, 0] = info.find("//*[#id='productright']/div/div[1]/h1").text
price = info.first("//div[contains(#class, 'price font left')]")
#work_list[#row, 1] = (price.text.to_f * 1.33).round(2) if price
#work_list[#row, 2] = info.find("//*[#id='productright']/div/div[11]").text
#work_list[#row, 3] = info.find("//*[#id='tabcontent1']/div/div").text.strip
color = info.all("//dd[1]//select[contains(#name, 'options')]//*[#price='0']")
#work_list[#row, 4] = color.collect(&:text).join(', ')
size = info.all("//dd[2]//select[contains(#name, 'options')]//*[#price='0']")
#work_list[#row, 5] = size.collect(&:text).join(', ')
model = File.basename(info.find("//*[#id='content-wrapper']/div[2]/div/div/div[1]/div[1]/a")['href'])
#work_list[#row, 6] = model.gsub!(/\D/, "")
#work_list[#row, 7] = File.basename(info.find("//*[#id='content-wrapper']/div[2]/div/div/div[1]/div[1]/a")['href'])
additional_image = info.all("//*[#rel='lightbox[rotation]']")
#work_list[#row, 8] = additional_image.map { |link| File.basename(link['href']) }.join(', ')
images = imagelink.map { |link| link['href'] }
images.each do |image|
File.open(File.basename("#{LOCAL_DIR}/#{image}"), 'w') do |f|
f.write(open(image).read)
end
end
#row = #row + 1
end
end
end
tomtop = Tomtop.new
tomtop.go
I would like this to do two things that I'm not sure how to do:
Each additional image should print to a new line (currently it prints all in one cell).
I would like the model field to be duplicated exactly as many times as there are additional_images in the same new line manner.
Use the CSV gem. I took the long way of writing this so you can see how it works.
require 'csv'
DOC = "file.csv"
profile = []
profile[0] = "model"
CSV.open(DOC, "a") do |me|
me << profile
end
img_url = ['pic_1.jpg','pic_2.jpg','pic_3.jpg','pic_4.jpg','pic_5.jpg','pic_6.jpg']
a = 0
b = img_url.length
while a < b
profile = []
profile[0] = img_url[a]
CSV.open(DOC, "a") do |me|
me << profile
end
a += 1
end
The csv file should look like this
model
pic_1.jpg
pic_2.jpg
pic_3.jpg
pic_4.jpg
pic_5.jpg
pic_6.jpg
for your last question
whatever = []
whatever = temp[1] + " " + temp[2]
profile[x] = whatever
OR
profile[x] = temp[1] + " " + temp[2]
NIL error in array
if temp[2] == nil
profile[x] = temp[1]
else
profile[x] = temp[1] + " " + temp[2]
end