I have a program using the spreadsheet gem to create a CSV file; I have not been able to find the way to configure the functionality that I need.
This is what I would like the gem to do: The model number and additional_image field should be "in sync", that is, each additional image written to the spreadsheet doc should be a new line and should not be wrapped.
Here are some snippets of the desired output in contrast with the current. These fields are defined by XPath objects that are screen scraped using another gem. The program won't know for sure how many objects it will encounter in the additional image field but due to business logic the number of objects in the additional image field should mirror the number of model number objects that are written to the spreadsheet.
model
168868837a
168868837a
168868837a
168868837a
168868837a
168868837a
additional_image
1688688371.jpg
1688688372.jpg
1688688373.jpg
1688688374.jpg
1688688375.jpg
1688688376.jpg
This is the current code:
require "capybara/dsl"
require "spreadsheet"
require "fileutils"
require "open-uri"
LOCAL_DIR = 'data-hold/images'
FileUtils.makedirs(LOCAL_DIR) unless File.exists?LOCAL_DIR
Capybara.run_server = false
Capybara.default_driver = :selenium
Capybara.default_selector = :xpath
Spreadsheet.client_encoding = 'UTF-8'
class Tomtop
include Capybara::DSL
def initialize
#excel = Spreadsheet::Workbook.new
#work_list = #excel.create_worksheet
#row = 0
end
def go
visit_main_link
end
def retryable(options = {}, &block)
opts = { :tries => 1, :on => Exception }.merge(options)
retry_exception, retries = opts[:on], opts[:tries]
begin
return yield
rescue retry_exception
retry if (retries -= 1) > 0
end
yield
end
def visit_main_link
retryable(:tries => 1, :on => OpenURI::HTTPError) do
visit "http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position"
results = all("//h5/a[contains(#onclick, 'analyticsLog')]")
item = []
results.each do |a|
item << a[:href]
end
item.each do |link|
visit link
save_item
end
#excel.write "inventory.csv"
end
end
def save_item
data = all("//*[#id='content-wrapper']/div[2]/div/div")
data.each do |info|
#work_list[#row, 0] = info.find("//*[#id='productright']/div/div[1]/h1").text
price = info.first("//div[contains(#class, 'price font left')]")
#work_list[#row, 1] = (price.text.to_f * 1.33).round(2) if price
#work_list[#row, 2] = info.find("//*[#id='productright']/div/div[11]").text
#work_list[#row, 3] = info.find("//*[#id='tabcontent1']/div/div").text.strip
color = info.all("//dd[1]//select[contains(#name, 'options')]//*[#price='0']")
#work_list[#row, 4] = color.collect(&:text).join(', ')
size = info.all("//dd[2]//select[contains(#name, 'options')]//*[#price='0']")
#work_list[#row, 5] = size.collect(&:text).join(', ')
model = File.basename(info.find("//*[#id='content-wrapper']/div[2]/div/div/div[1]/div[1]/a")['href'])
#work_list[#row, 6] = model.gsub!(/\D/, "")
#work_list[#row, 7] = File.basename(info.find("//*[#id='content-wrapper']/div[2]/div/div/div[1]/div[1]/a")['href'])
additional_image = info.all("//*[#rel='lightbox[rotation]']")
#work_list[#row, 8] = additional_image.map { |link| File.basename(link['href']) }.join(', ')
images = imagelink.map { |link| link['href'] }
images.each do |image|
File.open(File.basename("#{LOCAL_DIR}/#{image}"), 'w') do |f|
f.write(open(image).read)
end
end
#row = #row + 1
end
end
end
tomtop = Tomtop.new
tomtop.go
I would like this to do two things that I'm not sure how to do:
Each additional image should print to a new line (currently it prints all in one cell).
I would like the model field to be duplicated exactly as many times as there are additional_images in the same new line manner.
Use the CSV gem. I took the long way of writing this so you can see how it works.
require 'csv'
DOC = "file.csv"
profile = []
profile[0] = "model"
CSV.open(DOC, "a") do |me|
me << profile
end
img_url = ['pic_1.jpg','pic_2.jpg','pic_3.jpg','pic_4.jpg','pic_5.jpg','pic_6.jpg']
a = 0
b = img_url.length
while a < b
profile = []
profile[0] = img_url[a]
CSV.open(DOC, "a") do |me|
me << profile
end
a += 1
end
The csv file should look like this
model
pic_1.jpg
pic_2.jpg
pic_3.jpg
pic_4.jpg
pic_5.jpg
pic_6.jpg
for your last question
whatever = []
whatever = temp[1] + " " + temp[2]
profile[x] = whatever
OR
profile[x] = temp[1] + " " + temp[2]
NIL error in array
if temp[2] == nil
profile[x] = temp[1]
else
profile[x] = temp[1] + " " + temp[2]
end
Related
The csv file gets created when I scrape but it doesn't display in the terminal. I want to display each row onto the screen
def get_aspley_data
url = "https://www.domain.com.au/rent/aspley-qld-4034/?price=0-900"
unparsed_page = HTTParty.get(url)
parsed_page = Nokogiri::HTML(unparsed_page)
house_listings = parsed_page.css('.listing-result__details')
house_listings.each do |hl|
prop_type = hl.css('.listing-result__property-type')[0]
price = hl.css('.listing-result__price')[0]
suburb_address = hl.css('span[itemprop=streetAddress]')[0]
house_array = [house_listings]
house_array.push("#{prop_type} #{price}")
aspley_dis = CSV.open($aspley_file, "ab", {:col_sep => "|"}) do |csv|
csv << [prop_type, price, suburb_address]
end
end
end
Try below one
def get_aspley_data
url = "https://www.domain.com.au/rent/aspley-qld-4034/?price=0-900"
unparsed_page = HTTParty.get(url)
parsed_page = Nokogiri::HTML(unparsed_page)
house_listings_data = []
house_listings = parsed_page.css('.listing-result__details')
house_listings.each do |hl|
prop_type = hl.css('.listing-result__property-type')[0]
price = hl.css('.listing-result__price')[0]
suburb_address = hl.css('span[itemprop=streetAddress]')[0]
house_array = [house_listings]
house_array.push("#{prop_type} #{price}")
house_listings_data << [prop_type, price, suburb_address]
puts [prop_type, price, suburb_address].to_csv(col_sep: "|")
end
File.open($aspley_file, "ab") do |f|
data = house_listings_data.map{ |d| d.to_csv(col_sep: "|") }.join
f.write(data)
end
end
Since Shopify's default products importer (via CSV) is really slow, I'm using multithreading to add ~24000 products to a Shopify store using the API. The API has a call limit of 2 per second. With 4 threads the calls are within the limit.
But after a while all threads stop working except one. I don't get any error messages, the code keeps running but doesn't print any product information. I can't seem to figure out what's going wrong.
Here's the code I'm using:
require 'shopify_api'
require 'open-uri'
require 'json'
require 'base64'
begin_time = Time.now
my_threads = []
shop_url = "https://<API_KEY>:<PASSWORD>#<SHOPNAME>.myshopify.com/admin"
ShopifyAPI::Base.site = shop_url
raw_product_data = JSON.parse(open('<REDACTED>') {|f| f.read }.force_encoding('UTF-8'))
# Split raw product data
one, two, three, four = raw_product_data.each_slice( (raw_product_data.size/4.0).round ).to_a
def category_to_tag(input)
<REDACTED>
end
def bazookah(array, number)
array.each do |item|
single_product_begin_time = Time.now
# Store item data in variables
vendor = item['brand'].nil? ? 'Overige' : item['brand']
title = item['description']
item_size = item['salesUnitSize']
body = "#{vendor} - #{title} - #{item_size}"
type = item['category'].nil? ? 'Overige' : item['category']
tags = category_to_tag(item['category']) unless item['category'].nil?
variant_sku = item['itemId']
variant_price = item['basePrice']['price']
if !item['images'].nil? && !item['images'][2].nil?
image_src = item['images'][2]['url']
end
image_time_begin = Time.now
image = Base64.encode64(open(image_src) { |io| io.read }) unless image_src.nil?
image_time_end = Time.now
total_image_time = image_time_end - image_time_begin
# Create new product
new_product = ShopifyAPI::Product.new
new_product.title = title
new_product.body_html = body
new_product.product_type = type
new_product.vendor = vendor
new_product.tags = item['category'].nil? ? 'Overige' : tags
new_product.variants = [ <REDACTED> ]
new_product.images = [ <REDACTED> ]
new_product.save
creation_time = Time.now - single_product_begin_time
puts "#{number}: #{variant_sku} - P: #{creation_time.round(2)} - I: #{image_src.nil? ? 'No image' : total_image_time.round(3)}"
end
end
puts '====================================================================================='
puts "#{raw_product_data.size} products loaded. Starting import at #{begin_time}..."
puts '-------------------------------------------------------------------------------------'
my_threads << Thread.new { bazookah(one, 'one') }
my_threads << Thread.new { bazookah(two, 'two') }
my_threads << Thread.new { bazookah(three, 'three') }
my_threads << Thread.new { bazookah(four, 'four') }
my_threads.each { |thr| thr.join }
puts '-------------------------------------------------------------------------------------'
puts "Done. It took #{Time.now - begin_time} minutes."
puts '====================================================================================='
What could I try to solve this?
It most likely has something to do with this:
http://docs.shopify.com/api/introduction/api-call-limit
I'd suspect that you are being rate limited by Shopify. You are trying to add 24000 records, via the API, from a single IP address. Most people don't like that kind of thing.
I have two CSV files with some common headers and others that only appear in one or in the other, for example:
# csv_1.csv
H1,H2,H3
V11,V22,V33
V14,V25,V35
# csv_2.csv
H1,H4
V1a,V4b
V1c,V4d
I would like to merge both and obtain a new CSV file that combines all the information for the previous CSV files. Injecting new columns when needed, and feeding the new cells with null values.
Result example:
H1,H2,H3,H4
V11,V22,V33,
V14,V25,V35,
V1a,,,V4b
V1c,,,V4d
Challenge accepted :)
#!/usr/bin/env ruby
require "csv"
module MergeCsv
class << self
def run(csv_paths)
csv_files = csv_paths.map { |p| CSV.read(p, headers: true) }
merge(csv_files)
end
private
def merge(csv_files)
headers = csv_files.flat_map(&:headers).uniq.sort
hash_array = csv_files.flat_map(&method(:csv_to_hash_array))
CSV.generate do |merged_csv|
merged_csv << headers
hash_array.each do |row|
merged_csv << row.values_at(*headers)
end
end
end
# Probably not the most performant way, but easy
def csv_to_hash_array(csv)
csv.to_a[1..-1].map { |row| csv.headers.zip(row).to_h }
end
end
end
if(ARGV.length == 0)
puts "Use: ruby merge_csv.rb <file_path_csv_1> <file_path_csv_2>"
exit 1
end
puts MergeCsv.run(ARGV)
I have the answer, I just wanted to help people that is looking for the same solution
require "csv"
module MergeCsv
def self.run(csv_1_path, csv_2_path)
merge(File.read(csv_1_path), File.read(csv_2_path))
end
def self.merge(csv_1, csv_2)
csv_1_table = CSV.parse(csv_1, :headers => true)
csv_2_table = CSV.parse(csv_2, :headers => true)
return csv_2_table.to_csv if csv_1_table.headers.empty?
return csv_1_table.to_csv if csv_2_table.headers.empty?
headers_in_1_not_in_2 = csv_1_table.headers - csv_2_table.headers
headers_in_1_not_in_2.each do |header_in_1_not_in_2|
csv_2_table[header_in_1_not_in_2] = nil
end
headers_in_2_not_in_1 = csv_2_table.headers - csv_1_table.headers
headers_in_2_not_in_1.each do |header_in_2_not_in_1|
csv_1_table[header_in_2_not_in_1] = nil
end
csv_2_table.each do |csv_2_row|
csv_1_table << csv_1_table.headers.map { |csv_1_header| csv_2_row[csv_1_header] }
end
csv_1_table.to_csv
end
end
if(ARGV.length != 2)
puts "Use: ruby merge_csv.rb <file_path_csv_1> <file_path_csv_2>"
exit 1
end
puts MergeCsv.run(ARGV[0], ARGV[1])
And execute it from the console this way:
$ ruby merge_csv.rb csv_1.csv csv_2.csv
Any other, maybe cleaner, solution is welcome.
Simplied first answer:
How to use it:
listPart_A = CSV.read(csv_path_A, headers:true)
listPart_B = CSV.read(csv_path_B, headers:true)
listPart_C = CSV.read(csv_path_C, headers:true)
list = merge(listPart_A,listPart_B,listPart_C)
Function:
def merge(*csvs)
headers = csvs.map {|csv| csv.headers }.flatten.compact.uniq.sort
csvs.flat_map(&method(:csv_to_hash_array))
end
def csv_to_hash_array(csv)
csv.to_a[1..-1].map do |row|
Hash[csv.headers.zip(row)]
end
end
I had to do something very similar
to merge n CSV files that the might share some of the columns but some may not
if you want to keep a structure and do it easily,
I think the best way is to convert to hash and then re-convert to CSV file
my solution:
#!/usr/bin/env ruby
require "csv"
def join_multiple_csv(csv_path_array)
return nil if csv_path_array.nil? or csv_path_array.empty?
f = CSV.parse(File.read(csv_path_array[0]), :headers => true)
f_h = {}
f.headers.each {|header| f_h[header] = f[header]}
n_rows = f.size
csv_path_array.shift(1)
csv_path_array.each do |csv_file|
curr_csv = CSV.parse(File.read(csv_file), :headers => true)
curr_h = {}
curr_csv.headers.each {|header| curr_h[header] = curr_csv[header]}
new_headers = curr_csv.headers - f_h.keys
exist_headers = curr_csv.headers - new_headers
new_headers.each { |new_header|
f_h[new_header] = Array.new(n_rows) + curr_csv[new_header]
}
exist_headers.each {|exist_header|
f_h[exist_header] = f_h[exist_header] + curr_csv[exist_header]
}
n_rows = n_rows + curr_csv.size
end
csv_string = CSV.generate do |csv|
csv << f_h.keys
(0..n_rows-1).each do |i|
row = []
f_h.each_key do |header|
row << f_h[header][i]
end
csv << row
end
end
return csv_string
end
if(ARGV.length < 2)
puts "Use: ruby merge_csv.rb <file_path_csv_1> <file_path_csv_2> .. <file_path_csv_n>"
exit 1
end
csv_str = join_multiple_csv(ARGV)
f = File.open("results.csv", "w")
f.write(csv_str)
puts "CSV merge is done"
I've been trying to use Jekyll categories in a hierarchical fashion, i.e.
A: ['class', 'topic', 'foo']
AA: ['class', 'topic', 'foo', 'bar']
AB: ['class', 'topic', 'foo', 'baz']
AAA: ['class', 'topic', 'foo', 'bar', 'qux']
I'm trying to create a listing of all immediate subdirectories programmatically. That is, on a page with categories (A), I wish to be able to list the posts with categories (AA) and (AB), but not (AAA). Is this possible with Jekyll's vanilla structure, or should I consider using a plugin?
You need to use a plugin.
I've managed to make most of what you describe happen, but I'm not using categories at all, just representing a directory tree.
require 'digest/md5'
require 'open-uri'
module Jekyll
# Add accessor for directory
class Page
attr_reader :dir
end
class NavTree < Liquid::Tag
def render(context)
site = context.registers[:site]
#page_url = context.environments.first["page"]["url"]
#folder_weights = site.data['folder_weights']
#folder_icons = site.data['folder_icons']["icons"]
#nodes = {}
tree = {}
sorted_tree = {}
site.pages.each do |page|
# exclude all pages that are hidden in front-matter
if page.data["navigation"]["show"] != false
path = page.url
path = path.index('/')==0 ? path[1..-1] : path
#nodes[path] = page.data
end
end
#let's sort the pages by weight
array = []
#nodes.each do |path, data|
array.push(:path => path, :weight => data["weight"], :title => data["title"])
end
sorted_nodes = array.sort_by {|h| [-(h[:weight]||0), h[:path] ]}
sorted_nodes.each do |node|
current = tree
node[:path].split("/").inject("") do |sub_path,dir|
sub_path = File.join(sub_path, dir)
current[sub_path] ||= {}
current = current[sub_path]
sub_path
end
end
tree.each do |base, subtree|
folder_weight = #folder_weights[base]? #folder_weights[base] : 0
tree[base] = {"weight" => folder_weight, "subtree" => subtree}
end
tree_array = []
tree.each do |key, value|
tree_array.push(:base => key, :weight => value["weight"], :subtree => value["subtree"])
end
sorted_tree = tree_array.sort_by {|node| [ -(node[:weight]), node[:base] ]}
puts "generating nav tree for #{#page_url}"
files_first_traverse "", sorted_tree, 0
end
def files_first_traverse(prefix, nodes = [], depth=0)
output = ""
if depth == 0
id = 'id="nav-menu"'
end
output += "#{prefix}<ul #{id} class=\"nav nav-list\">"
nodes.each do |node|
base = node[:base]
subtree = node[:subtree]
name = base[1..-1]
if name.index('.') != nil
icon_name = #nodes[name]["icon"]
name = #nodes[name]["title"]
end
li_class = ""
if base == #page_url
li_class = "active"
if icon_name
icon_name = icon_name + " icon-white"
end
end
icon_html = "<span class=\"#{icon_name}\"></span>" unless icon_name.nil?
output += "#{prefix}<li class=#{li_class}>#{icon_html}#{name}</li>" if subtree.empty?
end
nodes.each do |node|
base = node[:base]
subtree = node[:subtree]
next if subtree.empty?
href = base
name = base[1..-1]
if name.index('.') != nil
is_parent = false
name = #nodes[name]["title"]
else
is_parent = true
href = base + '/index.html'
if name.index('/')
name = name[name.rindex('/')+1..-1]
end
end
name.gsub!(/_/,' ')
li_class = ""
if #page_url.index(base)
list_class = "collapsibleListOpen"
else
list_class = "collapsibleListClosed"
end
if href == #page_url
li_class = "active"
end
if is_parent
id = Digest::MD5.hexdigest(base)
icon_name = #folder_icons[base]
icon_html = icon_name.nil? ? "" : "<span class=\"#{icon_name}\"></span>"
li = "<li id=\"node-#{id}\" class=\"parent #{list_class}\"><div class=\"subtree-name\">#{icon_html}#{name}</div>"
else
icon_name = #nodes[name]["icon"]
if icon_name && li_class=="active"
icon_name = icon_name + " icon-white"
end
icon_html = icon_name.nil? ? "<i class=\"#{icon_name}\"></i>" : ""
li = "<li class=\"#{li_class}\">#{icon_html}#{name}</li>"
end
output += "#{prefix} #{li}"
subtree_array = []
subtree.each do |base, subtree|
subtree_array.push(:base => base, :subtree => subtree)
end
depth = depth + 1
output += files_first_traverse(prefix + ' ', subtree_array, depth)
if is_parent
output+= "</li>"
end
end
output += "#{prefix} </ul>"
output
end
end
end
Liquid::Template.register_tag("navigation", Jekyll::NavTree)
It is ugly code, but it's a start. You can see it in action on https://sendgrid.com/docs
The following program does almost everything I want it to but it won't write the image files to disc that are scraped. The latest error has no such file or directory for the basename of one of the image files that I would like to obtain. It should be writing the new file but I guess I'm doing something wrong. Error: No such file or directory - h3130gy1-3-7ec5.jpg . Ideally this program would write each image to disc with the name of each image being the basename of the absolute url that was used to obtain it. I would also like the spreadsheet element to write the basename of each scraped image to the output file that is being compiled.
require "capybara/dsl"
require "spreadsheet"
require "fileutils"
require "open-uri"
LOCAL_DIR = 'data-hold/images'
FileUtils.makedirs(LOCAL_DIR) unless File.exists?LOCAL_DIR
Capybara.run_server = false
Capybara.default_driver = :selenium
Capybara.default_selector = :xpath
Spreadsheet.client_encoding = 'UTF-8'
class Tomtop
include Capybara::DSL
def initialize
#excel = Spreadsheet::Workbook.new
#work_list = #excel.create_worksheet
#row = 0
end
def go
visit_main_link
end
def visit_main_link
visit "http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position"
results = all("//h5/a[contains(#onclick, 'analyticsLog')]")
item = []
results.each do |a|
item << a[:href]
end
item.each do |link|
visit link
save_item
end
#excel.write "inventory.csv"
end
def save_item
data = all("//*[#id='content-wrapper']/div[2]/div/div")
data.each do |info|
#work_list[#row, 0] = info.find("//*[#id='productright']/div/div[1]/h1").text
price = info.first("//div[contains(#class, 'price font left')]")
#work_list[#row, 1] = (price.text.to_f * 1.33).round(2) if price
#work_list[#row, 2] = info.find("//*[#id='productright']/div/div[11]").text
#work_list[#row, 3] = info.find("//*[#id='tabcontent1']/div/div").text.strip
color = info.all("//dd[1]//select[contains(#name, 'options')]//*[#price='0']")
#work_list[#row, 4] = color.collect(&:text).join(', ')
size = info.all("//dd[2]//select[contains(#name, 'options')]//*[#price='0']")
#work_list[#row, 5] = size.collect(&:text).join(', ')
imagelink = info.all("//*[#rel='lightbox[rotation]']")
#work_list[#row, 6] = imagelink.map { |link| link['href'] }.join(', ')
image = imagelink.map { |link| link['href'] }
File.open (File.basename("#{LOCAL_DIR}/#{image}", 'w')) do |f|
f.write(open(image).read)
end
#row = #row + 1
end
end
end
tomtop = Tomtop.new
tomtop.go
It appears as if you have a parenthesis misplaced, this line:
File.open (File.basename("#{LOCAL_DIR}/#{image}", 'w')) do |f|
Should be this:
File.open(File.basename("#{LOCAL_DIR}/#{image}"), 'w') do |f|
But actually, on further investigation of your code, it appears that File.basename is acting on the incorrect string in this situation. After getting your code to run, it filled the root folder of scraper.rb with images. So, what I think you really want for that line is this:
#only grab the basename of the image, then concatenate that to the end of the local_dir:
filename = "#{LOCAL_DIR}/#{File.basename(image)}"
File.open(filename, 'w') do |f|
After running this, I got to the next problem. It appears as though 'image' is an array which contains many urls.
Depending on what you are trying to achieve, you may need to do some additional filtering to get the image down to a single image, or change it to 'images' and have the following code:
images = imagelink.map { |link| link['href'] }
images.each do |image|
File.open(File.basename("#{LOCAL_DIR}/#{image}"), 'w') do |f|
f.write(open(image).read)
end
end
#row = #row + 1