how to get my rexml/nokogiri script run faster - ruby

I have this ruby script that collects 46344 xml-links and then collects 16 elements-nodes in every xml file. The last part of the proccess is that it stores it in a CSV file. The problem that I have is that it takes to long. It takes more than 1-2 hour..
Here is the script without the link that have all the XML-links, I cant provid the link beacuse its company stuff.. I hope its cool.
Here is the script, and it works but it takes to long:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML
#urls = Array.new
#ID = Array.new
#titleSv = Array.new
#titleEn = Array.new
#identifier = Array.new
#typeOfLevel = Array.new
#typeOfResponsibleBody = Array.new
#courseTyp = Array.new
#credits = Array.new
#degree = Array.new
#preAcademic = Array.new
#subjectCodeVhs = Array.new
#descriptionSv = Array.new
#visibleToSweApplicants = Array.new
#lastedited = Array.new
#expires = Array.new
# Hämtar alla XML-länkar
htmldoc = Nokogiri::HTML(open('A SITE THAT HAVE ALL THE LINKS'))
# Hämtar alla länkar för xml-filerna och sparar dom i arrayn urls
htmldoc.xpath('//a/#href').each do |links|
#urls << links.content
end
#urls.each do |url|
# Loop throw the XML files and grab element nodes
xmldoc = REXML::Document.new(open(url).read)
# Root element
root = xmldoc.root
# Hämtar info-id
#ID << root.attributes["id"]
# TitleSv
xmldoc.elements.each("/ns:educationInfo/ns:titles/ns:title[1]"){
|e| #titleSv << e.text
}
# TitleEn
xmldoc.elements.each("/ns:educationInfo/ns:titles/ns:title[2]"){
|e| #titleEn << e.text
}
# Identifier
xmldoc.elements.each("/ns:educationInfo/ns:identifier"){
|e| #identifier << e.text
}
# typeOfLevel
xmldoc.elements.each("/ns:educationInfo/ns:educationLevelDetails/ns:typeOfLevel"){
|e| #typeOfLevel << e.text
}
# typeOfResponsibleBody
xmldoc.elements.each("/ns:educationInfo/ns:educationLevelDetails/ns:typeOfResponsibleBody"){
|e| #typeOfResponsibleBody << e.text
}
# courseTyp
xmldoc.elements.each("/ns:educationInfo/ns:educationLevelDetails/ns:academic/ns:courseOfferingPackage/ns:type"){
|e| #courseTyp << e.text
}
# credits
xmldoc.elements.each("/ns:educationInfo/ns:credits/ns:exact"){
|e| #credits << e.text
}
# degree
xmldoc.elements.each("/ns:educationInfo/ns:degrees/ns:degree"){
|e| #degree << e.text
}
# #preAcademic
xmldoc.elements.each("/ns:educationInfo/ns:prerequisites/ns:academic"){
|e| #preAcademic << e.text
}
# #subjectCodeVhs
xmldoc.elements.each("/ns:educationInfo/ns:subjects/ns:subject/ns:code"){
|e| #subjectCodeVhs << e.text
}
# DescriptionSv
xmldoc.elements.each("/educationInfo/descriptions/ct:description/ct:text"){
|e| #descriptionSv << e.text
}
# Hämtar dokuments utgångs-datum
#expires << root.attributes["expires"]
# Hämtar dokuments lastedited
#lastedited << root.attributes["lastEdited"]
# Lagrar dom i uni.CSV
CSV.open("eduction_normal.csv", "wb") do |row|
(0..#ID.length - 1).each do |index|
row << [#ID[index], #titleSv[index], #titleEn[index], #identifier[index], #typeOfLevel[index], #typeOfResponsibleBody[index], #courseTyp[index], #credits[index], #degree[index], #preAcademic[index], #subjectCodeVhs[index], #descriptionSv[index], #lastedited[index], #expires[index]]
end
end
end

If it's network access you could start threading it and/or start using Jruby which can use all cores on your processor. If you have to do it often you will have to work out a read write strategy that serves you best without blocking.

Related

For loop and if in puts function - Ruby

I am trying to use for loop and if condition in creating a file using File.open and puts function. My code is
I want to write these entries only if it is not null. How to do it?
Edit: Full code is
require 'fileutils'
require 'json'
require 'open-uri'
require 'pp'
data = JSON.parse('data')
array = data
if array &.any?
drafts_dir = File.expand_path('../drats', dir)
FileUtils.mkdir_p(drafts_dir)
array.each do |entry|
File.open(File.join(drafts_dir, "#{entry['twitter']}.md"), 'wb') do |draft|
keys = 1.upto(6).map { |i| "key_#{i}" }
values = keys.map { |k| "<img src='#{entry['image']} alt='image'>" if entry['image']}
# you can also do values = entry.values_at(*keys)
str = values.reject do |val|
val.nil? || val.length == 0
end.join("\n")
draft.puts str
end
end
end
I need the the file `mark.md` as
https://somesite.com/image.png' alt='image'>
https://twitter.com/mark'>mark
and `kevin.md` likewise.
you can build the string from an array, rejecting the null values:
keys = 1.upto(6).map { |i| "key_#{i}" }
values = keys.map { |k| entry[k] }
# you can also do values = entry.values_at(*keys)
str = values.reject do |val|
val.nil? || val.length == 0
end.join("\n")
draft.puts str
update in response to your changed question. Do this:
array.each do |entry|
File.open(File.join(drafts_dir, "#{entry['twitter']}.md"), 'wb') do |draft|
next unless ['image', 'twitter'].all? { |k| entry[k]&.length > 1 }
str = [
"<img src='#{entry['image']} alt='image'>",
"<a href='https://twitter.com/#{entry['twitter']}'>#{entry['twitter']}</a>"
].join("\n")
draft.puts str
end
end
Assuming, your entry is hash.
final_string = ''
entry.each_value { |value| final_string << "#{value}\n" }
puts final_string

How to call hash values outside class from defined hash map inside class methods?

Read a csv format file and construct a new class with the name of the file dynamically. So if the csv is persons.csv, the ruby class should be person, if it's places.csv, the ruby class should be places
Also create methods for reading and displaying each value in "csv" file and values in first row of csv file will act as name of the function.
Construct an array of objects and associate each object with the row of a csv file. For example the content of the csv file could be
name,age,city
abd,45,TUY
kjh,65,HJK
Previous code :
require 'csv'
class Feed
def initialize(source_name, column_names = [])
if column_names.empty?
column_names = CSV.open(source_name, 'r', &:first)
end
columns = column_names.reduce({}) { |columns, col_name| columns[col_name] = []; columns }
define_singleton_method(:columns) { column_names }
column_names.each do |col_name|
define_singleton_method(col_name.to_sym) { columns[col_name] }
end
CSV.foreach(source_name, headers: true) do |row|
column_names.each do |col_name|
columns[col_name] << row[col_name]
end
end
end
end
feed = Feed.new('input.csv')
puts feed.columns #["name", "age", "city"]
puts feed.name # ["abd", "kjh"]
puts feed.age # ["45", "65"]
puts feed.city # ["TUY", "HJK"]
I am trying to refine this solution using class methods and split code into smaller methods. Calling values outside the class using key names but facing errors like "undefined method `age' for Feed:Class". Is that a way I can access values outside the class ?
My solution looks like -
require 'csv'
class Feed
attr_accessor :column_names
def self.col_name(source_name, column_names = [])
if column_names.empty?
#column_names = CSV.open(source_name, :headers => true)
end
columns = #column_names.reduce({}) { |columns, col_name| columns[col_name] = []; columns }
end
def self.get_rows(source_name)
col_name(source_name, column_names = [])
define_singleton_method(:columns) { column_names }
column_names.each do |col_name|
define_singleton_method(col_name.to_sym) { columns[col_name] }
end
CSV.foreach(source_name, headers: true) do |row|
#column_names.each do |col_name|
columns[col_name] << row[col_name]
end
end
end
end
obj = Feed.new
Feed.get_rows('Input.csv')
puts obj.class.columns
puts obj.class.name
puts obj.class.age
puts obj.class.city
Expected Result -
input = Input.new
p input.name # ["abd", "kjh"]
p input.age # ["45", "65"]
input.name ='XYZ' # Value must be appended to array
input.age = 25
p input.name # ["abd", "kjh", "XYZ"]
p input.age # ["45", "65", "25"]
Let's create the CSV file.
str =<<END
name,age,city
abd,45,TUY
kjh,65,HJK
END
FName = 'temp/persons.csv'
File.write(FName, str)
#=> 36
Now let's create a class:
klass = Class.new
#=> #<Class:0x000057d0519de8a0>
and name it:
class_name = File.basename(FName, ".csv").capitalize
#=> "Persons"
Object.const_set(class_name, klass)
#=> Persons
Persons.class
#=> Class
See File::basename, String#capitalize and Module#const_set.
Next read the CSV file with headers into a CSV::Table object:
require 'csv'
csv = CSV.read(FName, headers: true)
#=> #<CSV::Table mode:col_or_row row_count:3>
csv.class
#=> CSV::Table
See CSV#read. We may now create the methods name, age and city.
csv.headers.each { |header| klass.define_method(header) { csv[header] } }
See CSV#headers, Module::define_method and CSV::Row#[].
We can now confirm they work as intended:
k = klass.new
k.name
#=> ["abd", "kjh"]
k.age
#=> ["45", "65"]
k.city
#=> ["TUY", "HJK"]
or
p = Persons.new
#=> #<Persons:0x0000598dc6b01640>
p.name
#=> ["abd", "kjh"]
and so on.

Ruby how to merge two CSV files with slightly different headers

I have two CSV files with some common headers and others that only appear in one or in the other, for example:
# csv_1.csv
H1,H2,H3
V11,V22,V33
V14,V25,V35
# csv_2.csv
H1,H4
V1a,V4b
V1c,V4d
I would like to merge both and obtain a new CSV file that combines all the information for the previous CSV files. Injecting new columns when needed, and feeding the new cells with null values.
Result example:
H1,H2,H3,H4
V11,V22,V33,
V14,V25,V35,
V1a,,,V4b
V1c,,,V4d
Challenge accepted :)
#!/usr/bin/env ruby
require "csv"
module MergeCsv
class << self
def run(csv_paths)
csv_files = csv_paths.map { |p| CSV.read(p, headers: true) }
merge(csv_files)
end
private
def merge(csv_files)
headers = csv_files.flat_map(&:headers).uniq.sort
hash_array = csv_files.flat_map(&method(:csv_to_hash_array))
CSV.generate do |merged_csv|
merged_csv << headers
hash_array.each do |row|
merged_csv << row.values_at(*headers)
end
end
end
# Probably not the most performant way, but easy
def csv_to_hash_array(csv)
csv.to_a[1..-1].map { |row| csv.headers.zip(row).to_h }
end
end
end
if(ARGV.length == 0)
puts "Use: ruby merge_csv.rb <file_path_csv_1> <file_path_csv_2>"
exit 1
end
puts MergeCsv.run(ARGV)
I have the answer, I just wanted to help people that is looking for the same solution
require "csv"
module MergeCsv
def self.run(csv_1_path, csv_2_path)
merge(File.read(csv_1_path), File.read(csv_2_path))
end
def self.merge(csv_1, csv_2)
csv_1_table = CSV.parse(csv_1, :headers => true)
csv_2_table = CSV.parse(csv_2, :headers => true)
return csv_2_table.to_csv if csv_1_table.headers.empty?
return csv_1_table.to_csv if csv_2_table.headers.empty?
headers_in_1_not_in_2 = csv_1_table.headers - csv_2_table.headers
headers_in_1_not_in_2.each do |header_in_1_not_in_2|
csv_2_table[header_in_1_not_in_2] = nil
end
headers_in_2_not_in_1 = csv_2_table.headers - csv_1_table.headers
headers_in_2_not_in_1.each do |header_in_2_not_in_1|
csv_1_table[header_in_2_not_in_1] = nil
end
csv_2_table.each do |csv_2_row|
csv_1_table << csv_1_table.headers.map { |csv_1_header| csv_2_row[csv_1_header] }
end
csv_1_table.to_csv
end
end
if(ARGV.length != 2)
puts "Use: ruby merge_csv.rb <file_path_csv_1> <file_path_csv_2>"
exit 1
end
puts MergeCsv.run(ARGV[0], ARGV[1])
And execute it from the console this way:
$ ruby merge_csv.rb csv_1.csv csv_2.csv
Any other, maybe cleaner, solution is welcome.
Simplied first answer:
How to use it:
listPart_A = CSV.read(csv_path_A, headers:true)
listPart_B = CSV.read(csv_path_B, headers:true)
listPart_C = CSV.read(csv_path_C, headers:true)
list = merge(listPart_A,listPart_B,listPart_C)
Function:
def merge(*csvs)
headers = csvs.map {|csv| csv.headers }.flatten.compact.uniq.sort
csvs.flat_map(&method(:csv_to_hash_array))
end
def csv_to_hash_array(csv)
csv.to_a[1..-1].map do |row|
Hash[csv.headers.zip(row)]
end
end
I had to do something very similar
to merge n CSV files that the might share some of the columns but some may not
if you want to keep a structure and do it easily,
I think the best way is to convert to hash and then re-convert to CSV file
my solution:
#!/usr/bin/env ruby
require "csv"
def join_multiple_csv(csv_path_array)
return nil if csv_path_array.nil? or csv_path_array.empty?
f = CSV.parse(File.read(csv_path_array[0]), :headers => true)
f_h = {}
f.headers.each {|header| f_h[header] = f[header]}
n_rows = f.size
csv_path_array.shift(1)
csv_path_array.each do |csv_file|
curr_csv = CSV.parse(File.read(csv_file), :headers => true)
curr_h = {}
curr_csv.headers.each {|header| curr_h[header] = curr_csv[header]}
new_headers = curr_csv.headers - f_h.keys
exist_headers = curr_csv.headers - new_headers
new_headers.each { |new_header|
f_h[new_header] = Array.new(n_rows) + curr_csv[new_header]
}
exist_headers.each {|exist_header|
f_h[exist_header] = f_h[exist_header] + curr_csv[exist_header]
}
n_rows = n_rows + curr_csv.size
end
csv_string = CSV.generate do |csv|
csv << f_h.keys
(0..n_rows-1).each do |i|
row = []
f_h.each_key do |header|
row << f_h[header][i]
end
csv << row
end
end
return csv_string
end
if(ARGV.length < 2)
puts "Use: ruby merge_csv.rb <file_path_csv_1> <file_path_csv_2> .. <file_path_csv_n>"
exit 1
end
csv_str = join_multiple_csv(ARGV)
f = File.open("results.csv", "w")
f.write(csv_str)
puts "CSV merge is done"

How do I customize the spreadsheet gem/output?

I have a program using the spreadsheet gem to create a CSV file; I have not been able to find the way to configure the functionality that I need.
This is what I would like the gem to do: The model number and additional_image field should be "in sync", that is, each additional image written to the spreadsheet doc should be a new line and should not be wrapped.
Here are some snippets of the desired output in contrast with the current. These fields are defined by XPath objects that are screen scraped using another gem. The program won't know for sure how many objects it will encounter in the additional image field but due to business logic the number of objects in the additional image field should mirror the number of model number objects that are written to the spreadsheet.
model
168868837a
168868837a
168868837a
168868837a
168868837a
168868837a
additional_image
1688688371.jpg
1688688372.jpg
1688688373.jpg
1688688374.jpg
1688688375.jpg
1688688376.jpg
This is the current code:
require "capybara/dsl"
require "spreadsheet"
require "fileutils"
require "open-uri"
LOCAL_DIR = 'data-hold/images'
FileUtils.makedirs(LOCAL_DIR) unless File.exists?LOCAL_DIR
Capybara.run_server = false
Capybara.default_driver = :selenium
Capybara.default_selector = :xpath
Spreadsheet.client_encoding = 'UTF-8'
class Tomtop
include Capybara::DSL
def initialize
#excel = Spreadsheet::Workbook.new
#work_list = #excel.create_worksheet
#row = 0
end
def go
visit_main_link
end
def retryable(options = {}, &block)
opts = { :tries => 1, :on => Exception }.merge(options)
retry_exception, retries = opts[:on], opts[:tries]
begin
return yield
rescue retry_exception
retry if (retries -= 1) > 0
end
yield
end
def visit_main_link
retryable(:tries => 1, :on => OpenURI::HTTPError) do
visit "http://www.example.com/clothing-accessories?dir=asc&limit=72&order=position"
results = all("//h5/a[contains(#onclick, 'analyticsLog')]")
item = []
results.each do |a|
item << a[:href]
end
item.each do |link|
visit link
save_item
end
#excel.write "inventory.csv"
end
end
def save_item
data = all("//*[#id='content-wrapper']/div[2]/div/div")
data.each do |info|
#work_list[#row, 0] = info.find("//*[#id='productright']/div/div[1]/h1").text
price = info.first("//div[contains(#class, 'price font left')]")
#work_list[#row, 1] = (price.text.to_f * 1.33).round(2) if price
#work_list[#row, 2] = info.find("//*[#id='productright']/div/div[11]").text
#work_list[#row, 3] = info.find("//*[#id='tabcontent1']/div/div").text.strip
color = info.all("//dd[1]//select[contains(#name, 'options')]//*[#price='0']")
#work_list[#row, 4] = color.collect(&:text).join(', ')
size = info.all("//dd[2]//select[contains(#name, 'options')]//*[#price='0']")
#work_list[#row, 5] = size.collect(&:text).join(', ')
model = File.basename(info.find("//*[#id='content-wrapper']/div[2]/div/div/div[1]/div[1]/a")['href'])
#work_list[#row, 6] = model.gsub!(/\D/, "")
#work_list[#row, 7] = File.basename(info.find("//*[#id='content-wrapper']/div[2]/div/div/div[1]/div[1]/a")['href'])
additional_image = info.all("//*[#rel='lightbox[rotation]']")
#work_list[#row, 8] = additional_image.map { |link| File.basename(link['href']) }.join(', ')
images = imagelink.map { |link| link['href'] }
images.each do |image|
File.open(File.basename("#{LOCAL_DIR}/#{image}"), 'w') do |f|
f.write(open(image).read)
end
end
#row = #row + 1
end
end
end
tomtop = Tomtop.new
tomtop.go
I would like this to do two things that I'm not sure how to do:
Each additional image should print to a new line (currently it prints all in one cell).
I would like the model field to be duplicated exactly as many times as there are additional_images in the same new line manner.
Use the CSV gem. I took the long way of writing this so you can see how it works.
require 'csv'
DOC = "file.csv"
profile = []
profile[0] = "model"
CSV.open(DOC, "a") do |me|
me << profile
end
img_url = ['pic_1.jpg','pic_2.jpg','pic_3.jpg','pic_4.jpg','pic_5.jpg','pic_6.jpg']
a = 0
b = img_url.length
while a < b
profile = []
profile[0] = img_url[a]
CSV.open(DOC, "a") do |me|
me << profile
end
a += 1
end
The csv file should look like this
model
pic_1.jpg
pic_2.jpg
pic_3.jpg
pic_4.jpg
pic_5.jpg
pic_6.jpg
for your last question
whatever = []
whatever = temp[1] + " " + temp[2]
profile[x] = whatever
OR
profile[x] = temp[1] + " " + temp[2]
NIL error in array
if temp[2] == nil
profile[x] = temp[1]
else
profile[x] = temp[1] + " " + temp[2]
end

Convert Hashes to CSV

I have a CSV that I like to save all my hash values on it. I am using nokogiri sax to parse a xml document and then save it to a CSV.
The sax parser:
require 'rubygems'
require 'nokogiri'
require 'csv'
class MyDocument < Nokogiri::XML::SAX::Document
HEADERS = [ :titles, :identifier, :typeOfLevel, :typeOfResponsibleBody,
:type, :exact, :degree, :academic, :code, :text ]
def initialize
#infodata = {}
#infodata[:titles] = Array.new([])
end
def start_element(name, attrs)
#attrs = attrs
#content = ''
end
def end_element(name)
if name == 'title'
Hash[#attrs]["xml:lang"]
#infodata[:titles] << #content
#content = nil
end
if name == 'identifier'
#infodata[:identifier] = #content
#content = nil
end
if name == 'typeOfLevel'
#infodata[:typeOfLevel] = #content
#content = nil
end
if name == 'typeOfResponsibleBody'
#infodata[:typeOfResponsibleBody] = #content
#content = nil
end
if name == 'type'
#infodata[:type] = #content
#content = nil
end
if name == 'exact'
#infodata[:exact] = #content
#content = nil
end
if name == 'degree'
#infodata[:degree] = #content
#content = nil
end
if name == 'academic'
#infodata[:academic] = #content
#content = nil
end
if name == 'code'
Hash[#attrs]['source="vhs"']
#infodata[:code] = #content
#content = nil
end
if name == 'ct:text'
#infodata[:beskrivning] = #content
#content = nil
end
end
def characters(string)
#content << string if #content
end
def cdata_block(string)
characters(string)
end
def end_document
File.open("infodata.csv", "ab") do |f|
csv = CSV.generate_line(HEADERS.map {|h| #infodata[h] })
csv << "\n"
f.write(csv)
end
end
end
creating new an object for every file that is store in a folder(47.000xml files):
parser = Nokogiri::XML::SAX::Parser.new(MyDocument.new)
counter = 0
Dir.glob('/Users/macbookpro/Desktop/sax/info_xml/*.xml') do |item|
parser.parse(File.open(item, 'rb'))
counter += 1
puts "Writing file nr: #{counter}"
end
The issue:
I dont get a new line for every new set of values. Any ideas?
3 xml files for trying the code:
https://gist.github.com/2378898
https://gist.github.com/2378901
https://gist.github.com/2378904
You need to open the file using "a" mode (opening a file with "w" clears any previous content).
Appending an array to the csv object will automatically insert newlines. Hash#values returns an array of the values, but it would be safer to force the order. Flattening the array will potentially lead to misaligned columns (e.g. [[:title1, :title2], 'other-value'] will result in [:title1, :title2, 'other-value']). Try something like this:
HEADERS = [:titles, :identifier, ...]
def end_document
# with ruby 1.8.7
File.open("infodata.csv", "ab") do |f|
csv = CSV.generate_line(HEADERS.map { |h| #infodata[h] })
csv << "\n"
f.write(csv)
end
# with ruby 1.9.x
CSV.open("infodata.csv", "ab") do |csv|
csv << HEADERS.map { |h| #infodata[h] }
end
end
The above change can be verified by executing the following:
require "csv"
class CsvAppender
HEADERS = [ :titles, :identifier, :typeOfLevel, :typeOfResponsibleBody, :type,
:exact, :degree, :academic, :code, :text ]
def initialize
#infodata = { :titles => ["t1", "t2"], :identifier => 0 }
end
def end_document
#infodata[:identifier] += 1
# with ruby 1.8.7
File.open("infodata.csv", "ab") do |f|
csv = CSV.generate_line(HEADERS.map { |h| #infodata[h] })
csv << "\n"
f.write(csv)
end
# with ruby 1.9.x
#CSV.open("infodata.csv", "ab") do |csv|
# csv << HEADERS.map { |h| #infodata[h] }
#end
end
end
appender = CsvAppender.new
3.times do
appender.end_document
end
File.read("infodata.csv").split("\n").each do |line|
puts line
end
After running the above the infodata.csv file will contain:
"[""t1"", ""t2""]",1,,,,,,,,
"[""t1"", ""t2""]",2,,,,,,,,
"[""t1"", ""t2""]",3,,,,,,,,
I guess you need an extra loop. Something similar to
CSV.open("infodata.csv", "wb") do |csv|
csv << #infodata.keys
#infodata.each do |key, value|
csv << value
end
end

Resources