Speed up csv import - ruby

I want to import big amount of cvs data (not directly to AR, but after some fetches), and my code is very slow.
def csv_import
require 'csv'
file = File.open("/#{Rails.public_path}/uploads/shate.csv")
csv = CSV.open(file, "r:ISO-8859-15:UTF-8", {:col_sep => ";", :row_sep => :auto, :headers => :first_row})
csv.each do |row|
#ename,esupp= row[1].split(/_/)
#(ename,esupp,foo) = row[1]..split('_')
abrakadabra = row[0].to_s()
(ename,esupp) = abrakadabra.split(/_/)
eprice = row[6]
eqnt = row[1]
# logger.info("1) ")
# logger.info(ename)
# logger.info("---")
# logger.info(esupp)
#----
#ename = row[4]
#eprice = row[7]
#eqnt = row[10]
#esupp = row[12]
if ename.present? && ename.size>3
search_condition = "*" + ename.upcase + "*"
if esupp.present?
#supplier = #suppliers.find{|item| item['SUP_BRAND'] =~ Regexp.new(".*#{esupp}.*") }
supplier = Supplier.where("SUP_BRAND like ?", "%#{esupp}%").first
logger.warn("!!! *** supp !!!")
#logger.warn(supplier)
end
if supplier.present?
#search = ArtLookup.find(:all, :conditions => ['MATCH (ARL_SEARCH_NUMBER) AGAINST(? IN BOOLEAN MODE)', search_condition.gsub(/[^0-9A-Za-z]/, '')])
#articles = Article.find(:all, :conditions => { :ART_ID => #search.map(&:ARL_ART_ID)})
#art_concret = #articles.find_all{|item| item.ART_ARTICLE_NR.gsub(/[^0-9A-Za-z]/, '').include?(ename.gsub(/[^0-9A-Za-z]/, '')) }
#aa = #art_concret.find{|item| item['ART_SUP_ID']==supplier.SUP_ID} #| #articles
if #aa.present?
#art = Article.find_by_ART_ID(#aa)
end
if #art.present?
#art.PRICEM = eprice
#art.QUANTITYM = eqnt
#art.datetime_of_update = DateTime.now
#art.save
end
end
logger.warn("------------------------------")
end
#logger.warn(esupp)
end
end
Even if I delete and get only this, it is slow.
def csv_import
require 'csv'
file = File.open("/#{Rails.public_path}/uploads/shate.csv")
csv = CSV.open(file, "r:ISO-8859-15:UTF-8", {:col_sep => ";", :row_sep => :auto, :headers => :first_row})
csv.each do |row|
end
end
Could anybody help me increase the speed using fastercsv?

I don't think it will get much faster.
That said, some testing shows that a significant part of time is spent for the transcoding (about 15% for my test case). So if you could skip that (e.g. by creating the CSV in UTF-8 already) you would see some improvement.
Besides, according to ruby-doc.org the "primary" interface for reading CSVs is
foreach, so this should be preferred:
def csv_import
import 'csv'
CSV.foreach("/#{Rails.public_path}/uploads/shate.csv", {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ';', :row_sep => :auto, :headers => :first_row}) do | row |
# use row here...
end
end
Update
You could also try splitting the parsing into several threads. I reached some performance increase experimenting with this code (treatment of heading left out):
N = 10000
def csv_import
all_lines = File.read("/#{Rails.public_path}/uploads/shate.csv").lines
# parts will contain the parsed CSV data of the different chunks/slices
# threads will contain the threads
parts, threads = [], []
# iterate over chunks/slices of N lines of the CSV file
all_lines.each_slice(N) do | plines |
# add an array object for the current chunk to parts
parts << result = []
# create a thread for parsing the current chunk, hand it over the chunk
# and the current parts sub-array
threads << Thread.new(plines.join, result) do | tsrc, tresult |
# parse the chunk
parsed = CSV.parse(tsrc, {:encoding => 'ISO-8859-15:UTF-8', :col_sep => ";", :row_sep => :auto})
# add the parsed data to the parts sub-array
tresult.replace(parsed.to_a)
end
end
# wait for all threads to finish
threads.each(&:join)
# merge all the parts sub-arrays into one big array and iterate over it
parts.flatten(1).each do | row |
# use row (Array)
end
end
This splits the input into chunks of 10000 lines and creates a parsing thread for each of the chunks. Each threads gets handed over a sub-array in the array parts for storing its result. When all threads are finished (after threads.each(&:join)) the results of all chunks in parts are joint and that's it.

As it's name implies Faster CSV is Well Faster :)
http://fastercsv.rubyforge.org
also see. for some more info
Ruby on Rails Moving from CSV to FasterCSV

I'm curious how big the file is, and how many columns it has.
Using CSV.foreach is the preferred way. It would be interesting to see the memory profile as your app is running. (Sometimes the slowness is due to printing, so make sure you don't do more of that than you need)
You might be able to preprocess it, and exclude any row that doesn't have the esupp, as it looks like your code only cares about those rows. Also, you could truncate any right-side columns you don't care about.
Another technique would be to gather up the unique components and put them in a hash. Seems like you are firing the same query multiple times.
You just need to profile it and see where it's spending its time.

check out the Gem smarter_csv! It can read CSV files in chunks, and you can then create Resque jobs to further process and insert those chunks into a database.
https://github.com/tilo/smarter_csv

Related

Filter unique values from a tsv file

I have a tsv file that has four columns. I'm having difficulty isolating the first column of the file (UUID), so I can strip out the 'UUID=' from each element, and also filter from unique values.
What am I doing wrong in my code? I've been pretty stuck on figuring this out. Thank you in advance!
Here's the link to the file, and my code below.
https://drive.google.com/file/d/1mGaK3n3YCrzrwOgSo5QQZ62FXDKJ3nZ8/view?usp=sharing
require "csv"
log_file = CSV.foreach("output_file.tsv",{:col_sep => "\t", :headers => true}) do |row|
uuid = row["UUID"]
ip = row["IP"]
time = row["TIME"]
ua = row["UA"]
uuid = uuid.drop(1)
ip = ip.drop(1)
time = time.drop(1)
ua = ua.drop(1)
uuid = uuid.map { |element|
element = element[5..-1]}
unique_logins = uuid.uniq
puts uuid.uniq.length
Probably you're confused a bit and think that CSV.foreach reads the whole column, but it actually reads your file row by row. That's why no need to drop(1).
This is the minimal code, which collects uuids from the file and prints the number of those uuids and then prints the number of unique uuids
require "csv"
uuids = []
log_file = CSV.foreach("output_file.tsv",{:col_sep => "\t", :headers => true}) do |row|
uuids << row["UUID"]
end
uuids = uuids.map { |element| element = element[5..-1]}
p uuids.length
unique_logins = uuids.uniq
p unique_logins.length
If your file isn't that big, you could also just read the entire file in at once, and then use the returned CSV::Table to read the entire column out and operate on that:
require 'csv'
tsv = CSV.read("output_file.tsv", col_sep: "\t", headers: true)
uuids = tsv['UUID'].map { |uuid| uuid[/\AUUID=(.+)\z/, 1] }.uniq
# => ["e9fc3b6e6641e69fb8cfbdfac48709ae", "f296020354e8c913454f62732d0e3dc4",
# "0300481b1e495e3c919b5214dda7b26c", "9ccc4096ed1d11d1b4c9e57ca1192176",
# "c0580eeb3f98d9c3fe232fc48694bf8e", "25ee63a754b9d4590b69b9ab2a4668cd",
# "aa61387f01797a839ca6f55daeb69b30", "9c7f37f5c187f662eaf7d0df83ac8804"]

How to remove a row from a CSV with Ruby

Given the following CSV file, how would you remove all rows that contain the word 'true' in the column 'foo'?
Date,foo,bar
2014/10/31,true,derp
2014/10/31,false,derp
I have a working solution, however it requires making a secondary CSV object csv_no_foo
#csv = CSV.read(#csvfile, headers: true) #http://bit.ly/1mSlqfA
#headers = CSV.open(#csvfile,'r', :headers => true).read.headers
# Make a new CSV
#csv_no_foo = CSV.new(#headers)
#csv.each do |row|
# puts row[5]
if row[#headersHash['foo']] == 'false'
#csv_no_foo.add_row(row)
else
puts "not pushing row #{row}"
end
end
Ideally, I would just remove the offending row from the CSV like so:
...
if row[#headersHash['foo']] == 'false'
#csv.delete(true) #Doesn't work
...
Looking at the ruby documentation, it looks like the row class has a delete_if function. I'm confused on the syntax that that function requires. Is there a way to remove the row without making a new csv object?
http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV/Row.html#method-i-each
You should be able to use CSV::Table#delete_if, but you need to use CSV::table instead of CSV::read, because the former will give you a CSV::Table object, whereas the latter results in an Array of Arrays. Be aware that this setting will also convert the headers to symbols.
table = CSV.table(#csvfile)
table.delete_if do |row|
row[:foo] == 'true'
end
File.open(#csvfile, 'w') do |f|
f.write(table.to_csv)
end
You might want to filter rows in a ruby manner:
require 'csv'
csv = CSV.parse(File.read(#csvfile), {
:col_sep => ",",
:headers => true
}
).collect { |item| item[:foo] != 'true' }
Hope it help.

Taking json data and converting it to a CSV file

Okay... so new to Ruby here but loving it so far. My problem is I cannot get the data to go into the CSV files.
#!/usr/bin/env ruby
require 'date'
require_relative 'amf'
require 'json'
require 'csv'
amf = Amf.new
#This makes it go out 3 days
apps = amf.post( 'Appointments.getBetweenDates',
{ 'startDate' => Date.today, 'endDate' => Date.today + 4 }
)
apps.each do |app|
cor_md_params = { 'appId' => app['appID'], 'relId' => 7 }
cor_md = amf.post( 'Clinicians.getByAppIdAndRelId', cor_md_params ).first
#this is where it breaks ----->
CSV.open("ile.csv", "wb") do |csv|
csv << ["column1", "column2", "etc.", "etc.."]
csv << ([
# if added puts ([ I can display the info and then make a csv...
app['patFirstName'],
app['patMiddleName'],
app['patLastName'],
app['patBirthdate'],
app['patHin'],
app['patPhone'],
app['patCellPhone'],
app['patBusinessPhone'],
app['appTime'],
app['appID'],
app['patPostalCode'],
app['patProvince'],
app['locName'],
# note that this is not exactly accurate for follow-ups,
# where you have to replace the "1" with the actual value
# in weeks, days, months, etc
#app[ 'bookName' ], => not sure this is needed
cor_md['id'],
cor_md['providerCode'],
cor_md['firstName'],
cor_md['lastName']
].join(', '))
end
end
Now, if I remove the attempt to make the ile.cvs file and just output it with a puts, all the data shows. But I don't want to have to go into the terminal and create a csv file... I would rather just run the .rb program and have it created. Also, hopefully I am making the columns correctly as well...
The thought occurred to me that I could just add another puts above the output.
Or, better, insert a row into the array before I output it...
Really not sure what is best practice here and standards.
This is what I have done and attempted. How can I get it to cleanly output to a CSV file since my attempts are not working
Also, to clarify where it breaks, it does add the column names just not the JSON info that is parsed. I could also be completely doing this the wrong way or a way that isn't possible. I just do not know.
What kind of error do you get? Is it this one:
<<': undefined methodmap' for "something":String (NoMethodError)
I think, you should remove the .join(', ')
The << method of CSV accepts an array, but not a String
http://ruby-doc.org/stdlib-1.9.2/libdoc/csv/rdoc/CSV.html#method-i-3C-3C
So instead of:
cor_md['lastName']
].join(', '))
rather:
cor_md['lastName']
])
The problem with the loop (why it writes only 1 row of data)
In the body of your loop, you always reopen the file, and always rewrite what you added before. What you want to do, is probably this:
CSV.open("ile3.csv", "wb") do |csv|
csv << ["column1", "column2", "etc.", "etc.."]
apps.each do |app|
cor_md_params = { 'appId' => app['appID'], 'relId' => 7 }
cor_md = amf.post( 'Clinicians.getByAppIdAndRelId', cor_md_params ).first
#csv << your long array
end
end

Ruby - Builder - Trying to convert CSV rows as data sets for constructing several XML's

Here's what I'm trying to accomplish. I need to have a single CSV with headers and several rows. I'm iterating through the headers and storing then and then associating the row data to the header. I need to be able to iterate through each of the rows in the CSV to use for constructing an XML's data. The constructed XML is then dumped as a .xml file and the program starts on the next row in the CSV. Each row has a column that provides the name of the XML file.
Here's what I've got so far.
Read in the data from the CSV file. Collect the header and row data.
def get_rows
raw_data = CSV.read('test.csv', {:skip_blanks => false, :headers => true})
data = []
raw_data.each { |row| data << row}
return build_header(data, raw_data)
end
take the header and row data and marry them up.
def build_header(data, raw_data)
(0..(data.length - 1)).each do |ri|
h = {}
raw_data.headers.each_with_index do |v, i|
h[v] = data[ri].fields[i]
end
return build_ostruct(h)
end
end
take the hash h and make an ostruct of it.
def build_ostruct(h)
x = OpenStruct.new(h)
uniq = x.tc_name
y = uniq_name.to_s + ".xml"
#marshal dump for debugging
x.marshal_dump.each{ |k,v| puts "#{k} => #{v}" }
return xml_builder(x, y)
end
Below this I'm taking the new ostruct "x" and calling the column headers from the CSV to #populate the XML nodes
For example: x.column1, x.column2, x.column3
Now the part I'm getting hung up on is getting the ostruct to receive the new row of data per iteration run. The objective is to have the ostruct populate with each row from the CSV. Currently the hash is displaying the proper data set and my XML is populating as expected but only with the first row of data. How do I get this to iterate through all the rows and populate the ostruct with the data per iteration so I can create a bulk set of XML's?
Thanks in advance for any and all help!
Something like this should work:
require 'csv'
require 'nokogiri'
CSV.foreach('test.csv', :headers => true) do |row|
builder = Nokogiri::XML::Builder.new do |xml|
xml.root do |root|
row.each do |k, v|
root.send k, v
end
end
end
File.open("#{row['tc_name']}.xml", 'w'){|f| f << builder.to_xml}
end
you are calling return in build_header, which ends the call. you need to collect your results in some way without immediately returning the first one, so that build_header can run for the entire set of rows.

import from CSV into Ruby array, with 1st field as hash key, then lookup a field's value given header row

Maybe somebody can help me.
Starting with a CSV file like so:
Ticker,"Price","Market Cap"
ZUMZ,30.00,933.90
XTEX,16.02,811.57
AAC,9.83,80.02
I manage to read them into an array:
require 'csv'
tickers = CSV.read("stocks.csv", {:headers => true, :return_headers => true, :header_converters => :symbol, :converters => :all} )
To verify data, this works:
puts tickers[1][:ticker]
ZUMZ
However this doesn't:
puts tickers[:ticker => "XTEX"][:price]
How would I go about turning this array into a hash using the ticker field as unique key, such that I could easily look up any other field associatively as defined in line 1 of the input? Dealing with many more columns and rows.
Much appreciated!
Like this (it works with other CSVs too, not just the one you specified):
require 'csv'
tickers = {}
CSV.foreach("stocks.csv", :headers => true, :header_converters => :symbol, :converters => :all) do |row|
tickers[row.fields[0]] = Hash[row.headers[1..-1].zip(row.fields[1..-1])]
end
Result:
{"ZUMZ"=>{:price=>30.0, :market_cap=>933.9}, "XTEX"=>{:price=>16.02, :market_cap=>811.57}, "AAC"=>{:price=>9.83, :market_cap=>80.02}}
You can access elements in this data structure like this:
puts tickers["XTEX"][:price] #=> 16.02
Edit (according to comment): For selecting elements, you can do something like
tickers.select { |ticker, vals| vals[:price] > 10.0 }
CSV.read(file_path, headers:true, header_converters: :symbol, converters: :all).collect do |row|
Hash[row.collect { |c,r| [c,r] }]
end
CSV.read(file_path, headers:true, header_converters: :symbol, converters: :all).collect do |row|
row.to_h
end
To add on to Michael Kohl's answer, if you want to access the elements in the following manner
puts tickers[:price]["XTEX"] #=> 16.02
You can try the following code snippet:
CSV.foreach("Workbook1.csv", :headers => true, :header_converters => :symbol, :converters => :all) do |row|
hash_row = row.headers[1..-1].zip( (Array.new(row.fields.length-1, row.fields[0]).zip(row.fields[1..-1])) ).to_h
hash_row.each{|key, value| tickers[key] ? tickers[key].merge!([value].to_h) : tickers[key] = [value].to_h}
end
To get the best of both worlds (very fast reading from a huge file AND the benefits of a native Ruby CSV object) my code had since evolved into this method:
$stock="XTEX"
csv_data = CSV.parse IO.read(%`|sed -n "1p; /^#{$stock},/p" stocks.csv`), {:headers => true, :return_headers => false, :header_converters => :symbol, :converters => :all}
# Now the 1-row CSV object is ready for use, eg:
$company = csv_data[:company][0]
$volatility_month = csv_data[:volatility_month][0].to_f
$sector = csv_data[:sector][0]
$industry = csv_data[:industry][0]
$rsi14d = csv_data[:relative_strength_index_14][0].to_f
which is closer to my original method, but only reads in one record plus line 1 of the input csv file containing the headers. The inline sed instructions take care of that--and the whole thing is noticably instant. This this is better than last because now I can access all the fields from Ruby, and associatively, not caring about column numbers anymore as was the case with awk.
Not as 1-liner-ie but this was more clear to me.
csv_headers = CSV.parse(STDIN.gets)
csv = CSV.new(STDIN)
kick_list = []
csv.each_with_index do |row, i|
row_hash = {}
row.each_with_index do |field, j|
row_hash[csv_headers[0][j]] = field
end
kick_list << row_hash
end
While this isn't a 100% native Ruby solution to the original question, should others stumble here and wonder what awk call I wound up using for now, here it is:
$dividend_yield = IO.readlines("|awk -F, '$1==\"#{$stock}\" {print $9}' datafile.csv")[0].to_f
where $stock is the variable I had previously assigned to a company's ticker symbol (the wannabe key field).
Conveniently survives problems by returning 0.0 if: ticker or file or field #9 not found/empty, or if value cannot be typecasted to a float. So any trailing '%' in my case gets nicely truncated.
Note that at this point one could easily add more filters within awk to have IO.readlines return a 1-dim array of output lines from the smaller resulting CSV, eg.
awk -F, '$9 >= 2.01 && $2 > 99.99 {print $0}' datafile.csv
outputs in bash which lines have a DivYld (col 9) over 2.01 and price (col 2) over 99.99. (Unfortunately I'm not using the header row to to determine field numbers, which is where I was ultimately hoping for some searchable associative Ruby array.)

Resources