trying to get the delta between columns using FasterCSV - ruby

A bit of a noob here so apologies in advance.
I am trying to read a CSV file which has a number of columns, I would like see if one string "foo" exists anywhere in the file, and if so, grab the string one cell over (aka same row, one column over) and then write that to a file
my file c.csv:
foo,bar,yip
12,apple,yap
23,orange,yop
foo,tom,yum
so in this case, I would want "bar" and "tom" in a new csv file.
Here's what I have so far:
#!/usr/local/bin/ruby -w
require 'rubygems'
require 'fastercsv'
rows = FasterCSV.read("c.csv")
acolumn = rows.collect{|row| row[0]}
if acolumn.select{|v| v =~ /foo/} == 1
i = 0
for z in i..(acolumn).count
puts rows[1][i]
end
I've looked here https://github.com/circle/fastercsv/blob/master/examples/csv_table.rb but I am obviously not understanding it, my best guess is that I'd have to use Table to do what I want to do but after banging my head up against the wall for a bit, I decided to ask for advice from the experienced folks. help please?

Given your input file c.csv
foo,bar,yip
12,apple,yap
23,orange,yop
foo,tom,yum
then this script:
#!/usr/bin/ruby1.8
require 'fastercsv'
FasterCSV.open('output.csv', 'w') do |output|
FasterCSV.foreach('c.csv') do |row|
foo_index = row.index('foo')
if foo_index
value_to_the_right_of_foo = row[foo_index + 1]
output << value_to_the_right_of_foo
end
end
end
will create the file output.csv
bar
tom

Related

Ignoring multiple header lines in a CSV

I've worked a bit with Ruby's CSV module, but am having some problems getting it to ignore multiple header lines.
Specifically, here are the first twenty lines of a file I want to parse:
USGS Digital Spectral Library splib06a
Clark and others 2007, USGS, Data Series 231.
For further information on spectrsocopy, see: http://speclab.cr.usgs.gov
ASCII Spectral Data file contents:
line 15 title
line 16 history
line 17 to end: 3-columns of data:
wavelength reflectance standard deviation
(standard deviation of 0.000000 means not measured)
( -1.23e34 indicates a deleted number)
----------------------------------------------------
Olivine GDS70.a Fo89 165um W1R1Bb AREF
copy of splib05a r 5038
0.205100 -1.23e34 0.090781
0.213100 -1.23e34 0.018820
0.221100 -1.23e34 0.005416
0.229100 -1.23e34 0.002928
The actual headers are given on the tenth line, and the seventeenth line is where the actual data start.
Here's my code:
require "nyaplot"
# Note that DataFrame basically just inherits from Ruby's CSV module.
class SpectraHelper < Nyaplot::DataFrame
class << self
def from_csv filename
df = super(filename, col_sep: ' ') do |csv|
csv.convert do |field, info|
STDERR.puts "Field is #{field}"
end
end
end
end
def csv_headers
[:wavelength, :reflectance, :standard_deviation]
end
end
def read_asc filename
f = File.open(filename, "r")
16.times do
line = f.gets
puts "Ignoring #{line}"
end
d = SpectraHelper.from_csv(f)
end
The output suggests that my calls to f.gets are not actually ignoring those lines, and I can't understand why. Here are the first few lines of output:
Field is Clark
Field is and
Field is others
Field is 2007,
Field is USGS,
I tried looking for a tutorial or example which shows processing of more complicated CSV files, but haven't had much luck. If someone could point me towards a resource which answers this question, I would be grateful (and would prefer to mark that as accepted over a solution to my specific problem — but both would be appreciated).
Using Ruby 2.1.
It believe that you are using ::open which uses IO.open. This method will open the file again.
I modified the script a bit
require 'csv'
class SpectraHelper < CSV
def self.from_csv(filename)
df = open(filename, 'r' , col_sep: ' ') do |csv|
csv.drop(16).each {|c| p c}
end
end
end
def read_asc(filename)
SpectraHelper.from_csv(filename)
end
read_asc "data/csv1.csv"
It turns out the problem here was not with my understanding of CSV, but rather with now Nyaplot::DataFrame handles CSV files.
Basically, Nyaplot doesn't actually store things as CSVs. CSV is just an intermediate format. So a simple way to handle the files makes use of #khelli's suggestion:
def read_asc filename
Nyaplot::DataFrame.new(CSV.open(filename, 'r',
col_sep: ' ',
headers: [:wavelength, :reflectance, :standard_deviation],
converters: :numeric).
drop(16).
map do |csv_row|
csv_row.to_h.delete_if { |k,v| k.nil? }
end)
end
Thanks, everyone, for the suggestions.
I wouldn't use the CSV module since your file is not well formatted. the following code will read the file and give you an array of your records:
lines = File.open(filename,'r').readlines
lines.slice!(0,16)
records = lines.map {|line| line.chomp.split}
the recordsoutput:
[["0.205100", "-1.23e34", "0.090781"], ["0.213100", "-1.23e34", "0.018820"], ["0.221100", "-1.23e34", "0.005416"], ["0.229100", "-1.23e34", "0.002928"]]

How to properly automate xml to xls

I am getting a lot of xml files recently, that i want to analyse in excel. In stead of using the xml conversion standard in (newer versions of) excel, I want to use a Ruby code that does it for a number of files automatically.
I am not very familiar, however, with rexml. After half a days work I got the code to convert just one(!) xml node. This is how it looks:
require 'rexml/document'
Dir.glob("FILES/archive/*.xml") do |eksemel|
puts "converting #{eksemel}"
filename = (/\d+/.match(eksemel)).to_s
xml_file = File.open("#{eksemel}", "r")
csv_file = File.new("#{filename}.csv", "w")
xml = REXML::Document.new( xml_file )
counter = 0
xml.elements.each("RESULTS") do |e|
e.elements.each("component") do |f|
f.elements.each("paragraph") do |g|
counter = counter + 1
csv_file.puts g.text
end
end
end
end
Is there a way to a) instead of define the names of the elements and the number let ruby do it automatically and b) save all of these as separate columns in a csv file?
It isn't clear what you are using counter for. It would also help if you clarified what kind of structure the XML file has (for instance, are there many <paragraph> elements within each <component> element?). But, here is a cleaner way to write what I think you shooting for:
require 'rexml/document'
require 'csv'
Dir.glob('FILES/archive/*.xml') do |eksemel|
puts "converting #{eksemel}"
# I assume you are creating a .csv file with the same name as your .xml file
xml_file = File.new(eksemel)
csv_file = CSV.open(eksemel.sub(/\.xml$/, '.csv'), 'w')
xml = REXML::Document.new(xml_file)
counter = xml.elements.to_a('RESULTS//component//paragraph').length
xml.elements.each('RESULTS//component') do |component|
csv_file << component.elements.to_a('paragraph')
end
[xml_file, csv_file].each {|f| f.close}
end

How do I make an array of arrays out of a CSV?

I have a CSV file that looks like this:
Jenny, jenny#example.com ,
Ricky, ricky#example.com ,
Josefina josefina#example.com ,
I'm trying to get this output:
users_array = [
['Jenny', 'jenny#example.com'], ['Ricky', 'ricky#example.com'], ['Josefina', 'josefina#example.com']
]
I've tried this:
users_array = Array.new
file = File.new('csv_file.csv', 'r')
file.each_line("\n") do |row|
puts row + "\n"
columns = row.split(",")
users_array.push columns
puts users_array
end
Unfortunately, in Terminal, this returns:
Jenny
jenny#example.com
Ricky
ricky#example.com
Josefina
josefina#example.com
Which I don't think will work for this:
users_array.each_with_index do |user|
add_page.form_with(:id => 'new_user') do |f|
f.field_with(:id => "user_email").value = user[0]
f.field_with(:id => "user_name").value = user[1]
end.click_button
end
What do I need to change? Or is there a better way to solve this problem?
Ruby's standard library has a CSV class with a similar api to File but contains a number of useful methods for working with tabular data. To get the output you want, all you need to do is this:
require 'csv'
users_array = CSV.read('csv_file.csv')
PS - I think you are getting the output you expected with your file parsing as well, but maybe you're thrown off by how it is printing to the terminal. puts behaves differently with arrays, printing each member object on a new line instead of as a single array. If you want to view it as an array, use puts my_array.inspect.
Assuming that your CSV file actually has a comma between the name and email address on the third line:
require 'csv'
users_array = []
CSV.foreach('csv_file.csv') do |row|
users_array.push row.delete_if(&:nil?).map(&:strip)
end
users_array
# => [["Jenny", "jenny#example.com"],
# ["Ricky", "ricky#example.com"],
# ["Josefina", "josefina#example.com"]]
There may be a simpler way, but what I'm doing there is discarding the nil field created by the trailing comma and stripping the spaces around the email addresses.

How not to save to csv when array is empty

I'm parsing through a website and i'm looking for potentially many million rows of content. However, csv/excel/ods doesn't allow for more than a million rows.
That is why I'm trying to use a provisionary to exclude saving empty content. However, it's not working: My code keeps creating empty rows in csv.
This is the code I have:
# create csv
CSV.open("neverending.csv", "w") do |csv|
csv << ["kuk","date","name"]
# loop through all urls
File.foreach("neverendingurls.txt") do |line|
begin
doorzoekbarefile = Nokogiri::HTML(open(line))
for k in 1..999 do
# PROVISIONARY / CONDITIONAL
unless doorzoekbarefile.at_xpath("//td[contains(style, '60px')])[#{k}]").nil?
# xpaths
kuk = doorzoekbarefile.at_xpath("(//td[contains(#style,'60px')])[#{k}]")
date = doorzoekbarefile.at_xpath("(//td[contains(#style, '60px')])[#{k}]/following-sibling::*[1]")
name = doorzoekbarefile.at_xpath("(//td[contains(#style, '60px')])[#{k}]/following-sibling::*[2]")
# save to csv
csv << [kuk,date,name]
end
end
end
rescue
puts "error bij url #{line}"
end
end
end
Anybody have a clue what's going wrong or how to solve the problem? Basically I simply need to change the code so that it doesn't create a new row of csv data when the xpaths are empty.
This really doesn't have to do with xpath. It's simple Array#empty?
row = [kuk,date,name]
csv << row if row.compact.empty?
BTW, your code is a mess. Learn how to indent at least beore posting again.

trying to find the 1st instance of a string in a CSV using fastercsv

I'm trying to open a CSV file, look up a string, and then return the 2nd column of the csv file, but only the the first instance of it. I've gotten as far as the following, but unfortunately, it returns every instance. I'm a bit flummoxed.
Can the gods of Ruby help? Thanks much in advance.
M
for the purpose of this example, let's say names.csv is a file with the following:
foo, happy
foo, sad
bar, tired
foo, hungry
foo, bad
#!/usr/local/bin/ruby -w
require 'rubygems'
require 'fastercsv'
require 'pp'
FasterCSV.open('newfile.csv', 'w') do |output|
FasterCSV.foreach('names.csv') do |lookup|
index_PL = lookup.index('foo')
if index_PL
output << lookup[2]
end
end
end
ok, so, if I want to return all instances of foo, but in a csv, then how does that work?
so what I'd like as an outcome is happy, sad, hungry, bad. I thought it would be:
FasterCSV.open('newfile.csv', 'w') do |output|
FasterCSV.foreach('names.csv') do |lookup|
index_PL = lookup.index('foo')
if index_PL
build_str << "," << lookup[2]
end
output << build_str
end
end
but it does not seem to work
Replace foreach with open (to get an Enumerable) and find:
FasterCSV.open('newfile.csv', 'w') do |output|
output << FasterCSV.open('names.csv').find { |r| r.index('foo') }[2]
end
The index call will return nil if it doesn't find anything; that means that the find will give you the first row that has 'foo' and you can pull out the column at index 2 from the result.
If you're not certain that names.csv will have what you're looking for then a bit of error checking would be advisable:
FasterCSV.open('newfile.csv', 'w') do |output|
foos_row = FasterCSV.open('names.csv').find { |r| r.index('foo') }
if(foos_row)
output << foos_row[2]
else
# complain or something
end
end
Or, if you want to silently ignore the lack of 'foo' and use an empty string instead, you could do something like this:
FasterCSV.open('newfile.csv', 'w') do |output|
output << (FasterCSV.open('names.csv').find { |r| r.index('foo') } || ['','',''])[2]
end
I'd probably go with the "complain if it isn't found" version though.

Resources