Taking multiple lines from a file and creating hash - ruby

I'm taking a file and reading in it's contents and creating a hash based on newlines. I've been able to make a hash based on the contents of each line, but how can I create a hash based on the content of everything before the next blank newline? Below is what I have so far.
Input:
Title 49th parallel
URL http://artsweb.bham.ac.uk/
Domain artsweb.bham.ac.uk
Title ABAA booknet
URL http://abaa.org/
Domain abaa.org
Code:
File.readlines('A.cfg').each do |line|
unless line.strip.empty?
hash = Hash[*line.strip.split("\t")]
puts hash
end
puts "\n" if line.strip.empty?
end
Outputs:
{"Title"=>"49th parallel"}
{"URL"=>"http://artsweb.bham.ac.uk/"}
{"Domain"=>"artsweb.bham.ac.uk"}
{"Title"=>"ABAA booknet"}
{"URL"=>"http://abaa.org/"}
{"Domain"=>"abaa.org"}
Desired Output:
{"Title"=>"49th parallel", "URL"=>"http://artsweb.bham.ac.uk/", "Domain"=>"artsweb.bham.ac.uk"}
{"Title"=>"ABAA booknet", "URL"=>"http://abaa.org/", "Domain"=>"abaa.org"}

Modifying your existing code, this does what you want:
hash = {}
File.readlines('A.cfg').each do |line|
if line.strip.empty?
puts hash if not hash.empty?
hash = {}
puts "\n"
else
hash.merge!(Hash[*line.strip.split("\t")])
end
end
puts hash
You can likely simplify that depending on what you're actually doing with the data.

open('A.cfg', &:read)
.strip.split(/#$/{2,}/)
.map{|s| Hash[s.scan(/^(\S+)\s+(\S+)/)]}
gives
[
{
"Title" => "49th",
"URL" => "http://artsweb.bham.ac.uk/",
"Domain" => "artsweb.bham.ac.uk"
},
{
"Title" => "ABAA",
"URL" => "http://abaa.org/",
"Domain" => "abaa.org"
}
]

read the whole content of the file using read:
contents = ""
File.open('A.cfg').do |file|
contents = file.read
end
And then split the contents on two newline characters:
contents.split("\n\n")
And lastly, create a function pretty similar to what you already have to parse those chunks.
Please note that if you are working on windows it may happen that you need to split on a different sequence because of the carriage return character.

Related

How to read multiple XML files then output to multiple CSV files with the same XML filenames

I am trying to parse multiple XML files then output them into CSV files to list out the proper rows and columns.
I was able to do so by processing one file at a time by defining the filename, and specifically output them into a defined output file name:
File.open('H:/output/xmloutput.csv','w')
I would like to write into multiple files and make their name the same as the XML filenames without hard coding it. I tried doing it multiple ways but have had no luck so far.
Sample XML:
<?xml version="1.0" encoding="UTF-8"?>
<record:root>
<record:Dataload_Request>
<record:name>Bob Chuck</record:name>
<record:Address_Data>
<record:Street_Address>123 Main St</record:Street_Address>
<record:Postal_Code>12345</record:Postal_Code>
</record:Address_Data>
<record:Age>45</record:Age>
</record:Dataload_Request>
</record:root>
Here is what I've tried:
require 'nokogiri'
require 'set'
files = ''
input_folder = "H:/input"
output_folder = "H:/output"
if input_folder[input_folder.length-1,1] == '/'
input_folder = input_folder[0,input_folder.length-1]
end
if output_folder[output_folder.length-1,1] != '/'
output_folder = output_folder + '/'
end
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
doc = Nokogiri::XML(file)
record = {} # hashes
keys = Set.new
records = [] # array
csv = ""
doc.traverse do |node|
value = node.text.gsub(/\n +/, '')
if node.name != "text" # skip these nodes: if class isnt text then skip
if value.length > 0 # skip empty nodes
key = node.name.gsub(/wd:/,'').to_sym
if key == :Dataload_Request && !record.empty?
records << record
record = {}
elsif key[/^root$|^document$/]
# neglect these keys
else
key = node.name.gsub(/wd:/,'').to_sym
# in case our value is html instead of text
record[key] = Nokogiri::HTML.parse(value).text
# add to our key set only if not already in the set
keys << key
end
end
end
end
# build our csv
File.open('H:/output/.*csv', 'w') do |file|
file.puts %Q{"#{keys.to_a.join('","')}"}
records.each do |record|
keys.each do |key|
file.write %Q{"#{record[key]}",}
end
file.write "\n"
end
print ''
print 'output files ready!'
print ''
end
I have been getting 'read memory': no implicit conversion of Array into String (TypeError) and other errors.
Here's a quick peer-review of your code, something like you'd get in a corporate environment...
Instead of writing:
input_folder = "H:/input"
input_folder[input_folder.length-1,1] == '/' # => false
Consider doing it using the -1 offset from the end of the string to access the character:
input_folder[-1] # => "t"
That simplifies your logic making it more readable because it's lacking unnecessary visual noise:
input_folder[-1] == '/' # => false
See [] and []= in the String documentation.
This looks like a bug to me:
files = Dir[input_folder + '/*.xml'].sort_by{ |f| File.mtime(f)}
file = File.read(input_folder + '/' + files)
files is an array of filenames. input_folder + '/' + files is appending an array to a string:
foo = ['1', '2'] # => ["1", "2"]
'/parent/' + foo # =>
# ~> -:9:in `+': no implicit conversion of Array into String (TypeError)
# ~> from -:9:in `<main>'
How you want to deal with that is left as an exercise for the programmer.
doc.traverse do |node|
is icky because it sidesteps the power of Nokogiri being able to search for a particular tag using accessors. Very rarely do we need to iterate over a document tag by tag, usually only when we're peeking at its structure and layout. traverse is slower so use it as a very last resort.
length is nice but isn't needed when checking whether a string has content:
value = 'foo'
value.length > 0 # => true
value > '' # => true
value = ''
value.length > 0 # => false
value > '' # => false
Programmers coming from Java like to use the accessors but I like being lazy, probably because of my C and Perl backgrounds.
Be careful with sub and gsub as they don't do what you're thinking they do. Both expect a regular expression, but will take a string which they do a escape on before beginning their scan.
You're passing in a regular expression, which is OK in this case, but it could cause unexpected problems if you don't remember all the rules for pattern matching and that gsub scans until the end of the string:
foo = 'wd:barwd:' # => "wd:barwd:"
key = foo.gsub(/wd:/,'') # => "bar"
In general I recommend people think a couple times before using regular expressions. I've seen some gaping holes opened up in logic written by fairly advanced programmers because they didn't know what the engine was going to do. They're wonderfully powerful, but need to be used surgically, not as a universal solution.
The same thing happens with a string, because gsub doesn't know when to quit:
key = foo.gsub('wd:','') # => "bar"
So, if you're looking to change just the first instance use sub:
key = foo.sub('wd:','') # => "barwd:"
I'd do it a little differently though.
foo = 'wd:bar'
I can check to see what the first three characters are:
foo[0,3] # => "wd:"
Or I can replace them with something else using string indexing:
foo[0,3] = ''
foo # => "bar"
There's more but I think that's enough for now.
You should use Ruby's CSV class. Also, you don't need to do any string matching or regex stuff. Use Nokogiri to target elements. If you know the node names in the XML will be consistent it should be pretty simple. I'm not exactly sure if this is the output you want, but this should get you in the right direction:
require 'nokogiri'
require 'csv'
def xml_to_csv(filename)
xml_str = File.read(filename)
xml_str.gsub!('record:','') # remove the record: namespace
doc = Nokogiri::XML xml_str
csv_filename = filename.gsub('.xml', '.csv')
CSV.open(csv_filename, 'wb' ) do |row|
row << ['name', 'street_address', 'postal_code', 'age']
row << [
doc.xpath('//name').text,
doc.xpath('//Street_Address').text,
doc.xpath('//Postal_Code').text,
doc.xpath('//Age').text,
]
end
end
# iterate over all xml files
Dir.glob('*.xml').each { |filename| xml_to_csv(filename) }

Change Headers for Certain Columns in CSV File

I have a CSV file that I want to change the headers only for certain columns (about 20 of them in my actual file). Here's a sample CSV file:
CSV File
"name","blah_01_blah","foo_1_01_foo","bacon_01_bacon","bacon_02_bacon"
"John","yucky","summer","yum","food"
"Mary","","","cool","sundae"
I have been trying this with a File/IO class, but when it reads the file to do the gsub it removes all of the quotation marks around each string separated by commas. Here's the code I'm using:
Ruby Code
file = 'file.csv'
replacements = {
'blah_01_blah' => 'newblah1',
'foo_01_foo' => 'coolfoo1',
'bacon_01_bacon' => 'goodpig1',
'bacon_01_bacon' => 'goodpig2'
}
matcher = /#{replacements.keys.join('|')}/
outdata = File.read(file).gsub(matcher, replacements)
File.open(file, 'w') do |out|
out << outdata
end
What I end up with is this in the CSV file:
New CSV File
name,blah_01_blah,foo_1_01_foo,bacon_01_bacon,bacon_02_bacon
John,yucky,summer,yum,food
Mary,"","",cool,sundae
It's keeping the quotation marks in fields that are blank, but taking them out around the strings elsewhere. I want to retain those quotation marks in case for some reason a rogue comma ends up in a string somewhere so it doesn't get thrown off. How can I change the headers without losing my quotation marks around the strings?
EDIT - This is what I want the file to look like at the end.
Expected Result CSV File
"name","newblah1","coolfoo1","goodpig1","goodpig2"
"John","yucky","summer","yum","food"
"Mary","","","cool","sundae"
Thanks!
You don’t need to handle CSV at all:
File.write(
file,
File.readlines(file).tap do |lines|
lines.first.gsub!(matcher, replacements)
end.join
)
File#readlines.
The trick here is we actually deal with the first line only, as with plain text.
Let's first create the input CSV file.
text =<<_
"name","blah_01_blah","foo_1_01_foo","bacon_01_bacon","bacon_02_bacon"
"John","yucky","summer","yum","food"
"Mary","","","cool","sundae"
_
file_in = 'file_in.csv'
file_out = 'file_out.csv'
File.write(file_in, text)
#=> 137
Here is the replacements hash, which I simplified slightly.
replacements = {'blah_01_blah'=>'newblah1', 'foo_01_foo'=>'coolfoo1',
'bacon_01_bacon'=>'goodpig1'}
The first task is to modify this hash so that if it has no key k, replacements[k] will return k. For this we use the method Hash#default_proc=.
replacements.default_proc = ->(_,k) { k }
Here are two examples of how this hash is used.
replacements['bacon_01_bacon']
#=> "goodpig1"
replacements['name']
#=> "name"`
The latter follows because replacements has no key 'name'.
The code is as follows.
require 'csv'
f_in = CSV.read(file_in, headers:true)
CSV.open(file_out, 'w') do |csv_out|
csv_out << replacements.values_at(*f_in.headers)
f_in.each { |row| csv_out << row }
end
#=> #<CSV::Table mode:col_or_row row_count:3>
Note that
f_in.headers
#=> ["name", "blah_01_blah", "foo_1_01_foo", "bacon_01_bacon", "bacon_02_bacon"]
Let's look at the output file.
puts File.read(file_out)
prints
name,newblah1,foo_1_01_foo,goodpig1,bacon_02_bacon
John,yucky,summer,yum,food
Mary,"","",cool,sundae

Ruby iterate over multiple values for single key

Okay so I have a hash with a keys where some contain multiple values per key. I am trying to create new files with key being the filename and values being written to the text while (one value per line). Here is what I got.
#agencyList.each do |domain, email|
File.open(domain.to_s, "w") { |file| file.write(email) }
end
The issue is only the first element of value set is being outputted to the file. Any ideas?
As I understood correctly #agencyList is an array of Hashes. For example:
#agencyList = [
{domain: 'domain1', email: 'email11'},
{domain: 'domain1', email: 'email12'},
{domain: 'domain2', email: 'email21'},
]
So in this case File.open(domain.to_s, "w") has incorrect file mode. w will recreate a new file so this file will contain always only one value - the last one.
Try to open files with a mode and write lines via puts to ensure that values will be on separate lines:
#agencyList.each do |hash|
File.open(hash[:domain].to_s, "a") { |file| file.puts(hash[:email]) }
end
OR
But if you are saying
I have a hash with a keys where some contain multiple values per key
A Hash cannot contain values with the same key. So your #agencyList is a Hash and should have values as an array:
#agencyList = {
'key1' => ['val11', 'val12'],
'key2' => ['val21'],
]
If so your code should be something like this:
#agencyList.each do |domain, emails|
File.open(domain.to_s, "w") do |file|
emails.each do |email|
file.puts(email)
end
end
end
You'd need to iterate over the emails set
domains.each { |domain, emails|
File.open(domain, 'w'){ |f|
emails.each { |email|
f.puts(email)
}
}
}

How do I make an array of arrays out of a CSV?

I have a CSV file that looks like this:
Jenny, jenny#example.com ,
Ricky, ricky#example.com ,
Josefina josefina#example.com ,
I'm trying to get this output:
users_array = [
['Jenny', 'jenny#example.com'], ['Ricky', 'ricky#example.com'], ['Josefina', 'josefina#example.com']
]
I've tried this:
users_array = Array.new
file = File.new('csv_file.csv', 'r')
file.each_line("\n") do |row|
puts row + "\n"
columns = row.split(",")
users_array.push columns
puts users_array
end
Unfortunately, in Terminal, this returns:
Jenny
jenny#example.com
Ricky
ricky#example.com
Josefina
josefina#example.com
Which I don't think will work for this:
users_array.each_with_index do |user|
add_page.form_with(:id => 'new_user') do |f|
f.field_with(:id => "user_email").value = user[0]
f.field_with(:id => "user_name").value = user[1]
end.click_button
end
What do I need to change? Or is there a better way to solve this problem?
Ruby's standard library has a CSV class with a similar api to File but contains a number of useful methods for working with tabular data. To get the output you want, all you need to do is this:
require 'csv'
users_array = CSV.read('csv_file.csv')
PS - I think you are getting the output you expected with your file parsing as well, but maybe you're thrown off by how it is printing to the terminal. puts behaves differently with arrays, printing each member object on a new line instead of as a single array. If you want to view it as an array, use puts my_array.inspect.
Assuming that your CSV file actually has a comma between the name and email address on the third line:
require 'csv'
users_array = []
CSV.foreach('csv_file.csv') do |row|
users_array.push row.delete_if(&:nil?).map(&:strip)
end
users_array
# => [["Jenny", "jenny#example.com"],
# ["Ricky", "ricky#example.com"],
# ["Josefina", "josefina#example.com"]]
There may be a simpler way, but what I'm doing there is discarding the nil field created by the trailing comma and stripping the spaces around the email addresses.

Removing whitespaces in a CSV file

I have a string with extra whitespace:
First,Last,Email ,Mobile Phone ,Company,Title ,Street,City,State,Zip,Country, Birthday,Gender ,Contact Type
I want to parse this line and remove the whitespaces.
My code looks like:
namespace :db do
task :populate_contacts_csv => :environment do
require 'csv'
csv_text = File.read('file_upload_example.csv')
csv = CSV.parse(csv_text, :headers => true)
csv.each do |row|
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
end
end
#prices = CSV.parse(IO.read('prices.csv'), :headers=>true,
:header_converters=> lambda {|f| f.strip},
:converters=> lambda {|f| f ? f.strip : nil})
The nil test is added to the row but not header converters assuming that the headers are never nil, while the data might be, and nil doesn't have a strip method. I'm really surprised that, AFAIK, :strip is not a pre-defined converter!
You can strip your hash first:
csv.each do |unstriped_row|
row = {}
unstriped_row.each{|k, v| row[k.strip] = v.strip}
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
Edited to strip hash keys too
CSV supports "converters" for the headers and fields, which let you get inside the data before it's passed to your each loop.
Writing a sample CSV file:
csv = "First,Last,Email ,Mobile Phone ,Company,Title ,Street,City,State,Zip,Country, Birthday,Gender ,Contact Type
first,last,email ,mobile phone ,company,title ,street,city,state,zip,country, birthday,gender ,contact type
"
File.write('file_upload_example.csv', csv)
Here's how I'd do it:
require 'csv'
csv = CSV.open('file_upload_example.csv', :headers => true)
[:convert, :header_convert].each { |c| csv.send(c) { |f| f.strip } }
csv.each do |row|
puts "First Name: #{row['First']} \nLast Name: #{row['Last']} \nEmail: #{row['Email']}"
end
Which outputs:
First Name: 'first'
Last Name: 'last'
Email: 'email'
The converters simply strip leading and trailing whitespace from each header and each field as they're read from the file.
Also, as a programming design choice, don't read your file into memory using:
csv_text = File.read('file_upload_example.csv')
Then parse it:
csv = CSV.parse(csv_text, :headers => true)
Then loop over it:
csv.each do |row|
Ruby's IO system supports "enumerating" over a file, line by line. Once my code does CSV.open the file is readable and the each reads each line. The entire file doesn't need to be in memory at once, which isn't scalable (though on new machines it's becoming a lot more reasonable), and, if you test, you'll find that reading a file using each is extremely fast, probably equally fast as reading it, parsing it then iterating over the parsed file.

Resources