I have a bunch of pipe-delimited files that weren't properly escaped for carriage returns when generated, and so I cant use the CR or newline characters to delimit the rows. I DO know however that each record has to have exactly 7 fields.
Splitting the fields is easy with the CSV library in Ruby 1.9 setting the 'col_sep' argument, but the 'row_sep' argument cannot be set because I have newlines within the fields.
Is there a way to parse a pipe-delimited file using a fixed number of fields as the row delimiter?
Thanks!
Here's one way of doing it:
Build a sample string of seven words, with an embedded new-line in the
middle of the string. There are three lines worth.
text = (["now is the\ntime for all good"] * 3).join(' ').gsub(' ', '|')
puts text
# >> now|is|the
# >> time|for|all|good|now|is|the
# >> time|for|all|good|now|is|the
# >> time|for|all|good
Process like this:
lines = []
chunks = text.gsub("\n", '|').split('|')
while (chunks.any?)
lines << chunks.slice!(0, 7).join(' ')
end
puts lines
# >> now is the time for all good
# >> now is the time for all good
# >> now is the time for all good
So, that shows we can rebuild the rows.
Pretending that the words are actually columns from the pipe-delimited file we can make the code do the real thing by taking out the .join(' '):
while (chunks.any?)
lines << chunks.slice!(0, 7)
end
ap lines
# >> [
# >> [0] [
# >> [0] "now",
# >> [1] "is",
# >> [2] "the",
# >> [3] "time",
# >> [4] "for",
# >> [5] "all",
# >> [6] "good"
# >> ],
# >> [1] [
# >> [0] "now",
# >> [1] "is",
# >> [2] "the",
# >> [3] "time",
# >> [4] "for",
# >> [5] "all",
# >> [6] "good"
# >> ],
# >> [2] [
# >> [0] "now",
# >> [1] "is",
# >> [2] "the",
# >> [3] "time",
# >> [4] "for",
# >> [5] "all",
# >> [6] "good"
# >> ]
# >> ]
Say for instance you wanted to parse all charities in the IRS txt file that is pipe delimited.
Say you had a model called Charity that had all the same fields as your pipe delimited file.
class Charity < ActiveRecord::Base
# http://apps.irs.gov/app/eos/forwardToPub78DownloadLayout.do
# http://apps.irs.gov/app/eos/forwardToPub78Download.do
attr_accessible :city, :country, :deductibility_status, :deductibility_status_description, :ein, :legal_name, :state
end
You can make a rake task called import.rake
namespace :import do
desc "Import Pipe Delimted IRS 5013c Data "
task :irs_data => :environment do
require 'csv'
txt_file_path = 'db/irs_5013cs.txt'
results = File.open(txt_file_path).readlines do |line|
line = line.split('|').each_slice(7)
end
# Order Field Notes
# 1 EIN Required
# 2 Legal Name Optional
# 3 City Optional
# 4 State Optional
# 5 Deductibility Status Optional
# 6 Country Optional - If Country is null, then Country is assumed to be United States
# 7 Deductibility Status Description Optional
results.each do |row|
row = row.split('|').each_slice(7).to_a.first
#ID,Category,Sub Category,State Standard
Charity.create!({
:ein => row[0],
:legal_name => row[1],
:city => row[2],
:state => row[3],
:deductibility_status => row[4],
:country => row[5],
:deductibility_status_description => row[6]
})
end
end
end
finally you can run this import by typing following on command line from your rails app
rake import:irs_data
Here's one idea, use a regex:
#!/opt/local/bin/ruby
fp = File.open("pipe_delim.txt")
r1 = /.*?\|.*?\|.*?\|.*?\|.*?\|.*?\|.*?\|/m
results = fp.gets.scan(r1)
results.each do |result|
puts result
end
This regex seems to trip up on newlines within a field, but I'm sure you could tweak it to work properly.
Just a thought, but the cucumber testing gem has a Cucumber::Ast::Table class you could use to process this file.
Cucumber::Ast::Table.new(File.read(file))
Then I think it's the rows method you can use to read it out.
Try using String#split and Enumerable#each_slice:
result = []
text.split('|').each_slice(7) { |record| result << record }
Related
I have a CSV file that looks something like this:
ID,Name,Age
1,John,99
I've required csv in my Ruby script.
But using CSV, how do I loop thru the header row? How do I find the position number for ID,Name and Age?
After copying your data to a file x.csv, I executed the following in irb:
2.3.0 :009 > require 'csv'
=> false
2.3.0 :010 > csv = CSV.read 'x.csv'
=> [["ID", "Name", "Age"], ["1", "John", "99"]]
2.3.0 :010 > csv = CSV.read 'x.csv'
=> [["ID", "Name", "Age"], ["1", "John", "99"]]
2.3.0 :011 > header_line = csv[0]
=> ["ID", "Name", "Age"]
2.3.0 :012 > header_line[0]
=> "ID"
2.3.0 :013 > header_line[1]
=> "Name"
2.3.0 :014 > header_line[2]
=> "Age"
...so this is one way you can do it; use read to get an array of arrays, and assume the first is an array of column headings.
In the real world you probably won't want to read the entire file into memory at once and would use CSV.foreach:
#!/usr/bin/env ruby
data = []
CSV.foreach('x.csv') do |values_in_row|
if #column_names # column names already read; this must be a data line
data << values_in_row # just an example
# do something with values_in_row
else
#column_names = values_in_row
end
end
puts "Column names are: #{#column_names.join(', ')}"
puts "Data lines are:"
puts data
If using the 'csv' library in ruby, how would you replace the headers without re-reading in a file?
foo.csv
'date','foo',bar'
1,2,3
4,5,6
Using a CSV::Table because of this answer
Here is a working solution, however it requires writing and reading from a file twice.
require 'csv'
#csv = CSV.table('foo.csv')
# Perform additional operations, like remove specific pieces of information.
# Save fixed csv to a file (with incorrect headers)
File.open('bar.csv','w') do |f|
f.write(#csv.to_csv)
end
# New headers
new_keywords = ['dur','hur', 'whur']
# Reopen the file, replace the headers, and print it out for debugging
# Not sure how to replace the headers of a CSV::Table object, however I *can* replace the headers of an array of arrays (hence the file.open)
lines = File.readlines('bar.csv')
lines.shift
lines.unshift(new_keywords.join(',') + "\n")
puts lines.join('')
# TODO: re-save file to disk
How could I modify the headers without reading from disk twice?
'dur','hur','whur'
1,x,3
4,5,x
Update
For those curious, here is the unabridged code. In order to use things like delete_if() the CSV must be imported with the CSV.table() function.
Perhaps the headers could be changed by converting the csv table into an array of arrays, however I'm not sure how to do that.
Given a test.csv file whose contents look like this:
id,name,age
1,jack,8
2,jill,9
You can replace the header row using this:
require 'csv'
array_of_arrays = CSV.read('test.csv')
p array_of_arrays # => [["id", "name", "age"],
# => ["1", "jack", "26"],
# => ["2", "jill", "27"]]
new_keywords = ['dur','hur','whur']
array_of_arrays[0] = new_keywords
p array_of_arrays # => [["dur", "hur", "whur"],
# => ["1", " jack", " 26"],
# => ["2", " jill", " 27"]]
Or if you'd rather preserve your original two-dimensional array:
new_array = Array.new(array_of_arrays)
new_array[0] = new_keywords
p new_array # => [["dur", "hur", "whur"],
# => ["1", " jack", " 26"],
# => ["2", " jill", " 27"]]
p array_of_arrays # => [["id", "name", "age"],
# => ["1", "jack", "26"],
# => ["2", "jill", "27"]]
I'm trying to solve this with a regex pattern, and even though my test passes with this solution, I would like split to only have ["1", "2"] inside the array. Is there a better way of doing this?
irb testing:
s = "//;\n1;2" # when given a delimiter of ';'
s2 = "1,2,3" # should read between commas
s3 = "//+\n2+2" # should read between delimiter of '+'
s.split(/[,\n]|[^0-9]/)
=> ["", "", "", "", "1", "2"]
Production:
module StringCalculator
def self.add(input)
solution = input.scan(/\d+/).map(&:to_i).reduce(0, :+)
input.end_with?("\n") ? nil : solution
end
end
Test:
context 'when given a newline delimiter' do
it 'should read between numbers' do
expect(StringCalculator.add("1\n2,3")).to eq(6)
end
it 'should not end in a newline' do
expect(StringCalculator.add("1,\n")).to be_nil
end
end
context 'when given different delimiter' do
it 'should support that delimiter' do
expect(StringCalculator.add("//;\n1;2")).to eq(3)
end
end
Very simple using String#scan :
s = "//;\n1;2"
s.scan(/\d/) # => ["1", "2"]
/\d/ - A digit character ([0-9])
Note :
If you have a string like below then, you should use /\d+/.
s = "//;\n11;2"
s.scan(/\d+/) # => ["11", "2"]
You're getting data that looks like this string: //1\n212
If you're getting the data as a file, then treat it as two separate lines. If it's a string, then, again, treat it as two separate lines. In either case it'd look like
//1
212
when output.
If it's a string:
input = "//1\n212".split("\n")
delimiter = input.first[2] # => "1"
values = input.last.split(delimiter) # => ["2", "2"]
If it's a file:
line = File.foreach('foo.txt')
delimiter = line.next[2] # => "1"
values = line.next.chomp.split(delimiter) # => ["2", "2"]
I would like to extract some information from a string in Ruby by only reading the String once (O(n) time complexity).
Here is an example:
The string looks like this: -location here -time 7:30pm -activity biking
I have a Ruby object I want to populate with this info. All the keywords are known, and they are all optional.
def ActivityInfo
_attr_reader_ :location, :time, :activity
def initialize(str)
#location, #time, #activity = DEFAULT_LOCATION, DEFAULT_TIME, DEFAULT_ACTIVITY
# Here is how I was planning on implementing this
current_string = ""
next_parameter = nil # A reference to keep track of which parameter the current string is refering to
words = str.split
while !str.empty?
word = str.shift
case word
when "-location"
if !next_parameter.nil?
next_parameter.parameter = current_string # Set the parameter value to the current_string
current_string = ""
else
next_parameter = #location
when "-time"
if !next_parameter.nil?
next_parameter.parameter = current_string
current_string = ""
else
next_parameter = #time
when "-activity"
if !next_parameter.nil?
next_parameter.parameter = current_string
current_string = ""
else
next_parameter = #time
else
if !current_string.empty?
current_string += " "
end
current_string += word
end
end
end
end
So basically I just don't know how to make a variable be the reference of another variable or method, so that I can then set it to a specific value. Or maybe there is just another more efficient way to achieve this?
Thanks!
The string looks suspiciously like a command-line, and there are some good Ruby modules to parse those, such as optparse.
Assuming it's not, here's a quick way to parse the commands in your sample into a hash:
cmd = '-location here -time 7:30pm -activity biking'
Hash[*cmd.scan(/-(\w+) (\S+)/).flatten]
Which results in:
{
"location" => "here",
"time" => "7:30pm",
"activity" => "biking"
}
Expanding it a bit farther:
class ActivityInfo
def initialize(h)
#location = h['location']
#time = h['time' ]
#activity = h['activity']
end
end
act = ActivityInfo.new(Hash[*cmd.scan(/-(\w+) (\S+)/).flatten])
Which sets act to an instance of ActivityInfo looking like:
#<ActivityInfo:0x101142df8
#activity = "biking",
#location = "here",
#time = "7:30pm"
>
--
The OP asked how to deal with situations where the commands are not flagged with - or are multiple words. These are equivalent, but I prefer the first stylistically:
irb(main):003:0> cmd.scan(/-((?:location|time|activity)) \s+ (\S+)/x)
[
[0] [
[0] "location",
[1] "here"
],
[1] [
[0] "time",
[1] "7:30pm"
],
[2] [
[0] "activity",
[1] "biking"
]
]
irb(main):004:0> cmd.scan(/-(location|time|activity) \s+ (\S+)/x)
[
[0] [
[0] "location",
[1] "here"
],
[1] [
[0] "time",
[1] "7:30pm"
],
[2] [
[0] "activity",
[1] "biking"
]
]
If the commands are multiple words, such as "at location":
irb(main):009:0> cmd = '-at location here -time 7:30pm -activity biking'
"-at location here -time 7:30pm -activity biking"
irb(main):010:0>
irb(main):011:0* cmd.scan(/-((?:at \s location|time|activity)) \s+ (\S+)/x)
[
[0] [
[0] "at location",
[1] "here"
],
[1] [
[0] "time",
[1] "7:30pm"
],
[2] [
[0] "activity",
[1] "biking"
]
]
If you need even more flexibility look at Ruby's strscan module. You can use that to tear apart a string and find the commands and their parameters.
Convert String to Options Hash
If you just want easy access to your flags and their values, you can split your string into a hash where each flag is a key. For example:
options = Hash[ str.scan /-(\w+)\s+(\S+)/ ]
=> {"location"=>"here", "time"=>"7:30pm", "activity"=>"biking"}
You can then reference values directly (e.g. options['location']) or iterate through your hash in key/value pairs. For example:
options.each_pair { |k, v| puts "%s %s" % [k, v] }
A Dash of Metaprogramming
Okay, this is serious over-engineering, but I spent a little extra time on this question because I found it interesting. I'm not claiming the following is useful; I'm just saying it was fun for me to do.
If you want to parse your option flags and and dynamically create a set of attribute readers and set some instance variables without having to define each flag or variable separately, you can do this with a dash of metaprogramming.
# Set attribute readers and instance variables dynamically
# using Kernel#instance_eval.
class ActivityInfo
def initialize(str)
options = Hash[ str.scan /-(\w+)\s+(\S+)/ ]
options.each_pair do |k, v|
self.class.instance_eval { attr_reader k.to_sym }
instance_variable_set("##{k}", v)
end
end
end
ActivityInfo.new '-location here -time 7:30pm -activity biking'
=> #<ActivityInfo:0x00000001b49398
#activity="biking",
#location="here",
#time="7:30pm">
Honestly, I think setting your variables explicitly from an options hash such as:
#activity = options['activity']`
will convey your intent more clearly (and be more readable), but it's always good to have alternatives. Your mileage may vary.
Why reinvent the wheel when Thor can do the heavy lifting for you?
class ActivityInfo < Thor
desc "record", "record details of your activity"
method_option :location, :type => :string, :aliases => "-l", :required => true
method_option :time, :type => :datetime, :aliases => "-t", :required => true
method_option :activity, :type => :string, :aliases => "-a", :required => true
def record
location = options[:location]
time = options[:time]
activity = options[:activity]
# record details of the activity
end
end
The options will be parse for you based on the datatype you specified. You can invoke it programmatically:
task = ActivityInfo.new([], {location: 'NYC', time: Time.now, activity: 'Chilling out'})
task.record
Or from command line: thor activity_info:record -l NYC -t "2012-06-23 02:30:00" -a "Chilling out"
I have a string:
TFS[MAD,GRO,BCN],ALC[GRO,PMI,ZAZ,MAD,BCN],BCN[ALC,...]...
I want to convert it into a list:
list = (
[0] => "TFS"
[0] => "MAD"
[1] => "GRO"
[2] => "BCN"
[1] => "ALC"
[0] => "GRO"
[1] => "PMI"
[2] => "ZAZ"
[3] => "MAD"
[4] => "BCN"
[2] => "BCN"
[1] => "ALC"
[2] => ...
[3] => ...
)
How do I do this in Ruby?
I tried:
(([A-Z]{3})\[([A-Z]{3},+))
But it returns only the first element in [] and doesn't make a comma optional (at the end of "]").
You need to tell the regex that the , is not required after each element, but instead in front of each argument except the first. This leads to the following regex:
str="TFS[MAD,GRO,BCN],ALC[GRO,PMI,ZAZ,MAD,BCN],BCN[ALC]"
str.scan(/[A-Z]{3}\[[A-Z]{3}(?:,[A-Z]{3})*\]/)
#=> ["TFS[MAD,GRO,BCN]", "ALC[GRO,PMI,ZAZ,MAD,BCN]", "BCN[ALC]"]
You can also use scan's behavior with capturing groups, to split each match into the part before the brackets and the part inside the brackets:
str.scan(/([A-Z]{3})\[([A-Z]{3}(?:,[A-Z]{3})*)\]/)
#=> [["TFS", "MAD,GRO,BCN"], ["ALC", "GRO,PMI,ZAZ,MAD,BCN"], ["BCN", "ALC"]]
You can then use map to split each part inside the brackets into multiple tokens:
str.scan(/([A-Z]{3})\[([A-Z]{3}(?:,[A-Z]{3})*)\]/).map do |x,y|
[x, y.split(",")]
end
#=> [["TFS", ["MAD", "GRO", "BCN"]],
# ["ALC", ["GRO", "PMI", "ZAZ", "MAD", "BCN"]],
# ["BCN", ["ALC"]]]
Here's another way using a hash to store your contents, and less regex.
string = "TFS[MAD,GRO,BCN],ALC[GRO,PMI,ZAZ,MAD,BCN],BCN[ALC]"
z=Hash.new([])
string.split(/][ \t]*,/).each do |x|
o,p=x.split("[")
z[o]=p.split(",")
end
z.each_pair{|x,y| print "#{x}:#{y}\n"}
output
$ ruby test.rb
TFS:["MAD", "GRO", "BCN"]
ALC:["GRO", "PMI", "ZAZ", "MAD", "BCN"]
BCN:["ALC]"]
first split the groups
groups = s.scan(/[^,][^\[]*\[[^\[]*\]/)
# => ["TFS[MAD,GRO,BCN]", "ALC[GRO,PMI,ZAZ,MAD,BCN]"]
Now you have the groups, the rest is pretty straightforward:
groups.map {|x| [x[0..2], x[4..-2].split(',')] }
# => [["TFS", ["MAD", "GRO", "BCN"]], ["ALC", ["GRO", "PMI", "ZAZ", "MAD", "BCN"]]]
If I understood correctly, you may want to get such array.
yourexamplestring.scan(/([A-Z]{3})\[([^\]]+)/).map{|a,b|[a,b.split(',')]}
[["TFS", ["MAD", "GRO", "BCN"]], ["ALC", ["GRO", "PMI", "ZAZ", "MAD", "BCN"]], ["BCN", ["ALC", "..."]]]