Parsing a dictionary text file in ruby - ruby

I am using ruby to try and parse a text file that has the form...
AAB eel bbc
ABA did eye non pap mom ere bob nun eve pip gig dad nan ana gog aha
mum sis ada ava ewe pop tit gag tat bub pup
eke ele hah huh pep sos tot wow aba ala
bib dud tnt
ABB all see off too ill add lee ass err xii ann fee vii inn egg odd bee dee goo
woo cnn pee fcc tee wee ebb edd gee ott ree vee ell orr rcc att boo cee cii
coo kee moo mss soo doo faa hee icc iss itt kii loo mee nee nuu ogg opp pii
tll upp voo zee
I need to be able to search by the first column, such as "AAB",and then search through all values that are associated with that key. I have tried to import the text file into a hash of arrays but could never get more than the first value to store. I have no preference as to how I can search the file, whether that is store the data into some data structure or just search the text file every time, I just need to be able to do it. I am at a loss as to how to proceed with this and any help would be greatly appreciated. Thanks
-amc25114

This will read your dictionary file. I'm storing the content in a string, then
turning it into a StringIO object to let me pretend it's a file. You can use
File.readlines to read directly from the file itself:
require 'pp'
require 'stringio'
text = 'AAB eel bbc
ABA did eye non pap mom ere bob nun eve pip gig dad nan ana gog aha
mum sis ada ava ewe pop tit gag tat bub pup
eke ele hah huh pep sos tot wow aba ala
bib dud tnt
ABB all see off too ill add lee ass err xii ann fee vii inn egg odd bee dee goo
woo cnn pee fcc tee wee ebb edd gee ott ree vee ell orr rcc att boo cee cii
coo kee moo mss soo doo faa hee icc iss itt kii loo mee nee nuu ogg opp pii
tll upp voo zee
'
file = StringIO.new(text)
dictionary = Hash[
file.readlines.slice_before(/^\S/).map{ |ary|
key, *values = ary.map(&:strip).join(' ').split(' ')
[key, values]
}
]
dictionary is a hash looking like:
{
"AAB"=>[
"eel", "bbc"
],
"ABA"=>[
"did", "eye", "non", "pap", "mom", "ere", "bob", "nun", "eve", "pip",
"gig", "dad", "nan", "ana", "gog", "aha", "mum", "sis", "ada", "ava",
"ewe", "pop", "tit", "gag", "tat", "bub", "pup", "eke", "ele", "hah",
"huh", "pep", "sos", "tot", "wow", "aba", "ala", "bib", "dud", "tnt"
],
"ABB"=>[
"all", "see", "off", "too", "ill", "add", "lee", "ass", "err", "xii",
"ann", "fee", "vii", "inn", "egg", "odd", "bee", "dee", "goo", "woo",
"cnn", "pee", "fcc", "tee", "wee", "ebb", "edd", "gee", "ott", "ree",
"vee", "ell", "orr", "rcc", "att", "boo", "cee", "cii", "coo", "kee",
"moo", "mss", "soo", "doo", "faa", "hee", "icc", "iss", "itt", "kii",
"loo", "mee", "nee", "nuu", "ogg", "opp", "pii", "tll", "upp", "voo", "zee"
]
}
You can look up using the keys:
dictionary['AAB']
=> ["eel", "bbc"]
And search inside the array using include?:
dictionary['AAB'].include?('eel')
=> true
dictionary['AAB'].include?('foo')
=> false

class A
def initialize
#h, key = readlines.inject({}) do |m, s|
a = s.split
m[key = a.shift] = [] if s =~ /^[^\s]/
m[key] += a
m
end
end
def lookup k, v # not sure what you really want to do here
p [k, v, (#h[k].index v)]
end
self
end.new.lookup 'ABA', 'wow'

My 2 cents:
file = File.open("/path_to_file_here")
recent_key = ""
results = Hash.new
while (line = file.gets)
key = line[/[A-Z]+/]
recent_key = key if key
line.scan(/[a-z]+/).each do |val|
results[recent_key.to_sym] = [] if !results[recent_key.to_sym]
results[recent_key.to_sym] << val
end
end
puts results
This will give you this ouput:
{
:AAB=>["eel", "bbc"],
:ABA=>["did", "eye", "non", "pap", "mom", "ere", "bob", "nun", "eve", "pip", "gig", "dad", "nan", "ana", "gog", "aha", "mum", "sis", "ada", "ava", "ewe", "pop", "tit", "gag", "tat", "bub", "pup", "eke", "ele", "hah", "huh", "pep", "sos", "tot", "wow", "aba", "ala", "bib", "dud", "tnt"],
:ABB=>["all", "see", "off", "too", "ill", "add", "lee", "ass", "err", "xii", "ann", "fee", "vii", "inn", "egg", "odd", "bee", "dee", "goo", "woo", "cnn", "pee", "fcc", "tee", "wee", "ebb", "edd", "gee", "ott", "ree", "vee", "ell", "orr", "rcc", "att", "boo", "cee", "cii", "coo", "kee", "moo", "mss", "soo", "doo", "faa", "hee", "icc", "iss", "itt", "kii", "loo", "mee", "nee", "nuu", "ogg", "opp", "pii", "tll", "upp", "voo", "zee"]
}

Related

How can i separate a full name?

I've to take the right part and clean it after it comparate with the middle part and save if are equal
> #!/usr/bin/env ruby
require 'rubygems'
require 'levenshtein'
require 'csv'
# Extending String class for blank? method
class String
def blank?
self.strip.empty?
end
end
# In
lines = CSV.read('entrada.csv')
lines.each do |line|
id = line[0].upcase.strip
left = line[1].upcase.strip
right = line[2].upcase.strip
eduardo = line[2].upcase.split(' ','de')
line[0] = id
line[1] = left
line[2] = right
line[4] = eduardo[0]+eduardo[1]
distance = Levenshtein.distance left, right
line << 99 if (left.blank? or right.blank?)
line << distance unless (left.blank? or right.blank?)
end
# Out
# counter = 0
CSV.open('salida.csv', 'w') do |csv|
lines.each do |line|
# counter = counter + 1 if line[3] <= 3
csv << line
end
end
# p counter
The middle is the correct the rigth i should correct
Some examples:
Eduardo | Abner | Herrera | Herrera -> Eduardo Herrera
Angel | De | Leon -> Angel De Leon
Maira | Angelina | de | Leon -> Maira De Leon
Marquilla | Gutierrez | Petronilda |De | Leon -> Marquilla Petronilda
First order of business is to come up with some rules. Based on your examples, and Spanish naming customs, here's my stab at the rules.
A name has a forename, paternal surname, and optional maternal surname.
A forename can be multiple words.
A surname can be multiple words linked by a de, y, or e.
So ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] should be { forename: 'Marquilla', paternal_surname: 'Gutierrez', maternal_surname: 'Petronilda de Leon' }
To simplify the process, I'd first join any composite surnames into one field. ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] becomes ['Marquilla', 'Gutierrez', 'Petronilda De Leon']. Watch out for cases like ['Angel', 'De', 'Leon'] in which case the surname is probably De Leon.
Once that's done, figuring out which part is which becomes easier.
name = {}
if parts.length == 1
error?
# The special case of only two parts: forename paternal_surname
elsif parts.length == 2
name = {
forename: parts[0],
paternal_surname: parts[1]
}
# forename paternal_surname maternal_surname
else
# The forename can have multiple parts, so work from the
# end and whatever's left is their forename.
name[:maternal_surname] = parts.pop
name[:paternal_surname] = parts.pop
name[:forename] = parts.join(" ")
end
There's a lot of ambiguity in Spanish naming, so this can only be an educated guess at what their actual name is. You'll probably have to tweak the rules as you learn more about the dataset. For example, I'm pretty sure handling of de is not that simple. For example...
One Leocadia Blanco Álvarez, married to a Pedro Pérez Montilla, may be addressed as Leocadia Blanco de Pérez or as Leocadia Blanco Álvarez de Pérez
In that case ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] becomes ['Marquilla', 'Gutierrez', 'Petronilda', 'De Leon'] which is { forename: 'Marquilla', paternal_surname: 'Gutierrez', maternal_surname: 'Petronilda', married_to: 'Leon' } or 'Marquilla Gutierrez Petronilda who is married to someone whose parental surname is Leon.
Good luck.
I would add more columns to the database, like last_name1, last_name2, last_name3, etc, and make them optional (don't put validations on those attributes). Hope that answers your question!

Ruby parse csv to usable hash

I'm making a graph of NBA team payrolls. I have this csv: http://www.basketball-reference.com/contracts/
Rk,Team,2016-17,2017-18,2018-19,2019-20,2020-21,2021-22
1,Cleveland Cavaliers,$123590274,$118585590,$111958508,$65464580,,
2,Los Angeles Clippers,$118663837,$111415942,$59741545,$2500725,,
3,Portland Trail Blazers,$115817639,$138409348,$121862764,$118747593,$62965110,
4,Dallas Mavericks,$112488859,$94965622,$62766454,$35361887,,
5,Memphis Grizzlies,$114096737,$97742075,$88773581,$86212960,$34504132,
6,San Antonio Spurs,$110631827,$96744068,$56909494,$25990334,,
7,Detroit Pistons,$110492645,$103423474,$97903566,$62944822,$28751775,
8,Orlando Magic,$110283846,$74591158,$60217493,$43122365,$17000000,
9,Toronto Raptors,$109253824,$102105184,$84967199,$51464677,$27739975,
10,Washington Wizards,$107257619,$99510391,$103018516,$49454077,$28751775,
11,Miami Heat,$105296376,$95803765,$94126541,$65558388,,
12,Golden State Warriors,$104677735,$66269011,$40907537,$20844187,,
13,New York Knicks,$104501658,$77290233,$78325386,$41195895,,
14,Milwaukee Bucks,$103338094,$112928334,$89209363,$76367783,$29393637,$1865547
15,Los Angeles Lakers,$102354756,$84416158,$62677311,$56232985,,
16,New Orleans Pelicans,$102177578,$78434268,$67966414,$65355463,$28751775,
17,Charlotte Hornets,$101879187,$91900885,$74203714,$53571467,$27130434,
18,Atlanta Hawks,$101793777,$70277529,$46071092,$25355630,,
19,Sacramento Kings,$99168232,$84890581,$27219195,$8350732,,
20,Houston Rockets,$97639130,$97370875,$73164132,$68025858,,
21,Chicago Bulls,$96908430,$81438211,$42663896,$24345313,,
22,Oklahoma City Thunder,$96181060,$68581249,$68960837,$8863055,,
23,Boston Celtics,$95289212,$77610987,$50413474,$45792877,,
24,Indiana Pacers,$91950761,$80519787,$62913923,,,
25,Minnesota Timberwolves,$84638527,$68085957,$37620814,$5348007,,
26,Utah Jazz,$84386693,$71138358,$17001288,,,
27,Phoenix Suns,$84297090,$71123131,$66028579,$26744625,,
28,Denver Nuggets,$79627212,$81852764,$49452791,$10497490,,
29,Brooklyn Nets,$78769729,$60910873,$25603813,$9344638,,
30,Philadelphia 76ers,$75336267,$50155264,$26385506,$14125600,,
In ruby I wrote:
def payrolls
payrolls = {}
CSV.foreach("payrolls.csv", :headers => true, :header_converters => :symbol, :converters => :all) do |row|
payrolls[row.fields[1]] = Hash[row.headers[1..-1].zip(row.fields[1..-1])]
end
puts payrolls.inspect
end
Which outputs:
{
"Cleveland Cavaliers"=>{:team=>"Cleveland Cavaliers", :"201617"=>"$123590274", :"201718"=>"$118585590", :"201819"=>"$111958508", :"201920"=>"$65464580", :"202021"=>nil, :"202122"=>nil},
"Los Angeles Clippers"=>{:team=>"Los Angeles Clippers", :"201617"=>"$118663837", :"201718"=>"$111415942", :"201819"=>"$59741545", :"201920"=>"$2500725", :"202021"=>nil, :"202122"=>nil
}
Which is fairly usable. However, since the years heading is a number, when I use
payrolls[Cleveland Cavaliers][:201617]
I'm getting this error:
payrolls.rb:31: syntax error, unexpected tINTEGER, expecting tSTRING_CONTENT or tSTRING_DBEG or tSTRING_DVAR or tSTRING_END
puts payrolls["Cleveland Cavaliers"][:201617]
Thus, what is the best way to get the salary data for the graph?
Your code has a syntax error which I fixed: you didn't have the second to last }:
hash = {
"Cleveland Cavaliers"=>{:team=>"Cleveland Cavaliers", :"201617"=>"$123590274", :"201718"=>"$118585590", :"201819"=>"$111958508", :"201920"=>"$65464580", :"202021"=>nil, :"202122"=>nil},
"Los Angeles Clippers"=>{:team=>"Los Angeles Clippers", :"201617"=>"$118663837", :"201718"=>"$111415942", :"201819"=>"$59741545", :"201920"=>"$2500725", :"202021"=>nil, :"202122"=>nil}
}
You should access the hash by the key. Neither Cleveland Cavaliers nor :201617 are valid objects, thus you get the error.
The key you are looking at is :"201617":
hash["Cleveland Cavaliers"][:"201617"]
#=> "$123590274"
Use a string instead of a symbol:
payrolls["Cleveland Cavaliers"]["201617"]

Reading a line from a text file, splitting the string version of that line into two parts

Newbie learning Ruby.
I am trying to take a txt file and on each line take the first 3 characters and assign them as a key, and the rest of the string as that's keys value.
f = File.open("textfile.txt", "r")
finalHash = {"Key" => "Data"}
lineString = ""
while f.gets != nil do
lineString = f.gets
part1 = lineString.slice(0, 2)
part2 = lineString.slice(3, lineString.length)
finalHash[:part1] = part2
end
puts finalHash
Any advice is appreciated!
the 2nd parameter of slice is the length, not the end-index, so change:
part1 = lineString.slice(0, 2)
to:
part1 = lineString.slice(0, 3)
If passed a start index and a length, returns a substring containing
length characters starting at the index
Also you don't need the second parameter here (this is not a bug though):
part2 = lineString.slice(3, lineString.length)
This is enough:
part2 = lineString.slice(3)
Let's first create a file:
text = <<_
Now is the
time for all
good Rubiests
to come to the
aid of their
bowling team.
_
FName = 'temp'
File.write(FName, text)
#=> 80
Now read the file a line at a time and construct the desired hash:
File.foreach(FName).with_object({}) do |line, h|
h[line.slice!(0,3)] = line.chomp
end
#=> {"Now"=>" is the", "tim"=>"e for all", "goo"=>"d Rubiests",
# "to "=>"come to the", "aid"=>" of their", "bow"=>"ling team."}
After reading the first line,
h = { "Now"=>" is the" }
line = "time for all\n"
a = line.chomp
#=> "time for all"
b = a.slice!(0,3)
#=> "tim"
a #=> "e for all"
h[b] = a
#=> "e for all"
h #=> {"Now"=>" is the", "tim"=>"e for all"}
No direction is given if a line contains fewer than three characters. That may be something to consider.
lines = File.open("textfile.txt").read.split("\n")
hsh = {}
lines.each do |line|
next if line == ""
hsh[line[0..2]] = line[3..-1]
end
using your method of slowly nibbling at the file
f = File.open("textfile.txt")
hsh = {}
loop do
x = f.gets
break unless x
hsh[x[0..2]] = x[3..-1]
end
Borrowing #Cary's sample file...
text = <<_
Now is the
time for all
good Rubiests
to come to the
aid of their
bowling team.
_
FName = 'temp'
File.write(FName, text)
Now the file exists. Convert it to a 2 dimensional array. This array is trivially converted to a hash
File.foreach(FName).map{|x| [x.slice!(0,3), x]}.to_h
=> {"Now"=>" is the\n", "tim"=>"e for all\n", "goo"=>"d Rubiests\n", "to "=>"come to the\n", "aid"=>" of their\n", "bow"=>"ling team.\n"}
Here you go :
Sample data:
[zatcsv]$ cat foo.txt
TOK UPDATE DATE SHOT TIME AUXHEAT PHASE STATE PGASA PGASZ BGASA BGASZ BGASA2 BGASZ2 PIMPA
PIMPZ PELLET RGEO RMAG AMIN SEPLIM XPLIM KAPPA DELTA INDENT AREA VOL CONFIG IGRADB WALMAT DIVMAT LIMMAT EVAP
BT IP VSURF Q95 BEPMHD BETMHD BEPDIA NEL DNELDT ZEFF PRAD POHM ENBI PINJ BSOURCE PINJ2 BSOURCE2 COCTR PNBI ECHFREQ
ECHMODE ECHLOC PECH ICFREQ ICSCHEME ICANTEN PICRH LHFREQ LHNPAR PLH IBWFREQ PIBW TE0 TI0 WFANI WFICRH MEFF ISEQ WTH WTOT
JET 20031201 20001006 53521 1.000E+01 NBIC HSELM TRANS 2.000E+00 1.000E+00 2 1 0 0 1.658E+01 8.152E+00 NONE 2.888E+00
HEEH OIJ OIJJ 3.047E+00 9.807E-01 2.924E-02 7.304E-02 1.572E+00 1.781E-01 0.000E+00 4.572E+00 8.161E+01 LSN 1 IN/
2.000E+06 1.013E-01 6.001E+00 1.053E+00 9.252E-01 1.128E+00 3.106E+19 3.106E+19 6.612E+00 4.515E+06 5.122E+04 1.000E+05 1.466E+07
771706 0.000E+00 652114 1.000E+00 1.420E+07 -9.999E-09 NONE NONE 0.000E+00 5.100E+07 HMIN MONOPOLE 4.027E+06 3.700E+09 1.840E+00
2.000E+06 -9.999E-09 0.000E+00 9.295E+03 1.373E+04 6.913E-01 7.319E+05 2.000E+00 NONE 3.715E+06 5.381E+06 1.282E+06 1.297E+07 1.210E+07
something like this will do it for you :
[za csv]$cat text_to_hash.rb
#!/usr/bin/env ruby
file_dir = "/dir/to_folder/foo.txt"
thehash = Hash.new
line = File.read(file_dir).each_line do |line|
thehash[ key = line.slice(0..2)] = val = line.slice(3..-1)
thehash.each { |k , val| puts " Key: #{key} Value: #{val}"}
end
Outputs:
[za csv]$ ./text_to_hash.rb
Key: TOK Value: UPDATE DATE SHOT TIME AUXHEAT PHASE STATE PGASA PGASZ BGASA BGASZ BGASA2 BGASZ2 PIMPA
Key: PIM Value: PZ PELLET RGEO RMAG AMIN SEPLIM XPLIM KAPPA DELTA INDENT AREA VOL CONFIG IGRADB WALMAT DIVMAT LIMMAT EVAP
Key: ECH Value: MODE ECHLOC PECH ICFREQ ICSCHEME ICANTEN PICRH LHFREQ LHNPAR PLH IBWFREQ PIBW TE0 TI0 WFANI WFICRH MEFF ISEQ WTH WTOT
Key: JET Value: 20031201 20001006 53521 1.000E+01 NBIC HSELM TRANS 2.000E+00 1.000E+00 2 1 0 0 1.658E+01 8.152E+00 NONE 2.888E+00
Key: HEE Value: H OIJ OIJJ 3.047E+00 9.807E-01 2.924E-02 7.304E-02 1.572E+00 1.781E-01 0.000E+00 4.572E+00 8.161E+01 LSN 1 IN/
Key: 2.0 Value: 00E+06 1.013E-01 6.001E+00 1.053E+00 9.252E-01 1.128E+00 3.106E+19 3.106E+19 6.612E+00 4.515E+06 5.122E+04 1.000E+05 1.466E+07
Key: 771 Value: 706 0.000E+00 652114 1.000E+00 1.420E+07 -9.999E-09 NONE NONE 0.000E+00 5.100E+07 HMIN MONOPOLE 4.027E+06 3.700E+09 1.840E+00
Key: 2.0 Value: 00E+06 -9.999E-09 0.000E+00 9.295E+03 1.373E+04 6.913E-01 7.319E+05 2.000E+00 NONE 3.715E+06 5.381E+06 1.282E+06 1.297E+07 1.210E+07
Key: 4.4 Value: 45E-01 2.194E-01

Create array from csv using readlines ruby

I can’t seem to get this to work
I know I can do this with csv gem but Im trying out new stuff and I want to do it this way. All Im trying to do is to read lines in from a csv and then create one array from each line. I then want to put the second element in each array.
So far I have
filed="/Users/me/Documents/Workbook3.csv"
if File.exists?(filed)
File.readlines(filed).map {|d| puts d.split(",").to_a}
else puts "No file here”
The problem is that this creates one array which has all the lines in it whereas I want a separate array for each line (perhaps an array of arrays?)
Test data
Trade date,Settle date,Reference,Description,Unit cost (p),Quantity,Value (pounds)
04/09/2014,09/09/2014,S5411,Plus500 Ltd ILS0.01 152 # 419,419,152,624.93
02/09/2014,05/09/2014,B5406,Biomarin Pharmaceutical Com Stk USD0.001 150 # 4284.75,4284.75,150,-6439.08
29/08/2014,03/09/2014,S5398,Hargreaves Lansdown plc Ordinary 0.4p 520 # 1116.84,1116.84,520,5795.62
What I would like
S5411
B5406
S5398
Let write your data to a file:
s =<<THE_BITTER_END
Trade date,Settle date,Reference,Description,Unit cost (p),Quantity,Value (pounds)
04/09/2014,09/09/2014,S5411,Plus500 Ltd ILS0.01 152 # 419,419,152,624.93
02/09/2014,05/09/2014,B5406,Biomarin Pharmaceutical Com Stk USD0.001 150 # 4284.75,4284.75,150,-6439.08
29/08/2014,03/09/2014,S5398,Hargreaves Lansdown plc Ordinary 0.4p 520 # 1116.84,1116.84,520,5795.62
THE_BITTER_END
IO.write('temp',s)
#=> 363
We can then do this:
arr = File.readlines('temp').map { |s| s.split(',') }
#=> [["Trade date", "Settle date", "Reference", "Description", "Unit cost (p)",
"Quantity", "Value (pounds)\n"],
["04/09/2014", "09/09/2014", "S5411",
"Plus500 Ltd ILS0.01 152 # 419", "419", "152", "624.93\n"],
["02/09/2014", "05/09/2014", "B5406",
"Biomarin Pharmaceutical Com Stk USD0.001 150 # 4284.75",
"4284.75", "150", "-6439.08\n"],
["29/08/2014", "03/09/2014", "S5398",
"Hargreaves Lansdown plc Ordinary 0.4p 520 # 1116.84", "1116.84",
"520", "5795.62\n"]]
The values you want begin in the second element of arr and is the third element in each of those arrays. Therefore, you can pluck them out as follows:
arr[1..-1].map { |a| a[2] }
#=> ["S5411", "B5406", "S5398"]
Adopting #Stefan's suggestion of putting [2] within the block containing split, we can write this more compactly as follows:
File.readlines('temp')[1..-1].map { |s| s.split(',')[2] }
#=> ["S5411", "B5406", "S5398"]
You can also use built-in class CSV to do this very easily.
require "csv"
s =<<THE_BITTER_END
Trade date,Settle date,Reference,Description,Unit cost (p),Quantity,Value (pounds)
04/09/2014,09/09/2014,S5411,Plus500 Ltd ILS0.01 152 # 419,419,152,624.93
02/09/2014,05/09/2014,B5406,Biomarin Pharmaceutical Com Stk USD0.001 150 # 4284.75,4284.75,150,-6439.08
29/08/2014,03/09/2014,S5398,Hargreaves Lansdown plc Ordinary 0.4p 520 # 1116.84,1116.84,520,5795.62
THE_BITTER_END
arr = CSV.parse(s, :headers=>true).collect { |row| row["Reference"] }
p arr
#=> ["S5411", "B5406", "S5398"]
PS: I have borrowed the string from #Cary's answer

Interpreting this raw text - a strategy?

I have this raw text:
________________________________________________________________________________________________________________________________
Pos Car Competitor/Team Driver Vehicle Cap CL Laps Race.Time Fastest...Lap
1 6 Jason Clements Jason Clements BMW M3 3200 10 9:48.5710 3 0:57.3228*
2 42 David Skillender David Skillender Holden VS Commodore 6000 10 9:55.6866 2 0:57.9409
3 37 Bruce Cook Bruce Cook Ford Escort 3759 10 9:56.4388 4 0:58.3359
4 18 Troy Marinelli Troy Marinelli Nissan Silvia 3396 10 9:56.7758 2 0:58.4443
5 75 Anthony Gilbertson Anthony Gilbertson BMW M3 3200 10 10:02.5842 3 0:58.9336
6 26 Trent Purcell Trent Purcell Mazda RX7 2354 10 10:07.6285 4 0:59.0546
7 12 Scott Hunter Scott Hunter Toyota Corolla 2000 10 10:11.3722 5 0:59.8921
8 91 Graeme Wilkinson Graeme Wilkinson Ford Escort 2000 10 10:13.4114 5 1:00.2175
9 7 Justin Wade Justin Wade BMW M3 4000 10 10:18.2020 9 1:00.8969
10 55 Greg Craig Grag Craig Toyota Corolla 1840 10 10:18.9956 7 1:00.7905
11 46 Kyle Orgam-Moore Kyle Organ-Moore Holden VS Commodore 6000 10 10:30.0179 3 1:01.6741
12 39 Uptiles Strathpine Trent Spencer BMW Mini Cooper S 1500 10 10:40.1436 2 1:02.2728
13 177 Mark Hyde Mark Hyde Ford Escort 1993 10 10:49.5920 2 1:03.8069
14 34 Peter Draheim Peter Draheim Mazda RX3 2600 10 10:50.8159 10 1:03.4396
15 5 Scott Douglas Scott Douglas Datsun 1200 1998 9 9:48.7808 3 1:01.5371
16 72 Paul Redman Paul Redman Ford Focus 2lt 9 10:11.3707 2 1:05.8729
17 8 Matthew Speakman Matthew Speakman Toyota Celica 1600 9 10:16.3159 3 1:05.9117
18 74 Lucas Easton Lucas Easton Toyota Celica 1600 9 10:16.8050 6 1:06.0748
19 77 Dean Fuller Dean Fuller Mitsubishi Sigma 2600 9 10:25.2877 3 1:07.3991
20 16 Brett Batterby Brett Batterby Toyota Corolla 1600 9 10:29.9127 4 1:07.8420
21 95 Ross Hurford Ross Hurford Toyota Corolla 1600 8 9:57.5297 2 1:12.2672
DNF 13 Charles Wright Charles Wright BMW 325i 2700 9 9:47.9888 7 1:03.2808
DNF 20 Shane Satchwell Shane Satchwell Datsun 1200 Coupe 1998 1 1:05.9100 1 1:05.9100
Fastest Lap Av.Speed Is 152kph, Race Av.Speed Is 148kph
R=under lap record by greatest margin, r=under lap record, *=fastest lap time
________________________________________________________________________________________________________________________________
Issue# 2 - Printed Sat May 26 15:43:31 2012 Timing System By NATSOFT (03)63431311 www.natsoft.com.au/results
Amended
I need to parse it into an object with the obvious Position, Car, Driver etc fields. The issue is I have no idea on what sort of strategy to use. If I split it on whitespace, I would end up with a list like so:
["1", "6", "Jason", "Clements", "Jason", "Clements", "BMW", "M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"]
Can you see the issue. I cannot just interpret this list, because people may have just 1 name, or 3 words in a name, or many different words in a car. It makes it impossible to just reference the list using indexes alone.
What about using the offsets defined by the column names? I can't quite see how that could be used though.
Edit: So the current algorithm I am using works like this:
Split the text on new line giving a collection of lines.
Find the common whitespace characters FURTHEST RIGHT on each line. I.e. the positions (indexes) on each line where every other
line contains whitespace. EG:
Split the lines based on those common characters.
Trim the lines
Several issues exist:
If the names contain the same lengths like so:
Jason Adams
Bobby Sacka
Jerry Louis
Then it will interpret that as two separate items: (["Jason" "Adams", "Bobby", "Sacka", "Jerry", "Louis"]).
Whereas if they all differed like so:
Dominic Bou
Bob Adams
Jerry Seinfeld
Then it would correctly split on the last 'd' in Seinfeld (and thus we'd get a collection of three names(["Dominic Bou", "Bob Adams", "Jerry Seinfeld"]).
It's also quite brittle. I am looking for a nicer solution.
This is not a good case for regex, you really want to discover the format and then unpack the lines:
lines = str.split "\n"
# you know the field names so you can use them to find the column positions
fields = ['Pos', 'Car', 'Competitor/Team', 'Driver', 'Vehicle', 'Cap', 'CL Laps', 'Race.Time', 'Fastest...Lap']
header = lines.shift until header =~ /^Pos/
positions = fields.map{|f| header.index f}
# use that to construct an unpack format string
format = 1.upto(positions.length-1).map{|x| "A#{positions[x] - positions[x-1]}"}.join
# A4A5A31A25A21A6A12A10
lines.each do |line|
next unless line =~ /^(\d|DNF)/ # skip lines you're not interested in
data = line.unpack(format).map{|x| x.strip}
puts data.join(', ')
# or better yet...
car = Hash[fields.zip data]
puts car['Driver']
end
http://blog.ryanwood.com/past/2009/6/12/slither-a-dsl-for-parsing-fixed-width-text-files this may solve your problem.
here are few more examples and github.
Hope this helps!
I think it is easy enough to just use the fixed width on each line.
#!/usr/bin/env ruby
# ruby parsing_winner.rb winners_list.txt
args = ARGV
puts "ruby parsing_winner.rb winners_list.txt " if args.empty?
winner_file = open args.shift
array_of_race_results, array_of_race_results_array = [], []
class RaceResult
attr_accessor :position, :car, :team, :driver, :vehicle, :cap, :cl_laps, :race_time, :fastest, :fastest_lap
def initialize(position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap)
#position = position
#car = car
#team = team
#driver = driver
#vehicle = vehicle
#cap = cap
#cl_laps = cl_laps
#race_time = race_time
#fastest = fastest
#fastest_lap = fastest_lap
end
def to_a
# ["1", "6", "Jason", "Clements", "Jason", "Clements", "BMW", "M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"]
[position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap]
end
end
# Pos Car Competitor/Team Driver Vehicle Cap CL Laps Race.Time Fastest...Lap
# 1 6 Jason Clements Jason Clements BMW M3 3200 10 9:48.5710 3 0:57.3228*
# 2 42 David Skillender David Skillender Holden VS Commodore 6000 10 9:55.6866 2 0:57.9409
# etc...
winner_file.each_line do |line|
next if line[/^____/] || line[/^\w{4,}|^\s|^Pos/] || line[0..3][/\=/]
position = line[0..3].strip
car = line[4..8].strip
team = line[9..39].strip
driver = line[40..64].strip
vehicle = line[65..85].strip
cap = line[86..91].strip
cl_laps = line[92..101].strip
race_time = line[102..113].strip
fastest = line[114..116].strip
fastest_lap = line[117..-1].strip
racer = RaceResult.new(position, car, team, driver, vehicle, cap, cl_laps, race_time, fastest, fastest_lap)
array_of_race_results << racer
array_of_race_results_array << racer.to_a
end
puts "Race Results Objects: #{array_of_race_results}"
puts "Race Results: #{array_of_race_results_array.inspect}"
Output =>
Race Results Objects: [#<RaceResult:0x007fcc4a84b7c8 #position="1", #car="6", #team="Jason Clements", #driver="Jason Clements", #vehicle="BMW M3", #cap="3200", #cl_laps="10", #race_time="9:48.5710", #fastest="3", #fastest_lap="0:57.3228*">, #<RaceResult:0x007fcc4a84aa08 #position="2", #car="42", #team="David Skillender", #driver="David Skillender", #vehicle="Holden VS Commodore", #cap="6000", #cl_laps="10", #race_time="9:55.6866", #fastest="2", #fastest_lap="0:57.9409">, #<RaceResult:0x007fcc4a849ce8 #position="3", #car="37", #team="Bruce Cook", #driver="Bruce Cook", #vehicle="Ford Escort", #cap="3759", #cl_laps="10", #race_time="9:56.4388", #fastest="4", #fastest_lap="0:58.3359">, #<RaceResult:0x007fcc4a8491f8 #position="4", #car="18", #team="Troy Marinelli", #driver="Troy Marinelli", #vehicle="Nissan Silvia", #cap="3396", #cl_laps="10", #race_time="9:56.7758", #fastest="2", #fastest_lap="0:58.4443">, #<RaceResult:0x007fcc4b091ab8 #position="5", #car="75", #team="Anthony Gilbertson", #driver="Anthony Gilbertson", #vehicle="BMW M3", #cap="3200", #cl_laps="10", #race_time="10:02.5842", #fastest="3", #fastest_lap="0:58.9336">, #<RaceResult:0x007fcc4b0916a8 #position="6", #car="26", #team="Trent Purcell", #driver="Trent Purcell", #vehicle="Mazda RX7", #cap="2354", #cl_laps="10", #race_time="10:07.6285", #fastest="4", #fastest_lap="0:59.0546">, #<RaceResult:0x007fcc4b091298 #position="7", #car="12", #team="Scott Hunter", #driver="Scott Hunter", #vehicle="Toyota Corolla", #cap="2000", #cl_laps="10", #race_time="10:11.3722", #fastest="5", #fastest_lap="0:59.8921">, #<RaceResult:0x007fcc4b090e88 #position="8", #car="91", #team="Graeme Wilkinson", #driver="Graeme Wilkinson", #vehicle="Ford Escort", #cap="2000", #cl_laps="10", #race_time="10:13.4114", #fastest="5", #fastest_lap="1:00.2175">, #<RaceResult:0x007fcc4b090a78 #position="9", #car="7", #team="Justin Wade", #driver="Justin Wade", #vehicle="BMW M3", #cap="4000", #cl_laps="10", #race_time="10:18.2020", #fastest="9", #fastest_lap="1:00.8969">, #<RaceResult:0x007fcc4b090668 #position="10", #car="55", #team="Greg Craig", #driver="Grag Craig", #vehicle="Toyota Corolla", #cap="1840", #cl_laps="10", #race_time="10:18.9956", #fastest="7", #fastest_lap="1:00.7905">, #<RaceResult:0x007fcc4b090258 #position="11", #car="46", #team="Kyle Orgam-Moore", #driver="Kyle Organ-Moore", #vehicle="Holden VS Commodore", #cap="6000", #cl_laps="10", #race_time="10:30.0179", #fastest="3", #fastest_lap="1:01.6741">, #<RaceResult:0x007fcc4b08fe48 #position="12", #car="39", #team="Uptiles Strathpine", #driver="Trent Spencer", #vehicle="BMW Mini Cooper S", #cap="1500", #cl_laps="10", #race_time="10:40.1436", #fastest="2", #fastest_lap="1:02.2728">, #<RaceResult:0x007fcc4b08fa38 #position="13", #car="177", #team="Mark Hyde", #driver="Mark Hyde", #vehicle="Ford Escort", #cap="1993", #cl_laps="10", #race_time="10:49.5920", #fastest="2", #fastest_lap="1:03.8069">, #<RaceResult:0x007fcc4b08f628 #position="14", #car="34", #team="Peter Draheim", #driver="Peter Draheim", #vehicle="Mazda RX3", #cap="2600", #cl_laps="10", #race_time="10:50.8159", #fastest="10", #fastest_lap="1:03.4396">, #<RaceResult:0x007fcc4b08f218 #position="15", #car="5", #team="Scott Douglas", #driver="Scott Douglas", #vehicle="Datsun 1200", #cap="1998", #cl_laps="9", #race_time="9:48.7808", #fastest="3", #fastest_lap="1:01.5371">, #<RaceResult:0x007fcc4b08ee08 #position="16", #car="72", #team="Paul Redman", #driver="Paul Redman", #vehicle="Ford Focus", #cap="2lt", #cl_laps="9", #race_time="10:11.3707", #fastest="2", #fastest_lap="1:05.8729">, #<RaceResult:0x007fcc4b08e9f8 #position="17", #car="8", #team="Matthew Speakman", #driver="Matthew Speakman", #vehicle="Toyota Celica", #cap="1600", #cl_laps="9", #race_time="10:16.3159", #fastest="3", #fastest_lap="1:05.9117">, #<RaceResult:0x007fcc4b08e5e8 #position="18", #car="74", #team="Lucas Easton", #driver="Lucas Easton", #vehicle="Toyota Celica", #cap="1600", #cl_laps="9", #race_time="10:16.8050", #fastest="6", #fastest_lap="1:06.0748">, #<RaceResult:0x007fcc4b08e1d8 #position="19", #car="77", #team="Dean Fuller", #driver="Dean Fuller", #vehicle="Mitsubishi Sigma", #cap="2600", #cl_laps="9", #race_time="10:25.2877", #fastest="3", #fastest_lap="1:07.3991">, #<RaceResult:0x007fcc4b08ddc8 #position="20", #car="16", #team="Brett Batterby", #driver="Brett Batterby", #vehicle="Toyota Corolla", #cap="1600", #cl_laps="9", #race_time="10:29.9127", #fastest="4", #fastest_lap="1:07.8420">, #<RaceResult:0x007fcc4a848348 #position="21", #car="95", #team="Ross Hurford", #driver="Ross Hurford", #vehicle="Toyota Corolla", #cap="1600", #cl_laps="8", #race_time="9:57.5297", #fastest="2", #fastest_lap="1:12.2672">, #<RaceResult:0x007fcc4a847948 #position="DNF", #car="13", #team="Charles Wright", #driver="Charles Wright", #vehicle="BMW 325i", #cap="2700", #cl_laps="9", #race_time="9:47.9888", #fastest="7", #fastest_lap="1:03.2808">, #<RaceResult:0x007fcc4a847010 #position="DNF", #car="20", #team="Shane Satchwell", #driver="Shane Satchwell", #vehicle="Datsun 1200 Coupe", #cap="1998", #cl_laps="1", #race_time="1:05.9100", #fastest="1", #fastest_lap="1:05.9100">]
Race Results: [["1", "6", "Jason Clements", "Jason Clements", "BMW M3", "3200", "10", "9:48.5710", "3", "0:57.3228*"], ["2", "42", "David Skillender", "David Skillender", "Holden VS Commodore", "6000", "10", "9:55.6866", "2", "0:57.9409"], ["3", "37", "Bruce Cook", "Bruce Cook", "Ford Escort", "3759", "10", "9:56.4388", "4", "0:58.3359"], ["4", "18", "Troy Marinelli", "Troy Marinelli", "Nissan Silvia", "3396", "10", "9:56.7758", "2", "0:58.4443"], ["5", "75", "Anthony Gilbertson", "Anthony Gilbertson", "BMW M3", "3200", "10", "10:02.5842", "3", "0:58.9336"], ["6", "26", "Trent Purcell", "Trent Purcell", "Mazda RX7", "2354", "10", "10:07.6285", "4", "0:59.0546"], ["7", "12", "Scott Hunter", "Scott Hunter", "Toyota Corolla", "2000", "10", "10:11.3722", "5", "0:59.8921"], ["8", "91", "Graeme Wilkinson", "Graeme Wilkinson", "Ford Escort", "2000", "10", "10:13.4114", "5", "1:00.2175"], ["9", "7", "Justin Wade", "Justin Wade", "BMW M3", "4000", "10", "10:18.2020", "9", "1:00.8969"], ["10", "55", "Greg Craig", "Grag Craig", "Toyota Corolla", "1840", "10", "10:18.9956", "7", "1:00.7905"], ["11", "46", "Kyle Orgam-Moore", "Kyle Organ-Moore", "Holden VS Commodore", "6000", "10", "10:30.0179", "3", "1:01.6741"], ["12", "39", "Uptiles Strathpine", "Trent Spencer", "BMW Mini Cooper S", "1500", "10", "10:40.1436", "2", "1:02.2728"], ["13", "177", "Mark Hyde", "Mark Hyde", "Ford Escort", "1993", "10", "10:49.5920", "2", "1:03.8069"], ["14", "34", "Peter Draheim", "Peter Draheim", "Mazda RX3", "2600", "10", "10:50.8159", "10", "1:03.4396"], ["15", "5", "Scott Douglas", "Scott Douglas", "Datsun 1200", "1998", "9", "9:48.7808", "3", "1:01.5371"], ["16", "72", "Paul Redman", "Paul Redman", "Ford Focus", "2lt", "9", "10:11.3707", "2", "1:05.8729"], ["17", "8", "Matthew Speakman", "Matthew Speakman", "Toyota Celica", "1600", "9", "10:16.3159", "3", "1:05.9117"], ["18", "74", "Lucas Easton", "Lucas Easton", "Toyota Celica", "1600", "9", "10:16.8050", "6", "1:06.0748"], ["19", "77", "Dean Fuller", "Dean Fuller", "Mitsubishi Sigma", "2600", "9", "10:25.2877", "3", "1:07.3991"], ["20", "16", "Brett Batterby", "Brett Batterby", "Toyota Corolla", "1600", "9", "10:29.9127", "4", "1:07.8420"], ["21", "95", "Ross Hurford", "Ross Hurford", "Toyota Corolla", "1600", "8", "9:57.5297", "2", "1:12.2672"], ["DNF", "13", "Charles Wright", "Charles Wright", "BMW 325i", "2700", "9", "9:47.9888", "7", "1:03.2808"], ["DNF", "20", "Shane Satchwell", "Shane Satchwell", "Datsun 1200 Coupe", "1998", "1", "1:05.9100", "1", "1:05.9100"]]
You can use the fixed_width gem.
Your given file can be parsed with the following code:
require 'fixed_width'
require 'pp'
FixedWidth.define :cars do |d|
d.head do |head|
head.trap { |line| line !~ /\d/ }
end
d.body do |body|
body.trap { |line| line =~ /^(\d|DNF)/ }
body.column :pos, 4
body.column :car, 5
body.column :competitor, 31
body.column :driver, 25
body.column :vehicle, 21
body.column :cap, 5
body.column :cl_laps, 11
body.column :race_time, 11
body.column :fast_lap_no, 4
body.column :fast_lap_time, 10
end
end
pp FixedWidth.parse(File.open("races.txt"), :cars)
The trap method identifies the lines in each section. I used regex:
The head regex looks for lines that don't contain a digit.
The body regex looks for lines starting with a digit or "DNF"
Each section must include the line immediately after the last. The column definitions simply identify the number of columns to grab. The library strips whitespace for you. If you wanted to produce a fixed-width file, you can add alignment parameters, but it doesn't appear you will need that.
The result is a hash that starts like this:
{:head=>[{}, {}, {}],
:body=>
[{:pos=>"1",
:car=>"6",
:competitor=>"Jason Clements",
:driver=>"Jason Clements",
:vehicle=>"BMW M3",
:cap=>"3200",
:cl_laps=>"10",
:race_time=>"9:48.5710",
:fast_lap_no=>"3",
:fast_lap_time=>"0:57.3228"},
{:pos=>"2",
:car=>"42",
:competitor=>"David Skillender",
:driver=>"David Skillender",
:vehicle=>"Holden VS Commodore",
:cap=>"6000",
:cl_laps=>"10",
:race_time=>"9:55.6866",
:fast_lap_no=>"2",
:fast_lap_time=>"0:57.9409"},
Depending on how consistent the formatting is, you can probably use regex for this.
Here is a sample regex that works for the current data - may need to be tweaked depending on precise rules, but it gives the idea:
^
# Pos
(\d+|DNF)
\s+
#Car
(\d+)
\s+
# Team
([\w-]+(?: [\w-]+)+)
\s+
# Driver
([\w-]+(?: [\w-]+)+)
\s+
# Vehicle
([\w-]+(?: ?[\w-]+)+)
\s+
# Cap
(\d{4}|\dlt)
\s+
# CL Laps
(\d+)
\s+
# Race.Time
(\d+:\d+\.\d+)
\s+
# Fastest Lap
(\d+)
\s+
# Fastest Lap Time
(\d+:\d+\.\d+\*?)
\s*
$
If you can verify that the whitespace is space characters rather than tabs, and that overlong text is always truncated to fit the column structure, then I'd hard-code the slice boundaries:
parsed = [rawLine[0:3],rawLine[4:7],rawLine[9:38], ...etc... ]
Depending on the data source, this may be brittle (if, for instance every run has different column widths).
If the header row is always the same, you could extract the slice boundaries by searching for the known words of the header row.
Alright, I gotchu:
Edit: I forgot to mention, its assuming you've stored your input text in the variable input_string
# Choose a delimeter that is unlikely to occure
DELIM = '|||'
# DRY -> extend String
class String
def split_on_spaces(min_spaces = 1)
self.strip.gsub(/\s{#{min_spaces},}/, DELIM).split(DELIM)
end
end
# just get the data lines
lines = input_string.split("\n")
lines = lines[2...(lines.length - 4)].delete_if { |line|
line.empty?
}
# Grab all the entries into a nice 2-d array
entries = lines.map { |line|
[
line[0..8].split_on_spaces,
line[9..85].split_on_spaces(3).map{ |string|
string.gsub(/\s+/, ' ') # replace whitespace with 1 space
},
line[85...line.length].split_on_spaces(2)
].flatten
}
# BONUS
# Make nice hashes
keys = [:pos, :car, :team, :driver, :vehicle, :cap, :cl_laps, :race_time, :fastest_lap]
objects = entries.map { |entry|
Hash[keys.zip entry]
}
Outputs:
entries # =>
["1", "6", "Jason Clements", "Jason Clements", "BMW M3", "3200", "10", "9:48.5710", "3 0:57.3228*"]
["2", "42", "David Skillender", "David Skillender", "Holden VS Commodore", "6000", "10", "9:55.6866", "2 0:57.9409"]
...
# all of length 9, no extra spaces
And in case arrays just dont cut it
objects # =>
{:pos=>"1", :car=>"6", :team=>"Jason Clements", :driver=>"Jason Clements", :vehicle=>"BMW M3", :cap=>"3200", :cl_laps=>"10", :race_time=>"9:48.5710", :fastest_lap=>"3 0:57.3228*"}
{:pos=>"2", :car=>"42", :team=>"David Skillender", :driver=>"David Skillender", :vehicle=>"Holden VS Commodore", :cap=>"6000", :cl_laps=>"10", :race_time=>"9:55.6866", :fastest_lap=>"2 0:57.9409"}
...
I leave refactoring it into nice functions to you.
Unless there's a clear rule on how the columns are separated, you can't really do it.
The approach you have is good, assuming you know that each column value is properly indented to the column title.
Another approach could be to group words that are separated by exactly one space together (from the text you provided, I can see that this rule also holds).
Assuming the text will always be spaced the same, you could split the string based on position, then strip away extra spaces around each part. For example, in python:
pos=row[0:3].strip()
car=row[4:7].strip()
and so on. Alternately, you could define a regular expression to capture each part:
([:alnum:]+)\s([:num:]+)\s(([:alpha:]+ )+)\s(([:alpha:]+ )+)\s(([:alpha:]* )+)\s
and so on. (The exact syntax depends on your regexp grammar.) Note that the car regexp needs to handle the added spaces.
I'm not going to code this, but one way that definitely works for the above data set is by parsing it by white space and then assigning elements this way:
someArray = array of strings that were split by white space
Pos = someArray[0]
Car = someArray[1]
Competitor/Team = someArray[2] + " " + someArray[3]
Driver = someArray[4] + " " + someArray[5]
Vehicle = someArray[6] + " " + ... + " " + someArray[someArray.length - 6]
Cap = someArray[someArray.length - 5]
CL Laps = someArray[someArray.length - 4]
Race.Time = someArray[someArray.length - 3]
Fastest...Lap = someArray[someArray.length - 2] + " " + someArray[someArray.length - 1]
The vehicle part can be done by some sort of for or while loop.

Resources