Parse CSV file with headers when the headers are part way down the page - ruby

I have a CSV file that, as a spreadsheet, looks like this:
I want to parse the spreadsheet with the headers at row 19. Those headers wont always start at row 19, so my question is, is there a simple way to parse this spreadsheet, and specify which row holds the headers, say by using the "Date" string to identify the header row?
Right now, I'm doing this:
CSV.foreach(params['logbook'].tempfile, headers: true) do |row|
Flight.create(row.to_hash)
end
but obviously that wont work because it doesn't get the right headers.
I feel like there should be a simple solution to this since it's pretty common to have CSV files in this format.

Let's first create the csv file that would be produced from the spreadsheet.
csv =<<-_
N211E,C172,2004,Cessna,172R,airplane,airplane
C-GPGT,C172,1976,Cessna,172M,airplane,airplane
N17AV,P28A,1983,Piper,PA-28-181,airplane,airplane
N4508X,P28A,1975,Piper,PA-28-181,airplane,airplane
,,,,,,
Flights Table,,,,,,
Date,AircraftID,From,To,Route,TimeOut,TimeIn
2017-07-27,N17AV,KHPN,KHPN,KHPN KHPN,17:26,18:08
2017-07-27,N17AV,KHSE,KFFA,,16:29,17:25
2017-07-27,N17AV,W41,KHPN,,21:45,23:53
_
FName = 'test.csv'
File1.write(FName, csv)
#=> 395
We only want the part of the string that begins "Date,".The easiest option is probably to first extract the relevant text. If the file is not humongous, we can slurp it into a string and then remove the unwanted bit.
str = File.read(FName).gsub(/\A.+?(?=^Date,)/m, '')
#=> "Date,AircraftID,From,To,Route,TimeOut,TimeIn\n2017-07-27,N17AV,
# KHPN,KHPN,KHPN KHPN,17:26,18:08\n2017-07-27,N17AV,KHSE,KFFA,,16:29,
# 17:25\n2017-07-27,N17AV,W41,KHPN,,21:45,23:53\n"
The regular expression that is gsub's first argument could be written in free-spacing mode, which makes it self-documenting:
/
\A # match the beginning of the string
.+? # match any number of characters, lazily
(?=^Date,) # match "Date," at the beginning of a line in a positive lookahead
/mx # multi-line and free-spacing regex definition modes
Now that we have the part of the file we want in the string str, we can use CSV::parse to create the CSV::Table object:
csv_tbl = CSV.parse(str, headers: true)
#=> #<CSV::Table mode:col_or_row row_count:4>
The option :headers => true is documented in CSV::new.
Here are a couple of examples of how csv_tbl can be used.
csv_tbl.each { |row| p row }
#=> #<CSV::Row "Date":"2017-07-27" "AircraftID":"N17AV" "From":"KHPN"\
# "To":"KHPN" "Route":"KHPN KHPN" "TimeOut":"17:26" "TimeIn":"18:08">
# #<CSV::Row "Date":"2017-07-27" "AircraftID":"N17AV" "From":"KHSE"\
# "To":"KFFA" "Route":nil "TimeOut":"16:29" "TimeIn":"17:25">
# #<CSV::Row "Date":"2017-07-27" "AircraftID":"N17AV" "From":"W41"\
# "To":"KHPN" "Route":nil "TimeOut":"21:45" "TimeIn":"23:53">
(I've used the character '\' to signify that the string continues on the following line, so that readers would not have to scroll horizontally to read the lines.)
csv_tbl.each { |row| p row["From"] }
# "KHPN"
# "KHSE"
# "W41"
Readers who want to know more about how Ruby's CSV class is used may wish to read Darko Gjorgjievski's piece, "A Guide to the Ruby CSV Library, Part 1 and Part 2".

You can use the smarter_csv gem for this. Parse the file once to determine how many rows you need to skip to get to the header row you want, and then use the skip_lines option:
header_offset = <code to determine number of lines above the header>
SmarterCSV.process(params['logbook'].tempfile, skip_lines: header_offset)

From this format, I think the easiest way is to detect an empty line that comes before the header line. That would also work under changes to the header text. In terms of CSV, that would mean a whole line that has only empty cell items.

Related

How can I read CSV with strange quoting in ruby?

I have CSV file with some line like:
col1,col "two",col3
so i get Illegal quoting error and fix that by setting :quote_char => "\x00"
["col1", "col\"two\"", "col3"]
but there is a line like
col1,col2,"col,3"
later in that file
["col1", "col2", "\"col", "3\""]
then i read file line by line and call parse_csv wrapped in block. Set :quote_char => "\"", rescue CSV::MalformedCSVError exceptions and for that particular lines set :quote_char => "\x00" and retry
All works perfectly until we get line
col1,col "two","col,3"
in this case it rescues from exception, set :quote_char => "\x00" and result is
["col1", "col\"two\"", "\"col", "3\""]
Apple Numbers is able to openn that file absolutely correctly.
Is there are any setting for parse_csv to handle this without preprocess string in some way?
UPD i show CSV lines as it is in file and results (arrays) as it was printed by p. there are no actual \" in my strings.
This is an invalid csv file. If you have access to the source, you could (ask to) generate the data as follows:
col1,"col ""two""","col,3"
If not, the only option is to parse the data yourself:
pseudocode:
while(read_line) {
bool InsideQuotes = false
for each_char_in_line {
if(char == doublequote)
InsideQuotes = !InsideQuotes
if(char == ',' and !InsideQuotes)
// separator found - process field
}
}
This will also take care of escaped quotes like in col1,"col ""two""","col,3".
If the file contains multiline fields, some more work has to be done.
CSV is less a standard and more of a name that everyone thinks they're using to describe their quirky format correctly, and this is despite their being an RFC standard for CSV which is just another thing nobody pays attention to.
As such, a lot of programs that read CSV are very forgiving. Ruby's core CSV library is pretty good, but not as adaptable as others. That's because you've got Ruby there to get you out of a jam, and in Numbers you don't.
Try rewriting \" to "" which is conventional CSV formatting, as defined in the spec linked above:
CSV.parse(File.read.gsub(/\\"/, '""'))

How to read a file's content and search for a string in multiple files

I have a text file that has around 100 plus entries like out.txt:
domain\1esrt
domain\2345p
yrtfj
tkpdp
....
....
I have to read out.txt, line-by-line and check whether the strings like "domain\1esrt" are present in any of the files under a different directory. If present delete only that string occurrence and save the file.
I know how to read a file line-by-line and also know how to grep for a string in multiple files in a directory but I'm not sure how to join those two to achieve my above requirement.
You can create an array with all the words or strings you want to find and then delete/replace:
strings_to_delete = ['aaa', 'domain\1esrt', 'delete_me']
Then to read the file and use map to create an array with all the lines who doesn't match with none of the elements in the array created before:
# read the file 'text.txt'
lines = File.open('text.txt', 'r').map do|line|
# unless the line matches with some value on the strings_to_delete array
line unless strings_to_delete.any? do |word|
word == line.strip
end
# then remove the nil elements
end.reject(&:nil?)
And then open the file again but this time to write on it, all the lines which didn't match with the values in the strings_to_delete array:
File.open('text.txt', 'w') do |line|
lines.each do |element|
line.write element
end
end
The txt file looks like:
aaa
domain\1esrt
domain\2345p
yrtfj
tkpdp
....
....
delete_me
I don't know how it'll work with a bigger file, anyways, I hope it helps.
I would suggest using gsub here. It will run a regex search on the string and replace it with the second parameter. So if you only have to replace any single string, I believe you can simply run gsub on that string (including the newline) and replace it with an empty string:
new_file_text = text.gsub(/regex_string\n/, "")

Ruby - Extra punctuation in file when using regex and csv class to write to a file

I'm using regex to grab parameters from an html file.
I've tested the regexp and it seems to be fine- it appears that the csv conversion is what's causing the issue, but I'm not sure.
Here is what I have:
mechanics_file= File.read(filename)
mechanics= mechanics_file.scan(/(?<=70%">)(.*)(?=<\/td)/)
id_file= File.read(filename)
id=id_file.scan(/(?<="propertyids\[]" value=")(.*)(?=")/)
puts id.zip(mechanics)
CSV.open('csvfile.csv', 'w') do |csv|
id.zip(mechanics) { |row| csv << row }
end
The puts output looks like this:
2073
Acting
2689
Action / Movement Programming
But the contents of the csv look like this:
"[""2073""]","[""Acting""]"
"[""2689""]","[""Action / Movement Programming""]"
How do I get rid of all of the extra quotes and brackets? Am I doing something wrong in the process of writing to a csv?
This is my first project in ruby so I would appreciate a child-friendly explanation :) Thanks in advance!
String#scan returns an Array of Arrays (bold emphasis mine):
scan(pattern) → array
Both forms iterate through str, matching the pattern (which may be a Regexp or a String). For each match, a result is generated and either added to the result array or passed to the block. If the pattern contains no groups, each individual result consists of the matched string, $&. If the pattern contains groups, each individual result is itself an array containing one entry per group.
a = "cruel world"
# […]
a.scan(/(...)/) #=> [["cru"], ["el "], ["wor"]]
So, id looks like this:
id == [['2073'], ['2689']]
and mechanics looks like this:
mechanics == [['Acting'], ['Action / Movement Programming']]
id.zip(movements) then looks like this:
id.zip(movements) == [[['2073'], ['Acting']], [['2689'], ['Action / Movement Programming']]]
Which means that in your loop, each row looks like this:
row == [['2073'], ['Acting']]
row == [['2689'], ['Action / Movement Programming']]
CSV#<< expects an Array of Strings, or things that can be converted to Strings as an argument. You are passing it an Array of Arrays, which it will happily convert to an Array of Strings for you by calling Array#to_s on each element, and that looks like this:
[['2073'], ['Acting']].map(&:to_s) == [ '["2073"]', '["Acting"]' ]
[['2689'], ['Action / Movement Programming']].map(&:to_s) == [ '["2689"]', '["Action / Movement Programming"]' ]
Lastly, " is the string delimiter in CSV, and needs to be escaped by doubling it, so what actually gets written to the CSV file is this:
"[""2073""]", "[""Acting""]"
"[""2689""]", "[""Action / Movement Programming""]"
The simplest way to correct this, would be to flatten the return values of the scans (and maybe also convert the IDs to Integers, assuming that they are, in fact, Integers):
mechanics_file = File.read(filename)
mechanics = mechanics_file.scan(/(?<=70%">)(.*)(?=<\/td)/).flatten
id_file = File.read(filename)
id = id_file.scan(/(?<="propertyids\[]" value=")(.*)(?=")/).flatten.map(&:to_i)
CSV.open('csvfile.csv', 'w') do |csv|
id.zip(mechanics) { |row| csv << row }
end
Another suggestion would be to forgo the Regexps completely and use an HTML parser to parse the HTML.

How to get text between two strings in ruby?

I have a text file that contains this text:
What's New in this Version
==========================
-This is the text I want to get
-It can have 1 or many lines
-These equal signs are repeated throughout the file to separate sections
Primary Category
================
I just want to get everything between ========================== and Primary Category and store that block of text in a variable. I thought the following match method would work but it gives me, NoMethodError: undefined method `match'
f = File.open(metadataPath, "r")
line = f.readlines
whatsNew = f.match(/==========================(.*)Primary Category/m).strip
Any ideas? Thanks in advance.
f is a file descriptor - you want to match on the text in the file, which you read into line. What I prefer to do instead of reading the text into an array (which is hard to regex on) is to just read it into one string:
contents = File.open(metadataPath) { |f| f.read }
contents.match(/==========================(.*)Primary Category/m)[1].strip
The last line produces your desired output:
-This is the text I want to get \n-It can have 1 or many lines\n-These equal signs are repeated throughout the file to separate sections"
f = File.open(metadataPath, "r")
line = f.readlines
line =~ /==========================(.*)Primary Category/m
whatsNew = $1
you may want to consider refining the .* though as that could be greedy
Your problem is that readlines gives you an array of strings (one for each line), but the regular expression you're using needs a single string. You could read the file as one string:
contents = File.read(metadataPath)
puts contents[/^=+(.*?)Primary Category/m]
# => ==========================
# => -This is the text I want to get
# => -It can have 1 or many lines
# => -These equal signs are repeated throughout the file to separate sections
# =>
# => Primary Category
or you could join the lines into a single string before applying the regular expression:
lines = File.readlines(metadataPath)
puts lines.join[/^=+(.*?)Primary Category/m]
# => ==========================
# => -This is the text I want to get
# => -It can have 1 or many lines
# => -These equal signs are repeated throughout the file to separate sections
# =>
# => Primary Category
The approach I'd take is read in the lines, find out which line numbers are a series of equal signs (using Array#find_index), and group the lines into chunks from the line after the equal signs to the line before (or two lines before) the next lot of equal signs (probably using Enumerable#each_cons(2) and map). That way I don't have to modify much if the section headings change.

Ruby MatchData class is repeating captures, instead of including additional captures as it "should"

Ruby 1.9.1, OSX 10.5.8
I'm trying to write a simple app that parses through of bunch of java based html template files to replace a period (.) with an underscore if it's contained within a specific tag. I use ruby all the time for these types of utility apps, and thought it would be no problem to whip up something using ruby's regex support. So, I create a Regexp.new... object, open a file, read it in line by line, then match each line against the pattern, if I get a match, I create a new string using replaceString = currentMatch.gsub(/./, '_'), then create another replacement as whole string by newReplaceRegex = Regexp.escape(currentMatch) and finally replace back into the current line with line.gsub(newReplaceRegex, replaceString) Code below, of course, but first...
The problem I'm having is that when accessing the indexes within the returned MatchData object, I'm getting the first result twice, and it's missing the second sub string it should otherwise be finding. More strange, is that when testing this same pattern and same test text using rubular.com, it works as expected. See results here
My pattern:
(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+.)+(?:[a-zA-Z0-9]+)(?:>))
Text text:
<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText
Here's the relevant code:
tagRegex = Regexp.new('(<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>))+')
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each{|htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
tagMatch = tagRegex.match(htmlLine)
if(tagMatch)
matchesArray = tagMatch.to_a
firstMatch = matchesArray[0]
secondMatch = matchesArray[1]
puts "First match: #{firstMatch} and second match #{secondMatch}"
tagMatch.captures.each {|lineMatchCapture|
puts "Current capture for tagMatches: #{lineMatchCapture} of total match count #{matchesArray.size}"
#create a new regex using the match results; make sure to use auto escape method
originalPatternString = Regexp.escape(lineMatchCapture)
replacementRegex = Regexp.new(originalPatternString)
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = lineMatchCapture.gsub(/\./, '_')
#replace original match with underscore replaced copy within line
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
}
end
}
I would think that I should get the first tag in matchData[0] then the second tag in matchData1, or, what I'm really doing because I don't know how many matches I'll get within any given line is matchData.to_a.each. And in this case, matchData has two captures, but they're both the first tag match
which is: <WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>
So, what the heck am I doing wrong, why does rubular test give me the expected results?
You want to use the on String#scan instead of the Regexp#match:
tag_regex = /<(?:WEBOBJECT|webobject) (?:NAME|name)=(?:[a-zA-Z0-9]+\.)+(?:[a-zA-Z0-9]+)(?:>)/
lines = "<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>moreNonMatchingText\
<WEBOBJECT NAME=admin.SecondLineMatch>AndEvenMoreNonMatchingText"
lines.scan(tag_regex)
# => ["<WEBOBJECT NAME=admin.normalMode.someOtherPatternWeDontWant.moreThatWeDontWant>", "<WEBOBJECT NAME=admin.SecondLineMatch>"]
A few recommendations for next ruby questions:
newlines and spaces are your friends, you don't loose points for using more lines on your code ;-)
use do-end on blocks instead of {}, improves readability a lot
declare variables in snake case (hello_world) instead of camel case (helloWorld)
Hope this helps
I ended up using the String.scan approach, the only tricky point there was figuring out that this returns an array of arrays, not a MatchData object, so there was some initial confusion on my part, mostly due to my ruby green-ness, but it's working as expected now. Also, I trimmed the regex per Trevoke's suggestion. But snake case? Never...;-) Anyway, here goes:
tagRegex = /(<(?:webobject) (?:name)=(?:\w+\.)+(?:\w+)(?:>))/i
testFile = File.open('RegexTestingCompFix.txt', "r+")
lineCount=0
testFile.each do |htmlLine|
lineCount += 1
puts ("Current line: #{htmlLine} at line num: #{lineCount}")
oldMatches = htmlLine.scan(tagRegex) #oldMatches thusly named due to not explicitly using Regexp or MatchData, as in "the old way..."
if(oldMatches.size > 0)
oldMatches.each_index do |index|
arrayMatch = oldMatches[index]
aMatch = arrayMatch[0]
#create a new regex using the match results; make sure to use auto escape method
replacementRegex = Regexp.new(Regexp.escape(aMatch))
#replace any periods with underscores in a copy of lineMatchCapture
periodToUnderscoreCorrection = aMatch.gsub(/\./, '_')
#replace original match with underscore replaced copy within line, matching against the new escaped literal regex
htmlLine.gsub!(replacementRegex, periodToUnderscoreCorrection)
puts "The modified htmlLine is now: #{htmlLine}"
end # I kind of still prefer the brackets...;-)
end
end
Now, why does MatchData work the way it does? It seems like it's behavior is a bug really, and certainly not very useful in general if you can't get it provide a simple means of accessing all the matches. Just my $.02
Small bits:
This regexp helps you get "normalMode" .. But not "secondLineMatch":
<webobject name=\w+\.((?:\w+)).+> (with option 'i', for "case insensitive")
This regexp helps you get "secondLineMatch" ... But not "normalMode":
<webobject name=\w+\.((?:\w+))> (with option 'i', for "case insensitive").
I'm not really good at regexpt but I'll keep toiling at it.. :)
And I don't know if this helps you at all, but here's a way to get both:
<webobject name=admin.(\w+) (with option 'i').

Resources