ruby regex to remove extra \n

ruby regex to remove extra \n - ruby

I have a mal-formatted .csv file which is caused by some extra \n. e.g.:
Name,Comment
"Peter","Good morning"
"Paul","How are you
"
"Mary","Fine"
The 2nd row ends with a unwanted, extra \n.
How can I remove all tailing \ns which are not followed by a double-quote " (assume the whole file is read into a string already)?

Don't read the whole thing into a string, use the standard CSV parser in 1.9 to read it. If you have that in, say, pancakes.csv, then:
require 'csv'
data = CSV.open('pancakes.csv').map { |r| r.map(&:strip) }
# or
data = CSV.open('pancakes.csv').map { |r| r.map(&:chomp) }
Then you'll have this in data:
[
["Name", "Comment"],
["Peter", "Good morning"],
["Paul", "How are you"],
["Mary", "Fine"]
]
So you can get your data all clean and nicely parsed quite simply. And if you just need to clean up the CSV for some other program that can't handled embedded newlines, then you can use CSV to write it back out again.

You don't need a Regexp for that. It's basically any double-quote on its own line:
csv_string.gsub("\n\"\n", "\"\n")

Why don't you just add a trailing double quote for lines which don't end in a double quote, and remove empty lines (lines that only have a double quote)?

Related

Ruby—nested gsubs?

I'm parsing a txt file (an old hymns book). I want to do the following:
Get the chorus
Parse/santize each line of that chorus
I've tried this code:
chorus_regex = /([^0-9]+\n)+/
puts hymn.gsub(chorus_regex) {|match| match.gsub(/^([^0-9]+\n)/, " \1")}
But the second gsub is only affecting the first line? I think it's because the \1 might be applied to the first regex, not the second
TL;DR
How do you write nested gsubs, so that you can grab blocks of txt, do a gsub on those blocks, and replace the old blocks with the results?
Edit
I simplified the regexes, so the question is focused on how to nest regex gsubs, and not distracted by complicated regex or badly encoded chars.

To nest gsubs, make sure you use blocks for the gsubs, and you can use the perlisms ($1 for first group, etc).
chorus_regex = /([^0-9]+\n)+/
puts hymn.gsub(chorus_regex) {|match|
match.gsub(/^([^0-9]+\n)/) { |line|
" #{$1}"
}
}

Parse CSV file with headers when the headers are part way down the page

I have a CSV file that, as a spreadsheet, looks like this:
I want to parse the spreadsheet with the headers at row 19. Those headers wont always start at row 19, so my question is, is there a simple way to parse this spreadsheet, and specify which row holds the headers, say by using the "Date" string to identify the header row?
Right now, I'm doing this:
CSV.foreach(params['logbook'].tempfile, headers: true) do |row|
Flight.create(row.to_hash)
end
but obviously that wont work because it doesn't get the right headers.
I feel like there should be a simple solution to this since it's pretty common to have CSV files in this format.

Let's first create the csv file that would be produced from the spreadsheet.
csv =<<-_
N211E,C172,2004,Cessna,172R,airplane,airplane
C-GPGT,C172,1976,Cessna,172M,airplane,airplane
N17AV,P28A,1983,Piper,PA-28-181,airplane,airplane
N4508X,P28A,1975,Piper,PA-28-181,airplane,airplane
,,,,,,
Flights Table,,,,,,
Date,AircraftID,From,To,Route,TimeOut,TimeIn
2017-07-27,N17AV,KHPN,KHPN,KHPN KHPN,17:26,18:08
2017-07-27,N17AV,KHSE,KFFA,,16:29,17:25
2017-07-27,N17AV,W41,KHPN,,21:45,23:53
_
FName = 'test.csv'
File1.write(FName, csv)
#=> 395
We only want the part of the string that begins "Date,".The easiest option is probably to first extract the relevant text. If the file is not humongous, we can slurp it into a string and then remove the unwanted bit.
str = File.read(FName).gsub(/\A.+?(?=^Date,)/m, '')
#=> "Date,AircraftID,From,To,Route,TimeOut,TimeIn\n2017-07-27,N17AV,
# KHPN,KHPN,KHPN KHPN,17:26,18:08\n2017-07-27,N17AV,KHSE,KFFA,,16:29,
# 17:25\n2017-07-27,N17AV,W41,KHPN,,21:45,23:53\n"
The regular expression that is gsub's first argument could be written in free-spacing mode, which makes it self-documenting:
/
\A # match the beginning of the string
.+? # match any number of characters, lazily
(?=^Date,) # match "Date," at the beginning of a line in a positive lookahead
/mx # multi-line and free-spacing regex definition modes
Now that we have the part of the file we want in the string str, we can use CSV::parse to create the CSV::Table object:
csv_tbl = CSV.parse(str, headers: true)
#=> #<CSV::Table mode:col_or_row row_count:4>
The option :headers => true is documented in CSV::new.
Here are a couple of examples of how csv_tbl can be used.
csv_tbl.each { |row| p row }
#=> #<CSV::Row "Date":"2017-07-27" "AircraftID":"N17AV" "From":"KHPN"\
# "To":"KHPN" "Route":"KHPN KHPN" "TimeOut":"17:26" "TimeIn":"18:08">
# #<CSV::Row "Date":"2017-07-27" "AircraftID":"N17AV" "From":"KHSE"\
# "To":"KFFA" "Route":nil "TimeOut":"16:29" "TimeIn":"17:25">
# #<CSV::Row "Date":"2017-07-27" "AircraftID":"N17AV" "From":"W41"\
# "To":"KHPN" "Route":nil "TimeOut":"21:45" "TimeIn":"23:53">
(I've used the character '\' to signify that the string continues on the following line, so that readers would not have to scroll horizontally to read the lines.)
csv_tbl.each { |row| p row["From"] }
# "KHPN"
# "KHSE"
# "W41"
Readers who want to know more about how Ruby's CSV class is used may wish to read Darko Gjorgjievski's piece, "A Guide to the Ruby CSV Library, Part 1 and Part 2".

You can use the smarter_csv gem for this. Parse the file once to determine how many rows you need to skip to get to the header row you want, and then use the skip_lines option:
header_offset = <code to determine number of lines above the header>
SmarterCSV.process(params['logbook'].tempfile, skip_lines: header_offset)

From this format, I think the easiest way is to detect an empty line that comes before the header line. That would also work under changes to the header text. In terms of CSV, that would mean a whole line that has only empty cell items.

Delete character in an XML file using Ruby

I am working with Ruby, and I want to delete all the \ characters from my XML file.
Here is my XML file:
<w:numId w:val=\"2\"/></w:numPr></w:pPr><w:bookmarkStart w:id=\"0\" w:name=\"__DdeLink__0_226207805\"/><w:bookmarkEnd w:id=\"0\"/><w:r><w:rPr></w:rPr><w:t>Serve high quality food</w:t></w:r></w:p>, <w:p><w:pPr><w:pStyle w:val=\"style17\"/><w:numPr><w:ilvl w:val=\"0\"/><w:numId w:val=\"2\"/></w:numPr></w:pPr><w:bookmarkStart w:id=\"0\" w:name=\"__DdeLink__0_226207805\"/><w:bookmarkEnd w:id=\"0\"/>

There's actually no backslash character (\) in your file. The backslash in your example simply escapes the following double-quote and prevents it terminating the string and thereby resulting in a syntax error due to an unterminated double-quote.
What you see when you print that string in IRB is actually not the backslash as is, but the backslash in combination with the following double-quote as an indication that the double-quote is escaped. The idea is kind of hard to grasp when you encounter it the first time. Have a look at "Escape sequences".
Saying it short and sweet, there is no backslash in your file so you can't remove it.
Let me explain with an example:
> text = "This is sample text for escape character\""
#=> "This is sample text for escape character\""
Is equivalent to:
> text = 'This is sample text for escape character"'
#=> "This is sample text for escape character\""
To remove the backslash (\) , just remove "
> text.tr!('"', '')
#=> "This is sample text for escape character"
I hope this makes it clear.

Thank you guys for you answers, here is what i dit and it worked as i wanted:
text = ''
File.open("#{temp_dir}/plan_report_template/word/document.xml").each { |line|
text << line
}
open("#{temp_dir}/plan_report_template/word/document.xml", "w") { |file| file.write(text.gsub('\"', '"')) }

Escaping Strings For Ruby SQLite Insert

I'm creating a Ruby script to import a tab-delimited text file of about 150k lines into SQLite. Here it is so far:
require 'sqlite3'
file = File.new("/Users/michael/catalog.txt")
string = []
# Escape single quotes, remove newline, split on tabs,
# wrap each item in quotes, and join with commas
def prepare_for_insert(s)
s.gsub(/'/,"\\\\'").chomp.split(/\t/).map {|str| "'#{str}'"}.join(", ")
end
file.each_line do |line|
string << prepare_for_insert(line)
end
database = SQLite3::Database.new("/Users/michael/catalog.db")
# Insert each string into the database
string.each do |str|
database.execute( "INSERT INTO CATALOG VALUES (#{str})")
end
The script errors out on the first line containing a single quote in spite of the gsub to escape single quotes in my prepare_for_insert method:
/Users/michael/.rvm/gems/ruby-1.9.3-p0/gems/sqlite3-1.3.5/lib/sqlite3/database.rb:91:
in `initialize': near "s": syntax error (SQLite3::SQLException)
It's erroring out on line 15. If I inspect that line with puts string[14], I can see where it's showing the error near "s". It looks like this: 'Touch the Top of the World: A Blind Man\'s Journey to Climb Farther Than the Eye Can See'
Looks like the single quote is escaped, so why am I still getting the error?

Don't do it like that at all, string interpolation and SQL tend to be a bad combination. Use a prepared statement instead and let the driver deal with quoting and escaping:
# Ditch the gsub in prepare_for_insert and...
db = SQLite3::Database.new('/Users/michael/catalog.db')
ins = db.prepare('insert into catalog (column_name) values (?)')
string.each { |s| ins.execute(s) }
You should replace column_name with the real column name of course; you don't have to specify the column names in an INSERT but you should always do it anyway. If you need to insert more columns then add more placeholders and arguments to ins.execute.
Using prepare and execute should be faster, safer, easier, and it won't make you feel like you're writing PHP in 1999.
Also, you should use the standard CSV parser to parse your tab-separated files, XSV formats aren't much fun to deal with (they're downright evil in fact) and you have better things to do with your time than deal with their nonsense and edge cases and what not.

Ruby: How can I process a CSV file with "bad commas"?

I need to process a CSV file from FedEx.com containing shipping history. Unfortunately FedEx doesn't seem to actually test its CSV files as it doesn't quote strings that have commas in them.
For instance, a company name might be "Dog Widgets, Inc." but the CSV doesn't quote that string, so any CSV parser thinks that comma before "Inc." is the start of a new field.
Is there any way I can reliably parse those rows using Ruby?
The only differentiating characteristic that I can find is that the commas that are part of a string have a space after then. Commas that separate fields have no spaces. No clue how that helps me parse this, but it is something I noticed.

you can use a negative lookahead
>> "foo,bar,baz,pop, blah,foobar".split(/,(?![ \t])/)
=> ["foo", "bar", "baz", "pop, blah", "foobar"]

Well, here's an idea: You could replace each instance of comma-followed-by-a-space with a unique character, then parse the CSV as usual, then go through the resulting rows and reverse the replace.

Perhaps something along these lines..
using gsub to change the ', ' to something else
ruby-1.9.2-p0 > "foo,bar,baz,pop, blah,foobar".gsub(/,\ /,'| ').split(',')
[
[0] "foo",
[1] "bar",
[2] "baz",
[3] "pop| blah",
[4] "foobar"
]
and then remove the | after words.

If you are so lucky as to only have one field like that, you can parse the leading fields off the start, the trailing fields off than end and assume whatever is left is the offending field. In python (no habla ruby) this would look something like:
fields = line.split(',') # doesn't work if some fields are quoted
fields = fields[:5] + [','.join(fields[5:-3])] + fields[-3:]
Whatever you do, you should be able at a minimum determine the number of offending commas and that should give you something (a sanity check if nothing else).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

ruby regex to remove extra \n - ruby

You don't need a Regexp for that. It's basically any double-quote on its own line: csv_string.gsub("\n\"\n", "\"\n")

Why don't you just add a trailing double quote for lines which don't end in a double quote, and remove empty lines (lines that only have a double quote)?

Related

Ruby—nested gsubs?

Parse CSV file with headers when the headers are part way down the page

Delete character in an XML file using Ruby

Escaping Strings For Ruby SQLite Insert

Ruby: How can I process a CSV file with "bad commas"?

Categories

Resources