Skip rows before the header in CSV file [duplicate] - ruby

This question already has answers here:
How to skip the first line of a CSV file and make the second line the header
(6 answers)
Closed 7 years ago.
I tried searching but couldn't find a question regarding my problem. Let's say I have a CSV file that looks something like this:
Metadata line 1
Metadata line 2
Metadata line 3
Metadata line 4
foo,bar,baz
apple,orange,banana
cashew,almond,walnut
The line foo,bar,baz is the header, and the following lines are the corresponding data. When I write my ruby code like so:
CSV.foreach("filename.csv",:headers=>true) do |row|
puts "#{row}"
end
It clearly breaks. What's the best way to skip the lines before the header? Currently I'm thinking I could do something like:
Find the first row with commas and get line number
Extract that line as an array
Pass that array to :headers
But this feels cumbersome - if I know exactly what line the header is, what's the best way to jump to that line and ignore everything previously? Is this possible? If this is a question that has been asked before, I will happily devour those answers, perhaps my search-fu just isn't good enough.
Thank you so much!

There is a skip_lines option to CSV. Not exactly clear if it will skip header lines or just rows, but worth a shot.
:skip_lines - When set to an object responding to match, every line
matching it is considered a comment and ignored during parsing. When
set to a String, it is first converted to a Regexp. When set to nil no
line is considered a comment. If the passed object does not respond to
match, ArgumentError is thrown.

If you know how many metadata lines there are, you can just eat them before creating the CSV object.
Can could of course also do something useful with them, but that's up to you!
require 'csv'
3.times { DATA.readline }
csv = CSV.new(DATA, headers: true, return_headers: false)
csv.read.each do |row|
p row
end
# => #<CSV::Row "header1":"1" "header2":"2">
# => #<CSV::Row "header1":"3" "header2":"4">
# => #<CSV::Row "header1":"5" "header2":"6">
p csv.headers
# => ["header1", " header2"]
__END__
# I know
# there are 3 lines
# here, so I can skip them.
header1,header2
1,2
3,4
5,6

You can do something like:
require 'csv'
while (header = DATA.readline) !~ /,.*,/
end
csv = CSV.new(DATA.read, headers: header)
csv.each do |row|
p row
end
p csv.headers
__END__
Metadata line 1
Metadata line 2
Metadata line 3
Metadata line 4
foo,bar,baz
apple,orange,banana
cashew,almond,walnut
One warning: Nicks 3rd data line (# here, so I can skip them.) contains only one comma. So your rule Find the first row with commas could lead to a misunderstanding. You can use the regex /,.*,/ but then you must have at least two commas in the header to be detected as the header.
In other words: It is essential to have maximum one comma before the header line and to have more then one comma in the real header line.
Remark 2: DATA is a special ruby construct that can be replaced with a file handle (e.g. the f in File.open(filename){|f| ...}.

Related

Alternative code to read and process array by newline in Ruby

My code is supposed to read a file on the server, store its content in an Array, then read the array elements (eventually each element is a line) and split each line into 7 parts by (:)
I wrote this code and it works 100% fine.
lines = File.readlines('/etc/passwd')
lines.each do |line|
line = line.chomp! #I removed the \n
line_arr = line.split(/:/)
puts line_arr.inspect
puts "*************"
end
I just want to know if there is a shortcut to do this since each element of the array ends with \n.
Maybe I am a bit confused between a an array elements ending with \n and a string that contains \n
the content of the file looks like this
root:x:0:0:root:/root:/bin/bash
daemon:x:1:1:daemon:/usr/sbin:/bin/sh
bin:x:2:2:bin:/bin:/bin/sh
sys:x:3:3:sys:/dev:/bin/sh
sync:x:4:65534:sync:/bin:/bin/sync
games:x:5:60:games:/usr/games:/bin/sh
man:x:6:12:man:/var/cache/man:/bin/sh
lp:x:7:7:lp:/var/spool/lpd:/bin/sh
mail:x:8:8:mail:/var/mail:/bin/sh
news:x:9:9:news:/var/spool/news:/bin/sh
uucp:x:10:10:uucp:/var/spool/uucp:/bin/sh
As for the output, there's no specific format, because I am going to use this part and extend my code later. As long as I can access those 7 parts that I extracted from the line_arr, i should be fine.
thank you
require 'etc'
[].tap {|ary| Etc.passwd {|u|
ary << [u.name, u.passwd, u.uid, u.gid, u.gecos, u.dir, u.shell, u.change,
u.uclass, u.expire]
}}
Rule of thumb: never try to reimplement behavior that someone else has already written for you. Unless you are really, really, really, REALLY smart.
Actually, now that you have edited your question, I don't even see why you need those arrays in the first place and cannot just use the Etc.passwd iterator and Struct::Passwd directly.

How to print from specific column range?

I want to grab only the first line of columns 46 to 245 of source.txt and write it to output.txt
source_file.each { |line|
File.open(output_file,"a+") { |f|
f.print ???
}
Bonus: I also need to keep a count of the number of characters in this range, as some may be whitespace. i.e. 38 characters and the rest whitespace.
Example:
source_file: (first line only, columns 45 to 245): 13287912721981239854 + 180 blank columns
output_file: 13287912721981239854
count = 20 characters
Update: appending [46..245].delete(' ').size gives me the desired count.
If I am understanding what you are asking correctly, there's no reason to grab the whole file when you only want the first line. If this isn't what you're asking for, then you need to specify what you're trying to pull out of the source file more clearly.
This should grab the data you need:
output_line = source_file.gets [45..244]
If you write:
source_file.each { |line|
File.open(output_file,"a+") { |f|
f.print ???
}
}
You will open, then close, your output file for each line read from the output file. That is the wrong way to do it, even if you only want to read one line of input.
Instead try something like one of these:
File.open(output_file, 'a') do |fo|
File.open('path/to/input_file') do |fi|
fo.puts fi.readline[46..245]
end
end
This uses IO.readline, which reads a single line from the file. The block falls through afterwards, causing both the input and output files to be closed automatically. Also, it opens the output file as 'a' which is append-mode only. 'a+' is wrong unless you intend to append and read, which is rarely done. From the documentation:
"a+" Read-write, starts at end of file if file exists,
otherwise creates a new file for reading and
writing
Or:
File.open(output_file, 'a') do |fo|
File.foreach('path/to/input_file') do |li|
fo.puts li[46..245]
break
end
end
foreach is used most often when we're reading a file line-by-line. It's the mainstay for reading files in a scalable manner. It wants to loop over the file inside the block, which is why break is there, to break out of that loop.
Or:
File.foreach('path/to/input_file') do |li|
File.write(output_file, li[46..245], -1, :mode => 'a')
break
end
File.write is useful when you have a blob of text or binary, and want to write it in one chunk, then move on. The -1 tells Ruby to move to the end of the file. :mode => 'a' overrides the default mode which would normally truncate an existing file.
Maybe this will do the job:
line = f.readline
columns = line.split
File.open("output.txt", "w") do |out|
columns[46, (245 - 46 + 1)].each do |column|
out.puts column
end
end
break # only process first line
I have used 245 - 46 + 1 to indicate this is the number of columns we are interested in. I have also assumed that columns are separate by whitespaces. If that is not the case you will need to change the delimiter of split.

Why do I have a trailing column when reading a CSV file?

I have a CSV file whith the following structure:
"customer_id";"customer_name";"quantity";
"id1234";"Henry";"15";
Parsing with Ruby's standard CSV lib:
csv_data = CSV.read(pathtofile,{
:headers => :first_row,
:col_sep => ";",
:quote_char => '"'
:row_sep => "\r\n" #setting it to "\r" or "\n" results in MalformedCSVError
})
puts csv_data.headers.count #4
I don't understand why the parsing seems to result in four columns although the file only contains three. Is this not the right approach to parse the file?
The ; at the end of each row is implying another field, even though there is no value.
I would either remove the trailing ;'s or just ignore the fourth field when it is parsed.
The trailing ; is the culprit.
You can preprocess the file, stripping the trailing ;, but that incurs unnecessary overhead.
You can post-process the returned array of data from CSV using something like this:
csv_data = CSV.read(...).map(&:pop)
That will iterate over the sub-arrays, removing the last element in each. The problem is that read isn't scalable, so you might want to rethink using it and instead, use CSV.foreach to read the file line by line and then pop the last value as they're returned to you.

How can I further process the line of data that causes the Ruby FasterCSV library to throw a MalformedCSVError?

The incoming data file(s) contain malformed CSV data such as non-escaped quotes, as well as (valid) CSV data such as fields containing new lines. If a CSV format error is detected I would like to use an alternative routine on that data.
With the following sample code (abbreviated for simplicity)
FasterCSV.open( file ){|csv|
row = true
while row
begin
row = csv.shift
break unless row
# Do things with the good rows here...
rescue FasterCSV::MalformedCSVError => e
# Do things with the bad rows here...
next
end
end
}
The MalformedCSVError is caused in the csv.shift method. How can I access the data that caused the error from the rescue clause?
require 'csv' #CSV in ruby 1.9.2 is identical to FasterCSV
# File.open('test.txt','r').each do |line|
DATA.each do |line|
begin
CSV.parse(line) do |row|
p row #handle row
end
rescue CSV::MalformedCSVError => er
puts er.message
puts "This one: #{line}"
# and continue
end
end
# Output:
# Unclosed quoted field on line 1.
# This one: 1,"aaa
# Illegal quoting on line 1.
# This one: aaa",valid
# Unclosed quoted field on line 1.
# This one: 2,"bbb
# ["bbb", "invalid"]
# ["3", "ccc", "valid"]
__END__
1,"aaa
aaa",valid
2,"bbb
bbb,invalid
3,ccc,valid
Just feed the file line by line to FasterCSV and rescue the error.
This is going to be really difficult. Some things that make FasterCSV, well, faster, make this particularly hard. Here's my best suggestion: FasterCSV can wrap an IO object. What you could do, then, is to make your own subclass of File (itself a subclass of IO) that "holds onto" the result of the last gets. Then when FasterCSV raises an exception you can ask your special File object for the last line. Something like this:
class MyFile < File
attr_accessor :last_gets
#last_gets = ''
def gets(*args)
line = super
#last_gets << $/ << line
line
end
end
# then...
file = MyFile.open(filename, 'r')
csv = FasterCSV.new file
row = true
while row
begin
break unless row = csv.shift
# do things with the good row here...
rescue FasterCSV::MalformedCSVError => e
bad_row = file.last_gets
# do something with bad_row here...
next
ensure
file.last_gets = '' # nuke the #last_gets "buffer"
end
end
Kinda neat, right? BUT! there are caveats, of course:
I'm not sure how much of a performance hit you take when you add an extra step to every gets call. It might be an issue if you need to parse multi-million-line files in a timely fashion.
This fails utterly might or might not fail if your CSV file contains newline characters inside quoted fields. The reason for this is described in the source--basically, if a quoted value contains a newline then shift has to do additional gets calls to get the entire line. There could be a clever way around this limitation but it's not coming to me right now. If you're sure your file doesn't have any newline characters within quoted fields then this shouldn't be a worry for you, though.
Your other option would be to read the file using File.gets and pass each line in turn to FasterCSV#parse_line but I'm pretty sure in so doing you'd squander any performance advantage gained from using FasterCSV.
I used Jordan's file subclassing approach to fix the problem with my input data before CSV ever tries to parse it. In my case, I had a file that used \" to escape quotes, instead of the "" that CSV expects. Hence,
class MyFile < File
def gets(*args)
line = super
if line != nil
line.gsub!('\\"','""') # fix the \" that would otherwise cause a parse error
end
line
end
end
infile = MyFile.open(filename)
incsv = CSV.new(infile)
while row = infile.shift
# process each row here
end
This allowed me to parse the non-standard CSV file. Ruby's CSV implementation is very strict and often has trouble with the many variants of the CSV format.

What's the best way to parse a tab-delimited file in Ruby?

What's the best (most efficient) way to parse a tab-delimited file in Ruby?
The Ruby CSV library lets you specify the field delimiter. Ruby 1.9 uses FasterCSV. Something like this would work:
require "csv"
parsed_file = CSV.read("path-to-file.csv", col_sep: "\t")
The rules for TSV are actually a bit different from CSV. The main difference is that CSV has provisions for sticking a comma inside a field and then using quotation characters and escaping quotes inside a field. I wrote a quick example to show how the simple response fails:
require 'csv'
line = 'boogie\ttime\tis "now"'
begin
line = CSV.parse_line(line, col_sep: "\t")
puts "parsed correctly"
rescue CSV::MalformedCSVError
puts "failed to parse line"
end
begin
line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ƃ")
puts "parsed correctly with random quote char"
rescue CSV::MalformedCSVError
puts "failed to parse line with random quote char"
end
#Output:
# failed to parse line
# parsed correctly with random quote char
If you want to use the CSV library you could used a random quote character that you don't expect to see if your file (the example shows this), but you could also use a simpler methodology like the StrictTsv class shown below to get the same effect without having to worry about field quotations.
# The main parse method is mostly borrowed from a tweet by #JEG2
class StrictTsv
attr_reader :filepath
def initialize(filepath)
#filepath = filepath
end
def parse
open(filepath) do |f|
headers = f.gets.strip.split("\t")
f.each do |line|
fields = Hash[headers.zip(line.split("\t"))]
yield fields
end
end
end
end
# Example Usage
tsv = Vendor::StrictTsv.new("your_file.tsv")
tsv.parse do |row|
puts row['named field']
end
The choice of using the CSV library or something more strict just depends on who is sending you the file and whether they are expecting to adhere to the strict TSV standard.
Details about the TSV standard can be found at http://en.wikipedia.org/wiki/Tab-separated_values
There are actually two different kinds of TSV files.
TSV files that are actually CSV files with a delimiter set to Tab. This is something you'll get when you e.g. save an Excel spreadsheet as "UTF-16 Unicode Text". Such files use CSV quoting rules, which means that fields may contain tabs and newlines, as long as they are quoted, and literal double quotes are written twice. The easiest way to parse everything correctly is to use the csv gem:
use 'csv'
parsed = CSV.read("file.tsv", col_sep: "\t")
TSV files conforming to the IANA standard. Tabs and newlines are not allowed as field values, and there is no quoting whatsoever. This is something you will get when you e.g. select a whole Excel spreadsheet and paste it into a text file (beware: it will get messed up if some cells do contain tabs or newlines). Such TSV files can be easily parsed line-by-line with a simple line.rstrip.split("\t", -1) (note -1, which prevents split from removing empty trailing fields). If you want to use the csv gem, simply set quote_char to nil:
use 'csv'
parsed = CSV.read("file.tsv", col_sep: "\t", quote_char: nil)
I like mmmries answer. HOWEVER, I hate the way that ruby strips off any empty values off of the end of a split. It isn't stripping off the newline at the end of the lines, either.
Also, I had a file with potential newlines within a field. So, I rewrote his 'parse' as follows:
def parse
open(filepath) do |f|
headers = f.gets.strip.split("\t")
f.each do |line|
myline=line
while myline.scan(/\t/).count != headers.count-1
myline+=f.gets
end
fields = Hash[headers.zip(myline.chomp.split("\t",headers.count))]
yield fields
end
end
end
This concatenates any lines as necessary to get a full line of data, and always returns the full set of data (without potential nil entries at the end).

Resources