Here is what my CSV looks like: http://tinypic.com/r/kuwk6/5
And here is my code:
File.open("/Users/Katie/Downloads/File_Name.csv", encoding: "ISO-8859-1").each_line do |line|
line.chomp!
CSV.parse(line, col_sep: "\t") do |row|
unless row[4].nil?
puts row[4].split("&Wt.srch=1")[0]
end
end
end
I had issues with special characters, which is why I have the encoding in there and because I'm on a Mac, when I open a CSV in excel, it does something weird to the rows, so I put in the line.chomp!. The file is technically tab deliminated, so i did the col_sep for the tabs.
Basically I want the URL to be split at the "&Wt.srch=1" but I only want to have the first part of the string returned after it splits them, which is why I put the [0].
When I run the code without the "unless" row, it says block (2 levels) in <main>': undefined methodsplit' for nil:NilClass (NoMethodError)
This makes me think that it thinks this column is empty, when in fact, it's not. But of course when I put in the "unless" line, it runs the script just fine, but doesn't actually split the url string.
Sorry if this is a really basic / easy problem... Thanks in advance for your help!
You don't need CSV.parse do do this
With tabs:
File:
c1 c2 c3 c4 c5
Hello Alpha Example More https://www.exampl.com?f1=1&Wt.srch=1&utm=2&utm2=blah
Thanks Bravo Example some https://www.exampl.com?f1=1&Wt.srch=1&utm=2&utm2=blah
Blah Charlie Example stuff https://www.exampl.com?f1=1&Wt.srch=1&utm=2&utm2=blah
Script:
#returns each_line of the csv file as a string
File.open("/Users/Katie/Downloads/File_Name.csv").each_line do |line|
#splits the line at tab character into row Array
row = line.chomp.split("\t")
unless row[4].nil?
puts row[4].split("&Wt.srch=1")[0]
end
end
Output:
c5
https://www.exampl.com?f1=1&
https://www.exampl.com?f1=1&
https://www.exampl.com?f1=1&
With Commas:
File:
c1,c2,c3,c4,c5
Hello,Alpha,Example,More,https://www.exampl.com?f1=1&Wt.srch=1&utm=2&utm2=blah
Thanks,Bravo,Example,some,https://www.exampl.com?f1=1&Wt.srch=1&utm=2&utm2=blah
Blah,Charlie,Example,stuff,https://www.exampl.com?f1=1&Wt.srch=1&utm=2&utm2=blah
Script:
#returns each_line of the csv file as a string
File.open("/Users/Katie/Downloads/File_Name.csv").each_line do |line|
#splits the line at tab character into row Array
row = line.chomp.split(",")
unless row[4].nil?
puts row[4].split("&Wt.srch=1")[0]
end
end
Output:
c5
https://www.exampl.com?f1=1&
https://www.exampl.com?f1=1&
https://www.exampl.com?f1=1&
Script to handle the use of encoding with "ISO-8859-1":
File.open("/Users/Katie/Downloads/File_Name.csv", encoding: "ISO-8859-1").each_line do |line|
#splits the line at tab character into row Array
row = line.chomp.split(" ").delete_if{|r| r.strip.empty?}
unless row[4].nil?
puts row[4].split("&Wt.srch=1")[0]
end
end
The way you have it set up you are looping through the lines and then splitting the lines into individual strings using CSV.parse so row is actually a single "cell" not an array of cells.
Related
I am opening a CSV file and then converting it to JSON. This is all working fine except the JSON data has \n characters in the string. These are not part of the last element as far as I can tell from printing it and trying to chomp it. When I print the row it does have \n
require 'csv'
require 'json'
def csv_to_json (tmpfile)
JSON_ARRAY = Array.new
CSV.foreach(tmpfile) do |row|
print row[row.length - 1]
if row[row.length - 1].chomp! == nil
print row
end
JSON_ARRAY.push(row)
end
return JSON_ARRAY.to_json
end
The JSON then looks like this when it is returned
["field11,field12\n",
"field21,field22\n"]
How can I remove these new line characters?
EDIT:
These are CSV::Row objects and do not support string operations like chomp or strip
tmpfile is in the format
field11,field21
field21,field22
Set the row_sep to nil.
JSON_ARRAY.push( row.to_s( row_sep: nil ) )
or
JSON_ARRAY.push( row.to_csv( row_sep: nil ) )
As a comment pointed out, CSV::row#to_s is an alias for CSV::row#to_csv, which adds a row separator after each line automatically. To get around this you can just set the row_sep to nil and it will not add \n at the end of each row.
Hope that helps.
The simplest way:
File.read(tmpfile).split("\n")
By the way, if you want to remove the newline from the string, you could use String::strip method.
CSV.foreach(tmpfile) do |row|
# here row should be an array.
p row
end
CSV.foreach(tmpfile) do |row|
print row[row.length - 1]
if row[row.length - 1].chomp! == nil
print row
end
row.map{|cell| cell.strip!}
JSON_ARRAY.push(row)
end
The row doesn't support stripping, but the cells do.
I was able to get it to work using a map! after the fact
json_array.map! { |row| row = row.to_s.chomp! }
You could also do the to_s.chomp! inside of the loop. This wasn't an option for me because I needed the regular objects to do some calculations before returning the json
I want to grab only the first line of columns 46 to 245 of source.txt and write it to output.txt
source_file.each { |line|
File.open(output_file,"a+") { |f|
f.print ???
}
Bonus: I also need to keep a count of the number of characters in this range, as some may be whitespace. i.e. 38 characters and the rest whitespace.
Example:
source_file: (first line only, columns 45 to 245): 13287912721981239854 + 180 blank columns
output_file: 13287912721981239854
count = 20 characters
Update: appending [46..245].delete(' ').size gives me the desired count.
If I am understanding what you are asking correctly, there's no reason to grab the whole file when you only want the first line. If this isn't what you're asking for, then you need to specify what you're trying to pull out of the source file more clearly.
This should grab the data you need:
output_line = source_file.gets [45..244]
If you write:
source_file.each { |line|
File.open(output_file,"a+") { |f|
f.print ???
}
}
You will open, then close, your output file for each line read from the output file. That is the wrong way to do it, even if you only want to read one line of input.
Instead try something like one of these:
File.open(output_file, 'a') do |fo|
File.open('path/to/input_file') do |fi|
fo.puts fi.readline[46..245]
end
end
This uses IO.readline, which reads a single line from the file. The block falls through afterwards, causing both the input and output files to be closed automatically. Also, it opens the output file as 'a' which is append-mode only. 'a+' is wrong unless you intend to append and read, which is rarely done. From the documentation:
"a+" Read-write, starts at end of file if file exists,
otherwise creates a new file for reading and
writing
Or:
File.open(output_file, 'a') do |fo|
File.foreach('path/to/input_file') do |li|
fo.puts li[46..245]
break
end
end
foreach is used most often when we're reading a file line-by-line. It's the mainstay for reading files in a scalable manner. It wants to loop over the file inside the block, which is why break is there, to break out of that loop.
Or:
File.foreach('path/to/input_file') do |li|
File.write(output_file, li[46..245], -1, :mode => 'a')
break
end
File.write is useful when you have a blob of text or binary, and want to write it in one chunk, then move on. The -1 tells Ruby to move to the end of the file. :mode => 'a' overrides the default mode which would normally truncate an existing file.
Maybe this will do the job:
line = f.readline
columns = line.split
File.open("output.txt", "w") do |out|
columns[46, (245 - 46 + 1)].each do |column|
out.puts column
end
end
break # only process first line
I have used 245 - 46 + 1 to indicate this is the number of columns we are interested in. I have also assumed that columns are separate by whitespaces. If that is not the case you will need to change the delimiter of split.
The incoming data file(s) contain malformed CSV data such as non-escaped quotes, as well as (valid) CSV data such as fields containing new lines. If a CSV format error is detected I would like to use an alternative routine on that data.
With the following sample code (abbreviated for simplicity)
FasterCSV.open( file ){|csv|
row = true
while row
begin
row = csv.shift
break unless row
# Do things with the good rows here...
rescue FasterCSV::MalformedCSVError => e
# Do things with the bad rows here...
next
end
end
}
The MalformedCSVError is caused in the csv.shift method. How can I access the data that caused the error from the rescue clause?
require 'csv' #CSV in ruby 1.9.2 is identical to FasterCSV
# File.open('test.txt','r').each do |line|
DATA.each do |line|
begin
CSV.parse(line) do |row|
p row #handle row
end
rescue CSV::MalformedCSVError => er
puts er.message
puts "This one: #{line}"
# and continue
end
end
# Output:
# Unclosed quoted field on line 1.
# This one: 1,"aaa
# Illegal quoting on line 1.
# This one: aaa",valid
# Unclosed quoted field on line 1.
# This one: 2,"bbb
# ["bbb", "invalid"]
# ["3", "ccc", "valid"]
__END__
1,"aaa
aaa",valid
2,"bbb
bbb,invalid
3,ccc,valid
Just feed the file line by line to FasterCSV and rescue the error.
This is going to be really difficult. Some things that make FasterCSV, well, faster, make this particularly hard. Here's my best suggestion: FasterCSV can wrap an IO object. What you could do, then, is to make your own subclass of File (itself a subclass of IO) that "holds onto" the result of the last gets. Then when FasterCSV raises an exception you can ask your special File object for the last line. Something like this:
class MyFile < File
attr_accessor :last_gets
#last_gets = ''
def gets(*args)
line = super
#last_gets << $/ << line
line
end
end
# then...
file = MyFile.open(filename, 'r')
csv = FasterCSV.new file
row = true
while row
begin
break unless row = csv.shift
# do things with the good row here...
rescue FasterCSV::MalformedCSVError => e
bad_row = file.last_gets
# do something with bad_row here...
next
ensure
file.last_gets = '' # nuke the #last_gets "buffer"
end
end
Kinda neat, right? BUT! there are caveats, of course:
I'm not sure how much of a performance hit you take when you add an extra step to every gets call. It might be an issue if you need to parse multi-million-line files in a timely fashion.
This fails utterly might or might not fail if your CSV file contains newline characters inside quoted fields. The reason for this is described in the source--basically, if a quoted value contains a newline then shift has to do additional gets calls to get the entire line. There could be a clever way around this limitation but it's not coming to me right now. If you're sure your file doesn't have any newline characters within quoted fields then this shouldn't be a worry for you, though.
Your other option would be to read the file using File.gets and pass each line in turn to FasterCSV#parse_line but I'm pretty sure in so doing you'd squander any performance advantage gained from using FasterCSV.
I used Jordan's file subclassing approach to fix the problem with my input data before CSV ever tries to parse it. In my case, I had a file that used \" to escape quotes, instead of the "" that CSV expects. Hence,
class MyFile < File
def gets(*args)
line = super
if line != nil
line.gsub!('\\"','""') # fix the \" that would otherwise cause a parse error
end
line
end
end
infile = MyFile.open(filename)
incsv = CSV.new(infile)
while row = infile.shift
# process each row here
end
This allowed me to parse the non-standard CSV file. Ruby's CSV implementation is very strict and often has trouble with the many variants of the CSV format.
So I've written some code in Ruby to split a text file up into individual lines, then to group those lines based on a delimiter character. This output is then written to an array, which is passed to a method, which spits out HTML into a text file. I started running into problems when I tried to use gsub in different methods to replace placeholders in a HTML text file with values from the record array - Ruby kept telling me that I was passing in nil values. After trying to debug that part of the program for several hours, I decided to look elsewhere, and I think I'm on to something. A modified version of the program is posted below.
Here is a sample of the input text file:
26188
WHL
1
Delco
B-7101
A-63
208-220/440
3
285 w/o pallet
1495.00
C:/img_converted/26188B.jpg
EDM Machine Part 2 of 3
AC Motor, 3/4 Hp, Frame 182, 1160 RPM
|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|
Here is a snippet of the code that I've been testing with:
# function to import file as a string
def file_as_string(filename)
data = ''
f = File.open(filename, "r")
f.each_line do |line|
data += line
end
return data
end
Dir.glob("single_listing.jma") do |filename|
content = file_as_string(filename)
content = content.gsub(/\t/, "\n")
database_array = Array.new
database_array = content.split("|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|")
for i in database_array do
record = Array.new
record = i.split("\n")
puts record[0]
puts record[0].class end
end
When that code is run, I get this output:
john#starfire:~/code/ruby/idealm_db_parser$ ruby putsarray.rb
26188
String
nil
NilClass
... which means that each array position in record apparently has data of type String and of type nil. why is this?
Your database_array has more dimensions than you think.
Your end-of-stanza marker, |--|--|...|--| has a newline after it. So, file_as_string returns something like this:
"26188\nWHL...|--|--|\n"
and is then split() on end-of-stanza into something like this:
["26188\nWHL...1160 RPM\n", "\n"] # <---- Note the last element here!
You then split each again, but "\n".split("\n") gives an empty array, the first element of which comes back as nil.
What's the best (most efficient) way to parse a tab-delimited file in Ruby?
The Ruby CSV library lets you specify the field delimiter. Ruby 1.9 uses FasterCSV. Something like this would work:
require "csv"
parsed_file = CSV.read("path-to-file.csv", col_sep: "\t")
The rules for TSV are actually a bit different from CSV. The main difference is that CSV has provisions for sticking a comma inside a field and then using quotation characters and escaping quotes inside a field. I wrote a quick example to show how the simple response fails:
require 'csv'
line = 'boogie\ttime\tis "now"'
begin
line = CSV.parse_line(line, col_sep: "\t")
puts "parsed correctly"
rescue CSV::MalformedCSVError
puts "failed to parse line"
end
begin
line = CSV.parse_line(line, col_sep: "\t", quote_char: "Ƃ")
puts "parsed correctly with random quote char"
rescue CSV::MalformedCSVError
puts "failed to parse line with random quote char"
end
#Output:
# failed to parse line
# parsed correctly with random quote char
If you want to use the CSV library you could used a random quote character that you don't expect to see if your file (the example shows this), but you could also use a simpler methodology like the StrictTsv class shown below to get the same effect without having to worry about field quotations.
# The main parse method is mostly borrowed from a tweet by #JEG2
class StrictTsv
attr_reader :filepath
def initialize(filepath)
#filepath = filepath
end
def parse
open(filepath) do |f|
headers = f.gets.strip.split("\t")
f.each do |line|
fields = Hash[headers.zip(line.split("\t"))]
yield fields
end
end
end
end
# Example Usage
tsv = Vendor::StrictTsv.new("your_file.tsv")
tsv.parse do |row|
puts row['named field']
end
The choice of using the CSV library or something more strict just depends on who is sending you the file and whether they are expecting to adhere to the strict TSV standard.
Details about the TSV standard can be found at http://en.wikipedia.org/wiki/Tab-separated_values
There are actually two different kinds of TSV files.
TSV files that are actually CSV files with a delimiter set to Tab. This is something you'll get when you e.g. save an Excel spreadsheet as "UTF-16 Unicode Text". Such files use CSV quoting rules, which means that fields may contain tabs and newlines, as long as they are quoted, and literal double quotes are written twice. The easiest way to parse everything correctly is to use the csv gem:
use 'csv'
parsed = CSV.read("file.tsv", col_sep: "\t")
TSV files conforming to the IANA standard. Tabs and newlines are not allowed as field values, and there is no quoting whatsoever. This is something you will get when you e.g. select a whole Excel spreadsheet and paste it into a text file (beware: it will get messed up if some cells do contain tabs or newlines). Such TSV files can be easily parsed line-by-line with a simple line.rstrip.split("\t", -1) (note -1, which prevents split from removing empty trailing fields). If you want to use the csv gem, simply set quote_char to nil:
use 'csv'
parsed = CSV.read("file.tsv", col_sep: "\t", quote_char: nil)
I like mmmries answer. HOWEVER, I hate the way that ruby strips off any empty values off of the end of a split. It isn't stripping off the newline at the end of the lines, either.
Also, I had a file with potential newlines within a field. So, I rewrote his 'parse' as follows:
def parse
open(filepath) do |f|
headers = f.gets.strip.split("\t")
f.each do |line|
myline=line
while myline.scan(/\t/).count != headers.count-1
myline+=f.gets
end
fields = Hash[headers.zip(myline.chomp.split("\t",headers.count))]
yield fields
end
end
end
This concatenates any lines as necessary to get a full line of data, and always returns the full set of data (without potential nil entries at the end).