Ruby CSV: How to skip the first two lines of file? - ruby

I have a file where the first line is a useless line, and the 2nd is a header. The problem is that when I'm looping through the file, it counts those are rows. Is there a way to use foreach with options to skip 2 lines? I know there's a read method on CSV, but that loads the data into RAM and if the file is too big I don't think it'll scale well.
However, if there is no other option I will consider it. This is what I have so far:
CSV.foreach(filename, col_sep: "\t") do |row|
until listings.size == limit
listing_class = 'Sale'
address = row[7]
unit = row[8]
price = row[2]
url = row[0]
listings << {listing_class: listing_class, address: address, unit: unit, url: url, price: price}
end
end

I didn't benchmark, but try this:
CSV.to_enum(:foreach, filename, col_sep: "\t").drop(2).each do |row|

Use a counter var, initialize it to 0, and increment it at every line, so if it's smaller than 2 then you can skip to the next row.

You can also use #read or #readlines like so
CSV.readlines(filename, col_sep: "\t")[2..-1] do |row|
#readlines is an alias for #read so it does not matter which you use but it splits the CSV into an Array of Arrays so [2..-1] means use rows 3 through the end.
Both this and #Nakilon's answer are probably better and definitely cleaner than using a counter.
As always Ruby classes are well documented and reading the Docs can be much more beneficial than just waiting for someone to hand you an answer.

Related

Ruby script which can replace a string in a binary file to a different, but same length string?

I would like to write a Ruby script (repl.rb) which can replace a string in a binary file (string is defined by a regex) to a different, but same length string.
It works like a filter, outputs to STDOUT, which can be redirected (ruby repl.rb data.bin > data2.bin), regex and replacement can be hardcoded. My approach is:
#!/usr/bin/ruby
fn = ARGV[0]
regex = /\-\-[0-9a-z]{32,32}\-\-/
replacement = "--0ca2765b4fd186d6fc7c0ce385f0e9d9--"
blk_size = 1024
File.open(fn, "rb") {|f|
while not f.eof?
data = f.read(blk_size)
data.gsub!(regex, str)
print data
end
}
My problem is that when string is positioned in the file that way it interferes with the block size used by reading the binary file. For example when blk_size=1024 and my 1st occurance of the string begins at byte position 1000, so I will not find it in the "data" variable. Same happens with the next read cycle. Should I process the whole file two times with different block size to ensure avoiding this worth case scenario, or is there any other approach?
I would posit that a tool like sed might be a better choice for this. That said, here's an idea: Read block 1 and block 2 and join them into a single string, then perform the replacement on the combined string. Split them apart again and print block 1. Then read block 3 and join block 2 and 3 and perform the replacement as above. Split them again and print block 2. Repeat until the end of the file. I haven't tested it, but it ought to look something like this:
File.open(fn, "rb") do |f|
last_block, this_block = nil
while not f.eof?
last_block, this_block = this_block, f.read(blk_size)
data = "#{last_block}#{this_block}".gsub(regex, str)
last_block, this_block = data.slice!(0, blk_size), data
print last_block
end
print this_block
end
There's probably a nontrivial performance penalty for doing it this way, but it could be acceptable depending on your use case.
Maybe a cheeky
f.pos = f.pos - replacement.size
at the end of the while loop, just before reading the next chunk.

How to print elements with same Xpath below same column (same header)

I'm trying to parse an XML file with REXML on Ruby.
What I want is print all values and the corresponding element name as header. The issue I have
is that some nodes have child elements that appear repeated and have the same Xpath, so for those
elements I want to printing in the same column. Then for the small sample below, the output desired
for the elements of Node_XX would be:
Output I'm looking for:
RepVal|YurVal|CD_val|HJY_val|CD_SubA|CD_SubB
MTSJ|AB01-J|45|01|87|12
||34|11|43|62
What I have so far is the code below, but I don´t know how to do in order repeated
elements be printed in the same column.
Thanks in advance for any help.
Code I have so far:
#!/usr/bin/env ruby
require 'rexml/document'
include REXML
xmldoc = Document.new File.new("input.xml")
arr_H_Xpath = [] # Array to store only once all Xpath´s (without Xpath repeated)
arr_H_Values = [] # Array for headers (each child element´s name)
arr_Values = [] # Values of each child element.
xmldoc.elements.each("//Node_XYZ") {|element|
element.each_recursive do |child|
if (child.has_text? && child.text =~ /^[[:alnum:]]/) && !arr_H_Xpath.include?(child.xpath.gsub(/\[.\]/,"")) # Check if element has text and Xpath is stored in arr_H_Xpath.
arr_H_Xpath << child.xpath.gsub(/\[.\]/,"") #Remove the [..] for repeated XPaths
arr_H_Values << child.xpath.gsub(/\/\w.*\//,"") #Get only name of child element to use it as header
arr_Values << child.text
end
print arr_H_Values + "|"
arr_H_Values.clear
end
puts arr_Values.join("|")
}
The input.xml is:
<TopNode>
<NodeX>
<Node_XX>
<RepCD_valm>
<RepVal>MTSJ</RepVal>
</RepCD_valm>
<RepCD_yur>
<Yur>
<YurVal>AB01-J</YurVal>
</Yur>
</RepCD_yur>
<CodesDif>
<CD_Ranges>
<CD_val>45</CD_val>
<HJY_val>01</HJY_val>
<CD_Sub>
<CD_SubA>87</CD_SubA>
<CD_SubB>12</CD_SubB>
</CD_Sub>
</CD_Ranges>
</CodesDif>
<CodesDif>
<CD_Ranges>
<CD_val>34</CD_val>
<HJY_val>11</HJY_val>
<CD_Sub>
<CD_SubA>43</CD_SubA>
<CD_SubB>62</CD_SubB>
</CD_Sub>
</CD_Ranges>
</CodesDif>
</Node_XX>
<Node_XY>
....
....
....
</Node_XY>
</NodeX>
</TopNode>
Here's one way to solve your problem. It is probably a little unusual, but I was experimenting. :)
First, I chose a data structure that can store the headers as keys and multiple values per key to represent the additional row(s) of data: a MultiMap. It is like a hash with multiple keys.
With the multimap, you can store the elements as key-value pairs:
data = Multimap.new
doc.xpath('//RepVal|//YurVal|//CD_val|//HJY_val|//CD_SubA|//CD_SubB').each do |elem|
data[elem.name] = elem.inner_text
end
The content of data is:
{"RepVal"=>["MTSJ"],
"YurVal"=>["AB01-J"],
"CD_val"=>["45", "34"],
"HJY_val"=>["01", "11"],
"CD_SubA"=>["87", "43"],
"CD_SubB"=>["12", "62"]}
As you can see, this was a simple way to collect all the information you need to create your table. Now it is just a matter of transforming it to your pipe-delimited format. For this, or any delimited format, I recommend using CSV:
out = CSV.generate({col_sep: "|"}) do |csv|
columns = data.keys.to_a.uniq
csv << columns
while !data.values.empty? do
csv << columns.map { |col| data[col].shift }
end
end
The output is:
RepVal|YurVal|CD_val|HJY_val|CD_SubA|CD_SubB
MTSJ|AB01-J|45|01|87|12
||34|11|43|62
Explanation:
CSV.generate creates a string. If you wanted to create an output file directly, use CSV.open instead. See the CSV class for more information. I added the col_sep option to delimit with a pipe character instead of the default of a comma.
Getting a list of columns would just be the keys if data was a hash. But since it is a Multimap which will repeat key names, I have to call .to_a.uniq on it. Then I add them to the output using csv << columns.
In order to create the second row (and any subsequent rows), we slice down and get the first value for each key of data. That's what the data[col].shift does: it actually removes the first value from each value in data. The loop is in place to keep going as long as there are more values (more rows).

Finding maximum CSV field sizes for large file using Ruby

I'm trying to determine maximum field sizes for a large CSV file (~5GB) with over 300 fields that I want to dump into a MySQL table. The CSV file schema I have for the file gives incorrect maximum field lengths, so I run into errors on the table import. I'm running Ruby 2.0 on Windows.
I'm using an array to store the maximum field lengths according to the field's index (or column) location, i.e. ignoring the fields actual name in the header. I've tried fancier things like using hashes, inject, and zip, etc, but it seems that a simple array works fastest here.
field_lengths[0] = Max length of first field
field_lengths[1] = Max length of second field
etc.
The file is too large to slurp at once or parse column-wise using CSV. So, I open the CSV file and use CSV#foreach to parse each line (ignoring the header using the :headers => true option). For each row, I loop through the parsed array of field values and compare the field's length with the current maximum length stored in the field_length array. I realize there are much easier ways to do this with smaller files. This method works OK for larger files, but I still haven't been able to make it to the end of my particular file using this method.
To get around not being able to finish the file, I currently define a number of lines to read including the header (=n), and break once I've reached the n-th line. In the example below, I read 101 lines from the CSV file. (1 header row + 100 actual data rows). I'm not sure how many total rows are in the file since the process hasn't finished.
require 'csv'
require 'pp'
data_file = 'big_file.csv'
# We're only reading the first 101 lines in this example
n = 101
field_lengths = []
File.open(data_file) do |f|
CSV.foreach(f, :headers => true, :header_converters => :symbol) do |csv_row|
break if $. > n
csv_row.fields.each_with_index do |a,i|
field_lengths[i] ||= a.to_s.length
field_lengths[i] = a.to_s.length if field_lengths[i] < a.to_s.length
end
end
end
pp field_lengths
IO#read can read a certain number of bytes, but if I parse the file by bytes the records may get split. Does anyone have alternate suggestions for parsing the CSV file, by splitting it up into smaller files? O'Reilly's Ruby Cookbook (Lucas Carlson & Leonard Richardson, 2006, 1st ed), suggests breaking a large file into chunks (as below), but I'm not sure how to extend it to this example, particularly dealing with the line breaks etc.
class File
def each_chunk(chunk_size = 1024)
yield read(chunk_size) until eof?
end
end
open("bigfile.txt") do |f|
f.each_chunk(15) {|chunk| puts chunk}
end
You're using CSV.foreach wrong, it takes a string for a filename:
field_lengths = {}
CSV.foreach(data_file, :headers => true, :header_converters => :symbol) do |csv_row|
csv_row.each do |k, v|
field_lengths[k] = [field_lengths[k] || 0, v.length].max
end
end
pp field_lengths
Given a CSV::Table in a variable called csv, you can change it to be "by column" and use collect or map...or inject. There's lots of approaches.
e.g.
csv.by_col!
field_lengths = cols.map{|col| col.map{|r| r.is_a?(String) ? r.to_s.length : r.map{|v| v.to_s.length}.max }.max}
The first map iterates through the columns.
The second map iterates through the header (String) then the array of row values.
So the third map iterates through the rows where you can convert the values into a string and return its length.
The first max is the maximum value string length.
The second max is the maximum between the header's lengths and the max value length.
At the end, you may wish to return the csv to col_or_row
e.g.
csv.by_col_or_row!

Extract a single line string having "foo: XXXX"

I have a file with one or more key:value lines, and I want to pull a key:value out if key=foo. How can I do this?
I can get as far as this:
if File.exist?('/file_name')
content = open('/file_name').grep(/foo:??/)
I am unsure about the grep portion, and also once I get the content, how do I extract the value?
People like to slurp the files into memory, which, if the file will always be small, is a reasonable solution. However, slurping isn't scalable, and the practice can lead to excessive CPU and I/O waits as content is read.
Instead, because you could have multiple hits in a file, and you're comparing the content line-by-line, read it line-by-line. Line I/O is very fast and avoids the scalability problems. Ruby's File.foreach is the way to go:
File.foreach('path/to/file') do |li|
puts $1 if li[/foo:\s*(\w+)/]
end
Because there are no samples of actual key/value pairs, we're shooting in the dark for valid regex patterns, but this is the basis for how I'd solve the problem.
Try this:
IO.readlines('key_values.txt').find_all{|line| line.match('key1')}
i would recommend to read the file into array and select only lines you need:
regex = /\A\s?key\s?:/
results = File.readlines('file').inject([]) do |f,l|
l =~ regex ? f << "key = %s" % l.sub(regex, '') : f
end
this will detect lines starting with key: and adding them to results like key = value,
where value is the portion going after key:
so if you have a file like this:
key:1
foo
key:2
bar
key:3
you'll get results like this:
key = 1
key = 2
key = 3
makes sense?
value = File.open('/file_name').read.match("key:(.*)").captures[0] rescue nil
File.read('file_name')[/foo: (.*)/, 1]
#=> XXXX

Problem with initializing a hash in ruby

I have a text file from which I want to create a Hash for faster access. My text file is of format (space delimited)
author title date popularity
I want to create a hash in which author is the key and the remaining is the value as an array.
created_hash["briggs"] = ["Manup", "Jun,2007", 10]
Thanks in advance.
require 'date'
created_hash = File.foreach('test.txt', mode: 'rt', encoding: 'UTF-8').
reduce({}) {|hsh, l|
name, title, date, pop = l.split
hsh.tap {|hsh| hsh[name] = [title, Date.parse(date), pop.to_i] }
}
I threw some type conversion code in there, just for fun. If you don't want that, the loop body becomes even simpler:
k, *v = l.split
hsh.tap {|hsh| hsh[k] = v }
You can also use readlines instead of foreach. Note that IO#readlines reads the entire file into an array first. So, you need enough memory to hold both the entire array and the entire hash. (Of course, the array will be eligible for garbage collection as soon as the loop finishes.)
Just loop through each line of the file, use the first space-delimited item as the hash key and the rest as the hash value. Pretty much exactly as you described.
created_hash = {}
file_contents.each_line do |line|
data = line.split(' ')
created_hash[data[0]] = data.drop 1
end

Resources