RUBY CSV Gem generating random quotation mark - ruby

I'm trying to generate a csv file from a SQL query result.
99% of the time it does work fine, but in some lines (rows) of the CSV file, it does generate a quotation mark at the start and the end of the row.
The problem pictured:
I've already checked the content of the SQL cells and it is ok.
So I think the problem happens when generating the file.
Here it is the way the file it's been generated.
#load query result
dataset = DB[ "select
id
,action
from
some_table"]
#generate csv file
CSV.open("#{table}.csv", "wb",:write_headers=> true, :headers => ["id_cliente|""acao"] ) do |csv|
dataset.each do |dbrow|
csv << [
"#{dbrow[:id_cliente]}"
+ "|" +
"#{dbrow[:acao]}"
]
end
end
new_object = $bucket_response.objects.build("#{table}.csv")
new_object.content = open("#{table}.csv")
new_object.acl = :public_read
new_object.save
Is there anyway so solve it or improve the generating process?

You must specify the separator instead of passing it as a string:
CSV.open("#{table}.csv", "wb", col_sep: '|', ..., headers: ['id_cliente', 'acao']
...
csv << [dbrow[:id_cliente], dbrow[:acao]]
...
For more infos check the CSV and CSV::Row docs

Related

use smarter csv gem and processing csv in chunks - i need to delete rows from a large csv ( 2GB) by comparing the key/values with another csv (1 GB)

The following is the code i have used. I am not able to delete the rows from Main.csv, when the value of "name" col in Main.csv equals to the value of "name" col in Sub.csv. Please help me on the same. I know i am missing something. Thanks in advance.
require 'rubygems'
require 'smarter_csv'
main_csv = SmarterCSV.process('Main.csv', {:chunk_size => 100}) do |chunk|
short_csv = SmarterCSV.process('Sub.csv', {:chunk_size => 100}) do |smaller_chunk|
chunk.each do |each_ch|
smaller_chunk.each do |small_each_ch|
each_ch.delete_if{|k,v| v == small_each_ch[:name]}
end
end
end
end
It's a bit of a non-standard scenario for smarter_csv..
Sub.csv has 2000 rows. whereas Main.csv has around 1million rows.
If all you need to decide is if the name appears in both files, then you can do this:
1) read the Sub.csv file first, and just store the values of name in an array sub_names
2) open an output file for the result.csv file
3) read the Main.csv file, with processing in chunks,
and write the data for each row to the result.csv file, if the name does not appear in the array sub_names
4) close the output file - est voila!

Ruby CSV: How to skip the first two lines of file?

I have a file where the first line is a useless line, and the 2nd is a header. The problem is that when I'm looping through the file, it counts those are rows. Is there a way to use foreach with options to skip 2 lines? I know there's a read method on CSV, but that loads the data into RAM and if the file is too big I don't think it'll scale well.
However, if there is no other option I will consider it. This is what I have so far:
CSV.foreach(filename, col_sep: "\t") do |row|
until listings.size == limit
listing_class = 'Sale'
address = row[7]
unit = row[8]
price = row[2]
url = row[0]
listings << {listing_class: listing_class, address: address, unit: unit, url: url, price: price}
end
end
I didn't benchmark, but try this:
CSV.to_enum(:foreach, filename, col_sep: "\t").drop(2).each do |row|
Use a counter var, initialize it to 0, and increment it at every line, so if it's smaller than 2 then you can skip to the next row.
You can also use #read or #readlines like so
CSV.readlines(filename, col_sep: "\t")[2..-1] do |row|
#readlines is an alias for #read so it does not matter which you use but it splits the CSV into an Array of Arrays so [2..-1] means use rows 3 through the end.
Both this and #Nakilon's answer are probably better and definitely cleaner than using a counter.
As always Ruby classes are well documented and reading the Docs can be much more beneficial than just waiting for someone to hand you an answer.

How to print elements with same Xpath below same column (same header)

I'm trying to parse an XML file with REXML on Ruby.
What I want is print all values and the corresponding element name as header. The issue I have
is that some nodes have child elements that appear repeated and have the same Xpath, so for those
elements I want to printing in the same column. Then for the small sample below, the output desired
for the elements of Node_XX would be:
Output I'm looking for:
RepVal|YurVal|CD_val|HJY_val|CD_SubA|CD_SubB
MTSJ|AB01-J|45|01|87|12
||34|11|43|62
What I have so far is the code below, but I don´t know how to do in order repeated
elements be printed in the same column.
Thanks in advance for any help.
Code I have so far:
#!/usr/bin/env ruby
require 'rexml/document'
include REXML
xmldoc = Document.new File.new("input.xml")
arr_H_Xpath = [] # Array to store only once all Xpath´s (without Xpath repeated)
arr_H_Values = [] # Array for headers (each child element´s name)
arr_Values = [] # Values of each child element.
xmldoc.elements.each("//Node_XYZ") {|element|
element.each_recursive do |child|
if (child.has_text? && child.text =~ /^[[:alnum:]]/) && !arr_H_Xpath.include?(child.xpath.gsub(/\[.\]/,"")) # Check if element has text and Xpath is stored in arr_H_Xpath.
arr_H_Xpath << child.xpath.gsub(/\[.\]/,"") #Remove the [..] for repeated XPaths
arr_H_Values << child.xpath.gsub(/\/\w.*\//,"") #Get only name of child element to use it as header
arr_Values << child.text
end
print arr_H_Values + "|"
arr_H_Values.clear
end
puts arr_Values.join("|")
}
The input.xml is:
<TopNode>
<NodeX>
<Node_XX>
<RepCD_valm>
<RepVal>MTSJ</RepVal>
</RepCD_valm>
<RepCD_yur>
<Yur>
<YurVal>AB01-J</YurVal>
</Yur>
</RepCD_yur>
<CodesDif>
<CD_Ranges>
<CD_val>45</CD_val>
<HJY_val>01</HJY_val>
<CD_Sub>
<CD_SubA>87</CD_SubA>
<CD_SubB>12</CD_SubB>
</CD_Sub>
</CD_Ranges>
</CodesDif>
<CodesDif>
<CD_Ranges>
<CD_val>34</CD_val>
<HJY_val>11</HJY_val>
<CD_Sub>
<CD_SubA>43</CD_SubA>
<CD_SubB>62</CD_SubB>
</CD_Sub>
</CD_Ranges>
</CodesDif>
</Node_XX>
<Node_XY>
....
....
....
</Node_XY>
</NodeX>
</TopNode>
Here's one way to solve your problem. It is probably a little unusual, but I was experimenting. :)
First, I chose a data structure that can store the headers as keys and multiple values per key to represent the additional row(s) of data: a MultiMap. It is like a hash with multiple keys.
With the multimap, you can store the elements as key-value pairs:
data = Multimap.new
doc.xpath('//RepVal|//YurVal|//CD_val|//HJY_val|//CD_SubA|//CD_SubB').each do |elem|
data[elem.name] = elem.inner_text
end
The content of data is:
{"RepVal"=>["MTSJ"],
"YurVal"=>["AB01-J"],
"CD_val"=>["45", "34"],
"HJY_val"=>["01", "11"],
"CD_SubA"=>["87", "43"],
"CD_SubB"=>["12", "62"]}
As you can see, this was a simple way to collect all the information you need to create your table. Now it is just a matter of transforming it to your pipe-delimited format. For this, or any delimited format, I recommend using CSV:
out = CSV.generate({col_sep: "|"}) do |csv|
columns = data.keys.to_a.uniq
csv << columns
while !data.values.empty? do
csv << columns.map { |col| data[col].shift }
end
end
The output is:
RepVal|YurVal|CD_val|HJY_val|CD_SubA|CD_SubB
MTSJ|AB01-J|45|01|87|12
||34|11|43|62
Explanation:
CSV.generate creates a string. If you wanted to create an output file directly, use CSV.open instead. See the CSV class for more information. I added the col_sep option to delimit with a pipe character instead of the default of a comma.
Getting a list of columns would just be the keys if data was a hash. But since it is a Multimap which will repeat key names, I have to call .to_a.uniq on it. Then I add them to the output using csv << columns.
In order to create the second row (and any subsequent rows), we slice down and get the first value for each key of data. That's what the data[col].shift does: it actually removes the first value from each value in data. The loop is in place to keep going as long as there are more values (more rows).

Finding maximum CSV field sizes for large file using Ruby

I'm trying to determine maximum field sizes for a large CSV file (~5GB) with over 300 fields that I want to dump into a MySQL table. The CSV file schema I have for the file gives incorrect maximum field lengths, so I run into errors on the table import. I'm running Ruby 2.0 on Windows.
I'm using an array to store the maximum field lengths according to the field's index (or column) location, i.e. ignoring the fields actual name in the header. I've tried fancier things like using hashes, inject, and zip, etc, but it seems that a simple array works fastest here.
field_lengths[0] = Max length of first field
field_lengths[1] = Max length of second field
etc.
The file is too large to slurp at once or parse column-wise using CSV. So, I open the CSV file and use CSV#foreach to parse each line (ignoring the header using the :headers => true option). For each row, I loop through the parsed array of field values and compare the field's length with the current maximum length stored in the field_length array. I realize there are much easier ways to do this with smaller files. This method works OK for larger files, but I still haven't been able to make it to the end of my particular file using this method.
To get around not being able to finish the file, I currently define a number of lines to read including the header (=n), and break once I've reached the n-th line. In the example below, I read 101 lines from the CSV file. (1 header row + 100 actual data rows). I'm not sure how many total rows are in the file since the process hasn't finished.
require 'csv'
require 'pp'
data_file = 'big_file.csv'
# We're only reading the first 101 lines in this example
n = 101
field_lengths = []
File.open(data_file) do |f|
CSV.foreach(f, :headers => true, :header_converters => :symbol) do |csv_row|
break if $. > n
csv_row.fields.each_with_index do |a,i|
field_lengths[i] ||= a.to_s.length
field_lengths[i] = a.to_s.length if field_lengths[i] < a.to_s.length
end
end
end
pp field_lengths
IO#read can read a certain number of bytes, but if I parse the file by bytes the records may get split. Does anyone have alternate suggestions for parsing the CSV file, by splitting it up into smaller files? O'Reilly's Ruby Cookbook (Lucas Carlson & Leonard Richardson, 2006, 1st ed), suggests breaking a large file into chunks (as below), but I'm not sure how to extend it to this example, particularly dealing with the line breaks etc.
class File
def each_chunk(chunk_size = 1024)
yield read(chunk_size) until eof?
end
end
open("bigfile.txt") do |f|
f.each_chunk(15) {|chunk| puts chunk}
end
You're using CSV.foreach wrong, it takes a string for a filename:
field_lengths = {}
CSV.foreach(data_file, :headers => true, :header_converters => :symbol) do |csv_row|
csv_row.each do |k, v|
field_lengths[k] = [field_lengths[k] || 0, v.length].max
end
end
pp field_lengths
Given a CSV::Table in a variable called csv, you can change it to be "by column" and use collect or map...or inject. There's lots of approaches.
e.g.
csv.by_col!
field_lengths = cols.map{|col| col.map{|r| r.is_a?(String) ? r.to_s.length : r.map{|v| v.to_s.length}.max }.max}
The first map iterates through the columns.
The second map iterates through the header (String) then the array of row values.
So the third map iterates through the rows where you can convert the values into a string and return its length.
The first max is the maximum value string length.
The second max is the maximum between the header's lengths and the max value length.
At the end, you may wish to return the csv to col_or_row
e.g.
csv.by_col_or_row!

How to store number in text format in csv file using Ruby CSV?

Even though I am inserting value as a string in CSV its getting stored as number e.g. "01" getting stored as 1.
I am using CSV writer:
#out = File.open("#{File.expand_path("CSV")}/#{file_name}.csv", "w")
CSV::Writer.generate(#out) do |csv|
csv << ["01", "02", "test"]
end
#out.close
This generates csv with given values but when we open csv using excel "01" is not stored as text it gets stored as number
Thanks
You have to surround the value with double quotations like "..." in order to get it stored as a string.
Use string formatting:
my_int = 1
p "%02d" % my_int
start here for Ruby 1.9.2
http://www.ruby-doc.org/core/classes/String.html
and you will see that for a full set of instructions, you need to dig into Kernel::sprintf

Resources