CSV parsing, newline/linebreak issues

CSV parsing, newline/linebreak issues - ruby

I'm trying to create a parser for multiple CSV files, that will eventually output to another CSV file in Excel-compatible format. The CSV files are exported by a commercial tool that takes a Firewall configuration and gives us a report of any issues it finds.
So far I have figured out how to read a directory of files in, look for certain values, determine the type of device I have and then spit it out to screen or to a CSV, but only if each line has single cell entries. If the source IP 'cell' (or any other) contains more than one IP, separated by a newline, the output breaks on that newline and pushes the remainder onto the next line.
The code I have so far is:
require 'csv'
require 'pp'
nipperfiles = Dir.glob(ARGV[0] + '/*.csv')
def allcsv(nipperfiles)
filearray = []
nipperfiles.each do |csv|
filearray << csv
end
filearray
end
def devicetype(filelist)
filelist.each do |f|
CSV.foreach(f, :headers => true, :force_quotes => true) do |row|
if row["Table"] =~ /audit device list/ && row["Device"] =~ /Cisco/
return "Cisco"
elsif row["Table"] =~ /audit device list/ && row["Device"] =~ /Dell/
return "Sonicwall"
elsif row["Table"] =~ /audit device list/ && row["Device"] =~ /Juniper/
return "Juniper"
end
end
end
end
def adminservices(device, filelist)
administrative = []
filelist.each do |f|
CSV.foreach(f, :headers => true, :col_sep => ",", :force_quotes => true, :encoding => Encoding::UTF_8) do |row|
if row["Table"] =~ /administrative service rule/
if row["Dst Port"] != "Any" and row["Service"] != "[Host] Any"
if device == "Cisco"
administrative << row["Table"] + ',' + row["Rule"] + ',' + row["Protocol"] + ',' + row["Source"] + ',' + row["Destination"] + ',' + row["Dst Port"]
elsif device == "Sonicwall"
administrative << row["Table"] + ',' + row["Rule"] + ',' + row["Source"] + ',' + row["Destination"] + ',' + row["Service"]
elsif device == "Juniper"
administrative << row["Table"] + ',' + row["Rule"] + ',' + row["Source"] + ',' + row["Destination"] + ',' + row["Service"]
end
end
end
end
end
administrative
end
def writecsv(admin)
finalcsv = File.new("randomstorm.csv", "w+")
finalcsv.puts("Administrative Services Table:\n", admin, "\r\n")
finalcsv.close
end
filelist = allcsv(nipperfiles)
device = devicetype(filelist)
adminservices(device, filelist)
admin = adminservices(device, filelist)
writecsv(admin)
Is there a way to get it to ignore the newlines that are inside cells, or is my code complete balls and needs to be started again?
I have tried writing a CSV file with the CSV library, but the results are the same and I figured this code was slightly clearer for demonstrating the issue.
I can sanitise an input file if it would help.

newlines are OK inside of fields as long they are quoted:
CSV.parse("1,\"2\n\n\",3")
=> [["1", "2\n\n", "3"]]
Try writing directly to a string or a file like in the documentation which will ensure your fields with newlines are quoted:
def writecsv(admin)
csv_string = CSV.generate do |csv|
admin.each { |row| csv << row }
end
finalcsv = File.new("randomstorm.csv", "w+")
finalcsv.puts("Administrative Services Table:\n", csv_string, "\r\n")
finalcsv.close
end
Also ensure you are writing your fields as an array inside of adminservices():
administrative << [row["Table"], row["Rule"], row["Protocol"], row["Source"], row["Destination"], row["Dst Port"]]

Related

Read complete CSV cell data for complicated strings

I am using ruby to merge CSV files that might contain different headers.
my problem is that some of the values in the CSV files are quite complicated and when data get lost in the merge process
for example the original value: "[cell([""A"",""B""]),""X""+cell([""A"",""C""])+""W""].join(""_"")" will be written as "[cell([""A"",v1,""B""]),
and as a result I get CSV::MalformedCSVError (CSV::MalformedCSVError) when trying to read the merged file.
how can I read and write the exact content of each CSV cell?
my code and running example:
def join_multiple_csv(csv_path_array)
f = CSV.parse(File.read(csv_path_array[0]), :headers => true, :quote_char => "'")
f_h = {}
f.headers.each {|header| f_h[header] = f[header]}
n_rows = f.size
csv_path_array.shift(1)
csv_path_array.each do |csv_file|
curr_csv = CSV.parse(File.read(csv_file), :headers => true, :quote_char => "'")
curr_h = {}
curr_csv.headers.each {|header| curr_h[header] = curr_csv[header]}
new_headers = curr_csv.headers - f_h.keys
exist_headers = curr_csv.headers - new_headers
new_headers.each { |new_header|
f_h[new_header] = Array.new(n_rows) + curr_csv[new_header]
}
exist_headers.each {|exist_header|
f_h[exist_header] = f_h[exist_header] + curr_csv[exist_header]
}
n_rows = n_rows + curr_csv.size
end
csv_headers = f_h.keys.map {|string| string}
output = csv_headers.join(",") + "\n"
(0..n_rows-1).each do |i|
row = ''
f_h.each_key do |header|
if f_h[header][i].nil?
row.concat(f_h[header][i].to_s + ",")
else
row.concat(f_h[header][i].to_s + ",")
end
end
output.concat(row + "\n")
end
return output
end
csv_files = ['f1.csv', 'f2.csv']
outputs = join_multiple_csv(csv_files)
f = CSV.new(outputs)
row = f.readline
while row do
row = f.readline
end
running example:
f1.csv
H1,H3,H4
v1,v2,v3
f2.csv
H2,H3,H4
v1,v3,"[cell([""A"",""B""]),""X""+cell([""A"",""C""])+""W""].join(""_"")"
expected output:
H1,H2,H3,H4
v1,,v2,v3
,v1,v3,"[cell([""A"",""B""]),""X""+cell([""A"",""C""])+""W""].join(""_"")"
output:
H1,H3,H4,H2,
v1,v2,v3,,,
,v3,"[cell([""A"",v1,""B""]),
,,,,,
,,,,,
Any idea what can I do?

Sorry I answered in rush.
I tried to run your program and found that the quote character causing to split the cell value on each comma in the string. changing quote character to double quote worked for me
f = CSV.parse(File.read(csv_path_array[0]), :headers => true, :quote_char => '"')
curr_csv = CSV.parse(File.read(csv_file), :headers => true, :quote_char => '"')

how i could wrapped two command in ruby?

Im trying to wrapped this two command in ruby but not work
ruby -a -ne 'print $F[0].gsub(/=(.*?)&/," \"\\1\" and ")' prueban > prueban2
ruby -a -F';' -ne 'puts $F[0].sub("less"," <")' prueban2 > prueban3
this is my command
File.open("text.txt", "r") do |fi|
fi.readlines.each do |line|
parts = line.chomp.split(';')
fx= puts parts[0].gsub(/=(.*?)&/," \"\\1\" and ")
end
fx.readlines.each do |line|
parts = line.chomp.split(';')
fx= puts parts[0].gsub("less"," <")
end
end
this is my file
pricegreater=2&priceless=4&seleccionequal=pet&
and this is my expected output
pricegreater "2" and price < "4" and seleccionequal "pet" and
I dont know whats is doing wrong please help me

Here's a reworked version of the core function to show how to do it in a more Ruby-esque way:
# Define a lookup table of all substitutions
REWRITE = {
'greater' => '>',
'less' => '<',
'equal' => '='
}
# Use the lookup table to create a regular expression that matches them all
REWRITE_RX = Regexp.new(Regexp.union(REWRITE.keys).to_s + '\z')
def rewrite(input)
# Split up each main part of the input on &
input.split('&').map do |pair|
# Carve up each part into a var and value on =
var, value = pair.split('=')
# Replace terms found in the lookup table
var.sub!(REWRITE_RX) do |m|
' ' + REWRITE[m]
end
# Combine these to get the result
[ var, value ].join(' ')
end.join(' and ')
end
Put into action you get this:
rewrite("pricegreater=2&priceless=4&seleccionequal=pet&")
# => "price > 2 and price < 4 and seleccion = pet"

I solved with this
File.open("text.txt", "r") do |fi|
fi.readlines.each do |line|
parts = line.chomp.split(';')
f = File.open('text2.txt', 'w')
old_out = $stdout
$stdout = f
puts parts[0].gsub(/=(.*?)&/," \"\\1\" and ")
f.close
$stdout = old_out
end
end
File.open("text2.txt", "r") do |fi|
fi.readlines.each do |line|
parts = line.chomp.split(';')
f = File.open('text3.txt', 'w')
old_out = $stdout
$stdout = f
puts parts[0].sub("less"," <")
f.close
$stdout = old_out
end
end

How to make sure REXML::Formatters::Pretty uses \t instead of white-space for indentation

It seems to me that there's no way of making sure REXML::Formatters::Pretty can use \t instead of white-space for the indentation strategy in the XML Tree. The only thing I can do is to define how many white spaces are used per indentation level.
Am I wrong?

Not sure why REXML library does not provide you with this option since it could definitely support it internally but you can just roll your own formatter:
module REXML
module Formatters
class Prettier < Pretty
attr_accessor :style
def initialize(indentation = 2, indent_style =" ", ie_hack=false)
#style = indent_style
super(indentation,ie_hack)
end
protected
def write_element(node, output)
output << style*#level
output << "<#{node.expanded_name}"
node.attributes.each_attribute do |attr|
output << " "
attr.write( output )
end unless node.attributes.empty?
if node.children.empty?
if #ie_hack
output << " "
end
output << "/"
else
output << ">"
# If compact and all children are text, and if the formatted output
# is less than the specified width, then try to print everything on
# one line
skip = false
if compact
if node.children.inject(true) {|s,c| s & c.kind_of?(Text)}
string = ""
old_level = #level
#level = 0
node.children.each { |child| write( child, string ) }
#level = old_level
if string.length < #width
output << string
skip = true
end
end
end
unless skip
output << "\n"
#level += #indentation
node.children.each { |child|
next if child.kind_of?(Text) and child.to_s.strip.length == 0
write( child, output )
output << "\n"
}
#level -= #indentation
output << style*#level
end
output << "</#{node.expanded_name}"
end
output << ">"
end
def write_text( node, output )
s = node.to_s()
s.gsub!(/\s/,' ')
s.squeeze!(" ")
s = wrap(s, #width - #level)
s = indent_text(s, #level, style, true)
output << (style*#level + s)
end
def write_comment( node, output)
output << style * #level
Default.instance_method(:write_comment).bind(self).call(node,output)
end
def write_cdata( node, output)
output << style * #level
Default.instance_method(:write_cdata).bind(self).call(node,output)
end
end
end
end
Now you can specify your own indentation level and a indent style e.g.
require "rexml/document"
include REXML
string = <<EOF
<mydoc>
<someelement attribute="nanoo">Text, text, text</someelement>
</mydoc>
EOF
doc = Document.new string
f = Formatters::Prettier(2,"h")
f.write(doc,$stdout)
#<mydoc>
#hh<someelement attribute='nanoo'>
#hhhhText, text, text
#hh</someelement>
#</mydoc>
I used "h" to show how the indentation works as \t will not show up in $stdout but in you case this would be
f = Formatters::Prettier(1,"\t")

What is the syntax for array.select?

I'm trying to use Array.select to separate out, and then delete, strings from a database that contain unwanted items. I get no errors but this does not seem to be working as hoped.
The relevant code is the last part:
totaltext = []
masterfacs = ''
nilfacs = ''
roomfacs_hash = {'lcd' => lcd2, 'wifi'=> wifi2, 'wired' => wired2, 'ac' => ac2}
roomfacs_hash.each do |fac, fac_array|
if roomfacs.include? (fac)
totaltext = (totaltext + fac_array)
masterfacs = (masterfacs + fac + ' ')
else
nilfacs = (nilfacs + fac + ' ')
end
end
finaltext = Array.new
text_to_delete = totaltext2.select {|sentences| sentences =~ /#{nilfacs}/i}
finaltext = totaltext2.delete (text_to_delete)
puts finaltext

It's probably not working because delete isn't a chainable method (the return value is the object you are trying to delete on success, or nil if not found; not the modified array). To simplify your code, just use reject
finaltext = totaltext.reject{|sentence| nilfacs.any?{|fac| sentence =~ /#{fac}/i } }

Better way to parse "Description (tag)" to "Description, tag"

I have a text file with many 1000s of lines like this, which are category descriptions with the keyword enclosed in parentheses
Chemicals (chem)
Electrical (elec)
I need to convert these lines to comma separated values like so:
Chemicals, chem
Electrical, elec
What I am using is this:
lines = line.gsub!('(', ',').gsub!(')', '').split(',')
I would like to know if there is a better way to do this.
for posterity, this is the full code (based on the answers)
require 'rubygems'
require 'csv'
csvfile = CSV.open('output.csv', 'w')
File.open('c:/categories.txt') do |f|
f.readlines.each do |line|
(desc, cat) = line.split('(')
desc.strip!
cat.strip!
csvfile << [desc, cat[0,cat.length-1]]
end
end

Try something like this:
line.sub!(/ \((\w+)\)$/, ', \1')
The \1 will be replaced with the first match of the given regexp (in this case it will be always the category keyword). So it will basically change the (chem) with , chem.
Let's create an example using a text file:
lines = []
File.open('categories.txt', 'r') do |file|
while line = file.gets
lines << line.sub(/ \((\w+)\)$/, ', \1')
end
end
Based on the question updates I can propose this:
require 'csv'
csv_file = CSV.open('output.csv', 'w')
File.open('c:/categories.txt') do |f|
f.each_line {|c| csv_file << c.scan(/^(.+) \((\w+)\)$/)}
end
csv_file.close

Starting with Ruby 1.9, you can do it in one method call:
str = "Chemicals (chem)\n"
mapping = { ' (' => ', ',
')' => ''}
str.gsub(/ \(|\)/, mapping) #=> "Chemicals, chem\n"

In Ruby, a cleaner, more efficient, way to do it would be:
description, tag = line.split(' ', 2) # split(' ', 2) will return an 2 element array of
# the all characters up to the first space and all characters after. We can then use
# multi assignment syntax to assign each array element in a different local variable
tag = tag[1, (tag.length - 1) - 1] # extract the inside characters (not first or last) of the string
new_line = description << ", " << tag # rejoin the parts into a new string
This will be computationally faster (if you have a lot of rows) because it uses direct string operations instead of regular expressions.

No need to manipulate the string. Just grab the data and output it to the CSV file.
Assuming that you have something like this in the data:
Chemicals (chem)
Electrical (elec)
Dyes & Intermediates (dyes)
This should work:
File.open('categories.txt', 'r') do |file|
file.each_line do |line|
csvfile << line.match(/^(.+)\s\((.+)\)$/) { |m| [m[1], m[2]] }
end
end

Benchmarks relevant to discussion in #hundredwatt's answer:
require 'benchmark'
line = "Chemicals (chem)"
# #hundredwatt
puts Benchmark.measure {
100000.times do
description, tag = line.split(' ', 2)
tag = tag[1, (tag.length - 1) - 1]
new_line = description << ", " << tag
end
} # => 0.18
# NeX
puts Benchmark.measure {
100000.times do
line.sub!(/ \((\w+)\)$/, ', \1')
end
} # => 0.08
# steenslag
mapping = { ' (' => ', ',
')' => ''}
puts Benchmark.measure {
100000.times do
line.gsub(/ \(|\)/, mapping)
end
} # => 0.08

know nothing about ruby, but it is easy in php
preg_match_all('~(.+)\((.+)\)~','Chemicals (chem)',$m);
$result = $m[1].','.$m[2];

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

CSV parsing, newline/linebreak issues - ruby

Related

Read complete CSV cell data for complicated strings

how i could wrapped two command in ruby?

How to make sure REXML::Formatters::Pretty uses \t instead of white-space for indentation

What is the syntax for array.select?

Better way to parse "Description (tag)" to "Description, tag"

Categories

Resources