I have to work with a csv file that is not directly useable for my needs of generating a simple chart. I need to manipulate the file into something "cleaner" and am running into issues and unsure if my overall strategy is correct, as I'm just learning to parse through files with ruby.... My issues here mainly related to me looking for data that is offset from where I've found or haven't found matches. After I find a line which meets criteria, I need to read info from 2 lines after it and manipulate some of it (move something from the last column to the second).
Here's the original csv file:
component
quantity header,design1,design2,design3,Ref,Units
quantity type,#,#,#,ref#,unit value
component
quantity header,design1,design2,design3,Ref,Units
quantity type,#,#,#,ref#,unit value
component
quantity header,design1,design2,design3,Ref,Units
quantity type,#,#,#,ref#,unit value
Desired output:
Component Header,Quantity type Header,Units Header,design1 header,design2 header,design3 header,Ref header
component,quantity type,unit value,#,#,#,n/a
component,quantity type,unit value,#,#,#,n/a
component,quantity type,unit value,#,#,#,n/a
component,quantity type,unit value,#,#,#,n/a
component,quantity type,unit value,#,#,#,n/a
My ruby script at the moment:
require 'csv'
f = File.new("sp.csv")
o = CSV.open('output.csv', 'w')
f.each_line do |l| #iterate through each line
data = l.split
if l !~ /,/ #if the line does not contain a comma it is a component
o << [data,f.gets] #start writing data, f.gets skips next line but need to skip 2 and split the line to manipulate columns
else
o << ['comma'] #just me testing that I can find lines with commas
end
end
f.gets skips the next line and the documentation isn't clear to me how to use it to skip 2. After that I THINK I can split that line by commas and manipulate row data with array[column]. Aside from this offset issue I'm also unsure if my general approach is a good strategy
EDIT
Here's some lines from the real file.... I'll work through the answers provided and see if I can make it all work. The idea I've had is to read and write line by line, vs. converting the whole file to an array and then reading and writing. My thought is that when these files get big, and they do, it'll take less memory doing it line by line.
THANKS for the help, I'll work through answers and get back to you.
DCB
Result Quantity,BL::BL,BL::BL_DCB-noHeat,DC1::DC1,DC2::DC2,noHS::noHS,20mmHS::20mmHS,Reference,Units
Avg Temperature,82.915,69.226,78.35,78.383,86.6,85.763,N/A,Celsius
RCB
Result Quantity,BL::BL,BL::BL_DCB-noHeat,DC1::DC1,DC2::DC2,noHS::noHS,20mmHS::20mmHS,Reference,Units
Avg Temperature,76.557,68.779,74.705,74.739,80.22,79.397,N/A,Celsius
Antenna
Result Quantity,BL::BL,BL::BL_DCB-noHeat,DC1::DC1,DC2::DC2,noHS::noHS,20mmHS::20mmHS,Reference,Units
Avg Temperature,69.988,65.045,69.203,69.238,73.567,72.777,N/A,Celsius
PCBA_fiberTray
Result Quantity,BL::BL,BL::BL_DCB-noHeat,DC1::DC1,DC2::DC2,noHS::noHS,20mmHS::20mmHS,Reference,Units
Avg Temperature,66.651,65.904,66.513,66.551,72.516,70.47,N/A,Celsius
EDIT 2
Using some regexp from answers below I developed a line by line strategy to parse through this. I'll post it as an answer for completeness.
Thanks for helping out and exposing me to methods to develop a solution
How about slicing it into groups of 3 lines:
File.read("sp.csv").split("\n").each_slice(3) do |slice|
o << [slice[0], *slice[2].split(',')]
end
I created a CSV file based on the sample, called "test.csv".
Starting with this code:
data = File.readlines('test.csv').slice_before(/^component/)
I get an enumerator back. If I look at the data that enumerator will return I get:
pp data.to_a
[["component\n",
"quantity header,design1,design2,design3,Ref,Units\n",
"quantity type,#,#,#,ref#,unit value\n"],
["component\n",
"quantity header,design1,design2,design3,Ref,Units\n",
"quantity type,#,#,#,ref#,unit value\n"],
["component\n",
"quantity header,design1,design2,design3,Ref,Units\n",
"quantity type,#,#,#,ref#,unit value\n"]]
That's an array of arrays, broken into sub-arrays on the "component" line. I suspect the values aren't reflecting reality, but without a more accurate sample... well, GIGO.
If the "component" line isn't really a bunch of repeating "component" lines, and doesn't have any commas, you can use this instead:
data = File.readlines('test.csv').slice_before(/\A[^,]+\Z/)
or:
data = File.readlines('test.csv').slice_before(/^[^,]+$/)
The result will be the same with the current samples.
If you need a more complex regex you can substitute that, for instance:
/^(?:#{ Regexp.union(%w[component1 component2]).source })$/i
Which returns a pattern that will find any words in the %w[] array:
/^(?:component1|component2)$/i
From there we can walk the data array and clean out all the extraneous headers using:
data.map{ |a| a[2..-1] }.flatten
Which returns something like:
[
"quantity type,#,#,#,ref#,unit value\n",
"quantity type,#,#,#,ref#,unit value\n",
"quantity type,#,#,#,ref#,unit value\n"
]
That can be iterated and passed to CSV to be parsed into arrays if needed:
data.map{ |a| a[2..-1].map{ |r| CSV.parse(r) }.flatten }
[
["quantity type", "#", "#", "#", "ref#", "unit value"],
["quantity type", "#", "#", "#", "ref#", "unit value"],
["quantity type", "#", "#", "#", "ref#", "unit value"]
]
That's all background to get you thinking how you can tear apart the CSV data.
Using this code:
data.flat_map { |ary|
component = ary[0].strip
ary[2..-1].map{ |a|
data = CSV.parse(a).flatten
[
component,
data.shift,
data.pop,
*data[0..-2]
]
}
}
Returns:
[
["component", "quantity type", "unit value", "#", "#", "#"],
["component", "quantity type", "unit value", "#", "#", "#"],
["component", "quantity type", "unit value", "#", "#", "#"]
]
The only thing left to do is create the header you want to use, and pass the returned data back into CSV to let it generate the output file. You should be able to get there from here using the CSV documentation.
Edit:
Based on the actual data, here's a version of the code with a minor tweak, along with its output:
require 'csv'
require 'pp'
data = File.readlines('test.csv').slice_before(/^[^,]+$/)
pp data.flat_map { |ary|
component = ary[0].strip
ary[2..-1].map{ |a|
record = CSV.parse(a).flatten
[
component,
record.shift,
record.pop,
*record[0..-2]
]
}
}
Which looks like:
[["DCB",
"Avg Temperature",
"Celsius",
"82.915",
"69.226",
"78.35",
"78.383",
"86.6",
"85.763"],
["RCB",
"Avg Temperature",
"Celsius",
"76.557",
"68.779",
"74.705",
"74.739",
"80.22",
"79.397"],
["Antenna",
"Avg Temperature",
"Celsius",
"69.988",
"65.045",
"69.203",
"69.238",
"73.567",
"72.777"],
["PCBA_fiberTray",
"Avg Temperature",
"Celsius",
"66.651",
"65.904",
"66.513",
"66.551",
"72.516",
"70.47"]]
Code I'm using which creates csv file with everything manipulated... Thanks to those that contributed some help.
require 'csv'
file_in = File.new('sp1.csv')
file_out = CSV.open('output.csv', 'w')
header = []
row = []
file_in.each_line do |line|
case line
when /^[^,]+$/ #Find a component (line with no comma)
comp_header = file_in.gets.split(',') #header is after component and is split into an arry
if header.empty? #header
header.push("Component", comp_header[0], comp_header[-1].strip)
comp_header[1..-3].each do |h|
header.push(h)
end
file_out << header
end
#comp = line.to_s.strip
next
when /,/ #when a row had commas
puts #comp
vals = line.split(',') #split up into vals array
row.push(#comp, vals[0], vals[-1].strip) #add quantity and unit to row array
vals[1..-3].each do |v| #for values (excluding quanity, units, reference info)
row.push(v) #add values to row array
end
end
file_out << row #write the current row to csv file
row = [] #reset the row array to move on to the next component set
end
Related
I do have a hash like this.
v_cp={"29000"=>["Quimper"],
"29100"=>["Douarnenez",
"Kerlaz",
"Le Juch",
"Pouldergat",
"Poullan-sur-Mer"],
"29120"=>["Combrit",
"Plomeur",
"Saint Jean Trolimon","Pont-L\'Abbe","Tremeoc"],
"29140"=>["Melgven","Rosporden","Tourch"]
I would like to print out each key as a table format on screen like this :
I use :
v_cp.each_key{|k| puts "*"+k+"*";}
But of course I get this output:
which is not what I aim to...
I thought of sprintf or printf but I'm really lost here...
Any help ? Thanks
If the length of each key is fixed you can just slice keys into subgroups and print them out:
v_cp.keys.each_slice(5) { |a| puts a.join(' ') }
If the length can vary, you should also ljust strings:
str_length = 6
v_cp.keys.each_slice(5) do |a|
puts a.map { |e| e.ljust(str_length , ' ') }.join(' ')
end
You can use #print instead of #puts to put line feeds exactly where you want them. Unlike #puts, which automatically adds a new line every time it's called, #print prints out only the string that is passed to it, so you have to specifically print a new line character to get a new line.
For example, to get five keys that are the same size on each row, as in your first image:
example = {
29000 => ['Bonjour'],
29100 => ['Ça va?'],
29200 => ['Hello'],
29300 => ['Doing ok?'],
29400 => ['Some text'],
29500 => ['Something else'],
29600 => ['More stuff'],
29700 => ['This'],
29800 => ['That'],
29900 => ['The other']
}
example.keys.each_with_index do |key, index|
print key.to_s
print ((index + 1) % 5).zero? ? "\n" : ' '
end
# Result:
=begin
29000 29100 29200 29300 29400
29500 29600 29700 29800 29900
=end
(I liked two spaces better than one.)
If the length varies, you can use #ljust to pad smaller strings with trailing spaces, as Fizvlad mentions.
Consider preferring #print over #puts to output anything more complex than a simple string. You often can do with one call to #print what can take multiple calls to #puts, so overall #print is more efficient.
I want to change CSV file content:
itemId,url,name,type
1|urlA|nameA|typeA
2|urlB|nameB|typeB
3|urlC,urlD|nameC|typeC
4|urlE|nameE|typeE
into an array:
[itemId,url,name,type]
[1,urlA,nameA,typeA]
[2,urlB,nameB,typeB]
[**3**,**urlC**,nameC,typeC]
[**3**,**urlD**,nameC,typeC]
[4,urlE,nameE,typeE]
Could anybody teach me how to do it?
Finally, I'm going to DL url files(.jpg)
The header row has a different separator than the data. That's a problem. You need to change the header row to use | instead of ,. Then:
require 'csv'
require 'pp'
array = Array.new
CSV.foreach("test.csv", col_sep: '|', headers: true) do |row|
if row['url'][/,/]
row['url'].split(',').each do |url|
row['url'] = url
array.push row.to_h.values
end
else
array.push row.to_h.values
end
end
pp array
=> [["1", "urlA", "nameA", "typeA"],
["2", "urlB", "nameB", "typeB"],
["3", "urlC", "nameC", "typeC"],
["3", "urlD", "nameC", "typeC"],
["4", "urlE", "nameE", "typeE"]]
You'll need to test the fifth column to see how the line should be parsed. If you see a fifth element (row[4]) output the line twice replacing the url column
array = Array.new
CSV.foreach("test.csv") do |row|
if row[4]
array << [row[0..1], row[3..4]].flatten
array << [[row[0]], row[2..4]].flatten
else
array << row
end
end
p array
In your example you had asterisks but I'm assuming that was just to emphasise the lines for which you want special handling. If you do want asterisks, you can modify the two array shovel commands appropriately.
Below is the input file that I want to store into a hash table, sort it and output in the format shown below.
Input File
Name=Ashok, Email=ashok85#gmail.com, Country=India, Comments=9898984512
Email=raju#hotmail.com, Country=Sri Lanka, Name=Raju
Country=India, Comments=45535878, Email=vijay#gmail.com, Name=Vijay
Name=Ashok, Country=India, Email=ashok37#live.com, Comments=8898788987
Output File (Sorted by Name)
Name Email Country Comments
-------------------------------------------------------
Ashok ashok37#live.com India 8898788987
Ashok ashok85#gmail.com India 9898984512
Raju raju#hotmail.com Sri Lanka
Vijay vijay#gmail.com India 45535878
So far, I have read the data from the file and stored every line into an array, but I am stuck at hash[key]=>value
file_data = {}
File.open('input.txt', 'r') do |file|
file.each_line do |line|
line_data = line.split('=')
file_data[line_data[0]] = line_data[1]
end
end
puts file_data
Given that each line in your input file has pattern of key=value strings which are separated by commas, you need to split the line first around comma, and then around equals sign. Here is version of corrected code:
# Need to collect parsed data from each line into an array
array_of_file_data = []
File.open('input.txt', 'r') do |file|
file.each_line do |line|
#create a hash to collect data from each line
file_data = {}
# First split by comma
pairs = line.chomp.split(", ")
pairs.each do |p|
#Split by = to separate out key and value
key_value = p.split('=')
file_data[key_value[0]] = key_value[1]
end
array_of_file_data << file_data
end
end
puts array_of_file_data
Above code will print:
{"Name"=>"Ashok", "Email"=>"ashok85#gmail.com", "Country"=>"India", "Comments"=>"9898984512"}
{"Email"=>"raju#hotmail.com", "Country"=>"Sri Lanka", "Name"=>"Raju"}
{"Country"=>"India", "Comments"=>"45535878", "Email"=>"vijay#gmail.com", "Name"=>"Vijay"}
{"Name"=>"Ashok", "Country"=>"India", "Email"=>"ashok37#live.com", "Comments"=>"8898788987"}
A more complete version of program is given below.
hash_array = []
# Parse the lines and store it in hash array
File.open("sample.txt", "r") do |f|
f.each_line do |line|
# Splits are done around , and = preceded or followed
# by any number of white spaces
splits = line.chomp.split(/\s*,\s*/).map{|p| p.split(/\s*=\s*/)}
# to_h can be used to convert an array with even number of elements
# into a hash, by treating it as an array of key-value pairs
hash_array << splits.to_h
end
end
# Sort the array of hashes
hash_array = hash_array.sort {|i, j| i["Name"] <=> j["Name"]}
# Print the output, more tricks needed to get it better formatted
header = ["Name", "Email", "Country", "Comments"]
puts header.join(" ")
hash_array.each do |h|
puts h.values_at(*header).join(" ")
end
Above program outputs:
Name Email Country Comments
Ashok ashok85#gmail.com India 9898984512
Ashok ashok37#live.com India 8898788987
Raju raju#hotmail.com Sri Lanka
Vijay vijay#gmail.com India 45535878
You may want to refer to Padding printed output of tabular data to have better formatted tabular output
I have code that works but am soliciting suggestions for improvement.
I have a file containing ruby hashes:
{"dat"=>"2013-09-01T20:40:00-07:00", "sca"=>"5", "del"=>"755", "dir"=>"S"}
{"dat"=>"2013-09-01T21:00:00-07:00", "sca"=>"5", "del"=>"459", "dir"=>"S"}
that I want to convert to JSON that is both valid and human-readable. This code is compact and produces valid JSON...
#!/usr/bin/env ruby
# expected input: file of hashes, one/line
# output: properly formatted json array
require 'json'
json_array = []
while input = ARGF.gets
input.each_line do |line|
json_array.push( eval(line) )
end
end
print json_array
puts
..but without any newlines is not easily human-readable:
[{"dat"=>"2013-09-01T20:40:00-07:00", "sca"=>"5", "del"=>"755", "dir"=>"S"}, {"dat"=>"2013-09-01T21:00:00-07:00", "sca"=>"5", "del"=>"459", "dir"=>"S"}]
Substituting
puts JSON.pretty_generate(json_array)
for the two output lines above produces valid JSON that is human-readable, but verbose:
[
{
"dat": "2013-09-01T20:40:00-07:00",
"sca": "5",
"del": "755",
"dir": "S"
},
(more lines...)
Better from a human-readbiility standpoint would be to have a "record" on each line:
[
{"dat":"2013-09-01T20:40:00-07:00","sca":"5","del":"755","dir":"S"},
{"dat":"2013-09-01T21:00:00-07:00","sca":"5","del":"459","dir":"S"}
]
But in order to avoid the trailing comma issue [apparently a common problem - see http://trailingcomma.com/ ] I have resorted to an ugly loop with special casing. While it accomplishes the goal, I'm not happy about it and I feel like there must be a simpler way:
#!/usr/bin/env ruby
# expected input: file of hashes, one/line
# output: properly formatted json array
require 'json'
prevHash = ""
currHash = ""
puts "["
while input = ARGF.gets
# in order to to prevent a dangling comma on last element in output json array
# this counter-intuitive loop always outputs the prev, not the current, array elem
# with a trailing comma
input.each_line do |currLine|
currHash = eval(currLine) # convert string to hash
if (prevHash != "") # if not first time thru
puts " " + prevHash.to_json + ","
end
prevHash = currHash
end
end
# then, finally add the last array element *without* the troublesome trailing comma
puts " " + currHash.to_json
puts "]"
Suggestions welcome, particularly those that show me the artful one-liner that I missed.
JSON.pretty_generate accepts an optional hash parameter where you can configure the generator.
A state hash can have the following keys:
indent: a string used to indent levels (default: ”),
space: a string that is put after, a : or , delimiter (default: ”),
space_before: a string that is put before a : pair delimiter (default: ”),
object_nl: a string that is put at the end of a JSON object (default: ”),
array_nl: a string that is put at the end of a JSON array (default: ”),
allow_nan: true if NaN, Infinity, and -Infinity should be generated, otherwise an exception is thrown if these values are encountered. This options defaults to false.
max_nesting: The maximum depth of nesting allowed in the data structures from which JSON is to be generated. Disable depth checking with :max_nesting => false, it defaults to 19.
Playing around with that the closest I could get to your requirement is
JSON.pretty_generate(hash, {object_nl: '', indent: ' '})
which renders to
[
{ "dat": "2013-09-01T20:40:00-07:00", "sca": "5", "del": "755", "dir": "S"},
{ "dat": "2013-09-01T21:00:00-07:00", "sca": "5", "del": "459", "dir": "S"}
]
Okay, so I'm building something that takes a text file and breaks it up into multiple sections that are further divided into entries, and then puts <a> tags around part of each entry. I have an instance variable, #section_name, that I need to use in making the link. The problem is, #section_name seems to lose its value if I look at it wrong. Some code:
def find_entries
#sections.each do |section|
#entries = section.to_s.shatter(/(some) RegEx/)
#section_name = $1.to_s
puts #section_name
add_links
end
end
def add_links
puts "looking for #{#section_name} in #{#section_hash}"
section_link = #section_hash.fetch(#section_name)
end
If I comment out the call to add_links, it spits out the names of all the sections, but if I include it, I just get:
looking for in {"contents" => "of", "the" => "hash"}
Any help is much appreciated!
$1 is a global variable which can be used in later code.$n contains the n-th (...) capture of the last match
"foobar".sub(/foo(.*)/, '\1\1')
puts "The matching word was #{$1}" #=> The matching word was bar
"123 456 789" =~ /(\d\d)(\d)/
p [$1, $2] #=> ["12", "3"]
So I think #entries = section.to_s.shatter(/(some) RegEx/) line is not doing match properly. thus your first matched group contains nothing. so $1 prints nil.