Ruby: script to convert structured text to a csv - ruby

I have a text file with structured text which I wish to convert to a csv file.
The file looks something like:
name: Seamus
address: 123 Strand Avenue
name: Seana
address: 126 Strand Avenue
I would like it to look like:
|name | address
______________________________
|Seamus | 123 Strand Avenue
______________________________
|Seana | 126 Strand Avenue
So I understand that I need to do something like;
create a csv file
create the column names
read the text file
for each row of the text file starting with 'name' I assign the following text to the 'name' column, for ech row starting with 'address' assign the value to the 'address' column etc.
But I dont' know how to do so.
I would appreciate any pointers people could provide.

The solution starts by identifying how to parse the text file. In this specific case what separates the "records" in the text file is an empty line.
First step would be importing the file contents:
string_content = File.read("path/to/my_file.txt")
# => "name: Seamus\naddress: 123 Strand Avenue\n\nname: Seana\naddress: 126 Strand Avenue\n"
Then you would need to separate the records. As you can see when parsing the file the empty line is a line that only contains \n, so the \n from the line above plus the one on the empty line make \n\n. That is what you need to look for to separate the records:
string_records = string_content.split("\n\n")
# => ["name: Seamus\naddress: 123 Strand Avenue", "name: Seana\naddress: 126 Strand Avenue\n"]
And then once you have the strings with the records is just a matter of splitting by \n again to separate the fields:
records_by_field = string_records.map do |string_record|
string_record.split("\n")
end
# => [["name: Seamus", "address: 123 Strand Avenue"], ["name: Seana", "address: 126 Strand Avenue"]]
Once that is separated you need to split the records by : to separate field_name and value:
data = records_by_field.map do |record|
record.each_with_object({}) do |field, new_record|
field_name, field_value = field.split(":")
new_record[field_name] = field_value.strip # don't forget to get rid of the initial space with String#strip
end
end
# => [{"name"=>"Seamus", "address"=>"123 Strand Avenue"}, {"name"=>"Seana", "address"=>"126 Strand Avenue"}]
And there you have it! An array of hashes with the correct key-value pairs.
Now from that you can create a CSV or just use it to give it any other format you may want.
To resolve your specific CSV question:
require 'csv'
# first you need to get your column headers, which will be the keys of any of the hashes, the first will do
column_names = data.first.keys
CSV.open("output_file.csv", "wb") do |csv|
# first we add the headers
csv << column_names
# for each data row we create an array with values ordered as the column_names
data.each do |data_hash|
csv << [data_hash[column_names[0]], data_hash[column_names[1]]]
end
end
That will create an output_file.csv in the same directory where you run your ruby script.
And that's it!

Let's construct the file.
str =<<~END
name: Seamus
address: 123 Strand Avenue
name: Seana
address: 126 Strand Avenue
address: 221B Baker Street
name: Sherlock
END
Notice that I've added a third record that has the order of the "name" and "address" lines reversed, and it is preceded by an extra blank line.
in_file = 'temp.txt'
File.write(in_file, str)
#=> 124
The first step is to to obtain the headers for the CSV file:
headers = []
f = File.open(in_file)
loop do
header = f.gets[/[^:]+(?=:)/]
break if header.nil?
headers << header
end
f.close
headers
#=> ["name", "address"]
Notice that the number of headers (two in the example) is arbitrary.
See IO::gets. The regular expression reads, "match one or more characters other than a colon" immediately followed by a colon ((?=:) being a positive lookahead).
If in_file is not exceedingly large it's easiest to first read that file into an array of hashes. The first step is to read the file into a string and then split the string on contiguous lines that contain nothing other than newlines and spaces:
arr = File.read(in_file).chomp.split(/\n\s*\n/)
#=> ["name: Seamus\naddress: 123 Strand Avenue",
# "name: Seana\naddress: 126 Strand Avenue",
# "address: 221B Baker Street\nname: Sherlock"]
We can now convert each element of this array to a hash:
arr = File.read(in_file).split(/\n\s*\n/).
map do |s|
s.split("\n").
each_with_object({}) do |p,h|
key, value = p.split(/: +/)
h[key] = value
end
end
#=> [{"name"=>"Seamus", "address"=>"123 Strand Avenue"},
# {"name"=>"Seana", "address"=>"126 Strand Avenue"},
# {"address"=>"221B Baker Street", "name"=>"Sherlock"}]
We are now ready to construct the CSV file:
out_file = 'temp.csv'
require 'csv'
CSV.open(out_file, 'w') do |csv|
csv << headers
arr.each { |h| csv << h.values_at(*headers) }
end
Let's see what was written:
puts File.read(out_file)
name,address
Seamus,123 Strand Avenue
Seana,126 Strand Avenue
Sherlock,221B Baker Street
See CSV::open and Hash#values_at.
This is not the format specified in the question. In fact, a file with that format would not be a valid CSV file, because there is no consistent column separator. For example, the first line, '|name | address' has a column separator ' | ', whereas the second line, '|Seamus | 123 Strand Avenue' has a column separator ' | '. Moreover, even if they were the same the pipe at the beginning of each line would become the first letter of the name.
We could change the column separator to a pipe (rather than a comma, the default) by writing CSV.open(out_file, col_sep: '|', 'w'). A common mistake in constructing CSV files is to surround the column separator with one or more spaces. That invariably leads to boo-boos.

Related

Remove whitespace from CSV row

I need to remove or stripe white space from CSV rows.
test.csv
60 500, # want 60500
8 100, # want 8100
5 400, # want 5400
480, # want 480
remove_space.rb
require 'csv'
CSV.foreach('test.csv') do |row|
row = row[0]
row = row.strip
row = row.gsub(" ","")
puts row
end
I don't understand why it doesn't work. Result is same as test.csv
Any idea?
Your test.csv file contains narrow no-break space (U+202F) unicode characters. This is a non-whitespace character. (A regular space character is U+0020.)
You can see the different possible unicode spaces here: http://jkorpela.fi/chars/spaces.html
Here is a more generic script - using a POSIX bracket group - to remove all "space-like" characters:
require 'csv'
CSV.foreach('test.csv') do |row|
row = row[0]
row = row.gsub(/[[:space:]]/,"")
puts row
end
The thing is those are not normal spaces but rather Narrow No-Break Spaces:
require 'csv'
CSV.foreach('/tmp/test.csv') do |row|
puts row[0].delete "\u202f"
end
#⇒ 60500
# 8100
# 5400
# 480
You can strip out all the spaces, including unicode ones, by using \p{Space} matcher.
require 'csv'
CSV.foreach('/tmp/test.csv') do |row|
puts row[0].gsub /\p{Space}/, ''
end

Ruby data formatting

I'm reading a log file and trying to organize the data in the below format, so I wanted to push NAME(i.e USOLA51, USOLA10..) as hash and create corresponding array for LIST and DETAILS.
I've created the hash too but not sure how to take/extract the corresponding/associated array values.
Expected Output
NAME LIST DETAILS
USOLA51 ICC_ONUS .035400391
PA_ONUS .039800391
PA_ONUS .000610352
USOLA10 PAL 52.7266846
CFG_ONUS 15.9489746
likewise for the other values
Log file:
--- data details ----
USOLA51
ONUS size
------------------------------ ----------
ICC_ONUS .035400391
PA_ONUS .039800391
PE_ONUS .000610352
=========================================
---- data details ----
USOLA10
ONUS size
------------------------------ ----------
PAL 52.7266846
CFG_ONUS 15.9489746
=========================================
---- data details ----
USOLA55
ONUS size
------------------------------ ----------
PA_ONUS 47.4707031
PAL 3.956604
ICC_ONUS .020385742
PE_ONUS .000610352
=========================================
---- data details ----
USOLA56
ONUS size
------------------------------ ----------
=========================================
what I've tried
unique = Array.new
owner = Array.new
db = Array.new
File.read("mydb_size.log").each_line do |line|
next if line =~ /---- data details ----|^ONUS|---|=======/
unique << line.strip if line =~ /^U.*\d/
end
hash = Hash[unique.collect { |item| [item, ""] } ]
puts hash
Current O/p
{"USOLA51"=>"", "USOLA10"=>"", "USOLA55"=>"", "USOLA56"=>""}
Any help to move forward would be really helpful here.Thanks !!
While your log file isn't CSV, I find the csv library useful in a lot of non-csv parsing. You can use it to parse your log file, by skipping blank lines, and any line starting with ---, ===, or ONUS. Your column separator is a white space character:
csv = CSV.read("./example.log", skip_lines: /\A(---|===|ONUS)/,
skip_blanks: true, col_sep: " ")
Then, some lines only have 1 element in the array parsed out, those are your header lines. So we can split the csv array into groups based on when we only have 1 element, and create a hash from the result:
output_hash = csv.slice_before { |row| row.length == 1 }.
each_with_object({}) do |((name), *rows), hash|
hash[name] = rows.to_h
end
Now, it's a little hard to tell if you wanted the hash output as the text you showed, or if you just wanted the hash. If you want the text output, we'll first need to see how much room each column needs to be displayed:
name_length = output_hash.keys.max_by(&:length).length
list_length = output_hash.values.flat_map(&:keys).max_by(&:length).length
detail_length = output_hash.values.flat_map(&:values).max_by(&:length).length
format = "%-#{name_length}s %-#{list_length}s %-#{detail_length}s"
and then we can output the header row and all the values in output_hash, but only if they have any values:
puts("#{format}\n\n" % ["NAME", "LIST", "DETAILS"])
output_hash.reject { |name, values| values.empty? }.each do |name, values|
list, detail = values.first
puts(format % [name, list, detail])
values.drop(1).each do |list, detail|
puts(format % ['', list, detail])
end
puts
end
and the result:
NAME LIST DETAILS
USOLA51 ICC_ONUS .035400391
PA_ONUS .039800391
PE_ONUS .000610352
USOLA10 PAL 52.7266846
CFG_ONUS 15.9489746
USOLA55 PA_ONUS 47.4707031
PAL 3.956604
ICC_ONUS .020385742
PE_ONUS .000610352
It's a little hard to explain (for me) what slice_before does. But, it takes an array (or other enumerable) and creates groups or chunks of its element, where the first element matches the parameter or the block returns true. For instance, if we had a smaller array:
array = ["slice here", 1, 2, "slice here", 3, 4]
array.slice_before { |el| el == "slice here" }.entries
# => [["slice here", 1, 2], ["slice here", 3, 4]]
We told slice_before, we want each group to begin with the element that equals "slice here", so we have 2 groups returned, the first element in each is "slice here" and the remaining elements are all the elements in the array until the next time it saw "slice here".
So then, we can take that result and we call each_with_object on it, passing an empty hash to start out with. With each_with_object, the first parameter is going to be the element of the array (from each) and the second is going to be the object you passed. What happens when the block parameters look like |((name), *rows), hash| is that first parameter (the element of the array) gets deconstructed into the first element of the array and the remaining elements:
# the array here is what gets passed to `each_with_object` for the first iteration as the first parameter
name, *rows = [["USOLA51"], ["ICC_ONUS", ".035400391"], ["PA_ONUS", ".039800391"], ["PE_ONUS", ".000610352"]]
name # => ["USOLA51"]
rows # => [["ICC_ONUS", ".035400391"], ["PA_ONUS", ".039800391"], ["PE_ONUS", ".000610352"]]
So then, we deconstruct that first element again, just so we don't have it in an array:
name, * = name # the `, *` isn't needed in the block parameters, but is needed when you run these examples in irb
name # => "USOLA51"
For the max_by(&:length).length, all we're doing is finding the longest element in the array (returned by either keys or values) and getting the length of it:
output_hash = {"USOLA51"=>{"ICC_ONUS"=>".035400391", "PA_ONUS"=>".039800391", "PE_ONUS"=>".000610352"}, "USOLA10"=>{"PAL"=>"52.7266846", "CFG_ONUS"=>"15.9489746"}, "USOLA55"=>{"PA_ONUS"=>"47.4707031", "PAL"=>"3.956604", "ICC_ONUS"=>".020385742", "PE_ONUS"=>".000610352"}, "USOLA56"=>{}}
output_hash.values.flat_map(&:keys)
# => ["ICC_ONUS", "PA_ONUS", "PE_ONUS", "PAL", "CFG_ONUS", "PA_ONUS", "PAL", "ICC_ONUS", "PE_ONUS"]
output_hash.values.map(&:length) # => [8, 7, 7, 3, 8, 7, 3, 8, 7]
output_hash.values.flat_map(&:keys).max_by(&:length) # => "ICC_ONUS"
output_hash.values.flat_map(&:keys).max_by(&:length).length # => 8
It's been a long time i've been working with ruby, so probably i forgot a lot of the shortcuts and syntactic sugar, but this file seems to be easily parseable without great efforts.
A simple line-by-line comparison of expected values will be enough. First step is to remove all surrounding whitespaces, ignore blank lines, or lines that start with = or -. Next if there is only one value, it is the title, the next line consists of the column names, which can be ignored for your desired output. If either title or column names are encountered, move on to the next line and save the following key/value pairs as ruby key/value pairs. During this operation also check for the longest occurring string and adjust the column padding, so that you can generate the table-like output afterwards with padding.
# Set up the loop
merged = []
current = -1
awaiting_headers = false
columns = ['NAME', 'LIST', 'DETAILS']
# Keep track of the max column length
columns_pad = columns.map { |c| c.length }
str.each_line do |line|
# Remove surrounding whitespaces,
# ignore empty or = - lines
line.strip!
next if line.empty?
next if ['-','='].include? line[0]
# Get the values of this line
parts = line.split ' '
# We're not awaiting the headers and
# there is just one value, must be the title
if not awaiting_headers and parts.size == 1
# If this string is longer than the current maximum
columns_pad[0] = line.length if line.length > columns_pad[0]
# Create a hash for this item
merged[current += 1] = {name: line, data: {}}
# Next must be the headers
awaiting_headers = true
next
end
# Headers encountered
if awaiting_headers
# Just skip it from here
awaiting_headers = false
next
end
# Take 2 parts of each (should be always only those two)
# and treat them as key/value
parts.each_cons(2) do |key, value|
# Make it a ruby key/value pair
merged[current][:data][key] = value
# Check if LIST or DETAILS column length needs to be raised
columns_pad[1] = key.length if key.length > columns_pad[1]
columns_pad[2] = value.length if value.length > columns_pad[2]
end
end
# Adding three spaces between columns
columns_pad.map! { |c| c + 3}
# Writing the headers
result = columns.map.with_index { |c, i| c.ljust(columns_pad[i]) }.join + "\n"
merged.each do |item|
# Remove the next line if you want to include empty data
next if item[:data].empty?
result += "\n"
result += item[:name].ljust(columns_pad[0])
# For the first value in data, we don't need extra padding or a line break
padding = ""
item[:data].each do |key, value|
result += padding
result += key.ljust(columns_pad[1])
result += value.ljust(columns_pad[2])
# Set the padding to include a line break and fill up the NAME column with spaces
padding = "\n" + "".ljust(columns_pad[0])
end
result += "\n"
end
puts result
Which will result in
NAME LIST DETAILS
USOLA51 ICC_ONUS .035400391
PA_ONUS .039800391
PE_ONUS .000610352
USOLA10 PAL 52.7266846
CFG_ONUS 15.9489746
USOLA55 PA_ONUS 47.4707031
PAL 3.956604
ICC_ONUS .020385742
PE_ONUS .000610352
Online demo here

How to separate two data in one cell on csv by ruby

I want to change CSV file content:
itemId,url,name,type
1|urlA|nameA|typeA
2|urlB|nameB|typeB
3|urlC,urlD|nameC|typeC
4|urlE|nameE|typeE
into an array:
[itemId,url,name,type]
[1,urlA,nameA,typeA]
[2,urlB,nameB,typeB]
[**3**,**urlC**,nameC,typeC]
[**3**,**urlD**,nameC,typeC]
[4,urlE,nameE,typeE]
Could anybody teach me how to do it?
Finally, I'm going to DL url files(.jpg)
The header row has a different separator than the data. That's a problem. You need to change the header row to use | instead of ,. Then:
require 'csv'
require 'pp'
array = Array.new
CSV.foreach("test.csv", col_sep: '|', headers: true) do |row|
if row['url'][/,/]
row['url'].split(',').each do |url|
row['url'] = url
array.push row.to_h.values
end
else
array.push row.to_h.values
end
end
pp array
=> [["1", "urlA", "nameA", "typeA"],
["2", "urlB", "nameB", "typeB"],
["3", "urlC", "nameC", "typeC"],
["3", "urlD", "nameC", "typeC"],
["4", "urlE", "nameE", "typeE"]]
You'll need to test the fifth column to see how the line should be parsed. If you see a fifth element (row[4]) output the line twice replacing the url column
array = Array.new
CSV.foreach("test.csv") do |row|
if row[4]
array << [row[0..1], row[3..4]].flatten
array << [[row[0]], row[2..4]].flatten
else
array << row
end
end
p array
In your example you had asterisks but I'm assuming that was just to emphasise the lines for which you want special handling. If you do want asterisks, you can modify the two array shovel commands appropriately.

Process a file using hash

Below is the input file that I want to store into a hash table, sort it and output in the format shown below.
Input File
Name=Ashok, Email=ashok85#gmail.com, Country=India, Comments=9898984512
Email=raju#hotmail.com, Country=Sri Lanka, Name=Raju
Country=India, Comments=45535878, Email=vijay#gmail.com, Name=Vijay
Name=Ashok, Country=India, Email=ashok37#live.com, Comments=8898788987
Output File (Sorted by Name)
Name Email Country Comments
-------------------------------------------------------
Ashok ashok37#live.com India 8898788987
Ashok ashok85#gmail.com India 9898984512
Raju raju#hotmail.com Sri Lanka
Vijay vijay#gmail.com India 45535878
So far, I have read the data from the file and stored every line into an array, but I am stuck at hash[key]=>value
file_data = {}
File.open('input.txt', 'r') do |file|
file.each_line do |line|
line_data = line.split('=')
file_data[line_data[0]] = line_data[1]
end
end
puts file_data
Given that each line in your input file has pattern of key=value strings which are separated by commas, you need to split the line first around comma, and then around equals sign. Here is version of corrected code:
# Need to collect parsed data from each line into an array
array_of_file_data = []
File.open('input.txt', 'r') do |file|
file.each_line do |line|
#create a hash to collect data from each line
file_data = {}
# First split by comma
pairs = line.chomp.split(", ")
pairs.each do |p|
#Split by = to separate out key and value
key_value = p.split('=')
file_data[key_value[0]] = key_value[1]
end
array_of_file_data << file_data
end
end
puts array_of_file_data
Above code will print:
{"Name"=>"Ashok", "Email"=>"ashok85#gmail.com", "Country"=>"India", "Comments"=>"9898984512"}
{"Email"=>"raju#hotmail.com", "Country"=>"Sri Lanka", "Name"=>"Raju"}
{"Country"=>"India", "Comments"=>"45535878", "Email"=>"vijay#gmail.com", "Name"=>"Vijay"}
{"Name"=>"Ashok", "Country"=>"India", "Email"=>"ashok37#live.com", "Comments"=>"8898788987"}
A more complete version of program is given below.
hash_array = []
# Parse the lines and store it in hash array
File.open("sample.txt", "r") do |f|
f.each_line do |line|
# Splits are done around , and = preceded or followed
# by any number of white spaces
splits = line.chomp.split(/\s*,\s*/).map{|p| p.split(/\s*=\s*/)}
# to_h can be used to convert an array with even number of elements
# into a hash, by treating it as an array of key-value pairs
hash_array << splits.to_h
end
end
# Sort the array of hashes
hash_array = hash_array.sort {|i, j| i["Name"] <=> j["Name"]}
# Print the output, more tricks needed to get it better formatted
header = ["Name", "Email", "Country", "Comments"]
puts header.join(" ")
hash_array.each do |h|
puts h.values_at(*header).join(" ")
end
Above program outputs:
Name Email Country Comments
Ashok ashok85#gmail.com India 9898984512
Ashok ashok37#live.com India 8898788987
Raju raju#hotmail.com Sri Lanka
Vijay vijay#gmail.com India 45535878
You may want to refer to Padding printed output of tabular data to have better formatted tabular output

Extract names from File using Ruby and Grep

I have a file with the following data:
other data
user1=name1
user2=name2
user3=name3
other data
to extract the names I do the following
names = File.open('resource.cfg', 'r') do |f|
f.grep(/[a-z][a-z][0-9]/)
end
which returns the following array
user1=name1
user2=name2
user3=name3
but I really want only the name part
name1
name2
name3
Right now I'm doing this after the file step:
names = names.map do |name|
name[7..9]
end
is there a better way to do? with the file step
You could do it like this, using String#scan with a regex:
Code
File.read(FNAME).scan(/(?<==)[A-Za-z]+\d+$/)
Explanation
Let's start by constructing a file:
FNAME = "my_file"
lines =<<_
other data
user1=name1
user2=name2
user3=name3
other data
_
File.write(FNAME,lines)
We can confirm the file contents:
puts File.read(FNAME)
other data
user1=name1
user2=name2
user3=name3
other data
Now run the code::
File.read(FNAME).scan(/(?<==)[A-Za-z]+\d+$/)
#=> ["name1", "name2", "name3"]
A word about the regex I used.
(?<=...)
is called a "positive lookbehind". Whatever is inserted in place of the dots must immediately precede the match, but is not part of the match (and for that reason is sometimes referred to as as "zero-length" group). We want the match to follow an equals sign, so the "positive lookbehind" is as follows:
(?<==)
This is followed by one or more letters, then one or more digits, then an end-of-line, which comprise the pattern to be matched. You could of course change this if you have different requirements, such as names being lowercase or beginning with a capital letter, a specified number of digits, and so on.
Is your code working as you have posted it?
names = File.open('resource.cfg', 'r') { |f| f.grep(/[a-z][a-z][0-9]/) }
names = names.map { |name| name[7..9] }
=> ["ame", "ame", "ame"]
You could make it into a neat little one-liner by writing it as such:
names = File.readlines('resource.cfg').grep(/=(\w*)/) { |x| x.split('=')[1].chomp }
You can do it all in a single step:
names = File.open('resource.cfg', 'r') do |f|
f.grep(/[a-z][a-z][0-9]/).map {|x| x.split('=')[1]}
end

Resources