How to create a hash by reading a text file in Ruby? - ruby

I want my program to read each word from the text file and then, match them with numbers by row.
For example; text file is
my name is donald knuth0
and the program should run like :
"my" => "1" , "name" => "2" , "donald" => "3" , "knuth" => "4"

You could do it by lines or by words. If you want to index the words by line numbers, you could probably take a little from both options.
# Option 1: Reads file as an Array of lines, but we need to replace (gsub) the newline character at the end of each line
lines_with_newlines = File.readlines("hello.txt").map{|line| line.gsub(/\n$/,"")}
# Option 2: If you wanted to do it by words instead of lines by cleaning up some whitespace
list_of_words = File.read("hello.txt").gsub(/\n/, " ").gsub(/^\s+/,"").gsub(/\s+$/,"").split(/\s/)
# Generates a list of line numbers, starting with 1
line_numbers = (1..(lines.length + 1))
# "Zips" up the line numbers [["line 1", 1], ["line 2", 2]]
pairs = lines.zip(line_numbers)
# Converts to Array
hash = Hash[pairs]

Related

Ruby: script to convert structured text to a csv

I have a text file with structured text which I wish to convert to a csv file.
The file looks something like:
name: Seamus
address: 123 Strand Avenue
name: Seana
address: 126 Strand Avenue
I would like it to look like:
|name | address
______________________________
|Seamus | 123 Strand Avenue
______________________________
|Seana | 126 Strand Avenue
So I understand that I need to do something like;
create a csv file
create the column names
read the text file
for each row of the text file starting with 'name' I assign the following text to the 'name' column, for ech row starting with 'address' assign the value to the 'address' column etc.
But I dont' know how to do so.
I would appreciate any pointers people could provide.
The solution starts by identifying how to parse the text file. In this specific case what separates the "records" in the text file is an empty line.
First step would be importing the file contents:
string_content = File.read("path/to/my_file.txt")
# => "name: Seamus\naddress: 123 Strand Avenue\n\nname: Seana\naddress: 126 Strand Avenue\n"
Then you would need to separate the records. As you can see when parsing the file the empty line is a line that only contains \n, so the \n from the line above plus the one on the empty line make \n\n. That is what you need to look for to separate the records:
string_records = string_content.split("\n\n")
# => ["name: Seamus\naddress: 123 Strand Avenue", "name: Seana\naddress: 126 Strand Avenue\n"]
And then once you have the strings with the records is just a matter of splitting by \n again to separate the fields:
records_by_field = string_records.map do |string_record|
string_record.split("\n")
end
# => [["name: Seamus", "address: 123 Strand Avenue"], ["name: Seana", "address: 126 Strand Avenue"]]
Once that is separated you need to split the records by : to separate field_name and value:
data = records_by_field.map do |record|
record.each_with_object({}) do |field, new_record|
field_name, field_value = field.split(":")
new_record[field_name] = field_value.strip # don't forget to get rid of the initial space with String#strip
end
end
# => [{"name"=>"Seamus", "address"=>"123 Strand Avenue"}, {"name"=>"Seana", "address"=>"126 Strand Avenue"}]
And there you have it! An array of hashes with the correct key-value pairs.
Now from that you can create a CSV or just use it to give it any other format you may want.
To resolve your specific CSV question:
require 'csv'
# first you need to get your column headers, which will be the keys of any of the hashes, the first will do
column_names = data.first.keys
CSV.open("output_file.csv", "wb") do |csv|
# first we add the headers
csv << column_names
# for each data row we create an array with values ordered as the column_names
data.each do |data_hash|
csv << [data_hash[column_names[0]], data_hash[column_names[1]]]
end
end
That will create an output_file.csv in the same directory where you run your ruby script.
And that's it!
Let's construct the file.
str =<<~END
name: Seamus
address: 123 Strand Avenue
name: Seana
address: 126 Strand Avenue
address: 221B Baker Street
name: Sherlock
END
Notice that I've added a third record that has the order of the "name" and "address" lines reversed, and it is preceded by an extra blank line.
in_file = 'temp.txt'
File.write(in_file, str)
#=> 124
The first step is to to obtain the headers for the CSV file:
headers = []
f = File.open(in_file)
loop do
header = f.gets[/[^:]+(?=:)/]
break if header.nil?
headers << header
end
f.close
headers
#=> ["name", "address"]
Notice that the number of headers (two in the example) is arbitrary.
See IO::gets. The regular expression reads, "match one or more characters other than a colon" immediately followed by a colon ((?=:) being a positive lookahead).
If in_file is not exceedingly large it's easiest to first read that file into an array of hashes. The first step is to read the file into a string and then split the string on contiguous lines that contain nothing other than newlines and spaces:
arr = File.read(in_file).chomp.split(/\n\s*\n/)
#=> ["name: Seamus\naddress: 123 Strand Avenue",
# "name: Seana\naddress: 126 Strand Avenue",
# "address: 221B Baker Street\nname: Sherlock"]
We can now convert each element of this array to a hash:
arr = File.read(in_file).split(/\n\s*\n/).
map do |s|
s.split("\n").
each_with_object({}) do |p,h|
key, value = p.split(/: +/)
h[key] = value
end
end
#=> [{"name"=>"Seamus", "address"=>"123 Strand Avenue"},
# {"name"=>"Seana", "address"=>"126 Strand Avenue"},
# {"address"=>"221B Baker Street", "name"=>"Sherlock"}]
We are now ready to construct the CSV file:
out_file = 'temp.csv'
require 'csv'
CSV.open(out_file, 'w') do |csv|
csv << headers
arr.each { |h| csv << h.values_at(*headers) }
end
Let's see what was written:
puts File.read(out_file)
name,address
Seamus,123 Strand Avenue
Seana,126 Strand Avenue
Sherlock,221B Baker Street
See CSV::open and Hash#values_at.
This is not the format specified in the question. In fact, a file with that format would not be a valid CSV file, because there is no consistent column separator. For example, the first line, '|name | address' has a column separator ' | ', whereas the second line, '|Seamus | 123 Strand Avenue' has a column separator ' | '. Moreover, even if they were the same the pipe at the beginning of each line would become the first letter of the name.
We could change the column separator to a pipe (rather than a comma, the default) by writing CSV.open(out_file, col_sep: '|', 'w'). A common mistake in constructing CSV files is to surround the column separator with one or more spaces. That invariably leads to boo-boos.

Ruby data formatting

I'm reading a log file and trying to organize the data in the below format, so I wanted to push NAME(i.e USOLA51, USOLA10..) as hash and create corresponding array for LIST and DETAILS.
I've created the hash too but not sure how to take/extract the corresponding/associated array values.
Expected Output
NAME LIST DETAILS
USOLA51 ICC_ONUS .035400391
PA_ONUS .039800391
PA_ONUS .000610352
USOLA10 PAL 52.7266846
CFG_ONUS 15.9489746
likewise for the other values
Log file:
--- data details ----
USOLA51
ONUS size
------------------------------ ----------
ICC_ONUS .035400391
PA_ONUS .039800391
PE_ONUS .000610352
=========================================
---- data details ----
USOLA10
ONUS size
------------------------------ ----------
PAL 52.7266846
CFG_ONUS 15.9489746
=========================================
---- data details ----
USOLA55
ONUS size
------------------------------ ----------
PA_ONUS 47.4707031
PAL 3.956604
ICC_ONUS .020385742
PE_ONUS .000610352
=========================================
---- data details ----
USOLA56
ONUS size
------------------------------ ----------
=========================================
what I've tried
unique = Array.new
owner = Array.new
db = Array.new
File.read("mydb_size.log").each_line do |line|
next if line =~ /---- data details ----|^ONUS|---|=======/
unique << line.strip if line =~ /^U.*\d/
end
hash = Hash[unique.collect { |item| [item, ""] } ]
puts hash
Current O/p
{"USOLA51"=>"", "USOLA10"=>"", "USOLA55"=>"", "USOLA56"=>""}
Any help to move forward would be really helpful here.Thanks !!
While your log file isn't CSV, I find the csv library useful in a lot of non-csv parsing. You can use it to parse your log file, by skipping blank lines, and any line starting with ---, ===, or ONUS. Your column separator is a white space character:
csv = CSV.read("./example.log", skip_lines: /\A(---|===|ONUS)/,
skip_blanks: true, col_sep: " ")
Then, some lines only have 1 element in the array parsed out, those are your header lines. So we can split the csv array into groups based on when we only have 1 element, and create a hash from the result:
output_hash = csv.slice_before { |row| row.length == 1 }.
each_with_object({}) do |((name), *rows), hash|
hash[name] = rows.to_h
end
Now, it's a little hard to tell if you wanted the hash output as the text you showed, or if you just wanted the hash. If you want the text output, we'll first need to see how much room each column needs to be displayed:
name_length = output_hash.keys.max_by(&:length).length
list_length = output_hash.values.flat_map(&:keys).max_by(&:length).length
detail_length = output_hash.values.flat_map(&:values).max_by(&:length).length
format = "%-#{name_length}s %-#{list_length}s %-#{detail_length}s"
and then we can output the header row and all the values in output_hash, but only if they have any values:
puts("#{format}\n\n" % ["NAME", "LIST", "DETAILS"])
output_hash.reject { |name, values| values.empty? }.each do |name, values|
list, detail = values.first
puts(format % [name, list, detail])
values.drop(1).each do |list, detail|
puts(format % ['', list, detail])
end
puts
end
and the result:
NAME LIST DETAILS
USOLA51 ICC_ONUS .035400391
PA_ONUS .039800391
PE_ONUS .000610352
USOLA10 PAL 52.7266846
CFG_ONUS 15.9489746
USOLA55 PA_ONUS 47.4707031
PAL 3.956604
ICC_ONUS .020385742
PE_ONUS .000610352
It's a little hard to explain (for me) what slice_before does. But, it takes an array (or other enumerable) and creates groups or chunks of its element, where the first element matches the parameter or the block returns true. For instance, if we had a smaller array:
array = ["slice here", 1, 2, "slice here", 3, 4]
array.slice_before { |el| el == "slice here" }.entries
# => [["slice here", 1, 2], ["slice here", 3, 4]]
We told slice_before, we want each group to begin with the element that equals "slice here", so we have 2 groups returned, the first element in each is "slice here" and the remaining elements are all the elements in the array until the next time it saw "slice here".
So then, we can take that result and we call each_with_object on it, passing an empty hash to start out with. With each_with_object, the first parameter is going to be the element of the array (from each) and the second is going to be the object you passed. What happens when the block parameters look like |((name), *rows), hash| is that first parameter (the element of the array) gets deconstructed into the first element of the array and the remaining elements:
# the array here is what gets passed to `each_with_object` for the first iteration as the first parameter
name, *rows = [["USOLA51"], ["ICC_ONUS", ".035400391"], ["PA_ONUS", ".039800391"], ["PE_ONUS", ".000610352"]]
name # => ["USOLA51"]
rows # => [["ICC_ONUS", ".035400391"], ["PA_ONUS", ".039800391"], ["PE_ONUS", ".000610352"]]
So then, we deconstruct that first element again, just so we don't have it in an array:
name, * = name # the `, *` isn't needed in the block parameters, but is needed when you run these examples in irb
name # => "USOLA51"
For the max_by(&:length).length, all we're doing is finding the longest element in the array (returned by either keys or values) and getting the length of it:
output_hash = {"USOLA51"=>{"ICC_ONUS"=>".035400391", "PA_ONUS"=>".039800391", "PE_ONUS"=>".000610352"}, "USOLA10"=>{"PAL"=>"52.7266846", "CFG_ONUS"=>"15.9489746"}, "USOLA55"=>{"PA_ONUS"=>"47.4707031", "PAL"=>"3.956604", "ICC_ONUS"=>".020385742", "PE_ONUS"=>".000610352"}, "USOLA56"=>{}}
output_hash.values.flat_map(&:keys)
# => ["ICC_ONUS", "PA_ONUS", "PE_ONUS", "PAL", "CFG_ONUS", "PA_ONUS", "PAL", "ICC_ONUS", "PE_ONUS"]
output_hash.values.map(&:length) # => [8, 7, 7, 3, 8, 7, 3, 8, 7]
output_hash.values.flat_map(&:keys).max_by(&:length) # => "ICC_ONUS"
output_hash.values.flat_map(&:keys).max_by(&:length).length # => 8
It's been a long time i've been working with ruby, so probably i forgot a lot of the shortcuts and syntactic sugar, but this file seems to be easily parseable without great efforts.
A simple line-by-line comparison of expected values will be enough. First step is to remove all surrounding whitespaces, ignore blank lines, or lines that start with = or -. Next if there is only one value, it is the title, the next line consists of the column names, which can be ignored for your desired output. If either title or column names are encountered, move on to the next line and save the following key/value pairs as ruby key/value pairs. During this operation also check for the longest occurring string and adjust the column padding, so that you can generate the table-like output afterwards with padding.
# Set up the loop
merged = []
current = -1
awaiting_headers = false
columns = ['NAME', 'LIST', 'DETAILS']
# Keep track of the max column length
columns_pad = columns.map { |c| c.length }
str.each_line do |line|
# Remove surrounding whitespaces,
# ignore empty or = - lines
line.strip!
next if line.empty?
next if ['-','='].include? line[0]
# Get the values of this line
parts = line.split ' '
# We're not awaiting the headers and
# there is just one value, must be the title
if not awaiting_headers and parts.size == 1
# If this string is longer than the current maximum
columns_pad[0] = line.length if line.length > columns_pad[0]
# Create a hash for this item
merged[current += 1] = {name: line, data: {}}
# Next must be the headers
awaiting_headers = true
next
end
# Headers encountered
if awaiting_headers
# Just skip it from here
awaiting_headers = false
next
end
# Take 2 parts of each (should be always only those two)
# and treat them as key/value
parts.each_cons(2) do |key, value|
# Make it a ruby key/value pair
merged[current][:data][key] = value
# Check if LIST or DETAILS column length needs to be raised
columns_pad[1] = key.length if key.length > columns_pad[1]
columns_pad[2] = value.length if value.length > columns_pad[2]
end
end
# Adding three spaces between columns
columns_pad.map! { |c| c + 3}
# Writing the headers
result = columns.map.with_index { |c, i| c.ljust(columns_pad[i]) }.join + "\n"
merged.each do |item|
# Remove the next line if you want to include empty data
next if item[:data].empty?
result += "\n"
result += item[:name].ljust(columns_pad[0])
# For the first value in data, we don't need extra padding or a line break
padding = ""
item[:data].each do |key, value|
result += padding
result += key.ljust(columns_pad[1])
result += value.ljust(columns_pad[2])
# Set the padding to include a line break and fill up the NAME column with spaces
padding = "\n" + "".ljust(columns_pad[0])
end
result += "\n"
end
puts result
Which will result in
NAME LIST DETAILS
USOLA51 ICC_ONUS .035400391
PA_ONUS .039800391
PE_ONUS .000610352
USOLA10 PAL 52.7266846
CFG_ONUS 15.9489746
USOLA55 PA_ONUS 47.4707031
PAL 3.956604
ICC_ONUS .020385742
PE_ONUS .000610352
Online demo here

How to parse a text by line and store some values in an array

I want to parse the following text and find lines that start with '+' or '-':
--- a/product.json
+++ b/product.json
## -1,4 +1,4 ##
{
- "name": "Coca Cola",
- "barcode": "41324134132"
+ "name": "Sprite",
+ "barcode": "41324134131"
}
\ No newline at end of file
When I find a such line, I want to store the attribute name. I.e., for:
- "name": "Coca Cola",
I want to store name in minus_array.
You want to iterate over the lines, and find lines that begin with either - or + followed by whitespace:
text = %[
--- a/product.json
+++ b/product.json
## -1,4 +1,4 ##
{
- "name": "Coca Cola",
- "barcode": "41324134132"
+ "name": "Sprite",
+ "barcode": "41324134131"
}
\ No newline at end of file
]
text.lines.select{ |l| l.lstrip[/^[+-]\s/] }.map{ |s| s.split[1] }
# => ["\"name\":", "\"barcode\":", "\"name\":", "\"barcode\":"]
lines splits a string on line-ends, returning the entire line, including the trailing line-end character.
lstrip removes whitespace at the start of the line. This is to normalize lines allowing the regex pattern to be a bit more simple.
l.lstrip[/^[+-]\s/] is a bit of Ruby String slight-of-hand, that basically says to apply the pattern to the string and return the matching text. If nothing matches in the string nil will be returned, which acts as false as far as select is concerned. If the string has something that matches the pattern, [] will return the text, which acts as a true value for select, which then passes on that string.
map iterates over all elements that select passes to it, and transforms the element by splitting it on spaces, which is the default behavior of split. [1] returns the second element in the string.
Here's an alternate path to the same place:
ary = []
text.lines.each do |l|
i = l.strip
ary << i if i[/^\{$/] .. i[/^}$/]
end
ary[1..-2].map{ |s| s.split[1] } # => ["\"name\":", "\"barcode\":", "\"name\":", "\"barcode\":"]
That'll get you started. How to remove duplicates, strip the leading/trailing double-quotes and colon is your task.
text.split(/\n/).select { |l| l =~ /^\+./ }
If you're using file:
File.open('your_file.txt', "r").select { |l| l =~ /^\+./ }
Use group_by to group according to the first character:
groups = text.lines.group_by { |l| l[0] }
groups['-']
# => ["--- a/product.json\n", "- \"name\": \"Coca Cola\",\n", "- \"barcode\": \"41324134132\"\n"]
groups['+']
# => ["+++ b/product.json\n", "+ \"name\": \"Sprite\",\n", "+ \"barcode\": \"41324134131\"\n"]
File.readlines("file.txt").each do |line|
if line.starts_with? '+ ' || line.starts_with? '- '
words = line.split(":")
key = words[0].match(/".*"/)
val = words[1].match(/".*"/)
# You can then do what you will with the name and value here
# For example, minus_array << if line.starts_with? '-'
end
end
I'm not entirely sure of the constraints you have with this, so I can't give a more specific answer. Basically, you can iterate the lines of a file with File.readlines('file') { }. Then we check for a the string to start with + or -, and grab the name and value accordingly. I put a space in the starts_with? because if I didn't it would also match the top two lines of your example.
Hopefully that's what you were looking for!

How do I search for lines that start with a word from an array?

I have two arrays. The first array is huge with thousands of elements. The second array contains a list of thirty or so words. I want to select lines from the first array that start with a word from the second array.
I'm thinking Regex, but I'm not quite sure how to accomplish it using that.
A sample from the first_array:
array[0] = [ 'jsmith88:*:4185:208:jsmith113:/students/jsmith88:/usr/bin/bash' ]
A sample from the second_array:
array[5] = [ 'jsmith88' ]
You could try using the select method from the Array class, like this:
# lines is the first array, word_list is the second array
words = words_list.join '|'
result = lines.select { |line| line =~ /^(#{words})/ }
result should contain every lines that start with a word from the second array.
Like #Sabuj Hassan explained below, ^ means the start of the line. The | character means OR.
Edit: Using Regexp.union as suggested below by #oro2k:
words = Regexp.union word_list
result = lines.select { |line| line =~ /^(#{words})/ }
Assuming your words do not contain any special character in it. Join the words separated with pipe(|)
words = [ 'jsmith88', 'alex' ]
word_list = words.join("|")
Now use the joined string in regex for each line from other arrray:
lines = [ 'jsmith88:*:4185:208:jsmith113:/students/jsmith88:/usr/bin/bash' ]
if(lines[0] =~ /^(#{word_list})/)
print "ok"
end
Here ^ means start of the line. And inside the brackets (..) it holds the words as OR condition.
Try this:--
lines = [ 'jsmith88:*:4185:208:jsmith113:/students/jsmith88:/usr/bin/bash' ]
words = [ 'jsmith88' ]
lines.each_with_object([]){|line, array_obj| array_obj << line if words.include?(line.scan(/\b\w+\b/)[0])}

remove special character from .split string ruby

I read file using ruby and use .split to split line.
Example.txt
1
2
3
line1,line2,line3= #line.to_s.split("\n",3)
#actual
line1 => ["1
line2 => ", "2
line3 => ", "3"]
#what I expect
line1=1
line2=2
line3=3
how can i get what i expected?
Edit: it 's just 1 new line because I can't enter 1 new line in my question. To be more specific:
Example.txt
first_line\nsecond_line\nthird_line
File.open('Example.txt', 'r') do |f1|
#line = f1.readlines
f1.close
end
line1,line2,line3= #line.to_s.split("\n",3)
#actual
line1 => ["first_line
line2 => ", "second_line
line3 => ", "third_line"]
#what I expect
line1=first_line
line2=second_line
line3=third_line
You can't split using '\n', if you're trying to use line-ends. You MUST use "\n".
Strings using '\n' do not interpret \n as a line-ending, instead, they treat it as a literal backslash followed by "n":
'\n' # => "\\n"
"\n" # => "\n"
The question isn't at all clear, nor is the input file example clear given the little example code presented, however, guessing at what you want from the desired result...
If the input is a file called 'example.txt' looking like:
1
2
3
You can read it numerous ways:
File.read('example.txt').split("\n")
# => ["1", "2", "3"]
Or:
File.readlines('example.txt').map(&:chomp)
# => ["1", "2", "3"]
Either of those work, however, there is a very bad precedence set when reading files into memory like this. It's called "slurping" and can crash your code or take it, and the machine it's running on, to a crawl if the file is larger than the available memory. And, even if it fits into memory, loading a huge file into memory can cause pauses as memory is allocated, and reallocated. So, don't do that.
Instead, read the file line-by-line and process it that way if at all possible:
File.foreach('example.txt') do |line|
puts line
end
# >> 1
# >> 2
# >> 3
Don't do this:
File.open('Example.txt', 'r') do |f1|
#line = f1.readlines
f1.close
end
Ruby will automatically close a file opened like this:
File.open('Example.txt', 'r') do |f1|
...
end
There is no need to use close inside the block.
Depends a bit on what exactly you're expecting (and what #line is). If you are looking for numbers you can use:
line1, line2, line3 = #line.to_s.scan(/\d/)
If you want other characters you can use some other regular expression.

Resources