I'm reading a log file and trying to organize the data in the below format, so I wanted to push NAME(i.e USOLA51, USOLA10..) as hash and create corresponding array for LIST and DETAILS.
I've created the hash too but not sure how to take/extract the corresponding/associated array values.
Expected Output
NAME LIST DETAILS
USOLA51 ICC_ONUS .035400391
PA_ONUS .039800391
PA_ONUS .000610352
USOLA10 PAL 52.7266846
CFG_ONUS 15.9489746
likewise for the other values
Log file:
--- data details ----
USOLA51
ONUS size
------------------------------ ----------
ICC_ONUS .035400391
PA_ONUS .039800391
PE_ONUS .000610352
=========================================
---- data details ----
USOLA10
ONUS size
------------------------------ ----------
PAL 52.7266846
CFG_ONUS 15.9489746
=========================================
---- data details ----
USOLA55
ONUS size
------------------------------ ----------
PA_ONUS 47.4707031
PAL 3.956604
ICC_ONUS .020385742
PE_ONUS .000610352
=========================================
---- data details ----
USOLA56
ONUS size
------------------------------ ----------
=========================================
what I've tried
unique = Array.new
owner = Array.new
db = Array.new
File.read("mydb_size.log").each_line do |line|
next if line =~ /---- data details ----|^ONUS|---|=======/
unique << line.strip if line =~ /^U.*\d/
end
hash = Hash[unique.collect { |item| [item, ""] } ]
puts hash
Current O/p
{"USOLA51"=>"", "USOLA10"=>"", "USOLA55"=>"", "USOLA56"=>""}
Any help to move forward would be really helpful here.Thanks !!
While your log file isn't CSV, I find the csv library useful in a lot of non-csv parsing. You can use it to parse your log file, by skipping blank lines, and any line starting with ---, ===, or ONUS. Your column separator is a white space character:
csv = CSV.read("./example.log", skip_lines: /\A(---|===|ONUS)/,
skip_blanks: true, col_sep: " ")
Then, some lines only have 1 element in the array parsed out, those are your header lines. So we can split the csv array into groups based on when we only have 1 element, and create a hash from the result:
output_hash = csv.slice_before { |row| row.length == 1 }.
each_with_object({}) do |((name), *rows), hash|
hash[name] = rows.to_h
end
Now, it's a little hard to tell if you wanted the hash output as the text you showed, or if you just wanted the hash. If you want the text output, we'll first need to see how much room each column needs to be displayed:
name_length = output_hash.keys.max_by(&:length).length
list_length = output_hash.values.flat_map(&:keys).max_by(&:length).length
detail_length = output_hash.values.flat_map(&:values).max_by(&:length).length
format = "%-#{name_length}s %-#{list_length}s %-#{detail_length}s"
and then we can output the header row and all the values in output_hash, but only if they have any values:
puts("#{format}\n\n" % ["NAME", "LIST", "DETAILS"])
output_hash.reject { |name, values| values.empty? }.each do |name, values|
list, detail = values.first
puts(format % [name, list, detail])
values.drop(1).each do |list, detail|
puts(format % ['', list, detail])
end
puts
end
and the result:
NAME LIST DETAILS
USOLA51 ICC_ONUS .035400391
PA_ONUS .039800391
PE_ONUS .000610352
USOLA10 PAL 52.7266846
CFG_ONUS 15.9489746
USOLA55 PA_ONUS 47.4707031
PAL 3.956604
ICC_ONUS .020385742
PE_ONUS .000610352
It's a little hard to explain (for me) what slice_before does. But, it takes an array (or other enumerable) and creates groups or chunks of its element, where the first element matches the parameter or the block returns true. For instance, if we had a smaller array:
array = ["slice here", 1, 2, "slice here", 3, 4]
array.slice_before { |el| el == "slice here" }.entries
# => [["slice here", 1, 2], ["slice here", 3, 4]]
We told slice_before, we want each group to begin with the element that equals "slice here", so we have 2 groups returned, the first element in each is "slice here" and the remaining elements are all the elements in the array until the next time it saw "slice here".
So then, we can take that result and we call each_with_object on it, passing an empty hash to start out with. With each_with_object, the first parameter is going to be the element of the array (from each) and the second is going to be the object you passed. What happens when the block parameters look like |((name), *rows), hash| is that first parameter (the element of the array) gets deconstructed into the first element of the array and the remaining elements:
# the array here is what gets passed to `each_with_object` for the first iteration as the first parameter
name, *rows = [["USOLA51"], ["ICC_ONUS", ".035400391"], ["PA_ONUS", ".039800391"], ["PE_ONUS", ".000610352"]]
name # => ["USOLA51"]
rows # => [["ICC_ONUS", ".035400391"], ["PA_ONUS", ".039800391"], ["PE_ONUS", ".000610352"]]
So then, we deconstruct that first element again, just so we don't have it in an array:
name, * = name # the `, *` isn't needed in the block parameters, but is needed when you run these examples in irb
name # => "USOLA51"
For the max_by(&:length).length, all we're doing is finding the longest element in the array (returned by either keys or values) and getting the length of it:
output_hash = {"USOLA51"=>{"ICC_ONUS"=>".035400391", "PA_ONUS"=>".039800391", "PE_ONUS"=>".000610352"}, "USOLA10"=>{"PAL"=>"52.7266846", "CFG_ONUS"=>"15.9489746"}, "USOLA55"=>{"PA_ONUS"=>"47.4707031", "PAL"=>"3.956604", "ICC_ONUS"=>".020385742", "PE_ONUS"=>".000610352"}, "USOLA56"=>{}}
output_hash.values.flat_map(&:keys)
# => ["ICC_ONUS", "PA_ONUS", "PE_ONUS", "PAL", "CFG_ONUS", "PA_ONUS", "PAL", "ICC_ONUS", "PE_ONUS"]
output_hash.values.map(&:length) # => [8, 7, 7, 3, 8, 7, 3, 8, 7]
output_hash.values.flat_map(&:keys).max_by(&:length) # => "ICC_ONUS"
output_hash.values.flat_map(&:keys).max_by(&:length).length # => 8
It's been a long time i've been working with ruby, so probably i forgot a lot of the shortcuts and syntactic sugar, but this file seems to be easily parseable without great efforts.
A simple line-by-line comparison of expected values will be enough. First step is to remove all surrounding whitespaces, ignore blank lines, or lines that start with = or -. Next if there is only one value, it is the title, the next line consists of the column names, which can be ignored for your desired output. If either title or column names are encountered, move on to the next line and save the following key/value pairs as ruby key/value pairs. During this operation also check for the longest occurring string and adjust the column padding, so that you can generate the table-like output afterwards with padding.
# Set up the loop
merged = []
current = -1
awaiting_headers = false
columns = ['NAME', 'LIST', 'DETAILS']
# Keep track of the max column length
columns_pad = columns.map { |c| c.length }
str.each_line do |line|
# Remove surrounding whitespaces,
# ignore empty or = - lines
line.strip!
next if line.empty?
next if ['-','='].include? line[0]
# Get the values of this line
parts = line.split ' '
# We're not awaiting the headers and
# there is just one value, must be the title
if not awaiting_headers and parts.size == 1
# If this string is longer than the current maximum
columns_pad[0] = line.length if line.length > columns_pad[0]
# Create a hash for this item
merged[current += 1] = {name: line, data: {}}
# Next must be the headers
awaiting_headers = true
next
end
# Headers encountered
if awaiting_headers
# Just skip it from here
awaiting_headers = false
next
end
# Take 2 parts of each (should be always only those two)
# and treat them as key/value
parts.each_cons(2) do |key, value|
# Make it a ruby key/value pair
merged[current][:data][key] = value
# Check if LIST or DETAILS column length needs to be raised
columns_pad[1] = key.length if key.length > columns_pad[1]
columns_pad[2] = value.length if value.length > columns_pad[2]
end
end
# Adding three spaces between columns
columns_pad.map! { |c| c + 3}
# Writing the headers
result = columns.map.with_index { |c, i| c.ljust(columns_pad[i]) }.join + "\n"
merged.each do |item|
# Remove the next line if you want to include empty data
next if item[:data].empty?
result += "\n"
result += item[:name].ljust(columns_pad[0])
# For the first value in data, we don't need extra padding or a line break
padding = ""
item[:data].each do |key, value|
result += padding
result += key.ljust(columns_pad[1])
result += value.ljust(columns_pad[2])
# Set the padding to include a line break and fill up the NAME column with spaces
padding = "\n" + "".ljust(columns_pad[0])
end
result += "\n"
end
puts result
Which will result in
NAME LIST DETAILS
USOLA51 ICC_ONUS .035400391
PA_ONUS .039800391
PE_ONUS .000610352
USOLA10 PAL 52.7266846
CFG_ONUS 15.9489746
USOLA55 PA_ONUS 47.4707031
PAL 3.956604
ICC_ONUS .020385742
PE_ONUS .000610352
Online demo here
Related
I've got an array of strings which contains 2 elements: first is a sequence of characters and the second will be a long string of comma-separated words, in alphabetical order that represents dictionary of some arbitrary length. Now I want to check if from the list of words of the second element I can create the word of the first element in the array.
Below example will better illustrate what I mean:
string_array = ["baseball", "a,all,b,ball,bas,base,cat,code,d,e,quit,z"]
Output: base,ball
so What I did is simple method:
def word_split(string_array)
splitted_arr = string_array[1].split(',')
strArr.include?(splitted_arr)
end
but it gives me only false result. How to compare those string in array?
I made a method that takes two arguments: a word and an array of words. You can easily make a wrapper function that operates on the data structure you proposed, if you want.
def find_sequence(target, words)
return [] if target.empty?
words.each do |word|
if target.start_with?(word)
# We found a good candidate for the first word, so let's recurse
# and see if we can find the rest of the words.
remainder = target.sub(word, '') # remove word from start of target
seq = find_sequence(remainder, words)
return [word] + seq if seq
end
end
nil
end
# Example 1: your example
s = ["baseball", "a,all,b,ball,bas,base,cat,code,d,e,quit,z"]
p find_sequence(s[0], s[1].split(",")) # ["bas", "e", "b", "all"]
# Example 2: no solution
p find_sequence("foobar", ["foo", "cat"]) # nil
# Example 3: backtracking
p find_sequence("abcde", ["abcd","abc","ab","a","d","bcde"]) # ["a", "bcde"]
By the way, this is an example of a depth first search.
I want to remove duplicate lines from a text for example :
1.aabba
2.abaab
3.aabba
4.aabba
After running :
1.aabba
2.abaab
Tried so far :
lines = File.readlines("input.txt")
lines = File.read('/path/to/file')
lines.split("\n").uniq.join("\n")
Let's construct a file.
fname = 't'
IO.write fname, <<~END
dog
cat
dog
pig
cat
END
#=> 20
See IO::write. First let's suppose you simply want to read the unique lines into an array.
If, as here, the file is not excessive large, you can write:
arr = IO.readlines(fname, chomp: true).uniq
#=> ["dog", "cat", "pig"]
See IO::readlines. chomp: true removes the newline character at the end of each line.
If you wish to then write that array to another file:
fname_out = 'tt'
IO.write(fname_out, arr.join("\n") << "\n")
#=> 12
or
File.open(fname_out, 'w') do |f|
arr.each { |line| f.puts line }
end
If you wish to overwrite fname, write to a new file, delete the existing file and then rename the new file fname.
If the file is so large it cannot be held in memory and there are many duplicate lines, you might be able to do the following.
require 'set'
st = IO.foreach(fname, chomp: true).with_object(Set.new) do |line, st|
st.add(line)
end
#=> #<Set: {"dog", "cat", "pig"}>
See IO::foreach.
If you wish to simply write the contents of this set to file, you can execute:
File.open(fname_out, 'w') do |f|
st.each { |s| f.puts(s) }
end
If instead you need to convert the set to an array:
st.to_a
#=> ["dog", "cat", "pig"]
This assumes you have enough memory to hold both st and st.to_a. If not, you could write:
st.size.times.with_object([]) do |_,a|
s = st.first
a << s
st.delete(s)
end
#=> ["dog", "cat", "pig"]
If you don't have enough memory to even hold st you will need to read your file (line-by-line) into a database and then use database operations.
If you wish to write the file with the duplicates skipped, and the file is very large, you may do the following, albeit with the infinitesimal risk of including one or more duplicates (see the comments).
require 'set'
line_map = IO.foreach(fname, chomp: true).with_object({}) do |line,h|
hsh = line.hash
h[hsh] = $. unless h.key?(hsh)
end
#=> {3393575068349183629=>1, -4358860729541388342=>2,
# -176447925574512206=>4}
$. is the number (base 1) of the line just read. See String#hash. Since the number of distinct values returned by this method is finite and the number of possible strings is infinite, there is the possibility that two distinct strings could have the same hash value.
Then (assuming line_map is not empty):
lines_to_keep = line_map.values
File.open(fname_out, 'w') do |fout|
IO.foreach(fname, chomp: true) do |line|
if lines_to_keep.first == $.
fout.puts(line)
lines_to_keep.shift
end
end
end
Let's see what we've written:
puts File.read(fname_out)
dog
cat
pig
See File::open.
Incidentally, for IO class methods m (including read, write, readlines and foreach), you may see IO.m... written File.m.... That's permissible because File is a subclass of IO and therefore inherits the latter's methods. That does not apply to my use of File::open, as IO::Open is a different method.
Set only stores unique elements, so:
require 'Set'
s = Set.new
while line = gets
s << line.strip
end
s.each { |unique_elt| puts unique_elt }
You can run this with any input file using < input.txt on the command-line rather than hardwiring the file name into your program.
Note that Set is based on Hash, and the documentation states "Hashes enumerate their values in the order that the corresponding keys were inserted", so this will preserve the order of entry.
You can continue your idea with uniq.
uniq compares result of the block and delete duplicates.
For example you have input.txt with this content:
1.aabba
2.abaab
3.aabba
4.aabba
puts File.readlines('input.txt', chomp: true).
uniq { |line| line.sub(/\A\d+\./, '') }.
join("\n")
# will print
# 1.aabba
# 2.abaab
Here Sring#sub that delete list numbers, but you can use other methods, for example line[2..-1].
I'm new in Ruby.
Here the script, I would like to use the selector in line 10 instead of fields[0] etc...
How can I do that ?
For the example the data are embedded.
Don't hesitate to correct me if I'm doing wrong when I'm opening or writing a file or anything else, I like to learn.
#!/usr/bin/ruby
filename = "/tmp/log.csv"
selector = [0, 3, 5, 7]
out = File.open(filename + ".rb.txt", "w")
DATA.each_line do |line|
fields = line.split("|")
columns = fields[0], fields[3], fields[5], fields[7]
puts columns.join("|")
out.puts(columns.join("|"))
end
out.close
__END__
20180704150930|rtsp|645645643|30193|211|KLM|KLM00SD624817.ts|172.30.16.34|127299264|VERB|01780000|21103|277|server01|OK
20180704150931|api|456456546|30130|234|VC3|VC300179201139.ts|172.30.16.138|192271838|VERB|05540000|23404|414|server01|OK
20180704150931|api|465456786|30154|443|BAD|BAD004416550.ts|172.30.16.50|280212202|VERB|04740000|44301|18|server01|OK
20180704150931|api|5437863735|30157|383|VSS|VSS0011062009.ts|172.30.16.66|312727922|VERB|05700000|38303|381|server01|OK
20180704150931|api|3453432|30215|223|VAE|VAE00TF548197.ts|172.30.16.74|114127126|VERB|05060000|22305|35|server01|OK
20180704150931|api|312121|30044|487|BOV|BOVVAE00549424.ts|172.30.16.58|69139448|VERB|05300000|48708|131|server01|OK
20180704150931|rtsp|453432123|30127|203|GZD|GZD0900032066.ts|172.30.16.58|83164150|VERB|05460000|20303|793|server01|OK
20180704150932|api|12345348|30154|465|TYH|TYH0011224259.ts|172.30.16.50|279556843|VERB|04900000|46503|241|server01|OK
20180704150932|api|4343212312|30154|326|VAE|VAE00TF548637.ts|172.30.16.3|28966797|VERB|04740000|32601|969|server01|OK
20180704150932|api|312175665|64530|305|TTT|TTT000000011852.ts|172.30.16.98|47868183|VERB|04740000|30501|275|server01|OK
You can get fields at specific indices using Ruby's splat operator (search for 'splat') and Array.values_at like so:
columns = fields.values_at(*selector)
A couple of coding style suggestions:
1.You may want to make selector a constant since its unlikely that you'll want to mutate it further down in your code base
2.The out and out.close and appending to DATA can all be condensed into a CSV.open:
CSV.open(filenname, 'wb') do |csv|
columns.map do |col|
csv << col
end
end
You can also specify a custom delimiter (pipe | in your case) as noted in this answer like so:
...
CSV.open(filenname, 'wb', {col_sep: '|') do |csv|
...
Let's begin with a more manageable example. First note that if your string is held by the variable data, each line of the string contains the same number (14) of vertical bars ('|'). Lets reduce that to the first 4 lines of data with each line terminated immediately before the 6th vertical bar:
str = data.each_line.map { |line| line.split("|").first(6).join("|") }.first(4).join("\n")
puts str
20180704150930|rtsp|645645643|30193|211|KLM
20180704150931|api|456456546|30130|234|VC3
20180704150931|api|465456786|30154|443|BAD
20180704150931|api|5437863735|30157|383|VSS
We need to also modify selector (arbitrarily):
selector = [0, 3, 4]
Now on to answering the question.
There is no need to divide the string into lines, split each line on the vertical bars, select the elements of interest from the resulting array, join the latter with a vertical bar and then lastly join the whole shootin' match with a newline (whew!). Instead, simply use String#gsub to remove all unwanted characters from the string.
terms_per_row = str.each_line.first.count('|') + 1
#=> 6
r = /
(?:^|\|) # match the beginning of a line or a vertical bar in a non-capture group
[^|\n|]+ # match one or more characters other than a vertical bar or newline
/x # free-spacing regex definition mode
line_idx = -1
new_str = str.gsub(r) do |s|
line_idx += 1
selector.include?(line_idx % terms_per_row) ? s : ''
end
puts new_str
20180704150930|30193|211
20180704150931|30130|234
20180704150931|30154|443
20180704150931|30157|383
Lastly, we write new_str to file:
File.write(fname, new_str)
Below is the input file that I want to store into a hash table, sort it and output in the format shown below.
Input File
Name=Ashok, Email=ashok85#gmail.com, Country=India, Comments=9898984512
Email=raju#hotmail.com, Country=Sri Lanka, Name=Raju
Country=India, Comments=45535878, Email=vijay#gmail.com, Name=Vijay
Name=Ashok, Country=India, Email=ashok37#live.com, Comments=8898788987
Output File (Sorted by Name)
Name Email Country Comments
-------------------------------------------------------
Ashok ashok37#live.com India 8898788987
Ashok ashok85#gmail.com India 9898984512
Raju raju#hotmail.com Sri Lanka
Vijay vijay#gmail.com India 45535878
So far, I have read the data from the file and stored every line into an array, but I am stuck at hash[key]=>value
file_data = {}
File.open('input.txt', 'r') do |file|
file.each_line do |line|
line_data = line.split('=')
file_data[line_data[0]] = line_data[1]
end
end
puts file_data
Given that each line in your input file has pattern of key=value strings which are separated by commas, you need to split the line first around comma, and then around equals sign. Here is version of corrected code:
# Need to collect parsed data from each line into an array
array_of_file_data = []
File.open('input.txt', 'r') do |file|
file.each_line do |line|
#create a hash to collect data from each line
file_data = {}
# First split by comma
pairs = line.chomp.split(", ")
pairs.each do |p|
#Split by = to separate out key and value
key_value = p.split('=')
file_data[key_value[0]] = key_value[1]
end
array_of_file_data << file_data
end
end
puts array_of_file_data
Above code will print:
{"Name"=>"Ashok", "Email"=>"ashok85#gmail.com", "Country"=>"India", "Comments"=>"9898984512"}
{"Email"=>"raju#hotmail.com", "Country"=>"Sri Lanka", "Name"=>"Raju"}
{"Country"=>"India", "Comments"=>"45535878", "Email"=>"vijay#gmail.com", "Name"=>"Vijay"}
{"Name"=>"Ashok", "Country"=>"India", "Email"=>"ashok37#live.com", "Comments"=>"8898788987"}
A more complete version of program is given below.
hash_array = []
# Parse the lines and store it in hash array
File.open("sample.txt", "r") do |f|
f.each_line do |line|
# Splits are done around , and = preceded or followed
# by any number of white spaces
splits = line.chomp.split(/\s*,\s*/).map{|p| p.split(/\s*=\s*/)}
# to_h can be used to convert an array with even number of elements
# into a hash, by treating it as an array of key-value pairs
hash_array << splits.to_h
end
end
# Sort the array of hashes
hash_array = hash_array.sort {|i, j| i["Name"] <=> j["Name"]}
# Print the output, more tricks needed to get it better formatted
header = ["Name", "Email", "Country", "Comments"]
puts header.join(" ")
hash_array.each do |h|
puts h.values_at(*header).join(" ")
end
Above program outputs:
Name Email Country Comments
Ashok ashok85#gmail.com India 9898984512
Ashok ashok37#live.com India 8898788987
Raju raju#hotmail.com Sri Lanka
Vijay vijay#gmail.com India 45535878
You may want to refer to Padding printed output of tabular data to have better formatted tabular output
I'd like to count the number of times a set of words appear in each paragraph in a text file. I am able to count the number of times a set of words appears in an entire text.
It has been suggested to me that my code is really buggy, so I'll just ask what I would like to do, and if you want, you can look at the code I have at the bottom.
So, given that "frequency_count.txt" has the words "apple pear grape melon kiwi" in it, I want to know how often "apple" shows up in each paragraph of a separate file "test_essay.txt", how often pear shows up, etc., and then for these numbers to be printed out in a series of lines of numbers, each corresponding to a paragraph.
For instance:
apple, pear, grape, melon, kiwi
3,5,2,7,8
2,3,1,6,7
5,6,8,2,3
Where each line corresponds to one of the paragraphs.
I am very, very new to Ruby, so thank you for your patience.
output_file = '/Users/yirenlu/Quora-Personal-Analytics/weka_input6.csv'
o = File.open(output_file, "r+")
common_words = '/Users/yirenlu/Quora-Personal-Analytics/frequency_count.txt'
c = File.open(common_words, "r")
c.each_line{|$line1|
words1 = $line1.split
words1.each{|w1|
the_file = '/Users/yirenlu/Quora-Personal-Analytics/test_essay.txt'
f = File.open(the_file, "r")
rows = File.readlines("/Users/yirenlu/Quora-Personal-Analytics/test_essay.txt")
text = rows.join
paragraph = text.split(/\n\n/)
paragraph.each{|p|
h = Hash.new
puts "this is each paragraph"
p.each_line{|line|
puts "this is each line"
words = line.split
words.each{|w|
if w1 == w
if h.has_key?(w)
h[w1] = h[w1] + 1
else
h[w1] = 1
end
$x = h[w1]
end
}
}
o.print "#{$x},"
}
}
o.print "\n"
o.print "#{$line1}"
}
If you're used to PHP or Perl you may be under the impression that a variable like $line1 is local, but this is a global. Use of them is highly discouraged and the number of instances where they are strictly required is very short. In most cases you can just omit the $ and use variables that way with proper scoping.
This example also suffers from nearly unreadable indentation, though perhaps that was an artifact of the cut-and-paste procedure.
Generally what you want for counters is to create a hash with a default of zero, then add to that as required:
# Create a hash where the default values for each key is 0
counter = Hash.new(0)
# Add to the counters where required
counter['foo'] += 1
counter['bar'] += 2
puts counter['foo']
# => 1
puts counter['baz']
# => 0
You basically have what you need, but everything is all muddled and just needs to be organized better.
Here are two one-liners to calculate frequencies of words in a string.
The first one is a bit easier to understand, but it's less effective:
txt.scan(/\w+/).group_by{|word| word.downcase}.map{|k,v| [k, v.size]}
# => [['word1', 1], ['word2', 5], ...]
The second solution is:
txt.scan(/\w+/).inject(Hash.new(0)) { |hash, w| hash[w.downcase] += 1; hash}
# => {'word1' => 1, 'word2' => 5, ...}
This could be shorter and easier to read if you use:
The CSV library.
A more functional approach using map and blocks.
require 'csv'
common_words = %w(apple pear grape melon kiwi)
text = File.open("test_essay.txt").read
def word_frequency(words, text)
words.map { |word| text.scan(/\b#{word}\b/).length }
end
CSV.open("file.csv", "wb") do |csv|
paragraphs = text.split /\n\n/
paragraphs.each do |para|
csv << word_frequency(common_words, para)
end
end
Note this is currently case-sensitive but it's a minor adjustment if you want case-insensitivity.
Here's an alternate answer, which is has been tweaked for conciseness (though not as easy to read as my other answer).
require 'csv'
words = %w(apple pear grape melon kiwi)
text = File.open("test_essay.txt").read
CSV.open("file.csv", "wb") do |csv|
text.split(/\n\n/).map {|p| csv << words.map {|w| p.scan(/\b#{w}\b/).length}}
end
I prefer the slightly longer but more self-documenting code, but it's fun to see how small it can get.
What about this:
# Create an array of regexes to be used in `scan' in the loop.
# `\b' makes sure that `barfoobar' does not match `bar' or `foo'.
p word_list = File.open("frequency_count.txt"){|io| io.read.scan(/\w+/)}.map{|w| /\b#{w}\b/}
File.open("test_essay.txt") do |io|
loop do
# Add lines to `paragraph' as long as there is a continuous line
paragraph = ""
# A `l.chomp.empty?' becomes true at paragraph border
while l = io.gets and !l.chomp.empty?
paragraph << l
end
p word_list.map{|re| paragraph.scan(re).length}
# The end of file has been reached when `l == nil'
break unless l
end
end
To count how many times one word appears in a text:
text = "word aaa word word word bbb ccc ccc"
text.scan(/\w+/).count("word") # => 4
To count a set of words:
text = "word aaa word word word bbb ccc ccc"
wlist = text.scan(/\w+/)
wset = ["word", "ccc"]
result = {}
wset.each {|word| result[word] = wlist.count(word) }
result # => {"word" => 4, "ccc" => 2}
result["ccc"] # => 2