Ruby : How to remove duplicate lines from a document text?

Ruby : How to remove duplicate lines from a document text? - ruby

I want to remove duplicate lines from a text for example :
1.aabba
2.abaab
3.aabba
4.aabba
After running :
1.aabba
2.abaab
Tried so far :
lines = File.readlines("input.txt")
lines = File.read('/path/to/file')
lines.split("\n").uniq.join("\n")

Let's construct a file.
fname = 't'
IO.write fname, <<~END
dog
cat
dog
pig
cat
END
#=> 20
See IO::write. First let's suppose you simply want to read the unique lines into an array.
If, as here, the file is not excessive large, you can write:
arr = IO.readlines(fname, chomp: true).uniq
#=> ["dog", "cat", "pig"]
See IO::readlines. chomp: true removes the newline character at the end of each line.
If you wish to then write that array to another file:
fname_out = 'tt'
IO.write(fname_out, arr.join("\n") << "\n")
#=> 12
or
File.open(fname_out, 'w') do |f|
arr.each { |line| f.puts line }
end
If you wish to overwrite fname, write to a new file, delete the existing file and then rename the new file fname.
If the file is so large it cannot be held in memory and there are many duplicate lines, you might be able to do the following.
require 'set'
st = IO.foreach(fname, chomp: true).with_object(Set.new) do |line, st|
st.add(line)
end
#=> #<Set: {"dog", "cat", "pig"}>
See IO::foreach.
If you wish to simply write the contents of this set to file, you can execute:
File.open(fname_out, 'w') do |f|
st.each { |s| f.puts(s) }
end
If instead you need to convert the set to an array:
st.to_a
#=> ["dog", "cat", "pig"]
This assumes you have enough memory to hold both st and st.to_a. If not, you could write:
st.size.times.with_object([]) do |_,a|
s = st.first
a << s
st.delete(s)
end
#=> ["dog", "cat", "pig"]
If you don't have enough memory to even hold st you will need to read your file (line-by-line) into a database and then use database operations.
If you wish to write the file with the duplicates skipped, and the file is very large, you may do the following, albeit with the infinitesimal risk of including one or more duplicates (see the comments).
require 'set'
line_map = IO.foreach(fname, chomp: true).with_object({}) do |line,h|
hsh = line.hash
h[hsh] = $. unless h.key?(hsh)
end
#=> {3393575068349183629=>1, -4358860729541388342=>2,
# -176447925574512206=>4}
$. is the number (base 1) of the line just read. See String#hash. Since the number of distinct values returned by this method is finite and the number of possible strings is infinite, there is the possibility that two distinct strings could have the same hash value.
Then (assuming line_map is not empty):
lines_to_keep = line_map.values
File.open(fname_out, 'w') do |fout|
IO.foreach(fname, chomp: true) do |line|
if lines_to_keep.first == $.
fout.puts(line)
lines_to_keep.shift
end
end
end
Let's see what we've written:
puts File.read(fname_out)
dog
cat
pig
See File::open.
Incidentally, for IO class methods m (including read, write, readlines and foreach), you may see IO.m... written File.m.... That's permissible because File is a subclass of IO and therefore inherits the latter's methods. That does not apply to my use of File::open, as IO::Open is a different method.

Set only stores unique elements, so:
require 'Set'
s = Set.new
while line = gets
s << line.strip
end
s.each { |unique_elt| puts unique_elt }
You can run this with any input file using < input.txt on the command-line rather than hardwiring the file name into your program.
Note that Set is based on Hash, and the documentation states "Hashes enumerate their values in the order that the corresponding keys were inserted", so this will preserve the order of entry.

You can continue your idea with uniq.
uniq compares result of the block and delete duplicates.
For example you have input.txt with this content:
1.aabba
2.abaab
3.aabba
4.aabba
puts File.readlines('input.txt', chomp: true).
uniq { |line| line.sub(/\A\d+\./, '') }.
join("\n")
# will print
# 1.aabba
# 2.abaab
Here Sring#sub that delete list numbers, but you can use other methods, for example line[2..-1].

Related

Ruby: Using an array list in order to select specific columns

I'm new in Ruby.
Here the script, I would like to use the selector in line 10 instead of fields[0] etc...
How can I do that ?
For the example the data are embedded.
Don't hesitate to correct me if I'm doing wrong when I'm opening or writing a file or anything else, I like to learn.
#!/usr/bin/ruby
filename = "/tmp/log.csv"
selector = [0, 3, 5, 7]
out = File.open(filename + ".rb.txt", "w")
DATA.each_line do |line|
fields = line.split("|")
columns = fields[0], fields[3], fields[5], fields[7]
puts columns.join("|")
out.puts(columns.join("|"))
end
out.close
__END__
20180704150930|rtsp|645645643|30193|211|KLM|KLM00SD624817.ts|172.30.16.34|127299264|VERB|01780000|21103|277|server01|OK
20180704150931|api|456456546|30130|234|VC3|VC300179201139.ts|172.30.16.138|192271838|VERB|05540000|23404|414|server01|OK
20180704150931|api|465456786|30154|443|BAD|BAD004416550.ts|172.30.16.50|280212202|VERB|04740000|44301|18|server01|OK
20180704150931|api|5437863735|30157|383|VSS|VSS0011062009.ts|172.30.16.66|312727922|VERB|05700000|38303|381|server01|OK
20180704150931|api|3453432|30215|223|VAE|VAE00TF548197.ts|172.30.16.74|114127126|VERB|05060000|22305|35|server01|OK
20180704150931|api|312121|30044|487|BOV|BOVVAE00549424.ts|172.30.16.58|69139448|VERB|05300000|48708|131|server01|OK
20180704150931|rtsp|453432123|30127|203|GZD|GZD0900032066.ts|172.30.16.58|83164150|VERB|05460000|20303|793|server01|OK
20180704150932|api|12345348|30154|465|TYH|TYH0011224259.ts|172.30.16.50|279556843|VERB|04900000|46503|241|server01|OK
20180704150932|api|4343212312|30154|326|VAE|VAE00TF548637.ts|172.30.16.3|28966797|VERB|04740000|32601|969|server01|OK
20180704150932|api|312175665|64530|305|TTT|TTT000000011852.ts|172.30.16.98|47868183|VERB|04740000|30501|275|server01|OK

You can get fields at specific indices using Ruby's splat operator (search for 'splat') and Array.values_at like so:
columns = fields.values_at(*selector)
A couple of coding style suggestions:
1.You may want to make selector a constant since its unlikely that you'll want to mutate it further down in your code base
2.The out and out.close and appending to DATA can all be condensed into a CSV.open:
CSV.open(filenname, 'wb') do |csv|
columns.map do |col|
csv << col
end
end
You can also specify a custom delimiter (pipe | in your case) as noted in this answer like so:
...
CSV.open(filenname, 'wb', {col_sep: '|') do |csv|
...

Let's begin with a more manageable example. First note that if your string is held by the variable data, each line of the string contains the same number (14) of vertical bars ('|'). Lets reduce that to the first 4 lines of data with each line terminated immediately before the 6th vertical bar:
str = data.each_line.map { |line| line.split("|").first(6).join("|") }.first(4).join("\n")
puts str
20180704150930|rtsp|645645643|30193|211|KLM
20180704150931|api|456456546|30130|234|VC3
20180704150931|api|465456786|30154|443|BAD
20180704150931|api|5437863735|30157|383|VSS
We need to also modify selector (arbitrarily):
selector = [0, 3, 4]
Now on to answering the question.
There is no need to divide the string into lines, split each line on the vertical bars, select the elements of interest from the resulting array, join the latter with a vertical bar and then lastly join the whole shootin' match with a newline (whew!). Instead, simply use String#gsub to remove all unwanted characters from the string.
terms_per_row = str.each_line.first.count('|') + 1
#=> 6
r = /
(?:^|\|) # match the beginning of a line or a vertical bar in a non-capture group
[^|\n|]+ # match one or more characters other than a vertical bar or newline
/x # free-spacing regex definition mode
line_idx = -1
new_str = str.gsub(r) do |s|
line_idx += 1
selector.include?(line_idx % terms_per_row) ? s : ''
end
puts new_str
20180704150930|30193|211
20180704150931|30130|234
20180704150931|30154|443
20180704150931|30157|383
Lastly, we write new_str to file:
File.write(fname, new_str)

Counting words in Ruby with some exceptions

Say that we want to count the number of words in a document. I know we can do the following:
text.each_line(){ |line| totalWords = totalWords + line.split.size }
Say, that I just want to add some exceptions, such that, I don't want to count the following as words:
(1) numbers
(2) standalone letters
(3) email addresses
How can we do that?
Thanks.

You can wrap this up pretty neatly:
text.each_line do |line|
total_words += line.split.reject do |word|
word.match(/\A(\d+|\w|\S*\#\S+\.\S+)\z/)
end.length
end
Roughly speaking that defines an approximate email address.
Remember Ruby strongly encourages the use of variables with names like total_words and not totalWords.

assuming you can represent all the exceptions in a single regular expression regex_variable, you could do:
text.each_line(){ |line| totalWords = totalWords + line.split.count {|wrd| wrd !~ regex_variable }
your regular expression could look something like:
regex_variable = /\d.|^[a-z]{1}$|\A([^#\s]+)#((?:[-a-z0-9]+\.)+[a-z]{2,})\Z/i
I don't claim to be a regex expert, so you may want to double check that, particularly the email validation part

In addition to the other answers, a little gem hunting came up with this:
WordsCounted Gem
Get the following data from any string or readable file:
Word count
Unique word count
Word density
Character count
Average characters per word
A hash map of words and the number of times they occur
A hash map of words and their lengths
The longest word(s) and its length
The most occurring word(s) and its number of occurrences.
Count invividual strings for occurrences.
A flexible way to exclude words (or anything) from the count. You can pass a string, a regexp, an array, or a lambda.
Customisable criteria. Pass your own regexp rules to split strings if you prefer. The default regexp has two features:
Filters special characters but respects hyphens and apostrophes.
Plays nicely with diacritics (UTF and unicode characters): "São Paulo" is treated as ["São", "Paulo"] and not ["S", "", "o", "Paulo"].
Opens and reads files. Pass in a file path or a url instead of a string.

Have you ever started answering a question and found yourself wandering, exploring interesting, but tangential issues, or concepts you didn't fully understand? That's what happened to me here. Perhaps some of the ideas might prove useful in other settings, if not for the problem at hand.
For readability, we might define some helpers in the class String, but to avoid contamination, I'll use Refinements.
Code
module StringHelpers
refine String do
def count_words
remove_punctuation.split.count { |w|
!(w.is_number? || w.size == 1 || w.is_email_address?) }
end
def remove_punctuation
gsub(/[.!?,;:)](?:\s|$)|(?:^|\s)\(|\-|\n/,' ')
end
def is_number?
self =~ /\A-?\d+(?:\.\d+)?\z/
end
def is_email_address?
include?('#') # for testing only
end
end
end
module CountWords
using StringHelpers
def self.count_words_in_file(fname)
IO.foreach(fname).reduce(0) { |t,l| t+l.count_words }
end
end
Note that using must be in a module (possibly a class). It does not work in main, presumably because that would make the methods available in the class self.class #=> Object, which would defeat the purpose of Refinements. (Readers: please correct me if I'm wrong about the reason using must be in a module.)
Example
Let's first informally check that the helpers are working correctly:
module CheckHelpers
using StringHelpers
s = "You can reach my dog, a 10-year-old golden, at fido#dogs.org."
p s = s.remove_punctuation
#=> "You can reach my dog a 10 year old golden at fido#dogs.org."
p words = s.split
#=> ["You", "can", "reach", "my", "dog", "a", "10",
# "year", "old", "golden", "at", "fido#dogs.org."]
p '123'.is_number? #=> 0
p '-123'.is_number? #=> 0
p '1.23'.is_number? #=> 0
p '123.'.is_number? #=> nil
p "fido#dogs.org".is_email_address? #=> true
p "fido(at)dogs.org".is_email_address? #=> false
p s.count_words #=> 9 (`'a'`, `'10'` and "fido#dogs.org" excluded)
s = "My cat, who has 4 lives remaining, is at abbie(at)felines.org."
p s = s.remove_punctuation
p s.count_words
end
All looks OK. Next, put I'll put some text in a file:
FName = "pets"
text =<<_
My cat, who has 4 lives remaining, is at abbie(at)felines.org.
You can reach my dog, a 10-year-old golden, at fido#dogs.org.
_
File.write(FName, text)
#=> 125
and confirm the file contents:
File.read(FName)
#=> "My cat, who has 4 lives remaining, is at abbie(at)felines.org.\n
# You can reach my dog, a 10-year-old golden, at fido#dogs.org.\n"
Now, count the words:
CountWords.count_words_in_file(FName)
#=> 18 (9 in ech line)
Note that there is at least one problem with the removal of punctuation. It has to do with the hyphen. Any idea what that might be?

Something like...?
def is_countable(word)
return false if word.size < 2
return false if word ~= /^[0-9]+$/
return false if is_an_email_address(word) # you need a gem for this...
return true
end
wordCount = text.split().inject(0) {|count,word| count += 1 if is_countable(word) }
Or, since I am jumping to the conclusion that you can just split your entire text into an array with split(), you might need:
wordCount = 0
text.each_line do |line|
line.split.each{|word| wordCount += 1 if is_countable(word) }
end

Search through text word by word

I'd like to search through a txt file for a particular word. If I find that word, I'd like to retrieve the word that immediately follows it in the file. If my text file contained:
"My name is Jay and I want to go to the store"
I'd be searching for the word "want", and would want to add the word "to" to my array. I'll be looking through a very big text file, so any notes on performance would be great too.

The most literal way to read that might look like this:
a = []
str = "My name is Jack and I want to go to the store"
str.scan(/\w+/).each_cons(2) {|x, y| a << y if x == 'to'}
a
#=> ["go", "the"]
To read the file into a string use File.read.

This is one way:
Code
def find_next(fname, word)
enum = IO.foreach(fname)
loop do
e = (enum.next).scan(/\w+/)
ndx = e.index(word)
if ndx
return e[ndx+1] if ndx < e.size-1
loop do
e = enum.next
break if e =~ /\w+/
end
return e[/\w+/]
end
end
nil
end
Example
text =<<_
It was the best of times, it was the worst of times,
it was the age of wisdom, it was the age of foolishness,
. . . . .
it was the epoch of belief, it was the epoch of incredulity,
it was the season of light, it was the season of darkness,
it was the spring of hope, it was the winter of despair…
_
FName = "two_cities"
File.write(FName, text)
find_next(FName, "worst")
# of
find_next(FName, "wisdom")
# it
find_next(FName, "foolishness")
# it
find_next(FName, "dispair")
#=> nil
find_next(FName, "magpie")
#=> nil
Shorter, but less efficient, and problematic with large files:
File.read(FName)[/(?<=\b#{word}\b)\W+(\w+)/,1]

This is probably not the fastest way to do it, but something along these lines should work:
filename = "/path/to/filename"
target_word = "weasel"
next_word = ""
File.open(filename).each_line do |line|
line.split.each_with_index do |word, index|
if word == target_word
next_word = line.split[index + 1]
end
end
end

Given a File, String, or StringIO stored in file:
pattern, match = 'want', nil
catch :found do
file.each_line do |line|
line.split.each_cons(2) do |words|
if words[0] == pattern
match = words.pop
throw :found
end
end
end
end
match
#=> "to"
Note that this answer will find at most one match per file for speed, and linewise operation will save memory. If you want to find multiple matches per file, or find matches across line breaks, then this other answer is probably the way to go. YMMV.

This is the fastest I could come up with, assuming your file is loaded in a string:
word = 'want'
array = []
string.scan(/\b#{word}\b\s(\w+)/) do
array << $1
end
This will find ALL words that follow your particular word. So for example:
word = 'want'
string = 'My name is Jay and I want to go and I want a candy'
array = []
string.scan(/\b#{word}\b\s(\w+)/) do
array << $1
end
p array #=> ["to", "a"]
Testing this on my machine where I duplicated this string 500,000 times, I was able to reach 0.6 seconds execution time. I've also tried other approaches like splitting the string etc. but this was the fastest solution:
require 'benchmark'
Benchmark.bm do |bm|
bm.report do
word = 'want'
string = 'My name is Jay and I want to go and I want a candy' * 500_000
array = []
string.scan(/\b#{word}\b\s(\w+)/) do
array << $1
end
end
end

My instance variable isn't holding its value

Okay, so I'm building something that takes a text file and breaks it up into multiple sections that are further divided into entries, and then puts <a> tags around part of each entry. I have an instance variable, #section_name, that I need to use in making the link. The problem is, #section_name seems to lose its value if I look at it wrong. Some code:
def find_entries
#sections.each do |section|
#entries = section.to_s.shatter(/(some) RegEx/)
#section_name = $1.to_s
puts #section_name
add_links
end
end
def add_links
puts "looking for #{#section_name} in #{#section_hash}"
section_link = #section_hash.fetch(#section_name)
end
If I comment out the call to add_links, it spits out the names of all the sections, but if I include it, I just get:
looking for in {"contents" => "of", "the" => "hash"}
Any help is much appreciated!

$1 is a global variable which can be used in later code.$n contains the n-th (...) capture of the last match
"foobar".sub(/foo(.*)/, '\1\1')
puts "The matching word was #{$1}" #=> The matching word was bar
"123 456 789" =~ /(\d\d)(\d)/
p [$1, $2] #=> ["12", "3"]
So I think #entries = section.to_s.shatter(/(some) RegEx/) line is not doing match properly. thus your first matched group contains nothing. so $1 prints nil.

How do I count the number of instances of particular words in a paragraph?

I'd like to count the number of times a set of words appear in each paragraph in a text file. I am able to count the number of times a set of words appears in an entire text.
It has been suggested to me that my code is really buggy, so I'll just ask what I would like to do, and if you want, you can look at the code I have at the bottom.
So, given that "frequency_count.txt" has the words "apple pear grape melon kiwi" in it, I want to know how often "apple" shows up in each paragraph of a separate file "test_essay.txt", how often pear shows up, etc., and then for these numbers to be printed out in a series of lines of numbers, each corresponding to a paragraph.
For instance:
apple, pear, grape, melon, kiwi
3,5,2,7,8
2,3,1,6,7
5,6,8,2,3
Where each line corresponds to one of the paragraphs.
I am very, very new to Ruby, so thank you for your patience.
output_file = '/Users/yirenlu/Quora-Personal-Analytics/weka_input6.csv'
o = File.open(output_file, "r+")
common_words = '/Users/yirenlu/Quora-Personal-Analytics/frequency_count.txt'
c = File.open(common_words, "r")
c.each_line{|$line1|
words1 = $line1.split
words1.each{|w1|
the_file = '/Users/yirenlu/Quora-Personal-Analytics/test_essay.txt'
f = File.open(the_file, "r")
rows = File.readlines("/Users/yirenlu/Quora-Personal-Analytics/test_essay.txt")
text = rows.join
paragraph = text.split(/\n\n/)
paragraph.each{|p|
h = Hash.new
puts "this is each paragraph"
p.each_line{|line|
puts "this is each line"
words = line.split
words.each{|w|
if w1 == w
if h.has_key?(w)
h[w1] = h[w1] + 1
else
h[w1] = 1
end
$x = h[w1]
end
}
}
o.print "#{$x},"
}
}
o.print "\n"
o.print "#{$line1}"
}

If you're used to PHP or Perl you may be under the impression that a variable like $line1 is local, but this is a global. Use of them is highly discouraged and the number of instances where they are strictly required is very short. In most cases you can just omit the $ and use variables that way with proper scoping.
This example also suffers from nearly unreadable indentation, though perhaps that was an artifact of the cut-and-paste procedure.
Generally what you want for counters is to create a hash with a default of zero, then add to that as required:
# Create a hash where the default values for each key is 0
counter = Hash.new(0)
# Add to the counters where required
counter['foo'] += 1
counter['bar'] += 2
puts counter['foo']
# => 1
puts counter['baz']
# => 0
You basically have what you need, but everything is all muddled and just needs to be organized better.

Here are two one-liners to calculate frequencies of words in a string.
The first one is a bit easier to understand, but it's less effective:
txt.scan(/\w+/).group_by{|word| word.downcase}.map{|k,v| [k, v.size]}
# => [['word1', 1], ['word2', 5], ...]
The second solution is:
txt.scan(/\w+/).inject(Hash.new(0)) { |hash, w| hash[w.downcase] += 1; hash}
# => {'word1' => 1, 'word2' => 5, ...}

This could be shorter and easier to read if you use:
The CSV library.
A more functional approach using map and blocks.
require 'csv'
common_words = %w(apple pear grape melon kiwi)
text = File.open("test_essay.txt").read
def word_frequency(words, text)
words.map { |word| text.scan(/\b#{word}\b/).length }
end
CSV.open("file.csv", "wb") do |csv|
paragraphs = text.split /\n\n/
paragraphs.each do |para|
csv << word_frequency(common_words, para)
end
end
Note this is currently case-sensitive but it's a minor adjustment if you want case-insensitivity.

Here's an alternate answer, which is has been tweaked for conciseness (though not as easy to read as my other answer).
require 'csv'
words = %w(apple pear grape melon kiwi)
text = File.open("test_essay.txt").read
CSV.open("file.csv", "wb") do |csv|
text.split(/\n\n/).map {|p| csv << words.map {|w| p.scan(/\b#{w}\b/).length}}
end
I prefer the slightly longer but more self-documenting code, but it's fun to see how small it can get.

What about this:
# Create an array of regexes to be used in `scan' in the loop.
# `\b' makes sure that `barfoobar' does not match `bar' or `foo'.
p word_list = File.open("frequency_count.txt"){|io| io.read.scan(/\w+/)}.map{|w| /\b#{w}\b/}
File.open("test_essay.txt") do |io|
loop do
# Add lines to `paragraph' as long as there is a continuous line
paragraph = ""
# A `l.chomp.empty?' becomes true at paragraph border
while l = io.gets and !l.chomp.empty?
paragraph << l
end
p word_list.map{|re| paragraph.scan(re).length}
# The end of file has been reached when `l == nil'
break unless l
end
end

To count how many times one word appears in a text:
text = "word aaa word word word bbb ccc ccc"
text.scan(/\w+/).count("word") # => 4
To count a set of words:
text = "word aaa word word word bbb ccc ccc"
wlist = text.scan(/\w+/)
wset = ["word", "ccc"]
result = {}
wset.each {|word| result[word] = wlist.count(word) }
result # => {"word" => 4, "ccc" => 2}
result["ccc"] # => 2

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ruby : How to remove duplicate lines from a document text? - ruby

I want to remove duplicate lines from a text for example : 1.aabba 2.abaab 3.aabba 4.aabba After running : 1.aabba 2.abaab Tried so far : lines = File.readlines("input.txt") lines = File.read('/path/to/file') lines.split("\n").uniq.join("\n")

Related

Ruby: Using an array list in order to select specific columns

Counting words in Ruby with some exceptions

Search through text word by word

My instance variable isn't holding its value

How do I count the number of instances of particular words in a paragraph?

Categories

Resources