Check for duplicates in a file of over 2GB - ruby

I have a large list of words that has been compiled from various sources. Having come from so many unrelated sources, I imagine there are some duplicates. Even inside some of the original files, there are duplicates. I've created a script to sort them out, but the file has gotten so ungainly at this point that I run out of memory when trying to parse it. The source is below. I'm running Windows 8, 64-bit, with Ruby 1.9.3-p327.
#!/usr/bin/env ruby
words = []
File.foreach( "wordlist.txt" ) do |line|
words << line
end
words.uniq!()
words = words.sort()
wordFile = File.open( "wordlist.txt", "w" )
words.each do |word|
wordFile << word + "\n"
puts "Wrote to file: #{ word }"
end

There are a number of different ways of removing duplicates. You don't need to do this in ruby for one. If the words fit in memory you can use a set of words that you've seen once, and not print those again. If the set is too big, you can always sort the file outside of ruby using the sort command (look into the -T switch to use a temporary directory instead of memory), and pipe the output into uniq -c.

Related

Is special variable $. gone from Ruby?

When processing a file, I used to use the special variable $. to get the last line number being read. For instance, the following program
require 'csv'
IFS=';'
CSV_OPTIONS = { col_sep: IFS, external_encoding: Encoding::ISO_8859_1, internal_encoding: Encoding::UTF_8 }
CSV.new($stdin, CSV_OPTIONS).each do |row|
puts "::::line #{$.} row=#{row}"
end
is supposed to dump a CSV file (where the fields are delimited by semicolon instead of comma, as is the case in our project) and prepend each output line by the line number.
After updating Ruby to
_ruby 2.6.3p62 (2019-04-16 revision 67580) [x86_64-cygwin]_
the lines are still dumped, but the line number is always displayed as zero.
What strikes me, is that this Ruby Wiki on special Ruby variables, while still having $. in its list, doesn't have a description for this variable anymore. So I wonder: Is this variable gone, or was it never supposed to work with the csv class and just worked for me by accident in the earlier versions?
I'm not sure why $. isn't working for you, but it's also not the best solution here. When it works, $. gives you the number of lines read from input, but since quoted fields in a CSV file can span multiple lines the number you get from $. won't always be the number of rows that have been read.
As mentioned above, each_with_index is a good alternative:
CSV.new($stdin, CSV_OPTIONS).each_with_index do |row, i|
puts "::::row #{i} row=#{row}"
end
Another alternative is CSV#lineno:
lineno()
The line number of the last row read from this file. Fields with nested line-end characters will not affect this count.
You would use it like this:
csv = CSV.new($stdin, CSV_OPTIONS)
csv.each do |row|
puts "::::row #{csv.lineno} row=#{row}"
end
Note that each_with_index will start counting at 0, whereas lineno starts at 1.
You can see both approaches in action on repl.it: https://repl.it/#jrunning/LoudBlushingCharactercode

ruby match string or space/tab at the beginning of a line and insert uniq lines to a file

This is my code:
File.open(file_name) do |file|
file.each_line do |line|;
if line =~ (/SAPK/) || (line =~ /^\t/ and tabs == true) || (line =~ /^ / and spaces == true)
file = File.open("./1.log", "a"); puts "found a line #{line}"; file.write("#{line}".lstrip!)
end
end
end
File.open("./2.log", "a") { |file| file.puts File.readlines("./1.log").uniq }
I want to to insert all the lines that match a specific string, start with tab or start with space to a file 1.log, all lines should be with space/tab at the beginning so I removed them.
I want get the unique lines in 1.log and write them to 2.log
It will be great if some can go over the code and tell me if something is not correct.
When using files in Ruby, what is the difference between the w+ and a modes?
I know:
w+ - Create an empty file for both reading and writing.
a - Append to a file.The file is created if it does not exist.
But both options append to the file, I though w+ should behave like >, instead of >> ,so I guess w+ also like >> ?
Thanks !!
There's a lot of confusion in this code and it isn't helped by your habit of jamming things on a single line for no reason. Do try and keep your code clean, as the functionality should be obvious. There's also a lot of quirky anti-patterns like stringifying strings and testing booleans vs booleans which you should really avoid doing.
One thing you'll want to do is employ Tempfile for those situations where you need an intermediate file.
Here's a reworked version that's cleaned up:
Tempfile.open do |temp|
File.open(file_name) do |input|
input.each_line do |line|
if line.match(/SAPK/) || (line.match(/^\t/) and tabs) || (line.match(/^ /) and spaces)
puts "found a line #{line}"
temp.write(line.lstrip!)
end
end
end
File.open("./2.log", "w+") do |file|
# Rewind the temporary file to read data back
temp.rewind
file.write(temp.readlines.uniq)
end
end
Now a and w+ are largely similar, it's just two ways that are offered for people familiar with whatever notation. It's like how Array has both length and size which do the same thing. Pick one and use it consistently or your code will be confusing.
My criticism over things like x == true is because something that narrowly specific usually means that x could take on a multitude of values and true is one particular case we're trying to handle, something that implies that we should be aware it might be false and many other things. It's a red herring and will only invite questions.

Working with large amount of data with multiple duplicates in ruby

I have a large text file which I want to process with a ruby script and store in a separate file. My problem is that the resulting file will consist of hundreds of million of lines where the vast majority of them are duplicates. I would like to eliminate the duplicates before writing them to disk.
I have tried processing them and putting the lines in a set to eliminate the duplicates before writing them to the output file, but eventually I ran out of memory and the script crashed.
Is there a way to solve my problem efficiently in ruby?
Create a file called uniq.rb with this code:
require 'digest'
hashes = {}
STDIN.each do |line|
line.chomp!
md5 = Digest::MD5.digest(line)
next if hashes.include?(md5)
hashes[md5] = true
puts line
end
then run it from the command line:
ruby uniq.rb < input.txt > output.txt
The main idea is that you don't have to save the entire line in memory, but instead just a 16-byte MD5 hash (plus true value) to track the unique lines.
If order doesn't matter you can use Unix uniq command.
ruby process.rb | sort | uniq > out.txt

How do I print the line number of the file I am working with via ARGV?

I'm currently opening a file taken at runtime via ARGV:
File.open(ARGV[0]) do |f|
f.each_line do |line|
Once a match is found I print output to the user.
if line.match(/(strcpy)/i)
puts "[!] strcpy does not check for buffer overflows when copying to destination."
puts "[!] Consider using strncpy or strlcpy (warning, strncpy is easily misused)."
puts " #{line}"
end
I want to know how to print out the line number for the matching line in the (ARGV[0]) file.
Using print __LINE__ shows the line number from the Ruby script. I've tried many different variations of print __LINE__ with different string interpolations of #{line} with no success. Is there a way I can print out the line number from the file?
When Ruby's IO class opens a file, it sets the $. global variable to 0. For each line that is read that variable is incremented. So, to know what line has been read simply use $..
Look in the English module for $. or $INPUT_LINE_NUMBER.
We can also use the lineno method that is part of the IO class. I find that a bit more convoluted because we need an IO stream object to tack that onto, while $. will work always.
I'd write the loop more simply:
File.foreach(ARGV[0]) do |line|
Something to think about is, if you're on a *nix system, you can use the OS's built-in grep or fgrep tool to greatly speed up your processing. The "grep" family of applications are highly optimized for doing what you want, and can find all occurrences, only the first, can use regular expressions or fixed strings, and can easily be called using Ruby's %x or backtick operators.
puts `grep -inm1 abacus /usr/share/dict/words`
Which outputs:
34:abacus
-inm1 means "ignore character-case", "output line numbers", "stop after the first occurrence"

How to efficiently parse large text files in Ruby

I'm writing an import script that processes a file that has potentially hundreds of thousands of lines (log file). Using a very simple approach (below) took enough time and memory that I felt like it would take out my MBP at any moment, so I killed the process.
#...
File.open(file, 'r') do |f|
f.each_line do |line|
# do stuff here to line
end
end
This file in particular has 642,868 lines:
$ wc -l nginx.log /code/src/myimport
642868 ../nginx.log
Does anyone know of a more efficient (memory/cpu) way to process each line in this file?
UPDATE
The code inside of the f.each_line from above is simply matching a regex against the line. If the match fails, I add the line to a #skipped array. If it passes, I format the matches into a hash (keyed by the "fields" of the match) and append it to a #results array.
# regex built in `def initialize` (not on each line iteration)
#regex = /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - (.{0})- \[([^\]]+?)\] "(GET|POST|PUT|DELETE) ([^\s]+?) (HTTP\/1\.1)" (\d+) (\d+) "-" "(.*)"/
#... loop lines
match = line.match(#regex)
if match.nil?
#skipped << line
else
#results << convert_to_hash(match)
end
I'm completely open to this being an inefficient process. I could make the code inside of convert_to_hash use a precomputed lambda instead of figuring out the computation each time. I guess I just assumed it was the line iteration itself that was the problem, not the per-line code.
I just did a test on a 600,000 line file and it iterated over the file in less than half a second. I'm guessing the slowness is not in the file looping but the line parsing. Can you paste your parse code also?
This blogpost includes several approaches to parsing large log files. Maybe thats an inspiration. Also have a look at the file-tail gem
If you are using bash (or similar) you might be able to optimize like this:
In input.rb:
while x = gets
# Parse
end
then in bash:
cat nginx.log | ruby -n input.rb
The -n flag tells ruby to assume 'while gets(); ... end' loop around your script, which might cause it to do something special to optimize.
You might also want to look into a prewritten solution to the problem, as that will be faster.

Resources