Working with large amount of data with multiple duplicates in ruby - ruby

I have a large text file which I want to process with a ruby script and store in a separate file. My problem is that the resulting file will consist of hundreds of million of lines where the vast majority of them are duplicates. I would like to eliminate the duplicates before writing them to disk.
I have tried processing them and putting the lines in a set to eliminate the duplicates before writing them to the output file, but eventually I ran out of memory and the script crashed.
Is there a way to solve my problem efficiently in ruby?

Create a file called uniq.rb with this code:
require 'digest'
hashes = {}
STDIN.each do |line|
line.chomp!
md5 = Digest::MD5.digest(line)
next if hashes.include?(md5)
hashes[md5] = true
puts line
end
then run it from the command line:
ruby uniq.rb < input.txt > output.txt
The main idea is that you don't have to save the entire line in memory, but instead just a 16-byte MD5 hash (plus true value) to track the unique lines.

If order doesn't matter you can use Unix uniq command.
ruby process.rb | sort | uniq > out.txt

Related

Merging CSVs into one sees exponentially bigger size

I have 600 CSV files of size ~1Mo for a total of roughly 600Mo. I want to put all of them into a sqlite3 db. So my first step would be to merge them into one big csv (of ~600Mo right?) before importing it into a sql db.
However, when I run the following bash command (to merge all files keeping one header):
cat file-chunk0001.csv | head -n1 > file.csv
for f in *.csv; do cat "`pwd`/$f" | tail -n +2 >> file.csv; done
The resulting file.csv has a size of 38Go, at which point the process stops because I have no space left on device.
So my question is: why would the merged file size be more than 50x times bigger than expected? And what can I do to put them in a sqlite3 db with a reasonable size?
I guess my first question is: if you know how to do a for loop, why do you need to merge all the files into a single CSV file? Can't you just load them one after the other?
But your problem is an infinite loop. Your wildcard (*.csv) includes the file you're writing to. You could put your output file in a different directory or make sure your file glob does not include the output file (for f in file-*.csv maybe).

Difference between cat file_name | sort, sort < file_name, sort file_name in bash

Although they do give the same results, I wonder if there is some difference between them and which is the most appropriate way to sort something contained in a file.
Another thing which intrigues me is the use of delimiters, I noticed that the sort filter only works if you separate the strings with a new line, are there any ways to do this without having to write the new strings in a separate line
The sort(1) command reads lines of text, analyzes and sorts them, and writes out the result. The command is intended to read lines, and lines in unix/linux are terminated by a new line.
The command takes its first non-option argument as the file to read; if there is no specification it reads standard input. So:
sort file_name
is a command line with such argument. The other two examples, "... | sort" and "sort < ..." do not specify the file to read directly to sort(1), but use its standard input. The effect, for what sort(1) is concerned, is the same.
ways to do this without having to write the new strings in a separate line
Ultimately no. But if you want you can feed sort using another filter (a program), which reads the file non-linefeed-separated and creates lines to pass to sort. If such program exists and is named "myparse", you can do:
myparse non-linefeed-separated-file | sort
The solution using cat involves creating a second process unnecessarily. This could be a performance issue if you perform many of such operation in a loop.
When doing input redirection to your file, the shell is setting up the association of file with std input. If the file would not exist, the shell complains about the file being missing.
When passing the file name as explicit argument, the sort process has to care about opening the file and to report an error if there is an accessability problem with it.

Ruby print last line of files where files are given from stdin re-direction

Want to run Ruby like this
cat *.txt |my_print_last_line.rb
And have the last line of each file printed (might do something interesting if I could first print)
The ARGF class seems promising to solve it.
Tried this:
ARGF.each do |line|
last_line_of_file = line[last]
puts last_line_of_file
end
However ARGF.each seems to be one long string of all files concatenated rather than the individual files.
This has to do with how cat handles multiple files. It just concatenates them:
$ echo 'foo' > foo
$ echo 'bar' > bar
$ cat *
foo
bar
This means that after issuing the cat command there's no way of telling where the individual files started and ended. If you would just invoke your script as ruby my_print_last_line.rb *.txt you would still have the same problem because using ARGF will also simply concatenate all files.
As an alternative, you could take the filenames as parameters to the ruby script to open and read the files directly in ruby:
ARGV.each do |file|
puts IO.readlines(file).last
end
Then invoke your script like this:
$ ruby my_print_last_line.rb *.txt
If you are reading really large files, you may gain a significant performance benefit by using a seek based approach, as described by #DonaldScottWilde in this answer to a similar question.

Check for duplicates in a file of over 2GB

I have a large list of words that has been compiled from various sources. Having come from so many unrelated sources, I imagine there are some duplicates. Even inside some of the original files, there are duplicates. I've created a script to sort them out, but the file has gotten so ungainly at this point that I run out of memory when trying to parse it. The source is below. I'm running Windows 8, 64-bit, with Ruby 1.9.3-p327.
#!/usr/bin/env ruby
words = []
File.foreach( "wordlist.txt" ) do |line|
words << line
end
words.uniq!()
words = words.sort()
wordFile = File.open( "wordlist.txt", "w" )
words.each do |word|
wordFile << word + "\n"
puts "Wrote to file: #{ word }"
end
There are a number of different ways of removing duplicates. You don't need to do this in ruby for one. If the words fit in memory you can use a set of words that you've seen once, and not print those again. If the set is too big, you can always sort the file outside of ruby using the sort command (look into the -T switch to use a temporary directory instead of memory), and pipe the output into uniq -c.

How to efficiently parse large text files in Ruby

I'm writing an import script that processes a file that has potentially hundreds of thousands of lines (log file). Using a very simple approach (below) took enough time and memory that I felt like it would take out my MBP at any moment, so I killed the process.
#...
File.open(file, 'r') do |f|
f.each_line do |line|
# do stuff here to line
end
end
This file in particular has 642,868 lines:
$ wc -l nginx.log /code/src/myimport
642868 ../nginx.log
Does anyone know of a more efficient (memory/cpu) way to process each line in this file?
UPDATE
The code inside of the f.each_line from above is simply matching a regex against the line. If the match fails, I add the line to a #skipped array. If it passes, I format the matches into a hash (keyed by the "fields" of the match) and append it to a #results array.
# regex built in `def initialize` (not on each line iteration)
#regex = /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - (.{0})- \[([^\]]+?)\] "(GET|POST|PUT|DELETE) ([^\s]+?) (HTTP\/1\.1)" (\d+) (\d+) "-" "(.*)"/
#... loop lines
match = line.match(#regex)
if match.nil?
#skipped << line
else
#results << convert_to_hash(match)
end
I'm completely open to this being an inefficient process. I could make the code inside of convert_to_hash use a precomputed lambda instead of figuring out the computation each time. I guess I just assumed it was the line iteration itself that was the problem, not the per-line code.
I just did a test on a 600,000 line file and it iterated over the file in less than half a second. I'm guessing the slowness is not in the file looping but the line parsing. Can you paste your parse code also?
This blogpost includes several approaches to parsing large log files. Maybe thats an inspiration. Also have a look at the file-tail gem
If you are using bash (or similar) you might be able to optimize like this:
In input.rb:
while x = gets
# Parse
end
then in bash:
cat nginx.log | ruby -n input.rb
The -n flag tells ruby to assume 'while gets(); ... end' loop around your script, which might cause it to do something special to optimize.
You might also want to look into a prewritten solution to the problem, as that will be faster.

Resources