How to efficiently parse large text files in Ruby

How to efficiently parse large text files in Ruby - ruby

I'm writing an import script that processes a file that has potentially hundreds of thousands of lines (log file). Using a very simple approach (below) took enough time and memory that I felt like it would take out my MBP at any moment, so I killed the process.
#...
File.open(file, 'r') do |f|
f.each_line do |line|
# do stuff here to line
end
end
This file in particular has 642,868 lines:
$ wc -l nginx.log /code/src/myimport
642868 ../nginx.log
Does anyone know of a more efficient (memory/cpu) way to process each line in this file?
UPDATE
The code inside of the f.each_line from above is simply matching a regex against the line. If the match fails, I add the line to a #skipped array. If it passes, I format the matches into a hash (keyed by the "fields" of the match) and append it to a #results array.
# regex built in `def initialize` (not on each line iteration)
#regex = /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}) - (.{0})- \[([^\]]+?)\] "(GET|POST|PUT|DELETE) ([^\s]+?) (HTTP\/1\.1)" (\d+) (\d+) "-" "(.*)"/
#... loop lines
match = line.match(#regex)
if match.nil?
#skipped << line
else
#results << convert_to_hash(match)
end
I'm completely open to this being an inefficient process. I could make the code inside of convert_to_hash use a precomputed lambda instead of figuring out the computation each time. I guess I just assumed it was the line iteration itself that was the problem, not the per-line code.

I just did a test on a 600,000 line file and it iterated over the file in less than half a second. I'm guessing the slowness is not in the file looping but the line parsing. Can you paste your parse code also?

This blogpost includes several approaches to parsing large log files. Maybe thats an inspiration. Also have a look at the file-tail gem

If you are using bash (or similar) you might be able to optimize like this:
In input.rb:
while x = gets
# Parse
end
then in bash:
cat nginx.log | ruby -n input.rb
The -n flag tells ruby to assume 'while gets(); ... end' loop around your script, which might cause it to do something special to optimize.
You might also want to look into a prewritten solution to the problem, as that will be faster.

Related

ruby match string or space/tab at the beginning of a line and insert uniq lines to a file

This is my code:
File.open(file_name) do |file|
file.each_line do |line|;
if line =~ (/SAPK/) || (line =~ /^\t/ and tabs == true) || (line =~ /^ / and spaces == true)
file = File.open("./1.log", "a"); puts "found a line #{line}"; file.write("#{line}".lstrip!)
end
end
end
File.open("./2.log", "a") { |file| file.puts File.readlines("./1.log").uniq }
I want to to insert all the lines that match a specific string, start with tab or start with space to a file 1.log, all lines should be with space/tab at the beginning so I removed them.
I want get the unique lines in 1.log and write them to 2.log
It will be great if some can go over the code and tell me if something is not correct.
When using files in Ruby, what is the difference between the w+ and a modes?
I know:
w+ - Create an empty file for both reading and writing.
a - Append to a file.The file is created if it does not exist.
But both options append to the file, I though w+ should behave like >, instead of >> ,so I guess w+ also like >> ?
Thanks !!

There's a lot of confusion in this code and it isn't helped by your habit of jamming things on a single line for no reason. Do try and keep your code clean, as the functionality should be obvious. There's also a lot of quirky anti-patterns like stringifying strings and testing booleans vs booleans which you should really avoid doing.
One thing you'll want to do is employ Tempfile for those situations where you need an intermediate file.
Here's a reworked version that's cleaned up:
Tempfile.open do |temp|
File.open(file_name) do |input|
input.each_line do |line|
if line.match(/SAPK/) || (line.match(/^\t/) and tabs) || (line.match(/^ /) and spaces)
puts "found a line #{line}"
temp.write(line.lstrip!)
end
end
end
File.open("./2.log", "w+") do |file|
# Rewind the temporary file to read data back
temp.rewind
file.write(temp.readlines.uniq)
end
end
Now a and w+ are largely similar, it's just two ways that are offered for people familiar with whatever notation. It's like how Array has both length and size which do the same thing. Pick one and use it consistently or your code will be confusing.
My criticism over things like x == true is because something that narrowly specific usually means that x could take on a multitude of values and true is one particular case we're trying to handle, something that implies that we should be aware it might be false and many other things. It's a red herring and will only invite questions.

Working with large amount of data with multiple duplicates in ruby

I have a large text file which I want to process with a ruby script and store in a separate file. My problem is that the resulting file will consist of hundreds of million of lines where the vast majority of them are duplicates. I would like to eliminate the duplicates before writing them to disk.
I have tried processing them and putting the lines in a set to eliminate the duplicates before writing them to the output file, but eventually I ran out of memory and the script crashed.
Is there a way to solve my problem efficiently in ruby?

Create a file called uniq.rb with this code:
require 'digest'
hashes = {}
STDIN.each do |line|
line.chomp!
md5 = Digest::MD5.digest(line)
next if hashes.include?(md5)
hashes[md5] = true
puts line
end
then run it from the command line:
ruby uniq.rb < input.txt > output.txt
The main idea is that you don't have to save the entire line in memory, but instead just a 16-byte MD5 hash (plus true value) to track the unique lines.

If order doesn't matter you can use Unix uniq command.
ruby process.rb | sort | uniq > out.txt

Ruby scan/gets until EOF

I want to scan unknown number of lines till all the lines are scanned. How do I do that in ruby?
For ex:
put returns between paragraphs
for linebreak add 2 spaces at end
_italic_ or **bold**
The input is not from a 'file' but through the STDIN.

Many ways to do that in ruby.
Most usually, you're gonna wanna process one line at a time, which you can do, for example, with
while line=gets
end
or
STDIN.each_line do |line|
end
or by running ruby with the -n switch, for example, which implies one of the above loops (line is being saved into $_ in each iteration, and you can addBEGIN{}, and END{}, just like in awk—this is really good for one-liners).
I wouldn't do STDIN.read, though, as that will read the whole file into memory at once (which may be bad, if the file is really big.)

Use IO#read (without length argument, it reads until EOF)
lines = STDIN.read
or use gets with nil as argument:
lines = gets(nil)
To denote EOF, type Ctrl + D (Unix) or Ctrl + Z (Windows).

How do I print the line number of the file I am working with via ARGV?

I'm currently opening a file taken at runtime via ARGV:
File.open(ARGV[0]) do |f|
f.each_line do |line|
Once a match is found I print output to the user.
if line.match(/(strcpy)/i)
puts "[!] strcpy does not check for buffer overflows when copying to destination."
puts "[!] Consider using strncpy or strlcpy (warning, strncpy is easily misused)."
puts " #{line}"
end
I want to know how to print out the line number for the matching line in the (ARGV[0]) file.
Using print __LINE__ shows the line number from the Ruby script. I've tried many different variations of print __LINE__ with different string interpolations of #{line} with no success. Is there a way I can print out the line number from the file?

When Ruby's IO class opens a file, it sets the $. global variable to 0. For each line that is read that variable is incremented. So, to know what line has been read simply use $..
Look in the English module for $. or $INPUT_LINE_NUMBER.
We can also use the lineno method that is part of the IO class. I find that a bit more convoluted because we need an IO stream object to tack that onto, while $. will work always.
I'd write the loop more simply:
File.foreach(ARGV[0]) do |line|
Something to think about is, if you're on a *nix system, you can use the OS's built-in grep or fgrep tool to greatly speed up your processing. The "grep" family of applications are highly optimized for doing what you want, and can find all occurrences, only the first, can use regular expressions or fixed strings, and can easily be called using Ruby's %x or backtick operators.
puts `grep -inm1 abacus /usr/share/dict/words`
Which outputs:
34:abacus
-inm1 means "ignore character-case", "output line numbers", "stop after the first occurrence"

Ruby grep, match and return

Is there anyway to check if a value exist in a file without ALWAYS going through entire file ?
Currently I used:
if open('file.txt').grep(/value/).length > 0
puts "match"
else
puts "no match"
end
But it's not efficient as I only want to know whether it exists or not. Really appreciate a solution with grep / others similar one-liner.
Please note the "ALWAYS" before down-vote my question

If you want line-by-line comparison using a one-liner:
matches = open('file.txt') { |f| f.lines.find { |line| line.include?("value") } }
puts matches ? "yes" : "naaw"

By definition, the only way you can tell if an arbitrary expression exists in a file is by going over the file and looking for it. If you're looking for the first instance, then on average you'll be scanning half the file until you find your expression when it's there. If the expression isn't there then you'll have to scan the entire file to figure that out.
You could implement that in a one-liner by scanning the file line-by-line. Use IO.foreach
If you do this often, then you can make the search super efficient by indexing the file first, e.g. by using Lucene. It's a trade-off - you still have to scan the file, but only once since you save it's content in a more search-friendly data structure. However, if you don't access a given file very frequently, it's probably not worth the overhead - implementation, maintenance and extra storage.

Here's a ruby one-liner that will work from the linux command line to perform a grep on a text file, and stop on first found.
ruby -ne '(puts "first found on line #{$.}"; break) if $_ =~ /regex here/' file.txt
-n gets each line in the file and feeds it to the global variable $_
$. is a global variable that stores the current line number
If you want to find all lines matching the regex, then:
ruby -ne 'puts "found on line #{$.}" if $_ =~ /regex here/' file.txt

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to efficiently parse large text files in Ruby - ruby

I just did a test on a 600,000 line file and it iterated over the file in less than half a second. I'm guessing the slowness is not in the file looping but the line parsing. Can you paste your parse code also?

This blogpost includes several approaches to parsing large log files. Maybe thats an inspiration. Also have a look at the file-tail gem

Related

ruby match string or space/tab at the beginning of a line and insert uniq lines to a file

Working with large amount of data with multiple duplicates in ruby

Ruby scan/gets until EOF

How do I print the line number of the file I am working with via ARGV?

Ruby grep, match and return

Categories

Resources