How do I quickly slice and dice large data files? - ruby

I'd like to slice and dice large datafiles, up to a gig, in a fairly quick and efficient manner. If I use something like UNIX's "CUT", it's extremely fast, even in a CYGWIN environment.
I've tried developing and benchmarking various Ruby scripts to process these files, and always end up with glacial results.
What would you do in Ruby to make this not so dog slow?

This question reminds me of Tim Bray's Wide Finder project. The fastest way he could read an Apache logfile using Ruby and figure out which articles have been fetched the most was with this script:
counts = {}
counts.default = 0
ARGF.each_line do |line|
if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
counts[$1] += 1
end
end
keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
keys_by_count[0 .. 9].each do |key|
puts "#{counts[key]}: #{key}"
end
It took this code 7½ seconds of CPU, 13½ seconds elapsed, to process a million and change records, a quarter-gig or so, on last year’s 1.67Ghz PowerBook.

Why not combine them together - using cut to do what it does best and ruby to provide the glue/value add with the results from CUT? you can run shell scripts by putting them in backticks like this:
puts `cut somefile > foo.fil`
# process each line of the output from cut
f = File.new("foo.fil")
f.each{|line|
}

I'm guessing that your Ruby implementations are reading the entire file prior to processing. Unix's cut works by reading things one byte at a time and immediately dumping to an output file. There is of course some buffering involved, but not more than a few KB.
My suggestion: try doing the processing in-place with as little paging or backtracking as possible.

I doubt the problem is that ruby is reading the whole file in memory. Look at the memory and disk usage while running the command to verify.
I'd guess the main reason is because cut is written in C and is only doing one thing, so it has probably be compiled down to the very metal. It's probably not doing a lot more than calling the system calls.
However the ruby version is doing many things at once. Calling a method is much slower in ruby than C function calls.
Remember old age and trechery beat youth and skill in unix: http://ridiculousfish.com/blog/archives/2006/05/30/old-age-and-treachery/

Related

Handle ARGV in Ruby without if...else block

In a blog post about unconditional programming Michael Feathers shows how limiting if statements can be used as a tool for reducing code complexity.
He uses a specific example to illustrate his point. Now, I've been thinking about other specific examples that could help me learn more about unconditional/ifless/forless programming.
For example in this cat clone there is an if..else block:
#!/usr/bin/env ruby
if ARGV.length > 0
ARGV.each do |f|
puts File.read(f)
end
else
puts STDIN.read
end
It turns out ruby has ARGF which makes this program much simpler:
#!/usr/bin/env ruby
puts ARGF.read
I'm wondering if ARGF didn't exist how could the above example be refactored so there is no if..else block?
Also interested in links to other illustrative specific examples.
Technically you can,
inputs = { ARGV => ARGV.map { |f| File.open(f) }, [] => [STDIN] }[ARGV]
inputs.map(&:read).map(&method(:puts))
Though that's code golf and too clever for its own good.
Still, how does it work?
It uses a hash to store two alternatives.
Map ARGV to an array of open files
Map [] to an array with STDIN, effectively overwriting the ARGV entry if it is empty
Access ARGV in the hash, which returns [STDIN] if it is empty
Read all open inputs and print them
Don't write that code though.
As mentioned in my answer to your other question, unconditional programming is not about avoiding if expressions at all costs but about striving for readable and intention revealing code. And sometimes that just means using an if expression.
You can't always get rid of a conditional (maybe with an insane number of classes) and Michael Feathers isn't advocating that. Instead it's sort of a backlash against overuse of conditionals. We've all seen nightmare code that's endless chains of nested if/elsif/else and so has he.
Moreover, people do routinely nest conditionals inside of conditionals. Some of the worst code I've ever seen is a cavernous nightmare of nested conditions with odd bits of work interspersed within them. I suppose that the real problem with control structures is that they are often mixed with the work. I'm sure there's some way that we can see this as a form of single responsibility violation.
Rather than slavishly try to eliminate the condition, you could simplify your code by first creating an array of IO objects from ARGV, and use STDIN if that list is empty.
io = ARGV.map { |f| File.new(f) };
io = [STDIN] if !io.length;
Then your code can do what it likes with io.
While this has strictly the same number of conditionals, it eliminates the if/else block and thus a branch: the code is linear. More importantly, since it separates gathering data from using it, you can put it in a function and reuse it further reducing complexity. Once it's in a function, we can take advantage of early return.
# I don't have a really good name for this, but it's a
# common enough idiom. Perl provides the same feature as <>
def arg_files
return ARGV.map { |f| File.new(f) } if ARGV.length;
return [STDIN];
end
Now that it's in a function, your code to cat all the files or stdin becomes very simple.
arg_files.each { |f| puts f.read }
First, although the principle is good, you have to consider other things that are more importants such as readability and perhaps speed of execution.
That said, you could monkeypatch the String class to add a read method and put STDIN and the arguments in an array and start reading from the beginning until the end of the array minus 1, so stopping before STDIN if there are arguments and go on until -1 (the end) if there are no arguments.
class String
def read
File.read self if File.exist? self
end
end
puts [*ARGV, STDIN][0..ARGV.length-1].map{|a| a.read}
Before someone notices that I still use an if to check if a File exists, you should have used two if's in your example to check this also and if you don't, use a rescue to properly inform the user.
EDIT: if you would use the patch, read about the possible problems at these links
http://blog.jayfields.com/2008/04/alternatives-for-redefining-methods.html
http://www.justinweiss.com/articles/3-ways-to-monkey-patch-without-making-a-mess/
Since the read method isn't part of String the solutions using alias and super are not necessary, if you plan to use a Module, here is how to do that
module ReadString
def read
File.read self if File.exist? self
end
end
class String
include ReadString
end
EDIT: just read about a safe way to monkey patch, for your documentation see https://solidfoundationwebdev.com/blog/posts/writing-clean-monkey-patches-fixing-kaminari-1-0-0-argumenterror-comparison-of-fixnum-with-string-failed?utm_source=rubyweekly&utm_medium=email

Ruby - how to read first n lines from file into array

For some reason, I can't find any tutorial mentioning how to do this...
So, how do I read the first n lines from a file?
I've come up with:
while File.open('file.txt') and count <= 3 do |f|
...
count += 1
end
end
but it is not working and it also doesn't look very nice to me.
Just out of curiosity, I've tried things like:
File.open('file.txt').10.times do |f|
but that didn't really work either.
So, is there a simple way to read just the first n lines without having to load the whole file?
Thank you very much!
Here is a one-line solution:
lines = File.foreach('file.txt').first(10)
I was worried that it might not close the file in a prompt manner (it might only close the file after the garbage collector deletes the Enumerator returned by File.foreach). However, I used strace and I found out that if you call File.foreach without a block, it returns an enumerator, and each time you call the first method on that enumerator it will open up the file, read as much as it needs, and then close the file. That's nice, because it means you can use the line of code above and Ruby will not keep the file open any longer than it needs to.
There are many ways you can approach this problem in Ruby. Here's one way:
File.open('Gemfile') do |f|
lines = 10.times.map { f.readline }
end
File.foreach('file.txt').with_index do |line, i|
break if i >= 10
puts line
end
File inherits from IO and IO mixes in Enumerable methods which include #first
Passing an integer to first(n) will return the first n items in the enumerable collection. For a File object, each item is a line in the file.
File.open('filename.txt', 'r').first(10)
This returns an array of the lines including the \n line breaks.
You may want to #join them to create a single whole string.
File.open('filename.txt', 'r').first(10).join
You could try the following:
`head -n 10 file`.split
It's not really "pure ruby" but that's rarely a requirement these days.

Recursive method performance in Ruby

I have following recursive function written in Ruby, however I find that the method is running too slowly. I am unsure if this the correct way to do it, so please suggest how to improve the performance of this code. The total file count including the subdirectories is 4,535,347
def start(directory)
Dir.foreach(directory) do |file|
next if file == '.' or file == '..'
full_file_path = "#{directory}/#{file}"
if File.directory?(full_file_path)
start(full_file_path)
elsif File.file?(full_file_path)
extract(full_file_path)
else
raise "Unexpected input type neither file nor folder"
end
end
With 4.5M directories, you might be better off working with a specialized lazy enumerator so as to only process entries you actually need, rather than generating each and every one of those 4.5M lists, returning the entire set and iterating through it in entirety.
Here's the example from the docs:
class Enumerator::Lazy
def filter_map
Lazy.new(self) do |yielder, *values|
result = yield *values
yielder << result if result
end
end
end
(1..Float::INFINITY).lazy.filter_map{|i| i*i if i.even?}.first(5)
http://ruby-doc.org/core-2.1.1/Enumerator/Lazy.html
It's not a very good example, btw: the important part is Lazy.new() rather than the fact that Enumerator::Lazy gets monkey patched. Here's a much better example imho:
What's the best way to return an Enumerator::Lazy when your class doesn't define #each?
Further reading on the topic:
http://patshaughnessy.net/2013/4/3/ruby-2-0-works-hard-so-you-can-be-lazy
Another option you might want to consider is computing the list across multiple threads.
I don't think there's a way to speed up much your start method; it does the correct things of going through your files and processing them as soon as it encounters them. You can probably simplify it with a single Dir.glob do, but it will still be slow. I suspect that this is not were most of the time is spent.
There very well might be a way to speed up your extract method, impossible to know without the code.
The other way to speed this up might be to split the processing to multiple processes. Since reading & writing is probably what is slowing you down, this way would give you hope that the ruby code executes while another process is waiting for the IO.

Fastest way to read user input in Ruby

I have some time now and I do some challenges from SPOJ in Ruby. One think that bothers me is how I read user input faster.
For example, this problem: http://www.spoj.com/problems/TEST/
I have tried many solutions, all based on gets:
while ((i=STDIN.gets.to_i) != 42) do
puts i
end
$stdin.each_line do |line|
exit if line.strip! == "42"
puts line
end
def input
while (true)
gets
exit if ($_.chomp == "42")
puts $_.chomp
end
end
input
and other variations with gets. Best time that I get is 0.01s and memory footprint of 7.2 Mb. But looking at best submissions using Ruby language first 5 pages are all 0.00s and 3.1Mb of memory used.
Any idea how I can get the input faster?
Also all the tests there are using STDIN to pass the test cases to the app, some very large (hundreds of Mb) and I suspect that gets is too slow for reading that kind of input (or chomp might be). Is some other way faster than gets?
I can get 0.0 times by keeping the code simple and knowing what's going to be a faster way to do things. I won't show my code, because people are supposed to figure out the problem on their own:
while ((i=STDIN.gets.to_i) != 42) do
puts i
end
Ugh. Don't convert from the retrieved string to an integer just to compare it to 42. Instead compare to '42', though, remember, you could be getting trailing line-endings on the strings.
$stdin.each_line do |line|
exit if line.strip! == "42"
puts line
end
Again, don't strip, instead use a smarter comparison. Also, strip! could bite you if nothing changed in the string by returning nil instead of the string you're expecting. I'd use strip instead, because it'd is guaranteed to return the expected value.
def input
while (true)
gets
exit if ($_.chomp == "42")
puts $_.chomp
end
end
input
Two chomp are costly. You should $_.chomp! separately.
Something to know, Ruby's regular expression engine is very fast, and an anchored regular expression pattern will outrun a regular instring search.
Test your code variations using Ruby's Benchmark class and you can narrow down which differences help.
I have one version that uses a loop, very similar to your middle solution, and another that's a single line of code. Both were 0.0 sec.

How can I handle large files in Ruby?

I'm pretty new to programming, so be gentle. I'm trying to extract IBSN numbers from a library database .dat file. I have written code that works, but it is only searching through about half of the 180MB file. How can I adjust it to search the whole file? Or how can I write a program the will split the dat file into manageable chunks?
edit: Here's my code:
export = File.new("resultsfinal.txt","w+")
File.open("bibrec2.dat").each do |line|
line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x|
export.puts x
end
line.scan(/[a]{1}[1234567890xX]{13}/) do |x|
export.puts x
end
end
You should try to catch exception to check if the problem is really on the read block or not.
Just so you know I already made a script with kinda the same syntax to search real big file of ~8GB without problem.
export = File.new("resultsfinal.txt","w+")
File.open("bibrec2.dat").each do |line|
begin
line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x|
export.puts x
end
line.scan(/[a]{1}[1234567890xX]{13}/) do |x|
export.puts x
end
rescue
puts "Problem while adding the result"
end
end
The main thing is to clean up and combine the regex for performance benefits. Also you should always use block syntax with files to ensure the fd's are getting closed properly. File#each doesn't load the whole file into memory, it does one line at a time:
File.open("resultsfinal.txt","w+") do |output|
File.open("bibrec2.dat").each do |line|
output.puts line.scan(/a[\dxX]{10}(?:[\dxX]{3}|\W)/)
end
end
file = File.new("bibrec2.dat", "r")
while (line = file.gets)
line.scan(/[a]{1}[1234567890xX]{10}\W/) do |x|
export.puts x
end
line.scan(/[a]{1}[1234567890xX]{13}/) do |x|
export.puts x
end
end
file.close
As to the performance issue, I can't see anything particularly worrying about the file size: 180MB shouldn't pose any problems. What happens to memory use when you're running your script?
I'm not sure, however, that your Regular Expressions are doing what you want. This, for example:
/[a]{1}[1234567890xX]{10}\W/
does (I think) this:
one "a". Do you really want to match for an "a"? "a" would suffice, rather than "[a]{1}", in that case.
exactly 10 of (digit or "x" or "X")
a single "non-word" character i.e. not a-z, A-Z, 0-9 or underscore
There are a couple of sample ISBN matchers here and here, although they seem to be matching something more like the format that we see on the back cover of a book and I'm guessing your input file has stripped out some of that formatting.
You can look into using File#truncate and IO#seek and employ the binary search type algorithm. #truncate may be destructive so you should duplicate the file (I know this is a hassle).
middle = File.new("my_huge_file.dat").size / 2
tmpfile = File.new("my_huge_file.dat", "r+").truncate(middle)
# run search algoritm on 'tmpfile'
File.open("my_huge_file.dat") do |huge_file|
huge_file.seek(middle + 1)
# run search algorithm from here
end
The code is highly untested, brittle and incomplete. But I hope it gives you a platform to build of off.
If you are programming on a modern operating system and the computer has enough memory (say 512megs), Ruby should have no problem reading the entire file into memory.
Things typically get iffy when you get to about a 2 gigabyte working set on a typical 32bit OS.

Resources