Recursive method performance in Ruby - ruby

I have following recursive function written in Ruby, however I find that the method is running too slowly. I am unsure if this the correct way to do it, so please suggest how to improve the performance of this code. The total file count including the subdirectories is 4,535,347
def start(directory)
Dir.foreach(directory) do |file|
next if file == '.' or file == '..'
full_file_path = "#{directory}/#{file}"
if File.directory?(full_file_path)
start(full_file_path)
elsif File.file?(full_file_path)
extract(full_file_path)
else
raise "Unexpected input type neither file nor folder"
end
end

With 4.5M directories, you might be better off working with a specialized lazy enumerator so as to only process entries you actually need, rather than generating each and every one of those 4.5M lists, returning the entire set and iterating through it in entirety.
Here's the example from the docs:
class Enumerator::Lazy
def filter_map
Lazy.new(self) do |yielder, *values|
result = yield *values
yielder << result if result
end
end
end
(1..Float::INFINITY).lazy.filter_map{|i| i*i if i.even?}.first(5)
http://ruby-doc.org/core-2.1.1/Enumerator/Lazy.html
It's not a very good example, btw: the important part is Lazy.new() rather than the fact that Enumerator::Lazy gets monkey patched. Here's a much better example imho:
What's the best way to return an Enumerator::Lazy when your class doesn't define #each?
Further reading on the topic:
http://patshaughnessy.net/2013/4/3/ruby-2-0-works-hard-so-you-can-be-lazy
Another option you might want to consider is computing the list across multiple threads.

I don't think there's a way to speed up much your start method; it does the correct things of going through your files and processing them as soon as it encounters them. You can probably simplify it with a single Dir.glob do, but it will still be slow. I suspect that this is not were most of the time is spent.
There very well might be a way to speed up your extract method, impossible to know without the code.
The other way to speed this up might be to split the processing to multiple processes. Since reading & writing is probably what is slowing you down, this way would give you hope that the ruby code executes while another process is waiting for the IO.

Related

Handle ARGV in Ruby without if...else block

In a blog post about unconditional programming Michael Feathers shows how limiting if statements can be used as a tool for reducing code complexity.
He uses a specific example to illustrate his point. Now, I've been thinking about other specific examples that could help me learn more about unconditional/ifless/forless programming.
For example in this cat clone there is an if..else block:
#!/usr/bin/env ruby
if ARGV.length > 0
ARGV.each do |f|
puts File.read(f)
end
else
puts STDIN.read
end
It turns out ruby has ARGF which makes this program much simpler:
#!/usr/bin/env ruby
puts ARGF.read
I'm wondering if ARGF didn't exist how could the above example be refactored so there is no if..else block?
Also interested in links to other illustrative specific examples.
Technically you can,
inputs = { ARGV => ARGV.map { |f| File.open(f) }, [] => [STDIN] }[ARGV]
inputs.map(&:read).map(&method(:puts))
Though that's code golf and too clever for its own good.
Still, how does it work?
It uses a hash to store two alternatives.
Map ARGV to an array of open files
Map [] to an array with STDIN, effectively overwriting the ARGV entry if it is empty
Access ARGV in the hash, which returns [STDIN] if it is empty
Read all open inputs and print them
Don't write that code though.
As mentioned in my answer to your other question, unconditional programming is not about avoiding if expressions at all costs but about striving for readable and intention revealing code. And sometimes that just means using an if expression.
You can't always get rid of a conditional (maybe with an insane number of classes) and Michael Feathers isn't advocating that. Instead it's sort of a backlash against overuse of conditionals. We've all seen nightmare code that's endless chains of nested if/elsif/else and so has he.
Moreover, people do routinely nest conditionals inside of conditionals. Some of the worst code I've ever seen is a cavernous nightmare of nested conditions with odd bits of work interspersed within them. I suppose that the real problem with control structures is that they are often mixed with the work. I'm sure there's some way that we can see this as a form of single responsibility violation.
Rather than slavishly try to eliminate the condition, you could simplify your code by first creating an array of IO objects from ARGV, and use STDIN if that list is empty.
io = ARGV.map { |f| File.new(f) };
io = [STDIN] if !io.length;
Then your code can do what it likes with io.
While this has strictly the same number of conditionals, it eliminates the if/else block and thus a branch: the code is linear. More importantly, since it separates gathering data from using it, you can put it in a function and reuse it further reducing complexity. Once it's in a function, we can take advantage of early return.
# I don't have a really good name for this, but it's a
# common enough idiom. Perl provides the same feature as <>
def arg_files
return ARGV.map { |f| File.new(f) } if ARGV.length;
return [STDIN];
end
Now that it's in a function, your code to cat all the files or stdin becomes very simple.
arg_files.each { |f| puts f.read }
First, although the principle is good, you have to consider other things that are more importants such as readability and perhaps speed of execution.
That said, you could monkeypatch the String class to add a read method and put STDIN and the arguments in an array and start reading from the beginning until the end of the array minus 1, so stopping before STDIN if there are arguments and go on until -1 (the end) if there are no arguments.
class String
def read
File.read self if File.exist? self
end
end
puts [*ARGV, STDIN][0..ARGV.length-1].map{|a| a.read}
Before someone notices that I still use an if to check if a File exists, you should have used two if's in your example to check this also and if you don't, use a rescue to properly inform the user.
EDIT: if you would use the patch, read about the possible problems at these links
http://blog.jayfields.com/2008/04/alternatives-for-redefining-methods.html
http://www.justinweiss.com/articles/3-ways-to-monkey-patch-without-making-a-mess/
Since the read method isn't part of String the solutions using alias and super are not necessary, if you plan to use a Module, here is how to do that
module ReadString
def read
File.read self if File.exist? self
end
end
class String
include ReadString
end
EDIT: just read about a safe way to monkey patch, for your documentation see https://solidfoundationwebdev.com/blog/posts/writing-clean-monkey-patches-fixing-kaminari-1-0-0-argumenterror-comparison-of-fixnum-with-string-failed?utm_source=rubyweekly&utm_medium=email

Loop collapsing in Ruby - iterating through combinations of two ranges

I have a code that iterates through two ranges. Please see the example below:
(0..6).each do |wday|
(0..23).each do |hour|
p [wday,hour]
end
end
Although this seems very concise and readable, sometimes 5 lines can be too much. One might want to write a more vertically compact code.
(0..6).to_a.product((0..23).to_a).each do |wday,hour|
p [wday, hour]
end
Above was my try, but the code looks very artificial to me. Am I missing something? Does ruby have a preferred way for this type of loop collapsing? If not, are there other alternatives to this workaround?
The following is slightly cleaner version of your loop:
[*0..6].product([*0..23]).each do |wday,hour|
p [wday, hour]
end
This approach does have the disadvantage of expanding the ranges into memory.
I think my preferred way of "collapsing" loops though, especially if the specific loop structure occurs in multiple places, is to turn the nested loops into a method that takes a block and yields to it. E.g.
def for_each_hour_in_week
(0..6).each do |wday|
(0..23).each do |hour|
yield wday,hour
end
end
end
for_each_hour_in_week do |wday,hour|
p [wday,hour]
end
This keeps the deep nesting out of the way of your logic, and makes your intent clear.

Why are else statements discouraged in Ruby?

I was looking for a Ruby code quality tool the other day, and I came across the pelusa gem, which looks interesting. One of the things it checks for is the number of else statements used in a given Ruby file.
My question is, why are these bad? I understand that if/else statements often add a great deal of complexity (and I get that the goal is to reduce code complexity) but how can a method that checks two cases be written without an else?
To recap, I have two questions:
1) Is there a reason other than reducing code complexity that else statements could be avoided?
2) Here's a sample method from the app I'm working on that uses an else statement. How would you write this without one? The only option I could think of would be a ternary statement, but there's enough logic in here that I think a ternary statement would actually be more complex and harder to read.
def deliver_email_verification_instructions
if Rails.env.test? || Rails.env.development?
deliver_email_verification_instructions!
else
delay.deliver_email_verification_instructions!
end
end
If you wrote this with a ternary operator, it would be:
def deliver_email_verification_instructions
(Rails.env.test? || Rails.env.development?) ? deliver_email_verification_instructions! : delay.deliver_email_verification_instructions!
end
Is that right? If so, isn't that way harder to read? Doesn't an else statement help break this up? Is there another, better, else-less way to write this that I'm not thinking of?
I guess I'm looking for stylistic considerations here.
Let me begin by saying that there isn't really anything wrong with your code, and generally you should be aware that whatever a code quality tool tells you might be complete nonsense, because it lacks the context to evaluate what you are actually doing.
But back to the code. If there was a class that had exactly one method where the snippet
if Rails.env.test? || Rails.env.development?
# Do stuff
else
# Do other stuff
end
occurred, that would be completely fine (there are always different approaches to a given thing, but you need not worry about that, even if programmers will hate you for not arguing with them about it :D).
Now comes the tricky part. People are lazy as hell, and thusly code snippets like the one above are easy targets for copy/paste coding (this is why people will argue that one should avoid them in the first place, because if you expand a class later you are more likely to just copy and paste stuff than to actually refactor it).
Let's look at your code snippet as an example. I'm basically proposing the same thing as #Mik_Die, however his example is equally prone to be copy/pasted as yours. Therefore, would should be done (IMO) is this:
class Foo
def initialize
#target = (Rails.env.test? || Rails.env.development?) ? self : delay
end
def deliver_email_verification_instructions
#target.deliver_email_verification_instructions!
end
end
This might not be applicable to your app as is, but I hope you get the idea, which is: Don't repeat yourself. Ever. Every time you repeat yourself, not only are you making your code less maintainable, but as a result also more prone to errors in the future, because one or even 99/100 occurrences of whatever you've copied and pasted might be changed, but the one remaining occurrence is what causes the #disasterOfEpicProportions in the end :)
Another point that I've forgotten was brought up by #RayToal (thanks :), which is that if/else constructs are often used in combination with boolean input parameters, resulting in constructs such as this one (actual code from a project I have to maintain):
class String
def uc(only_first=false)
if only_first
capitalize
else
upcase
end
end
end
Let us ignore the obvious method naming and monkey patching issues here, and focus on the if/else construct, which is used to give the uc method two different behaviors depending on the parameter only_first. Such code is a violation of the Single Responsibility Principle, because your method is doing more than one thing, which is why you should've written two methods in the first place.
def deliver_email_verification_instructions
subj = (Rails.env.test? || Rails.env.development?) ? self : delay
subj.deliver_email_verification_instructions!
end

When to use blocks

I love Ruby blocks! The idea behind them is just very very neat and convenient.
I have just looked back over my code from the past week or so, which is basically every single ruby function I ever have written, and I have noticed that not a single one of them returns a value! Instead of returning values, I always use a block to pass the data back!
I have even caught myself contemplating writing a little status class which would allow me to write code like :
something.do_stuff do |status|
status.success do
# successful code
end
status.fail do
# fail code
puts status.error_message
end
end
Am I using blocks too much? Is there a time to use blocks and a time to use return values?
Are there any gotchas to be aware of? Will my huge use of blocks come and bite me sometime?
The whole thing would be more readable as:
if something.do_stuff
#successful code
else
#unsuccessful code
end
or to use a common rails idiom:
if #user.save
render :action=>:show
else
#user.errors.each{|attr,msg| logger.info "#{attr} - #{msg}" }
render :action=>:edit
end
IMHO, avoiding the return of a boolean value is overuse of code blocks.
A block makes sense if . . .
It allows code to use a resource without having to close that resource
open("fname") do |f|
# do stuff with the file
end #don't have to worry about closing the file
The calling code would have to do non-trivial computation with the result
In this case, you avoid adding the return value to calling scope. This also often makes sense with multiple return values.
something.do_stuff do |res1, res2|
if res1.foo? and res2.bar?
foo(res1)
elsif res2.bar?
bar(res2)
end
end #didn't add res1/res2 to the calling scope
Code must be called both before and after the yield
You see this in some of the rails helpers:
&lt% content_tag :div do %>
&lt%= content_tag :span "span content" %>
&lt% end -%>
And of course iterators are a great use case, as they're (considered by ruby-ists to be) prettier than for loops or list comprehensions.
Certainly not an exhaustive list, but I recommend that you don't just use blocks because you can.
This is what functional programming people call "continuation-passing style". It's a valid technique, though there are cases where it will tend to complicate things more than it's worth. It might be worth to relook some of the places where you're using it and see if that's the case in your code. But there's nothing inherently wrong with it.
I like this style. It's actually very Ruby-like, and often you'll see projects restructure their code to use this format instead of something less readable.
Returning values makes sense where returning values makes sense. If you have an Article object, you want article.title to return the title. But for this particular example of callbacks, it's stellar style, and it's good that you know how to use them. I suspect that many new to Ruby will never figure out how to do it well.

How do I quickly slice and dice large data files?

I'd like to slice and dice large datafiles, up to a gig, in a fairly quick and efficient manner. If I use something like UNIX's "CUT", it's extremely fast, even in a CYGWIN environment.
I've tried developing and benchmarking various Ruby scripts to process these files, and always end up with glacial results.
What would you do in Ruby to make this not so dog slow?
This question reminds me of Tim Bray's Wide Finder project. The fastest way he could read an Apache logfile using Ruby and figure out which articles have been fetched the most was with this script:
counts = {}
counts.default = 0
ARGF.each_line do |line|
if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
counts[$1] += 1
end
end
keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
keys_by_count[0 .. 9].each do |key|
puts "#{counts[key]}: #{key}"
end
It took this code 7½ seconds of CPU, 13½ seconds elapsed, to process a million and change records, a quarter-gig or so, on last year’s 1.67Ghz PowerBook.
Why not combine them together - using cut to do what it does best and ruby to provide the glue/value add with the results from CUT? you can run shell scripts by putting them in backticks like this:
puts `cut somefile > foo.fil`
# process each line of the output from cut
f = File.new("foo.fil")
f.each{|line|
}
I'm guessing that your Ruby implementations are reading the entire file prior to processing. Unix's cut works by reading things one byte at a time and immediately dumping to an output file. There is of course some buffering involved, but not more than a few KB.
My suggestion: try doing the processing in-place with as little paging or backtracking as possible.
I doubt the problem is that ruby is reading the whole file in memory. Look at the memory and disk usage while running the command to verify.
I'd guess the main reason is because cut is written in C and is only doing one thing, so it has probably be compiled down to the very metal. It's probably not doing a lot more than calling the system calls.
However the ruby version is doing many things at once. Calling a method is much slower in ruby than C function calls.
Remember old age and trechery beat youth and skill in unix: http://ridiculousfish.com/blog/archives/2006/05/30/old-age-and-treachery/

Resources