How is Array#sort in Ruby so quick? - ruby

C# has BitArray, C has bit fields.. I couldn't find an equivalent in the Ruby core. Google showed me a BitField class that Peter Cooper has written for the same.
I've been reading Jon Bentley's Programming Pearls and while trying out one of the first examples, which deals with a BitMap sort - I needed a type that is an array of bits. I used Peter's class
class BitMapSort
def self.sort(list, value_range_max)
bitmap = BitField.new(value_range_max)
list.each{|x| bitmap[x] = 1 }
sorted_list = []
0.upto(value_range_max-1){ |number|
sorted_list << number if (bitmap[number] == 1)
}
sorted_list
end
end
Running this on a set of 1M unique numbers in the range [0, 10,000,000), produced some interesting results,
user system total real
bitmap 11.078000 0.015000 11.093000 ( 11.156250)
ruby-system-sort 0.531000 0.000000 0.531000 ( 0.531250)
quick-sort 21.562000 0.000000 21.562000 ( 21.625000)
Benchmark.bm(20){|x|
x.report("bitmap"){ ret = BitMapSort.sort(series, 10_000_000);}
x.report("ruby-system-sort"){ ret = series.sort; }
x.report("quick-sort"){ ret = QuickSort.sort( series, 0, series.length-1); }
}
How is ruby's default sort 22x faster than 1M BitField.set + 1 loop over a 10M bit vector ? Is there a more efficient bit field / array in Ruby ? How does Ruby's default sort achieve this level of performance.. is it jumping into C to get this done?

Array#sort is implemented in C, see rb_ary_sort in array.c
It also has some checks to compare Fixnums so sorting an array of Integers doesn't even need method lookups.

How does Ruby's default sort achieve this level of performance.. is it jumping into C to get this done?
All of the core classes and methods in ruby's default implementation are implemented in C.

The reason why it's so much faster is probably because it's implemented in the ruby implementation in C.

I think the real problem here is you're doing 10M comparisons, 10M array fetches, 10M of a lot of things, whereas a properly optimized sort routine is doing far fewer operations since it is working with a fixed set of 1M items.
Basic operations such as sort are highly optimized in the Ruby VM and are difficult to beat with a pure-Ruby alternative.

Jumping into C is exactly correct. Array and Hash both have C implementations of many methods in order to improve performance. Integer and float literals also have some tricky code optimizations. When you convert them to a bitmap you loose this optimization as well.
With a compiled language like C or Java it really makes sense to look for tricky optimization patterns. with interpreted languages the cost of interpreting each command make this counter productive.

Related

Why are these two smaller loops so much slower than a single loop containing the same instructions?

Ruby v2.3.3p222
MacOS Catalina v10.15.3
Processor: 2.6GHz 6-Core Intel Core i7
I have the following performance benchmark script, which was designed to test the difference between one larger loop operation vs. two smaller loops:
require 'benchmark'
N = 10_000_000
def one_loop
N.times do
foo = 1+1
bar = 2+2
end
end
def two_loops
N.times do
foo = 1+1
end
N.times do
bar = 2+2
end
end
Benchmark.bmbm do |performance|
performance.report("two smaller loops") { two_loops }
performance.report("one large loop") { one_loop }
end
My hypothesis was that these two methods would execute in approximately the same amount of time, because (I thought) they were both executing the same number of instructions: the larger loop was doing 2 * 10,000,000 operations, while each of the 2 smaller loops was doing 1 * 10,000,000 operations.
However, that doesn't appear to be what I observed. When I run the script, I get the following output:
Rehearsal -----------------------------------------------------
two smaller loops 0.840000 0.000000 0.840000 ( 0.838101)
one large loop 0.500000 0.010000 0.510000 ( 0.506283)
-------------------------------------------- total: 1.350000sec
user system total real
two smaller loops 0.850000 0.000000 0.850000 ( 0.863052)
one large loop 0.500000 0.000000 0.500000 ( 0.494525)
This is really disappointing, because I was hoping to convince my team that we wouldn't see any reduced performance by splitting up our 1 large loop of code into several more concise loops which each did one thing and did it well.
I thought it might be due to the order in which the reports are generated, but when I reverse the order of the two calls to performance.report, I get similarly disappointing results:
Rehearsal -----------------------------------------------------
one large loop 0.500000 0.010000 0.510000 ( 0.508246)
two smaller loops 0.850000 0.000000 0.850000 ( 0.852467)
-------------------------------------------- total: 1.360000sec
user system total real
one large loop 0.490000 0.000000 0.490000 ( 0.496130)
two smaller loops 0.830000 0.000000 0.830000 ( 0.831476)
Am I missing something? Are the 2 smaller loops really doing a much greater amount of work than the single larger loop? Or did I somehow construct my benchmark script in a misleading or inaccurate way?
the larger loop was doing 2 * 10,000,000 operations, while each of the 2 smaller loops was doing 1 * 10,000,000 operations
It doesn't make sense to talk about "operations" without defining the machine model and cost model within which we model those "operations". Or, to put it simply: it doesn't make sense to count things until you are clear what you are counting.
In this case, you are counting additions. And you are correct: within your model which counts only additions, both versions have the same amount of additions.
They do, however, not have the same amount of block activations.
Remember, Integer#times looks roughly like this:
class Integer
def times
return enum_for(__callee__) unless block_given?
return self unless positive?
i = -1
yield i while (i += 1) < self
self
end
end
So, for each iteration of the loop, there is an activation (i.e. a yield) of the block passed to Integer#times.
If we add that as a new class of "operation", we have the following:
one_loop: 20,000,000 additions and 10,000,000 block activations
two_loops: 20,000,000 additions and 20,000,000 block activations
So, both methods have the same number of additions but two_loops has twice the number of block activations.
Which means, we also have to take into account the relative costs of additions versus block activations. Now, semantically, an addition is just a normal method call. And activating a block is sort-of similar to a method call.
So, we would expect that an addition and a block activation have roughly similar cost, which means our cost would be:
one_loop: 30,000,000 "method call like operations"
two_loops: 40,000,000 "method call like operations"
In other words, we would expect two_loops to be 33% slower or one_loop to be 25% faster, depending on how you look at it.
However, we actually find that the difference is much bigger, so clearly we are missing something in our model.
What we are missing are optimizations. Arithmetic operations on integers are extremely common and extremely performance-critical, and so all Ruby implementations go to great lengths to make them fast. In fact, on all Ruby implementations, simple additions such as the one you are using will be mapped directly to single CPU ADD instructions, and will not incur the overhead of a method call at all.
Block activations are also very important in Ruby, and thus they are also heavily optimized, but they are just fundamentally several orders of magnitude more complex than adding two machine-word integers.
In fact, the relative complexity of block activations to machine-word integer additions is so big, that we can actually ignore the additions completely in our model:
one_loop: 10,000,000 block activations
two_loops: 20,000,000 block activations
This gives us a factor of 2:1, so we would expect two_loops to be 100% slower or one_loop to be 50% faster.
By the way, I ignored another operation that is happening: the definition and initialization of the local variables. The argument is similar: that is an operation that is so fast that it is negligible compared to the block activation.
Actually, so far, we have only talked about the relative cost of those operations and how they mean that we can ignore the cost of the additions and the local variables. There is, however, an even stronger reason to ignore those: optimizations.
Even the most simple Ruby implementation will be able to optimize away the local variables completely: they are defined and initialized in only one place, and they are never accessed ever again. They only exist within the scope of the block, for the duration of one activation of the block, so even a very simple optimizer can see that they are completely useless, so even the most simple of optimizers will optimize the code to something roughly like this:
def one_loop
N.times do
1+1
2+2
end
end
def two_loops
N.times do
1+1
end
N.times do
2+2
end
end
Meaning that we can not only ignore the cost of the local variables because it is small compared to the other costs, but actually, the local variables don't even exist.
Likewise, a slightly smarter optimizer will recognize that the first addition in one_loop has no side-effects, is not returned, is not stored in a variable (or at least not in one that is used anywhere), and in general does not influence the outcome of the computation in any way, shape, or form, and thus will optimize the code to this:
def one_loop
N.times do
2+2
end
end
def two_loops
N.times do
1+1
end
N.times do
2+2
end
end
Furthermore, the same argument actually applies to the remaining addition. It has no side-effect, all it does is being returned from the block, but Integer#times ignores the return value of the block. I haven't taken a look at the generated code, but I strongly suspect that even the most stupid optimizer can easily prove that your blocks are no-ops, and thus it will optimize the code to something roughly like this:
def one_loop
N.times do
end
end
def two_loops
N.times do
end
N.times do
end
end
Which means that one_loop has N iterations of the block, two_loops has 2 * N iterations, and thus should take roughly twice as long.
Now, we can see in your benchmarks that the numbers are not actually 2:1. They are 1.75:1 or roughly 7:4.
I can confirm those results on my machine, here with YARV 2.7.1 without JIT, I get almost exactly 7:4:
user system total real
two smaller loops 0.711479 0.000099 0.711578 ( 0.711680)
one large loop 0.401808 0.000059 0.401867 ( 0.401916)
However, when I turn on the JIT, I get almost exactly the 2:1 we would expect:
user system total real
two smaller loops 0.587017 0.000279 0.587296 ( 0.587098)
one large loop 0.291713 0.000062 0.291775 ( 0.291779)
You will also notice that it is faster in general.
With JRuby 9.2.9.0, we get again slightly faster execution and almost 2:1:
user system total real
two smaller loops 0.740000 0.010000 0.750000 ( 0.517670)
one large loop 0.260000 0.000000 0.260000 ( 0.263270)
This was with default options, here are the results with some more aggressive compiler flags:
user system total real
two smaller loops 0.370000 0.000000 0.370000 ( 0.362050)
one large loop 0.390000 0.010000 0.400000 ( 0.213861)
TruffleRuby 20.1.0 is again much faster than even JRuby:
user system total real
two smaller loops 0.009955 0.000039 0.009994 ( 0.010035)
one large loop 0.004759 0.000007 0.004766 ( 0.004742)
And again, very close to 2:1. Also, even though we are only interested in the relative performance of the two methods, it is quite pleasing to see that TruffleRuby is 70x-100x faster than YARV on this benchmark!
Actually, I am somewhat surprised that TruffleRuby could not prove that Integer#times with an empty block body is a no-op. I would have expected it to be able to optimize the code like this:
def one_loop
end
def two_loops
end
And thus have no runtime difference between the two versions at all.
Am I missing something? Are the 2 smaller loops really doing a much greater amount of work than the single larger loop? Or did I somehow construct my benchmark script in a misleading or inaccurate way?
I would say all of the above.
The main problem is that you are measuring almost the exact opposite of what you are counting. You are counting only additions and ignoring block activations, and there is nothing wrong with that, IFF all you are interested in is the number of additions and nothing else.
And you are measuring only the cost of block activations and ignoring the cost of additions, which is also totally fine if that is what you are interested in.
The problem is that those two do not match up: you are not measuring what you are counting and you are not counting what you are measuring, so you simply cannot draw any conclusions from the results of your experiments towards your hypothesis.
In one of your comments, you asked:
so does this mean that each iteration of each loop counts as its own operation, in addition to whatever operation(s) happen inside the loop?
We can't tell you that. You need to define what operations you are interested in, and what operations you want to ignore. If you define "operation" to only mean "addition", then no, each iteration of the loop does not count as its own operation, and both of your examples have the exact same amount of operations.
Another problem is that your hypothesis "the number of additions is the same, therefore the execution time is the same" is invalid, because additions are not the only operations that take time. And even if you count other kinds of operations, then your hypothesis still assumes that every operation takes the same amount of time, which is also not true.
There are also some more problems with your benchmarking approach in general, which however are not the source of your confusion. Here are some of the problems with your benchmarks I have identified, although I am sure there others:
Your benchmarks are written in a way that all the operations you are interested about are optimized away, leaving only operations you are not interested about.
Even if they weren't optimized away, their execution time negligible compared to the execution time of the operations you don't care about.
Your benchmarks don't run long and often enough to give the optimizer a chance. For example, the default threshold for when a method is compiled is between 20 and 20000 calls depending on the Ruby implementation, compiler flags, etc. Both your methods are only called twice, once during rehearsal, once during the real thing. You would need to make sure they are called a lot more than 20000 times to ensure that a) they get compiled at all, and b) there are enough iterations after they have been compiled that the slower iterations before they were compiled do not significantly affect the outcome.
I always recommend that someone who wants to write a benchmark read and understand the following mailing list thread:
JMH vs Caliper: reference thread
Especially the sub-thread and following discussion starting at the linked message.
While this thread is about a specific benchmark tool for benchmarking Java code, anything that is discussed in the thread applies to all benchmarking on all modern high-performance language implementations.
There is a reason why benchmarks are written by benchmark engineers who have writing benchmarks as their full-time job: it takes a lot of knowledge and expertise to write benchmarks.
You need, at least
Deep knowledge of computer organization in general.
Deep knowledge of all the specific hardware platforms that you are benchmarking on.
Deep knowledge of programming languages in general.
Deep knowledge of al the specific programming languages you are writing benchmark code in.
Deep knowledge of language implementations in general, including but not limited to ahead-of-time compilers, JIT compilers, interpreters, VMs, garbage collectors, memory allocators, optimizers, inlining, loop unrolling, dead code elimination, constant folding, common subexpression elimination, tail call elimination, peephole optimizations, compile-time evaluation, Polymorphic Inline Caching, and lots more.
Deep knowledge of the specific language implementations you are running your code on.
And more, e.g. operating systems, scheduling, NUMA, multi-threading, SMT, CMT, …
And when you have all of that, you are presented with a bunch of numbers you need to know how to interpret, which requires deep knowledge of statistics.
The benchmark library in the Ruby stdlib is an example of many of those "sins". It was perfectly fine 25 years ago, when there was only one single Ruby implementation which was simply a dumb AST-walking interpreter with no optimizations whatsoever, and computers had one CPU with no out-of-order execution or speculation, and that was it. But for current times, where we have a multitude of aggressively optimizing Ruby implementations (most prominently TruffleRuby) and complex CPUs that perform their own optimizations as well, it just doesn't cut it anymore.
Unfortunately, there is no benchmark tool of comparable sophistication to what's available e.g. in the Java world, but there are at least some alternatives such as better-benchmark (no longer maintained), benchmark-ips (by Evan Phoenix, founder of Rubinius), or fruity (by Marc-André Lafortune, ruby-core team member).
This is 10 million iterations, and in each iteration two calculations are done, for a total of 30 million of what we'll call operations:
N.times do
foo = 1+1
bar = 2+2
end
This is 20 million iterations, and in each iteration one calculation is done, for a total of 40 million of what we'll call operations:
N.times do
foo = 1+1
end
N.times do
bar = 2+2
end
30 < 40, therefore the first example is faster.

Ruby: Help improving hashing algorithm

I am still relatively new to ruby as a language, but I know there are a lot of convenience methods built into the language. I am trying to generate a "hash" to check against in a low level block-chain verifier and I am wondering if there are any "convenience methods" that I could you to try to make this hashing algorithm more efficient. I think I can make this more efficient by utilizing ruby's max integer size, but I'm not sure.
Below is the current code which takes in a string to hash, unpacks it into an array of UTF-8 values, does computationally intensive math to each one of those values, adds up all of those values after the math is done to them, takes that value modulo 65,536, and then returns the hex representation of that value.
def generate_hash(string)
unpacked_string = string.unpack('U*')
sum = 0
unpacked_string.each do |x|
sum += (x**2000) * ((x + 2)**21) - ((x + 5)**3)
end
new_val = sum % 65_536 # Gives a number from 0 to 65,535
new_val.to_s(16)
end
On very large block-chains there is a very large performance hit which I am trying to get around. Any help would be great!
First and foremost, it is extremely unlikely that you are going to create anything that is more efficient than simply using String#hash. This is a case of you trying to build a better mousetrap.
Honestly, your hashing algorithm is very inefficient. The entire point of a hash is to be a fast, low-overhead way of quickly getting a "unique" (as unique as possible) integer to represent any object to avoid comparing by values.
Using that as a premise, if you start doing any type of intense computation in a hash algorithm, it is already counter-productive. Once you start implementing modulo and pow functions, it is inefficient.
Usually best practice involves taking a value(s) of the object that can be represented as integers, and performing bit operations on them, typically with prime numbers to help reduce hash collisions.
def hash
h = value1 ^ 393
h += value2 ^ 17
h
end
In your example, you are for some reason forcing the hash to the max value of a 16-bit unsigned integer, when typically 32-bits is used, although if you are comparing on the Ruby-side, this would be 31-bits due to how Ruby masks Fixnum values. Fixnum was deprecated on the Ruby side as it should have been, but internally the same threshold exists between what how a Bignum and Fixnum are handled. The Integer class simply provides one interface on the Ruby side, as those two really should never have been exposed outside of the C code.
In your specific example using strings, I would simply symbolize them. This guarantees a quick and efficient way that determines if two strings are equal without hardly any overhead, and comparing 2 symbols is the exact same as comparing 2 integers. There is a caveat to this method if you are comparing a vast number of strings. Once a symbol is created, it is alive for the life of the program. Any additional strings that equal to it will return the same symbol, but you cannot remove the memory of the symbol (just a few bytes) for as long as the program runs. Not good if using this method to compare thousands and thousands of unique strings.

Performance of Sets V.S. Arrays in Ruby

In Ruby, I am building a method which constructs and returns a (probably large) array which should contain no duplicate elements. Would I get better performance by using a set and then converting that to an array? Or would it be better to just call .uniq on the array I am using before I return it? Or what about using & to append items to the array instead of +=? And if I do use a set, would not having a <=> method on the object I am putting into the set have an effect on performance? (If you're not sure, do you know of a way to test this?)
The real answer is: write the most readable and maintainable code, and optimize it only after you've shown it is a bottleneck. If you can find an algorithm in that is in linear time, you won't have to optimize it. Here it's easy to find...
Not quite sure which methods you are suggesting, but using my fruity gem:
require 'fruity'
require 'set'
enum = 1000.times
compare do
uniq { enum.each_with_object([]){|x, array| array << x}.uniq }
set { enum.each_with_object(Set[]){|x, set| set << x}.to_a }
join { enum.inject([]){|array, x| array | [x]} }
end
# set is faster than uniq by 10.0% ± 1.0%
# uniq is faster than join by 394x ± 10.0
Clearly, it makes no sense building intermediate arrays like in the third method. Otherwise, it's not going to make a big difference since you will be in O(n); that's the main thing.
BTW, both sets, uniq and Array#| use eql? and hash on your objects, not <=>. These need to be defined in a sane manner, because the default is that objects are never eql? unless they have the same object_id (see this question)
Have you tried using the Benchmark library? Tests are usually very easy to construct and will properly reflect how it works in your particular version of Ruby.

Is Scala functional programming slower than traditional coding?

In one of my first attempts to create functional code, I ran into a performance issue.
I started with a common task - multiply the elements of two arrays and sum up the results:
var first:Array[Float] ...
var second:Array[Float] ...
var sum=0f;
for (ix<-0 until first.length)
sum += first(ix) * second(ix);
Here is how I reformed the work:
sum = first.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
When I benchmarked the two approaches, the second method takes 40 times as long to complete!
Why does the second method take so much longer? How can I reform the work to be both speed efficient and use functional programming style?
The main reasons why these two examples are so different in speed are:
the faster one doesn't use any generics, so it doesn't face boxing/unboxing.
the faster one doesn't create temporary collections and, thus, avoids extra memory copies.
Let's consider the slower one by parts. First:
first.zip(second)
That creates a new array, an array of Tuple2. It will copy all elements from both arrays into Tuple2 objects, and then copy a reference to each of these objects into a third array. Now, notice that Tuple2 is parameterized, so it can't store Float directly. Instead, new instances of java.lang.Float are created for each number, the numbers are stored in them, and then a reference for each of them is stored into the Tuple2.
map{ case (a,b) => a*b }
Now a fourth array is created. To compute the values of these elements, it needs to read the reference to the tuple from the third array, read the reference to the java.lang.Float stored in them, read the numbers, multiply, create a new java.lang.Float to store the result, and then pass this reference back, which will be de-referenced again to be stored in the array (arrays are not type-erased).
We are not finished, though. Here's the next part:
reduceLeft(_+_)
That one is relatively harmless, except that it still do boxing/unboxing and java.lang.Float creation at iteration, since reduceLeft receives a Function2, which is parameterized.
Scala 2.8 introduces a feature called specialization which will get rid of a lot of these boxing/unboxing. But let's consider alternative faster versions. We could, for instance, do map and reduceLeft in a single step:
sum = first.zip(second).foldLeft(0f) { case (a, (b, c)) => a + b * c }
We could use view (Scala 2.8) or projection (Scala 2.7) to avoid creating intermediary collections altogether:
sum = first.view.zip(second).map{ case (a,b) => a*b }.reduceLeft(_+_)
This last one doesn't save much, actually, so I think the non-strictness if being "lost" pretty fast (ie, one of these methods is strict even in a view). There's also an alternative way of zipping that is non-strict (ie, avoids some intermediary results) by default:
sum = (first,second).zipped.map{ case (a,b) => a*b }.reduceLeft(_+_)
This gives much better result that the former. Better than the foldLeft one, though not by much. Unfortunately, we can't combined zipped with foldLeft because the former doesn't support the latter.
The last one is the fastest I could get. Faster than that, only with specialization. Now, Function2 happens to be specialized, but for Int, Long and Double. The other primitives were left out, as specialization increases code size rather dramatically for each primitive. On my tests, though Double is actually taking longer. That might be a result of it being twice the size, or it might be something I'm doing wrong.
So, in the end, the problem is a combination of factors, including producing intermediary copies of elements, and the way Java (JVM) handles primitives and generics. A similar code in Haskell using supercompilation would be equal to anything short of assembler. On the JVM, you have to be aware of the trade-offs and be prepared to optimize critical code.
I did some variations of this with Scala 2.8. The loop version is as you write but the
functional version is slightly different:
(xs, ys).zipped map (_ * _) reduceLeft(_ + _)
I ran with Double instead of Float, because currently specialization only kicks in for Double. I then tested with arrays and vectors as the carrier type. Furthermore, I tested Boxed variants which work on java.lang.Double's instead of primitive Doubles to measure
the effect of primitive type boxing and unboxing. Here is what I got (running Java 1.6_10 server VM, Scala 2.8 RC1, 5 runs per test).
loopArray 461 437 436 437 435
reduceArray 6573 6544 6718 6828 6554
loopVector 5877 5773 5775 5791 5657
reduceVector 5064 4880 4844 4828 4926
loopArrayBoxed 2627 2551 2569 2537 2546
reduceArrayBoxed 4809 4434 4496 4434 4365
loopVectorBoxed 7577 7450 7456 7463 7432
reduceVectorBoxed 5116 4903 5006 4957 5122
The first thing to notice is that by far the biggest difference is between primitive array loops and primitive array functional reduce. It's about a factor of 15 instead of the 40 you have seen, which reflects improvements in Scala 2.8 over 2.7. Still, primitive array loops are the fastest of all tests whereas primitive array reduces are the slowest. The reason is that primitive Java arrays and generic operations are just not a very good fit. Accessing elements of primitive Java arrays from generic functions requires a lot of boxing/unboxing and sometimes even requires reflection. Future versions of Scala will specialize the Array class and then we should see some improvement. But right now that's what it is.
If you go from arrays to vectors, you notice several things. First, the reduce version is now faster than the imperative loop! This is because vector reduce can make use of efficient bulk operations. Second, vector reduce is faster than array reduce, which illustrates the inherent overhead that arrays of primitive types pose for generic higher-order functions.
If you eliminate the overhead of boxing/unboxing by working only with boxed java.lang.Double values, the picture changes. Now reduce over arrays is a bit less than 2 times slower than looping, instead of the 15 times difference before. That more closely approximates the inherent overhead of the three loops with intermediate data structures instead of the fused loop of the imperative version. Looping over vectors is now by far the slowest solution, whereas reducing over vectors is a little bit slower than reducing over arrays.
So the overall answer is: it depends. If you have tight loops over arrays of primitive values, nothing beats an imperative loop. And there's no problem writing the loops because they are neither longer nor less comprehensible than the functional versions. In all other situations, the FP solution looks competitive.
This is a microbenchmark, and it depends on how the compiler optimizes you code. You have 3 loops composed here,
zip . map . fold
Now, I'm fairly sure the Scala compiler cannot fuse those three loops into a single loop, and the underlying data type is strict, so each (.) corresponds to an intermediate array being created. The imperative/mutable solution would reuse the buffer each time, avoiding copies.
Now, an understanding of what composing those three functions means is key to understanding performance in a functional programming language -- and indeed, in Haskell, those three loops will be optimized into a single loop that reuses an underlying buffer -- but Scala cannot do that.
There are benefits to sticking to the combinator approach, however -- by distinguishing those three functions, it will be easier to parallelize the code (replace map with parMap etc). In fact, given the right array type, (such as a parallel array) a sufficiently smart compiler will be able to automatically parallelize your code, yielding more performance wins.
So, in summary:
naive translations may have unexpected copies and inefficiences
clever FP compilers remove this overhead (but Scala can't yet)
sticking to the high level approach pays off if you want to retarget your code, e.g. to parallelize it
Don Stewart has a fine answer, but it might not be obvious how going from one loop to three creates a factor of 40 slowdown. I'll add to his answer that Scala compiles to JVM bytecodes, and not only does the Scala compiler not fuse the three loops into one, but the Scala compiler is almost certainly allocating all the intermediate arrays. Notoriously, implementations of the JVM are not designed to handle the allocation rates required by functional languages. Allocation is a significant cost in functional programs, and that's one the loop-fusion transformations that Don Stewart and his colleagues have implemented for Haskell are so powerful: they eliminate lots of allocations. When you don't have those transformations, plus you're using an expensive allocator such as is found on a typical JVM, that's where the big slowdown comes from.
Scala is a great vehicle for experimenting with the expressive power of an unusual mix of language ideas: classes, mixins, modules, functions, and so on. But it's a relatively young research language, and it runs on the JVM, so it's unreasonable to expect great performance except on the kind of code that JVMs are good at. If you want to experiment with the mix of language ideas that Scala offers, great—it's a really interesting design—but don't expect the same performance on pure functional code that you'd get with a mature compiler for a functional language, like GHC or MLton.
Is scala functional programming slower than traditional coding?
Not necessarily. Stuff to do with first-class functions, pattern matching, and currying need not be especially slow. But with Scala, more than with other implementations of other functional languages, you really have to watch out for allocations—they can be very expensive.
The Scala collections library is fully generic, and the operations provided are chosen for maximum capability, not maximum speed. So, yes, if you use a functional paradigm with Scala without paying attention (especially if you are using primitive data types), your code will take longer to run (in most cases) than if you use an imperative/iterative paradigm without paying attention.
That said, you can easily create non-generic functional operations that perform quickly for your desired task. In the case of working with pairs of floats, we might do the following:
class FastFloatOps(a: Array[Float]) {
def fastMapOnto(f: Float => Float) = {
var i = 0
while (i < a.length) { a(i) = f(a(i)); i += 1 }
this
}
def fastMapWith(b: Array[Float])(f: (Float,Float) => Float) = {
val len = a.length min b.length
val c = new Array[Float](len)
var i = 0
while (i < len) { c(i) = f(a(i),b(i)); i += 1 }
c
}
def fastReduce(f: (Float,Float) => Float) = {
if (a.length==0) Float.NaN
else {
var r = a(0)
var i = 1
while (i < a.length) { r = f(r,a(i)); i += 1 }
r
}
}
}
implicit def farray2fastfarray(a: Array[Float]) = new FastFloatOps(a)
and then these operations will be much faster. (Faster still if you use Double and 2.8.RC1, because then the functions (Double,Double)=>Double will be specialized, not generic; if you're using something earlier, you can create your own abstract class F { def f(a: Float) : Float } and then call with new F { def f(a: Float) = a*a } instead of (a: Float) => a*a.)
Anyway, the point is that it's not the functional style that makes functional coding in Scala slow, it's that the library is designed with maximum power/flexibility in mind, not maximum speed. This is sensible, since each person's speed requirements are typically subtly different, so it's hard to cover everyone supremely well. But if it's something you're doing more than just a little, you can write your own stuff where the performance penalty for a functional style is extremely small.
I am not an expert Scala programmer, so there is probably a more efficient method, but what about something like this. This can be tail call optimized, so performance should be OK.
def multiply_and_sum(l1:List[Int], l2:List[Int], sum:Int):Int = {
if (l1 != Nil && l2 != Nil) {
multiply_and_sum(l1.tail, l2.tail, sum + (l1.head * l2.head))
}
else {
sum
}
}
val first = Array(1,2,3,4,5)
val second = Array(6,7,8,9,10)
multiply_and_sum(first.toList, second.toList, 0) //Returns: 130
To answer the question in the title: Simple functional constructs may be slower than imperative on the JVM.
But, if we consider only simple constructs, then we might as well throw out all modern languages and stick with C or assembler. If you look a the programming language shootout, C always wins.
So why choose a modern language? Because it lets you express a cleaner design. Cleaner design leads to performance gains in the overall operation of the application. Even if some low-level methods can be slower. One of my favorite examples is the performance of BuildR vs. Maven. BuildR is written in Ruby, an interpreted, slow, language. Maven is written in Java. A build in BuildR is twice as fast as Maven. This is due mostly to the design of BuildR which is lightweight compared with that of Maven.
Your functional solution is slow because it is generating unnecessary temporary data structures. Removing these is known as deforesting and it is easily done in strict functional languages by rolling your anonymous functions into a single anonymous function and using a single aggregator. For example, your solution written in F# using zip, map and reduce:
let dot xs ys = Array.zip xs ys |> Array.map (fun (x, y) -> x * y) -> Array.reduce ( * )
may be rewritten using fold2 so as to avoid all temporary data structures:
let dot xs ys = Array.fold2 (fun t x y -> t + x * y) 0.0 xs ys
This is a lot faster and the same transformation can be done in Scala and other strict functional languages. In F#, you can also define the fold2 as inline in order to have the higher-order function inlined with its functional argument whereupon you recover the optimal performance of the imperative loop.
Here is dbyrnes solution with arrays (assuming Arrays are to be used) and just iterating over the index:
def multiplyAndSum (l1: Array[Int], l2: Array[Int]) : Int =
{
def productSum (idx: Int, sum: Int) : Int =
if (idx < l1.length)
productSum (idx + 1, sum + (l1(idx) * l2(idx))) else
sum
if (l2.length == l1.length)
productSum (0, 0) else
error ("lengths don't fit " + l1.length + " != " + l2.length)
}
val first = (1 to 500).map (_ * 1.1) toArray
val second = (11 to 510).map (_ * 1.2) toArray
def loopi (n: Int) = (1 to n).foreach (dummy => multiplyAndSum (first, second))
println (timed (loopi (100*1000)))
That needs about 1/40 of the time of the list-approach. I don't have 2.8 installed, so you have to test #tailrec yourself. :)

What's the most efficient way to partition a large hash into N smaller hashes in Ruby?

The Problem
I'm working on a problem that involves sharding. As part of the problem I need to find the fastest way to partition a large Ruby hash (> 200,0000 entries) in two or more pieces.
Are there any non O(n) approaches?
Is there a non-Ruby i.e. C/C++ implementation?
Please don't reply with examples using the trivial approach of converting the hash to an array and rebuilding N distinct hashes.
My concern is that Ruby is too slow to do this kind of work.
The initial approach
This was the first solution I tried. What was appealing about it was:
it didn't need to loop slavishly across the hash
it didn't need to manage a counter to allocate the members evenly among the shards.
it's short and neat looking
Ok, it isn't O(n) but it relies on methods in the standard library which I figured would be faster than writing my own Ruby code.
pivot = s.size / 2
slices = s.each_slice(pivot)
s1 = Hash[*slices.entries[0].flatten]
s2 = Hash[*slices.entries[1].flatten]
A better solution
Mark and Mike were kind enough to suggest approaches. I have to admit that Mark's approach felt wrong - it did exactly what I didn't want - it looped over all of the members of the has and evaluated a conditional as it went - but since he'd taken the time to do the evaluation, I figured that I should try a similar approach and benchmark that. This is my adapted version of his approach (My keys aren't numbers so I can't take his approach verbatim)
def split_shard(s)
shard1 = {}
shard2 = {}
t = Benchmark.measure do
n = 0
pivot = s.size / 2
s.each_pair do |k,v|
if n < pivot
shard1[k] = v
else
shard2[k] = v
end
n += 1
end
end
$b += t.real
$e += s.size
return shard1, shard2
end
The results
In both cases, a large number of hashes are split into shards. The total number of elements across all of the hashes in the test data set was 1,680,324.
My initial solution - which had to be faster because it uses methods in the standard library and minimises the amount of Ruby code (no loop, no conditional) - runs in just over 9s
Mark's approach runs in just over 5s
That's a significant win
Take away
Don't be fooled by 'intuition' - measure the performance of competing algorithm
Don't worry about Ruby's performance as a language - my initial concern is that if I'm doing ten million of these operations, it could take a significant amount of time in Ruby but it doesn't really.
Thanks to Mark and Mike who both get points from me for their help.
Thanks!
I don't see how you can achieve this using an unmodified "vanilla" Hash - I'd expect that you'd need to get into the internals in order to make partitioning into some kind of bulk memory-copying operation. How good is your C?
I'd be more inclined to look into partitioning instead of creating a Hash in the first place, especially if the only reason for the 200K-item Hash existing in the first place is to be subdivided.
EDIT: After thinking about it at the gym...
The problem with finding some existing solution is that someone else needs to have (a) experienced the pain, (b) had the technical ability to address it and (c) felt community-friendly enough to have released it into the wild. Oh, and for your OS platform.
What about using a B-Tree instead of a Hash? Hold your data sorted by key and it can be traversed by memcpy(). B-Tree retrieval is O(log N), which isn't much of a hit against Hash most of the time.
I found something here which might help, and I'd expect there'd only be a little duck-typing wrapper needed to make it quack like a Hash.
Still gonna need those C/C++ skills, though. (Mine are hopelessly rusty).
This probably isn't fast enough for your needs (which sound like they'll require an extension in C), but perhaps you could use Hash#select?
I agree with Mike Woodhouse's idea. Is it possible for you to construct your shards at the same place where the original 200k-item hash is being constructed? If the items are coming out of a database, you could split your query into multiple disjoint queries, based either on some aspect of the key or by repeatedly using something like LIMIT 10000 to grab a chunk at a time.
Additional
Hi Chris, I just compared your approach to using Hash#select:
require 'benchmark'
s = {}
1.upto(200_000) { |i| s[i] = i}
Benchmark.bm do |x|
x.report {
pivot = s.size / 2
slices = s.each_slice(pivot)
s1 = Hash[*slices.entries[0].flatten]
s2 = Hash[*slices.entries[1].flatten]
}
x.report {
s1 = {}
s2 = {}
s.each_pair do |k,v|
if k < 100_001
s1[k] = v
else
s2[k] = v
end
end
}
end
It looks like Hash#select is much faster, even though it goes through the entire large hash for each one of the sub-hashes:
# ruby test.rb
user system total real
0.560000 0.010000 0.570000 ( 0.571401)
0.320000 0.000000 0.320000 ( 0.323099)
Hope this helps.

Resources