Ruby's `lazy` less performant than allocating huge lists? - ruby

I'm trying to process a large list of numbers:
require 'benchmark'
N = 999999
Benchmark.bm 10 do |bm|
bm.report 'Eager:' do
(0..N).select(&:even?).map{|x| x * x}.reduce(&:+)
end
bm.report 'Lazy:' do
(0..N).lazy.select(&:even?).map{|x| x * x}.reduce(&:+)
end
end;
To my understanding, the lazy version should be much faster, because the eager version needs to allocate two lists of half a million items each(one for select and one for map) while the lazy version is streaming everything.
However, when I run it, the lazy version takes more than twice as long as the eager one! (http://rextester.com/OTEX7399)
user system total real
Eager: 0.210000 0.010000 0.220000 ( 0.216572)
Lazy: 0.580000 0.000000 0.580000 ( 0.635091)
How can it be?

I'd say an Enumerator is a far slower middle man than memory is.
This was also reported a while back and Ruby core team member Yusuke Endoh said:
Enumerator::Lazy is not a silver bullet; it removes the overhead for
creating an intermediate array, but brings the drawback for calling
a block. Unfortunately, the latter is much bigger than the former.
Thus, in general, Lazy does bring performance drawback.
An analogy I just thought of: Imagine you're building some furniture for a friend.
Non-lazy: You build the whole thing, rent a truck, and drive it to your friend.
Lazy: You build a little piece and drive it to your friend with your car. You build the next little piece and drive it to your friend with your car. You build the next little piece and drive it to your friend with your car. And so on.
Yes, renting that truck is extra overhead, but it's nothing compared to driving over and over again with your car.
The real reason that being lazy can save time is that after the first few pieces your friend finds out that you slept with his wife so now you're no longer friends and he doesn't want your stupid furniture anymore and you're not building the remaining pieces at all.

Related

Why are these two smaller loops so much slower than a single loop containing the same instructions?

Ruby v2.3.3p222
MacOS Catalina v10.15.3
Processor: 2.6GHz 6-Core Intel Core i7
I have the following performance benchmark script, which was designed to test the difference between one larger loop operation vs. two smaller loops:
require 'benchmark'
N = 10_000_000
def one_loop
N.times do
foo = 1+1
bar = 2+2
end
end
def two_loops
N.times do
foo = 1+1
end
N.times do
bar = 2+2
end
end
Benchmark.bmbm do |performance|
performance.report("two smaller loops") { two_loops }
performance.report("one large loop") { one_loop }
end
My hypothesis was that these two methods would execute in approximately the same amount of time, because (I thought) they were both executing the same number of instructions: the larger loop was doing 2 * 10,000,000 operations, while each of the 2 smaller loops was doing 1 * 10,000,000 operations.
However, that doesn't appear to be what I observed. When I run the script, I get the following output:
Rehearsal -----------------------------------------------------
two smaller loops 0.840000 0.000000 0.840000 ( 0.838101)
one large loop 0.500000 0.010000 0.510000 ( 0.506283)
-------------------------------------------- total: 1.350000sec
user system total real
two smaller loops 0.850000 0.000000 0.850000 ( 0.863052)
one large loop 0.500000 0.000000 0.500000 ( 0.494525)
This is really disappointing, because I was hoping to convince my team that we wouldn't see any reduced performance by splitting up our 1 large loop of code into several more concise loops which each did one thing and did it well.
I thought it might be due to the order in which the reports are generated, but when I reverse the order of the two calls to performance.report, I get similarly disappointing results:
Rehearsal -----------------------------------------------------
one large loop 0.500000 0.010000 0.510000 ( 0.508246)
two smaller loops 0.850000 0.000000 0.850000 ( 0.852467)
-------------------------------------------- total: 1.360000sec
user system total real
one large loop 0.490000 0.000000 0.490000 ( 0.496130)
two smaller loops 0.830000 0.000000 0.830000 ( 0.831476)
Am I missing something? Are the 2 smaller loops really doing a much greater amount of work than the single larger loop? Or did I somehow construct my benchmark script in a misleading or inaccurate way?
the larger loop was doing 2 * 10,000,000 operations, while each of the 2 smaller loops was doing 1 * 10,000,000 operations
It doesn't make sense to talk about "operations" without defining the machine model and cost model within which we model those "operations". Or, to put it simply: it doesn't make sense to count things until you are clear what you are counting.
In this case, you are counting additions. And you are correct: within your model which counts only additions, both versions have the same amount of additions.
They do, however, not have the same amount of block activations.
Remember, Integer#times looks roughly like this:
class Integer
def times
return enum_for(__callee__) unless block_given?
return self unless positive?
i = -1
yield i while (i += 1) < self
self
end
end
So, for each iteration of the loop, there is an activation (i.e. a yield) of the block passed to Integer#times.
If we add that as a new class of "operation", we have the following:
one_loop: 20,000,000 additions and 10,000,000 block activations
two_loops: 20,000,000 additions and 20,000,000 block activations
So, both methods have the same number of additions but two_loops has twice the number of block activations.
Which means, we also have to take into account the relative costs of additions versus block activations. Now, semantically, an addition is just a normal method call. And activating a block is sort-of similar to a method call.
So, we would expect that an addition and a block activation have roughly similar cost, which means our cost would be:
one_loop: 30,000,000 "method call like operations"
two_loops: 40,000,000 "method call like operations"
In other words, we would expect two_loops to be 33% slower or one_loop to be 25% faster, depending on how you look at it.
However, we actually find that the difference is much bigger, so clearly we are missing something in our model.
What we are missing are optimizations. Arithmetic operations on integers are extremely common and extremely performance-critical, and so all Ruby implementations go to great lengths to make them fast. In fact, on all Ruby implementations, simple additions such as the one you are using will be mapped directly to single CPU ADD instructions, and will not incur the overhead of a method call at all.
Block activations are also very important in Ruby, and thus they are also heavily optimized, but they are just fundamentally several orders of magnitude more complex than adding two machine-word integers.
In fact, the relative complexity of block activations to machine-word integer additions is so big, that we can actually ignore the additions completely in our model:
one_loop: 10,000,000 block activations
two_loops: 20,000,000 block activations
This gives us a factor of 2:1, so we would expect two_loops to be 100% slower or one_loop to be 50% faster.
By the way, I ignored another operation that is happening: the definition and initialization of the local variables. The argument is similar: that is an operation that is so fast that it is negligible compared to the block activation.
Actually, so far, we have only talked about the relative cost of those operations and how they mean that we can ignore the cost of the additions and the local variables. There is, however, an even stronger reason to ignore those: optimizations.
Even the most simple Ruby implementation will be able to optimize away the local variables completely: they are defined and initialized in only one place, and they are never accessed ever again. They only exist within the scope of the block, for the duration of one activation of the block, so even a very simple optimizer can see that they are completely useless, so even the most simple of optimizers will optimize the code to something roughly like this:
def one_loop
N.times do
1+1
2+2
end
end
def two_loops
N.times do
1+1
end
N.times do
2+2
end
end
Meaning that we can not only ignore the cost of the local variables because it is small compared to the other costs, but actually, the local variables don't even exist.
Likewise, a slightly smarter optimizer will recognize that the first addition in one_loop has no side-effects, is not returned, is not stored in a variable (or at least not in one that is used anywhere), and in general does not influence the outcome of the computation in any way, shape, or form, and thus will optimize the code to this:
def one_loop
N.times do
2+2
end
end
def two_loops
N.times do
1+1
end
N.times do
2+2
end
end
Furthermore, the same argument actually applies to the remaining addition. It has no side-effect, all it does is being returned from the block, but Integer#times ignores the return value of the block. I haven't taken a look at the generated code, but I strongly suspect that even the most stupid optimizer can easily prove that your blocks are no-ops, and thus it will optimize the code to something roughly like this:
def one_loop
N.times do
end
end
def two_loops
N.times do
end
N.times do
end
end
Which means that one_loop has N iterations of the block, two_loops has 2 * N iterations, and thus should take roughly twice as long.
Now, we can see in your benchmarks that the numbers are not actually 2:1. They are 1.75:1 or roughly 7:4.
I can confirm those results on my machine, here with YARV 2.7.1 without JIT, I get almost exactly 7:4:
user system total real
two smaller loops 0.711479 0.000099 0.711578 ( 0.711680)
one large loop 0.401808 0.000059 0.401867 ( 0.401916)
However, when I turn on the JIT, I get almost exactly the 2:1 we would expect:
user system total real
two smaller loops 0.587017 0.000279 0.587296 ( 0.587098)
one large loop 0.291713 0.000062 0.291775 ( 0.291779)
You will also notice that it is faster in general.
With JRuby 9.2.9.0, we get again slightly faster execution and almost 2:1:
user system total real
two smaller loops 0.740000 0.010000 0.750000 ( 0.517670)
one large loop 0.260000 0.000000 0.260000 ( 0.263270)
This was with default options, here are the results with some more aggressive compiler flags:
user system total real
two smaller loops 0.370000 0.000000 0.370000 ( 0.362050)
one large loop 0.390000 0.010000 0.400000 ( 0.213861)
TruffleRuby 20.1.0 is again much faster than even JRuby:
user system total real
two smaller loops 0.009955 0.000039 0.009994 ( 0.010035)
one large loop 0.004759 0.000007 0.004766 ( 0.004742)
And again, very close to 2:1. Also, even though we are only interested in the relative performance of the two methods, it is quite pleasing to see that TruffleRuby is 70x-100x faster than YARV on this benchmark!
Actually, I am somewhat surprised that TruffleRuby could not prove that Integer#times with an empty block body is a no-op. I would have expected it to be able to optimize the code like this:
def one_loop
end
def two_loops
end
And thus have no runtime difference between the two versions at all.
Am I missing something? Are the 2 smaller loops really doing a much greater amount of work than the single larger loop? Or did I somehow construct my benchmark script in a misleading or inaccurate way?
I would say all of the above.
The main problem is that you are measuring almost the exact opposite of what you are counting. You are counting only additions and ignoring block activations, and there is nothing wrong with that, IFF all you are interested in is the number of additions and nothing else.
And you are measuring only the cost of block activations and ignoring the cost of additions, which is also totally fine if that is what you are interested in.
The problem is that those two do not match up: you are not measuring what you are counting and you are not counting what you are measuring, so you simply cannot draw any conclusions from the results of your experiments towards your hypothesis.
In one of your comments, you asked:
so does this mean that each iteration of each loop counts as its own operation, in addition to whatever operation(s) happen inside the loop?
We can't tell you that. You need to define what operations you are interested in, and what operations you want to ignore. If you define "operation" to only mean "addition", then no, each iteration of the loop does not count as its own operation, and both of your examples have the exact same amount of operations.
Another problem is that your hypothesis "the number of additions is the same, therefore the execution time is the same" is invalid, because additions are not the only operations that take time. And even if you count other kinds of operations, then your hypothesis still assumes that every operation takes the same amount of time, which is also not true.
There are also some more problems with your benchmarking approach in general, which however are not the source of your confusion. Here are some of the problems with your benchmarks I have identified, although I am sure there others:
Your benchmarks are written in a way that all the operations you are interested about are optimized away, leaving only operations you are not interested about.
Even if they weren't optimized away, their execution time negligible compared to the execution time of the operations you don't care about.
Your benchmarks don't run long and often enough to give the optimizer a chance. For example, the default threshold for when a method is compiled is between 20 and 20000 calls depending on the Ruby implementation, compiler flags, etc. Both your methods are only called twice, once during rehearsal, once during the real thing. You would need to make sure they are called a lot more than 20000 times to ensure that a) they get compiled at all, and b) there are enough iterations after they have been compiled that the slower iterations before they were compiled do not significantly affect the outcome.
I always recommend that someone who wants to write a benchmark read and understand the following mailing list thread:
JMH vs Caliper: reference thread
Especially the sub-thread and following discussion starting at the linked message.
While this thread is about a specific benchmark tool for benchmarking Java code, anything that is discussed in the thread applies to all benchmarking on all modern high-performance language implementations.
There is a reason why benchmarks are written by benchmark engineers who have writing benchmarks as their full-time job: it takes a lot of knowledge and expertise to write benchmarks.
You need, at least
Deep knowledge of computer organization in general.
Deep knowledge of all the specific hardware platforms that you are benchmarking on.
Deep knowledge of programming languages in general.
Deep knowledge of al the specific programming languages you are writing benchmark code in.
Deep knowledge of language implementations in general, including but not limited to ahead-of-time compilers, JIT compilers, interpreters, VMs, garbage collectors, memory allocators, optimizers, inlining, loop unrolling, dead code elimination, constant folding, common subexpression elimination, tail call elimination, peephole optimizations, compile-time evaluation, Polymorphic Inline Caching, and lots more.
Deep knowledge of the specific language implementations you are running your code on.
And more, e.g. operating systems, scheduling, NUMA, multi-threading, SMT, CMT, …
And when you have all of that, you are presented with a bunch of numbers you need to know how to interpret, which requires deep knowledge of statistics.
The benchmark library in the Ruby stdlib is an example of many of those "sins". It was perfectly fine 25 years ago, when there was only one single Ruby implementation which was simply a dumb AST-walking interpreter with no optimizations whatsoever, and computers had one CPU with no out-of-order execution or speculation, and that was it. But for current times, where we have a multitude of aggressively optimizing Ruby implementations (most prominently TruffleRuby) and complex CPUs that perform their own optimizations as well, it just doesn't cut it anymore.
Unfortunately, there is no benchmark tool of comparable sophistication to what's available e.g. in the Java world, but there are at least some alternatives such as better-benchmark (no longer maintained), benchmark-ips (by Evan Phoenix, founder of Rubinius), or fruity (by Marc-André Lafortune, ruby-core team member).
This is 10 million iterations, and in each iteration two calculations are done, for a total of 30 million of what we'll call operations:
N.times do
foo = 1+1
bar = 2+2
end
This is 20 million iterations, and in each iteration one calculation is done, for a total of 40 million of what we'll call operations:
N.times do
foo = 1+1
end
N.times do
bar = 2+2
end
30 < 40, therefore the first example is faster.

polyfit on GPUArray is extremely slow [duplicate]

function w=oja(X, varargin)
% get the dimensionality
[m n] = size(X);
% random initial weights
w = randn(m,1);
options = struct( ...
'rate', .00005, ...
'niter', 5000, ...
'delta', .0001);
options = getopt(options, varargin);
success = 0;
% run through all input samples
for iter = 1:options.niter
y = w'*X;
for ii = 1:n
% y is a scalar, not a vector
w = w + options.rate*(y(ii)*X(:,ii) - y(ii)^2*w);
end
end
if (any(~isfinite(w)))
warning('Lost convergence; lower learning rate?');
end
end
size(X)= 400 153600
This code implements oja's rule and runs slow. I am not able to vectorize it any more. To make it run faster I wanted to do computations on the GPU, therefore I changed
X=gpuArray(X)
But the code instead ran slower. The computation used seems to be compatible with GPU. Please let me know my mistake.
Profile Code Output:
Complete details:
https://drive.google.com/file/d/0B16PrXUjs69zRjFhSHhOSTI5RzQ/view?usp=sharing
This is not a full answer on how to solve it, but more an explanation why GPUs does not speed up, but actually enormously slow down your code.
GPUs are fantastic to speed up code that is parallel, meaning that they can do A LOT of things at the same time (i.e. my GPU can do 30070 things at the same time, while a modern CPU cant go over 16). However, GPU processors are very slow! Nowadays a decent CPU has around 2~3Ghz speed while a modern GPU has 700Mhz. This means that a CPU is much faster than a GPU, but as GPUs can do lots of things at the same time they can win overall.
Once I saw it explained as: What do you prefer, A million dollar sports car or a scooter? A million dolar car or a thousand scooters? And what if your job is to deliver pizza? Hopefully you answered a thousand scooters for this last one (unless you are a scooter fan and you answered the scooters in all of them, but that's not the point). (source and good introduction to GPU)
Back to your code: your code is incredibly sequential. Every inner iteration depends in the previous one and the same with the outer iteration. You can not run 2 of these in parallel, as you need the result from one iteration to run the next one. This means that you will not get a pizza order until you have delivered the last one, thus what you want is to deliver 1 by 1, as fast as you can (so sports car is better!).
And actually, each of these 1 line equations is incredibly fast! If I run 50 of them in my computer I get 13.034 seconds on that line which is 1.69 microseconds per iteration (7680000 calls).
Thus your problem is not that your code is slow, is that you call it A LOT of times. The GPU will not accelerate this line of code, because it is already very fast, and we know that CPUs are faster than GPUs for these kind of things.
Thus, unfortunately, GPUs suck for sequential code and your code is very sequential, therefore you can not use GPUs to speed up. An HPC will neither help, because every loop iteration depends in the previous one (no parfor :( ).
So, as far I can say, you will need to deal with it.

Speed differences between proc, Proc.new, lambda, and stabby lambda

Procs and lambdas differ with respect to method scoping and the effect of the return keyword. I am rather interested in the performance differences between them. I wrote a test as shown below:
def time(&block)
start = Time.now
block.call
p "that took #{Time.now - start}"
end
def test(proc)
time{(0..10000000).each{|n| proc.call(n)}}
end
def test_block(&block)
time{(0..10000000).each{|n| block.call(n)}}
end
def method_test
time{(1..10000000).each{|n| my_method(n)}}
end
proc1 = Proc.new{|x| x*x}
proc2 = proc{|x| x*x}
lam1 = lambda{|x| x*x}
lam2 = ->x{x*x}
def my_method(x)
x*x
end
test(proc1)
test(proc2)
test(lam1)
test(lam2)
test_block{|x| x*x}
test(method(:my_method))
method_test
The result of this code is shown below.
"that took 0.988388739"
"that took 0.963193172"
"that took 0.943111226"
"that took 0.950506263"
"that took 0.960760843"
"that took 1.090146951"
"that took 0.644500627"
method(:my_method) is the slowest, which is because it checks a look up table for each iteration in the loop.
Similarly, another test as below:
def test2(&block)
time{(0..1000000).each{block.call}}
end
test2{Proc.new{|x| x*x}}
test2{proc{|x| x*x}}
test2{lambda{|x| x*x}}
test2{->(x){x*x}}
returns this result:
"that took 0.415290453"
"that took 0.378787963"
"that took 0.3888118"
"that took 0.391414639"
Proc.new is the slowest creation method, which is because we have the overhead of creating an entire object to wrap our proc.
I assert that the execution time of the procs and lambdas are the same as one another regardless of their creation method.
Why is normal method invocation so much faster than procs and lambdas (1/3 time reduction)?
Do people think this is likely to change with different block functions etc.?
Are there any other (performance based) reasons to chose between the different approaches?
So it seems you have three questions. The middle one is unclear to me, so I will address the other two:
Why is normal method invocation so much faster?
This is the easier of the questions.
First realize that the times involved here are for function call overhead. I did my own timings based on your code (but with an identity function instead of multiplication), and non-direct invocations took 49% longer. With one multiplication, non-direct invocations took only 43% longer. In other words, one reason why you're seeing a large disparity is that your function itself is doing almost nothing. Even a single multiplication makes 6% of the difference "vanish". In a method of any reasonable complexity, the method call overhead is usually going to be a relatively small percentage of the overall time.
Next, remember that a proc/block/lambda is essentially a chunk of code that is being carried around (though a block literal cannot be saved into a variable). This implies one more level of indirection than a method call...meaning that at the very least the CPU is going to have to traverse a pointer to something.
Also, remember that Ruby supports closures, and I'm betting there is some overhead in deciding which environment the indirect code should run in.
On my machine, running a C program that invokes a function directly has 10% less overhead than one that uses a pointer to a function. An interpreted language like Ruby, where closures are also involved, is definitely going to use more.
My measurements (Ruby 2.2) indicate a direct method invocation takes about as long as 6 multiplications, and an indirect invocation takes about as long as 10.
So while the overhead is nearly twice as large, remember that the overhead in both cases is often relatively small.
Are there any other (performance based) reasons to chose between the different approaches?
I'd say given the above data the answer is usually no: you're much better off using the construct that gives you the most maintainable code.
There are definitely good reasons to choose one over the other. To be honest, I'm surprised about the difference I see between lambdas and blocks (on my machine, lambdas have 20% less overhead). Lambdas create anonymous methods that include parameter list checking, so if anything I would expect it to be slightly slower, but my measurements put it ahead.
That the direct invocation is faster simply isn't surprising at all.
The place where this kind of thing makes a difference is in very frequently called code where the overhead adds up to be noticeable in wall-clock kinds of ways. In this case, it can make sense to do all manner of ugly optimizations to try to squeeze a bit more speed, all the way up to inlining code (dodge the function-call overhead altogether).
For example, say your application uses a database framework. You profile it and find a frequently-called, small (e.g. less than 20 multiplications worth of work) function that copies the data from the database result into data structures. Such a function might comprise the lion's share of the result processing, and simply inlining that function might shave off significant amounts of CPU time at the expense of some code clarity.
Simply inlining your "square function" into a long numeric calculation with a billion steps could save you dramatic amounts of time because the operation itself takes a lot less time than even a direct method invocation.
In most cases, though, you're better off with clean, clear code.

Ruby Benchmarking Accuracy - Branch Prediction at its finest?

So this morning I decided to play around with Benchmarking for the first time.
I was curious about the speed different between code with "do-end" block formatting vs. "{ }" formatting.
So I stored the Benchmark code in a Proc so I could call it multiple times consecutively:
n = 100_000_000
bmp = Proc.new do
Benchmark.bm do |x|
x.report {n.times {a = "1"}}
x.report {n.times do; a = "1"; end}
end
end
My results we're expected when I ran it once.
>> bmp.call
user system total real
1.840000 0.030000 1.870000 ( 1.874507)
1.860000 0.050000 1.910000 ( 1.926101)
=> true
But then ran it again.
>> bmp.call
user system total real
1.870000 0.050000 1.920000 ( 1.922810)
1.840000 0.000000 1.840000 ( 1.850615)
To me this looks like the exact opposite of what I'm expecting. I am familiar with the concept of Branch Prediction. Is this a classic example of Branch Prediction? If not, what? Is there anyway to prevent any inaccuracies like this (if this is even considered one)?
EDIT: I did run this code over 30 times, after some suggestions. Frequently it would alternate between the two results. The sample of the data is found here:
gist.github.com/TheLarkInn/5599676
First of all, your benchmark is utterly pointless. The difference between the do / end syntax and the { / } syntax is just that: syntax. There is no semantic difference. Ergo, there cannot possibly be any runtime performance difference whatsoever between the two. It's just not logically possible. You don't need to benchmark it.
The only performance difference that could exist, is that one takes longer to parse than the other. However, none of the two is harder to parse than the other. The only difference is precedence. Therefore, there very likely isn't any performance difference in parsing, either.
And even if there were a performance difference in parsing, your benchmark wouldn't show it. You are using a benchmark written in Ruby, but in order to run Ruby code, the Ruby execution engine has to parse it first, which means that parsing will already have happened, before your benchmark even starts. So, even if your benchmark weren't pointless, it would still be useless, since it cannot possibly measure the performance difference in parsing.
As to your question about Branch Prediction: there are no branches in your code, there is nothing to predict.
BTW: even if your benchmark was intended for a different purpose, it still wouldn't be measuring anything, since at least the more advanced Ruby implementations would recognize that your blocks are essentially no-ops and simply optimize them away. And even if they aren't optimized away, all they are measuring is memory allocator performance (allocating a couple hundred megabytes of tiny String objects), not the performance of blocks.
Just a quick primer on stats:
I'm not sure if two runs is enough to spot a trend. What if there was a difference in system load between the two test blocks the second time you ran it?
A rule of thumb for determining a statistical difference between two samples is that 30 or more data points will give you a statistically relevant result.
I'd run your tests at least that many times, store the results for the two versions independently, and then compare them internally to ensure they're consistent, before comparing the two sets to one-another.
It could be that your initial premise is incorrect :)

Aren't modern computers powerful enough to handle Strings without needing to use Symbols (in Ruby)

Every text I've read about Ruby symbols talks about the efficiency of symbols over strings. But, this isn't the 1970s. My computer can handle a little bit of extra garbage collection. Am I wrong? I have the latest and greatest Pentium dual core processor and 4 gigs of RAM. I think that should be enough to handle some Strings.
Your computer may well be able to handle "a little bit of extra garbage collection", but what about when that "little bit" takes place in an inner loop that runs millions of times? What about when it's running on an embedded system with limited memory?
There are a lot of places you can get away with using strings willy-nilly, but in some you can't. It all depends on the context.
It's true, you don't need tokens so very badly for memory reasons. Your computer could undoubtedly handle all kinds of gnarly string handling.
But, in addition to being faster, tokens have the added advantage (especially with context coloring) of screaming out visually: LOOK AT ME, I AM A KEY OF A KEY-VALUE PAIR. That's a good enough reason to use them for me.
There's other reasons too... and the performance gain on lots of them might be more than you realize, especially doing something like comparison.
When comparing two ruby symbols, the interpreter is just comparing two object addresses. When comparing two strings, the interpreter has to compare every character one at a time. That kind of computation can add up if you're doing a lot of this.
Symbols have their own performance problems though... they are never garbage collected.
It's worth reading this article:
http://www.randomhacks.net/articles/2007/01/20/13-ways-of-looking-at-a-ruby-symbol
It's nice that symbols are guaranteed unique--that can have some nice effects that you wouldn't get from String (such as their addresses are always exactly equal I believe).
Plus they have a different meaning and you would want to use them in different areas, but ruby isn't too strict about that kind of stuff anyway, so I can understand your question.
Here's the real reason for the difference: strings are never the same. Every instance of a string is a separate object, even if the content is identical. And most operations on strings will make new string objects. Consider the following:
a = 'zowie'
b = 'zowie'
a == b #=> true
On the surface, it'd be easy to claim that a and b are the same. Most common sense operations will work as you'd expect. But:
a.object_id #=> 2152589920 (when I ran this in irb)
b.object_id #=> 2152572980
a.equal?(b) #=> false
They look the same, but they're different objects. Ruby had to allocate memory twice, perform the String#initialize method twice, etc. They're taking up two separate spots in memory. And hey! It gets even more fun when you try to modify them:
a += '' #=> 'zowie'
a.object_id #=> 2151845240
Here we add nothing to a and leave the content exactly the same -- but Ruby doesn't know that. It still allocates a whole new String object, reassigns the variable a to it, and the old String object sits around waiting for eventual garbage collection. Oh, and the empty '' string also gets a temporary String object allocated just for the duration of that line of code. Try it and see:
''.object_id #=> 2152710260
''.object_id #=> 2152694840
''.object_id #=> 2152681980
Are these object allocations fast on your slick multi-Gigahertz processor? Sure they are. Will they chew up much of your 4 GB of RAM? No they won't. But do it a few million times over, and it starts to add up. Most applications use temporary strings all over the place, and your code's probably full of string literals inside your methods and loops. Each of those string literals and such will allocate a new String object, every single time that line of code gets run. The real problem isn't even the memory waste; it's the time wasted when garbage collection gets triggered too frequently and your application starts hanging.
In contrast, take a look at symbols:
a = :zowie
b = :zowie
a.object_id #=> 456488
b.object_id #=> 456488
a == b #=> true
a.equal?(b) #=> true
Once the symbol :zowie gets made, it'll never make another one. Every time you refer to a given symbol, you're referring to the same object. There's no time or memory wasted on new allocations. This can also be a downside if you go too crazy with them -- they're never garbage collected, so if you start creating countless symbols dynamically from user input, you're risking an endless memory leak. But for simple literals in your code, like constant values or hash keys, they're just about perfect.
Does that help? It's not about what your application does once. It's about what it does millions of times.
One less character to type. That's all the justification I need to use them over strings for hash keys, etc.

Resources