why golang is slower than scala? [closed]

why golang is slower than scala? [closed] - performance

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
In this test, we can see that the performance of golang is sometimes much slower than scala. In my opinion, since the code of golang is compiled directly to c/c++ compatible binary code, while the code of scala is compiled to JVM byte code, golang should have much better performance, especially in these computation-intensive algorithm the benchmark did. Is my understanding incorrect?
http://benchmarksgame.alioth.debian.org/u64/chartvs.php?r=eNoljskRAEEIAlPCA48ozD%2Bb1dkX1UIhzELXeGcih5BqXeksDvbs8Vgi9HFr23iGiD82SgxJqRWkKNctgkMVUfwlHXnZWDkut%2BMK1nGawoYeDLlYQ8eLG1tvF91Dd8NVGm4sBfGaYo0Pok0rWQ%3D%3D&m=eNozMFFwSU1WMDIwNFYoNTNRyAMAIvoEBA%3D%3D&w=eNpLz%2FcvTk7MSQQADkoDKg%3D%3D

Here's what I think's going on in the four benchmarks where the go solutions are the slowest compared to the scala solutions.
mandelbrot: the scala implementation has its internal loop unrolled one time. It may be also that the JVM can vectorise the calculation like this, which I think the go compiler doesn't yet do. This is good manual optimisation plus better JVM support for speeding arithmetic.
regex-dna: the scala implementation isn't doing what the benchmark requires: it's asked to """(one pattern at a time) match-replace the pattern in the redirect file, and record the sequence length""" but it's just calculating the length and printing that. The go version does the match-replace so is slower.
k-nucleotide: the scala implementation has been optimised by using bit-twiddling to pack nucleotides into a long rather than use chars. It's a good optimisation that could also be applied to the Go code.
binary-trees: this tests gc performance by filling RAM. It's true that java gc is much faster than the go gc, but the argument for this not being the top priority for go is that usually one can avoid gc in real programs by not producing garbage in the first place.

This chart is from the Programming Shootout. You should read the disclaimers on the Shootout page before taking the benchmarks as gospel. At best these benchmarks are only useful for indicating broad expectations of performance.
That said, the JVM has a decade of well-funded optimization and apart from startup time, provides excellent performance for running code. Go is still a young language. The fact that Go comes within spitting distance of a JVM language is impressive. If you enjoy programming in Go, you should not reject it over one benchmark.

This is discussed in the go FAQ:
One of Go's design goals is to approach the performance of C for comparable programs, yet on some benchmarks it does quite poorly, including several in test/bench/shootout. The slowest depend on libraries for which versions of comparable performance are not available in Go. For instance, pidigits.go depends on a multi-precision math package, and the C versions, unlike Go's, use GMP (which is written in optimized assembler). Benchmarks that depend on regular expressions (regex-dna.go, for instance) are essentially comparing Go's native regexp package to mature, highly optimized regular expression libraries like PCRE.
Benchmark games are won by extensive tuning and the Go versions of most of the benchmarks need attention. If you measure comparable C and Go programs (reverse-complement.go is one example), you'll see the two languages are much closer in raw performance than this suite would indicate.
Still, there is room for improvement. The compilers are good but could be better, many libraries need major performance work, and the garbage collector isn't fast enough yet. (Even if it were, taking care not to generate unnecessary garbage can have a huge effect.)
As an aside, consider the 10x (!) speed difference between the different versions of a benchmark for a given programming language. C gcc #7 is 8.3 times slower than C gcc #5, and Ada #3 almost 10 times slower than Ada #5. These benchmarks provide a rough idea of how language compare, but the difference between Go and Scala is within one order of magnitude, which means any 'intrinsic' variation between the runtimes is likely to be dwarfed by differences in the implementation: this post describes how they sped up a program 11x by performing smarter memory allocation. Maybe the compiler/runtime should be handling this kind of optimisations automatically (as the JVM does, to a certain level), but I am not sure you can really draw the conclusion that 'Go is slower (resp. faster) than Scala' in the general case from these figures. Just my opinion though :)

Since you seem to be keen in looking at these biased benchmarks. Let's take a real example for real scenario not some Fibonacci implementations.
Take a look at these rankings for web frameworks benchmarks, the testing was done using native client if available and sometimes using OSS web frameworks, they also use many packages for testing with the same language. The tests vary from requests for raw strings to using ORM to query a database.
It is clear that Scala performance is no where close to Go, in all of the tests Scala was below Go. Having said this, benchmarks are nothing close to reality and I suggest you look at a language from tools/features perspective or simply what would be best to solve your problem.

As Brad pointed out, these results are from one particular benchmark suite. This provides some information, but don't assume it's the whole picture. It would be helpful to know whether the source code is well enough written in each case to give the fastest speed, the least memory use, or some other target goal.
Perhaps we might compare with another website that ranks languages. Take a look at http://www.techempower.com/benchmarks/ in which web service codes are compared. In spite of being a young language, Go is one of the best in some of these benchmarks.
As in all benchmarks, it always depends what you strive for and how you measure it.

Related

Is optimization dangerous? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I've read somewhere in the comments section of one of the questions here in stack overflow that:
Always start coding unoptimized.
If it meets the requirements then it's good,
else code an optimized version.
Check if the optimized code meets the requirements, if it meets the requirement, keep it but also keep the unoptimized version or paste the unoptimized version as a comment.
if the optimized version doesn't meet the requirements, delete it and stick with the unoptimized.
^Is there a term for this kind of programming? Is this a good or bad programming practice
Is optimization dangerous? The only reason I can think of is that it can create unnecessary complexity which can lead to errors. Is there anything else?
Is there a general rule to be followed about when one should optimize or not?

Optimising code takes time from the developers that they could instead use to add new features or polish their product. Since the end goal of development is not the code but the product that is build with it, spending time on optimisation should be balanced with the other uses that could be done of that time.
It's a waste when the effort is spent on code that does not end up in the product due to a change in the requirements. If optimisation is performed from the beginning, you may also spend lots of time optimising a part of the code that only marginally contribute to the overall time spent by the application.
Instead, you should probably wait until you have a clear vision of what the application is and what are the bottleneck before spending too much effort on optimisation. And then, you'll have a large suite of unit tests and of use cases that will allow you to optimise in confidence that you don't break the application and only spend your effort on parts that are really worth optimising thanks to profiling.
As always in engineering, optimisation is a tradeoff that you make. And you should definitely be sure that it is going to payoff before doing it if you mind your resources (time, money, ...).

In general, optimized code is more complex and difficult to get correct. It's also often counter productive to optimize code early (simply because you may be spending time optimizing something that doesn't provide any real improvement in overall performance).
So the guidance you're asking about really boils down to:
write code that easier to write and verify that it's correct
optimize that code when it makes sense to expend the effort
No matter how fast it runs, incorrect code is not optimized code.

Always profile before optimizing. If a small amount of code takes up a majority of the execution time and you can prove this from your profiling results, consider the programming effort to write, test, reprofile, maintain, and have someone else inherit this added complexity. Once you've done this, revert your code back to before you optimized it for runtime and deoptimized it for readability. Just don't do it. Seriously, unless over 90% of your execution is spent on one function, it's not worth the effort.
Keep in mind that a speedup of 10x on code that consumes 90% of your runtime will decrease your total runtime by a factor of ~5. A speedup of infinity on that slow function still only speeds up your entire program by a factor of 10. If you're hoping for more than an order of magnitude speed improvement (which is my threshold for whether I may start o think about optimizing), you will need to change how you approach a problem, and this kind of change means rethinking the architecture of the program. If you're lucky, it may be as simple as replacing your queue with a priority queue. Most likely you won't be lucky. Sorry the answer is bleak.

If optimization (by your compiler) is breaking your code while you believe it should not,
your code is not following the language standard, or
your compiler is broken, and you should upgrade it.
Language standards are quite complex to understand (in particular because not everything is specified, and some things are explicitly left unspecified or implementation specific). Read about undefined behavior
Compilers are in practice tested a big lot, and you should often first suspect your own code, and only after be sure your code is right (and fully standard conforming) suspect the compiler (in other words, compiler optimization bugs -where the generated code is wrong- are quite rare in practice).
Be sure to upgrade your compiler to a recent version. For GCC it is today (december 2013) 4.8.2; don't blame GCC if you are using a 4.4 or 3.6 GCC compiler, these ancient versions are not maintained anymore!
In practice, enable all warnings and debugging info when developping your code (e.g. compile with gcc -Wall -g at least, perhaps even with -Wextra). When you are sure of the quality of your code, compile it with optimizations and warnings (e.g. gcc -Wall -g -O2) and test it a lot.
In practice, profile the execution of your tests and (when possible) focus your efforts on the hot code (the one taking most of the CPU time).

Premature optimization is the root of all evil .... but sometimes you don't have really a choice, see audio codec implementation on ARM devices, in that case you need to get benefit from DSP ARM assembly extensions (like QADD, QSUB, QDADD, and QDSUB
) that can only be mapped on C code with multiple line instructions (highly inefficient), compilers cannot do a good job there, so you will need to optimize code inlining assembly.
You will probably write a "non optimized code" first in that case, but with the optimization in mind... so that when you will add optimization you won't need to change your code too much.
Another case in which you know you will need to optimize your code is when you will write signal processing functions (correlation, convolution, fft) for embedded devices. In that case you will have to do algorithmic optimization (choose the best method to approach the problem, choose the right approximation) and code optimizations (to use the pipeline properly for example) and it will be good to know that you are going to optimize the code before starting doing it (expecially the algorithmic one that can be performed on paper even before coding, and that can be tested separately).

Performance of Google's Go?

So has anyone used Google's Go? I was wondering how the mathematical performance (e.g. flops) is compared to other languages with a garbage collector... like Java or .NET?
Has anyone investigated this?

Theoretical performance: The theoretical performance of pure Go programs is somewhere between C/C++ and Java. This assumes an advanced optimizing compiler and it also assumes the programmer takes advantage of all features of the language (be it C, C++, Java or Go) and refactors the code to fit the programming language.
Practical performance (as of July 2011): The standard Go compiler (5g/6g/8g) is currently unable to generate efficient instruction streams for high-performance numerical codes, so the performance will be lower than C/C++ or Java. There are multiple reasons for this: each function call has an overhead of a couple of additional instructions (compared to C/C++ or Java), no function inlining, average-quality register allocation, average-quality garbage collector, limited ability to erase bound checks, no access to vector instructions from Go, compiler has no support for SSE2 on 32-bit x86 CPUs, etc.
Bottom line: As a rule of thumb, expect the performance of numerical codes implemented in pure Go, compiled by 5g/6g/8g, to be 2 times lower than C/C++ or Java. Expect the performance to get better in the future.
Practical performance (September 2013): Compared to older Go from July 2011, Go 1.1.2 is capable of generating more efficient numerical codes but they remain to run slightly slower than C/C++ and Java. The compiler utilizes SSE2 instructions even on 32-bit x86 CPUs which causes 32-bit numerical codes to run much faster, most likely thanks to better register allocation. The compiler now implements function inlining and escape analysis. The garbage collector has also been improved but it remains to be less advanced than Java's garbage collector. There is still no support for accessing vector instructions from Go.
Bottom line: The performance gap seems sufficiently small for Go to be an alternative to C/C++ and Java in numerical computing, unless the competing implementation is using vector instructions.

The Go math package is largely written in assembler for performance.
Benchmarks are often unreliable and are subject to interpretation. For example, Robert Hundt's paper Loop Recognition in C++/Java/Go/Scala looks flawed. The Go blog post on Profiling Go Programs dissects Hundt's claims.

You're actually asking several different questions. First of all, Go's math performance is going to be about as fast as anything else. Any language that compiles down to native code (which arguably includes even JIT languages like .NET) is going to perform extremely well at raw math -- as fast as the machine can go. Simple math operations are very easy to compile into a zero-overhead form. This is the area where compiled (including JIT) languages have a advantage over interpreted ones.
The other question you asked was about garbage collection. This is, to a certain extent, a bit of a side issue if you're talking about heavy math. That's not to say that GC doesn't impact performance -- actually it impacts quite a bit. But the common solution for tight loops is to avoid or minimize GC sweeps. This is often quite simple if you're doing a tight loop -- you just re-use your old variables instead of constantly allocating and discarding them. This can speed your code by several orders of magnitude.
As for the GC implementations themselves -- Go and .NET both use mark-and-sweep garbage collection. Microsoft has put a lot of focus and engineering into their GC engine, and I'm obliged to think that it's quite good all things considered. Go's GC engine is a work in progress, and while it doesn't feel any slower than .NET's architecture, the Golang folks insist that it needs some work. The fact that Go's specification disallows destructors goes a long way in speeding things up, which may be why it doesn't seem that slow.
Finally, in my own anecdotal experience, I've found Go to be extremely fast. I've written very simple and easy programs that have stood up in my own benchmarks against highly-optimized C code from some long-standing and well-respected open source projects that pride themselves on performance.
The catch is that not all Go code is going to be efficient, just like not all C code is efficient. You've got to build it correctly, which often means doing things differently than what you're used to from other languages. The profiling blog post mentioned here several times is a good example of that.

Google did a study comparing Go to some other popular languages (C++, Java, Scala). They concluded it was not as strong performance-wise:
https://days2011.scala-lang.org/sites/days2011/files/ws3-1-Hundt.pdf
Quote from the Conclusion, about Go:
Go offers interesting language features, which also allow for a concise and standardized notation. The compilers for this language are still immature, which reﬂects in both performance and binary sizes.

Is it possible to design a dynamic language without significant performance loss?

Is it possible to design something like Ruby or Clojure without the significant performance loss in many situations compared with C/Java? Does hardware design play a role?
Edit: With significant I mean in an order of magnitudes, not just ten procent
Edit: I suspect that delnan is correct with me meaning dynamic languages so I changed the title

Performance depends on many things. Of course the semantics of the language have to be preserved even if we are compiling it - you can't remove dynamic dispatch from Ruby, it would speed things up drmatically but it would totally break 95% of the all Ruby code in the world. But still, much of the performance depends on how smart the implementation is.
I assume, by "high-level", you mean "dynamic"? Haskell and OCaml are extremely high-level, yet are is compiled natively and can outperform C# or Java, even C and C++ in some corner cases - especially if parallelism comes into play. And they certainly weren't designed with performance as #1 goal. But compiler writers, especially those focused onfunctional languages, are a very clever folk. If you or I started a high-level language, even if we used e.g. LLVM as backend for native compilation, we wouldn't get anywhere near this performance.
Making dynamic languages run fast is harder - they delay many decisions (types, members of a class/an object, ...) to runtime instead of compiletime, and while static code analysis can sometimes prove it's not possible in lines n and m, you still have to carry an advanced runtime around and do quite a few things a static language's compiler can do at compiletime. Even dynamic dispatch can be optimized with a smarter VM (Inline Cache anyone?), but it's a lot of work. More than a small new-fangeled language could do, that is.
Also see Steve Yegge's Dynamic Languages Strike Back.
And of course, what is a significant peformance loss? 100 times slower than C reads like a lot, but as we all know, 80% of execution time is spent in 20% of the code = 80% of the code won't have notable impact on the percieved performance of the whole program. For the remaining 20%, you can always rewrite it in C or C++ and call it from the dynamic language. For many applications, this suffices (for some, you don't even need to optimize). For the rest... well, if performance is that critical, you should propably write it in a language designed for performance.

Don't confuse the language design with the platform that it runs on.
For instance, Java is a high-level language. It runs on the JVM (as does Clojure - identified above, and JRuby - a Java version of Ruby). The JVM will perform byte-code analysis and optimise how the code runs (making use of escape analysis, just-in-time compilation etc.). So the platform has an effect on the performance that is largely independent of the language itself (see here for more info on Java performance and comparisons to C/C++)

Loss compared to what? If you need a garbage collector or closures then you need them, and you're going to pay the price regardless. If a language makes them easy for you to get at, that doesn't mean you have to use them when you don't need them.
If a language is interpreted instead of compiled, that's going to introduce an order of magnitude slowdown. But such a language may have compensating advantages, like ease of use, platform independence, and not having to compile. And, the programs you write in them may not run long enough for speed to be an issue.
There may be language implementations that introduce slowness for no good reason, but those don't have to be used.

You might want to look at what the DARPA HPCS initiative has come up with. There were 3 programming languages proposed: Sun's Fortress, IBM's X10 and Cray's Chapel. The latter two are still under development. Whether any of these meet your definition of high-level I don't know.
And yes, hardware design certainly does play a part. All 3 of these languages are targeted at supercomputers with very many processors and exhibit features appropriate to that domain.

It's certainly possible. For example, Objective-C is a dynamically-typed language that has performance comparable to C++ (although a wee bit slower, generally speaking, but still roughly equivalent).

Are dynamic languages slower than static languages? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Are dynamic languages slower than static languages because, for example, the run-time has to check the type consistently?

No.
Dynamic languages are not slower than static languages. In fact, it is impossible for any language, dynamic or not, to be slower than another language (or faster, for that matter), simply because a language is just a bunch of abstract mathematical rules. You cannot execute a bunch of abstract mathematical rules, therefore they cannot ever be slow(er) or fast(er).
The statement that "dynamic languages are slower than static languages" is not only wrong, it doesn't even make sense. If English were a typed language, that statement wouldn't even typecheck.
In order for a language to even be able to run, it has to be implemented first. Now you can measure performance, but you aren't measuring the performance of the language, you are measuring the performance of the execution engine. Most languages have many different execution engines, with very different performance characteristics. For C, for example, the difference between the fastest and slowest implementations is a factor of 100000 or so!
Also, you cannot really measure the performance of an execution engine, either: you have to write some code to run on that exection engine first. But now you aren't measuring the performance of the execution engine, you are measuring the performance of the benchmark code. Which has very little to do with the performance of the execution engine and certainly nothing to do with the performance of the language.
In general, running well-designed code on well-designed high-performance execution engines will yield about the same performance, independent of whether the language is static or dynamic, procedural, object-oriented or functional, imperative or declarative, lazy or strict, pure or impure.
In fact, I would propose that the performance of a system is solely dependent on the amount of money that was spent making it fast, and completely independent of any particular typing discipline, programming paradigm or language.
Take for example Smalltalk, Lisp, Java and C++. All of them are, or have at one point been, the language of choice for high-performance code. All of them have huge amounts of engineering and research man-centuries expended on them to make them fast. All of them have highly-tuned proprietary commercial high-performance execution engines available. Given roughly the same problem, implemented by roughly comparable developers, they all perform roughly the same.
Two of those languages are dynamic, two are static. Java is interesting, because although it is a static language, most modern high-performance implementations are actually dynamic implementations. (In fact, several modern high-performance JVMs are actually either Smalltalk VMs in disguise, derived from Smalltalk VMs or written by Smalltalk VM companies.) Lisp is also interesting, because although it is a dynamic language, there are some (although not many) static high-performance implementations.
And we haven't even begun talking about the rest of the execution environment: modern mainstream operating systems, mainstream CPUs and mainstream hardware architectures are heavily biased towards static languages, to the point of being actively hostile for dynamic languages. Given that modern mainstream execution environments are pretty much of a worst-case scenario for dynamic languages, it is quite astonishing how well they actually perform and one can only imagine what the performance in a less hostile environment would look like.

All other things being equal, usually, yes.

First you must clarify whether you consider
dynamic typing vs. static typing or
statically compiled languaged vs. interpreted languages vs. bytecode JIT.
Usually we mean
dynamc language = dynamic typing + interpreted at run-time and
static languages = static typing + statically compiled
, but it's not necessary the case.
Type information can help the VM dispatch the message faster than witout type information, but the difference tend to disappear with optimization in the VM which detect monomorphic call sites. See the paragraph "performance consideration" in this post about dynamic invokation.
The debates between compiled vs. interpreted vs. byte-code JIT is still open. Some argue that bytecode JIT results in faster execution than regular compilation because the compilation is more accurate due to the presence of more information collected at run-time. Read the wikipedia entry about JIT for more insight. Interpreted language are indeed slower than any of the two forms or compilation.
I will not argue further, and start a heated discussion, I just wanted to point out that the gap between both tend to get smaller and smaller. Chances are that the performance problem that you might face will not be related to the language and VM but because of your design.
EDIT
If you want numbers, I suggest you look at the The Computer Language Benchmarks. I found it insightful.

At the instruction level current implementations of dynamically typed languages are typically slower than current implementations of statically typed languages.
However that does not necessarily mean that the implementation of a program will be slower in dynamic languages - there are lots of documented cases of the same program being implemented in both a static and dynamic language and the dynamic implementation has turned out to be faster. For example this study (PDF) gave the same problem to programmers in a variety of languages and compared the result. The mean runtime for the Python and Perl implementations were faster than the mean runtime for the C++ and Java implementations.
There are several reasons for this:
1) the code can be implemented more quickly in a dynamic language, leaving more time for optimisation.
2) high level data structures (maps, sets etc) are a core part of most dynamic languages and so are more likely to be used. Since they are core to the language they tend to be highly optimised.
3) programmer skill is more important than language speed - an inexperienced programmer can write slow code in any language. In the study mentioned above there were several orders of magnitude difference between the fastest and slowest implementation in each of the languages.
4) in many problem domains execution speed it dominated by I/O or some other factor external to the language.
5) Algorithm choice can dwarf language choice. In the book "More Programming Pearls" Jon Bentley implemented two algorithms for a problem - one was O(N^3) and implemented in optimised fortran on a Cray1. The other was O(N) and implemented in BASIC on a TRS80 home micro (this was in the 1980s). The TRS80 outperformed the Cray 1 for N > 5000.

Dynamic language run-times only need to check the type occasionally.
But it is still, typically, slower.
There are people making good claims that such performance gaps are attackable, however; e.g. http://steve-yegge.blogspot.com/2008/05/dynamic-languages-strike-back.html

Themost important factor is to consider the method dispatch algorithm. With static languages each method is typically allocated an index. THe names we see in source are not actually used at runtime and are in source for readaility purposes. Naturally languages like java keep them and make them available in reflection but in terms of when one invokes a method they are not used. I will leave reflection and binding out of this discussion. This means when a method is invoked the runtmne simply uses the offset to lookup a table and call. A dynamic language on the other hand uses the name of the function to lookup a map and then calls said function. A hashmap is always going to be slower than using an index lookup into an array.

No, dynamic languages are not necessarily slower than static languages.
The pypy and psyco projects have been making a lot of progress on building JIT compilers for python that have data-driven compilation; in other words, they will automatically compile versions of frequently called functions specialised for particular common values of arguments. Not just by type, like a C++ template, but actual argument values; say an argument is usually zero, or None, then there will be a specifically compiled version of the function for that value.
This can lead to compiled code that is faster than you'd get out of a C++ compiler, and since it is doing this at runtime, it can discover optimisations specifically for the actual input data for this particular instance of the program.

Reasonable to assume as more things need to be computed in runtime.

Actually, it's difficult to say because many of the benchmarks used are not that representative. And with more sophisticated execution environments, like HotSpot JVM, differences are getting less and less relevant. Take a look at following article:
Java theory and practice: Dynamic compilation and performance measurement

What makes Ruby slow? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Ruby is slow at certain things. But what parts of it are the most problematic?
How much does the garbage collector affect performance? I know I've had times when running the garbage collector alone took several seconds, especially when working with OpenGL libraries.
I've used matrix math libraries with Ruby that were particularly slow. Is there an issue with how ruby implements basic math?
Are there any dynamic features in Ruby that simply cannot be implemented efficiently? If so, how do other languages like Lua and Python solve these problems?
Has there been recent work that has significantly improved performance?

Ruby is slow. But what parts of it are the most problematic?
It does "late lookup" for methods, to allow for flexibility. This slows it down quite a bit. It also has to remember variable names per context to allow for eval, so its frames and method calls are slower. Also it lacks a good JIT compiler currently, though MRI 1.9 has a bytecode compiler (which is better), and jruby compiles it down to java bytecode, which then (can) compile via the HotSpot JVM's JIT compiler, but it ends up being about the same speed as 1.9.
How much does the garbage collector effect performance? I know I've had times when running the garbage collector alone took several seconds, especially when working with OpenGL libraries.
from some of the graphs at http://www.igvita.com/2009/06/13/profiling-ruby-with-googles-perftools/ I'd say it takes about 10% which is quite a bit--you can decrease that hit by increasing the malloc_limit in gc.c and recompiling.
I've used matrix math libraries with Ruby that were particularly slow. Is there an issue with how ruby implements basic math?
Ruby 1.8 "didn't" implement basic math it implemented Numeric classes and you'd call things like Fixnum#+ Fixnum#/ once per call--which was slow. Ruby 1.9 cheats a bit by inlining some of the basic math ops.
Are there any dynamic features in Ruby that simply cannot be implemented efficiently? If so, how do other languages like Lua and Python solve these problems?
Things like eval are hard to implement efficiently, though much work can be done, I'm sure. The kicker for Ruby is that it has to accomodate for somebody in another thread changing the definition of a class spontaneously, so it has to be very conservative.
Has there been recent work that has significantly improved performance?
1.9 is like a 2x speedup. It's also more space efficient. JRuby is constantly trying to improve speed-wise [and probably spends less time in the GC than KRI]. Besides that I'm not aware of much except little hobby things I've been working on. Note also that 1.9's strings are at times slower because of encoding friendliness.

Ruby is very good for delivering solutions quickly. Less so for delivering quick solutions. It depends what kind of problem you're trying to solve. I'm reminded of the discussions on the old CompuServe MSBASIC forum in the early 90s: when asked which was faster for Windows development, VB or C, the usual answer was "VB, by about 6 months".
In its MRI 1.8 form, Ruby is - relatively - slow to perform some types of computationally-intensive tasks. Pretty much any interpreted language suffers in that way in comparison to most mainstream compiled languages.
The reasons are several: some fairly easily addressable (the primitive garbage collection in 1.8, for example), some less so.
1.9 addresses some of the issues, although it's probably going to be some time before it becomes generally available. Some of the other implementation that target pre-existing runtimes, JRuby, IronRuby, MagLev for example, have the potential to be significantly quicker.
Regarding mathematical performance, I wouldn't be surprised to see fairly slow throughput: it's part of the price you pay for arbitrary precision. Again, pick your problem. I've solved 70+ of the Project Euler problems in Ruby with almost no solution taking more than a mintue to run. How fast do you need it to run and how soon do you need it?

The most problematic part is "everyone".
Bonus points if that "everyone" didn't really use the language, ever.
Seriously, 1.9 is much faster and now is on par with python, and jruby is faster than jython.
Garbage collectors are everywhere; for example, Java has one, and it's faster than C++ on dynamic memory handling. Ruby isn't suited well for number crunching; but few languages are, so if you have computational-intensive parts in your program in any language, you better rewrite them in C (Java is fast with math due to its primitive types, but it paid dearly for them, they're clearly #1 in ugliest parts of the language).
As for dynamic features: they aren't fast, but code without them in static languages can be even slower; for example, java would use a XML config instead of Ruby using a DSL; and it would likely be SLOWER since XML parsing is costly.

Hmm - I worked on a project a few years ago where I scraped the barrel with Ruby performance, and I'm not sure much has changed since. Right now it's caveat emptor - you have to know not to do certain things, and frankly games / realtime applications would be one of them (since you mention OpenGL).
The culprit for killing interactive performance is the garbage collector - others here mention that Java and other environments have garbage collection too, but Ruby's has to stop the world to run. That is to say, it has to stop running your program, scan through every register and memory pointer from scratch, mark the memory that's still in use, and free the rest. The process can't be interrupted while this happens, and as you might have noticed, it can take hundreds of milliseconds.
Its frequency and length of execution is proportional to the number of objects you create and destroy, but unless you disable it altogether, you have no control. My experience was there were several unsatisfactory strategies to smooth out my Ruby animation loop:
GC.disable / GC.enable around critical animation loops and maybe an opportunistic GC.start to force it to go when it can't do any harm. (because my target platform at the time was a 64MB Windows NT machine, this caused the system to run out of memory occasionally. But fundamentally it's a bad idea - unless you can pre-calculate how much memory you might need before doing this, you're risking memory exhaustion)
Reduce the number of objects you create so the GC has less work to do (reduces the frequency / length of its execution)
Rewrite your animation loop in C (a cop-out, but the one I went with!)
These days I would probably also see if JRuby would work as an alternative runtime, as I believe it relies on Java's more sophisticated garbage collector.
The other major performance issue I've found is basic I/O when trying to write a TFTP server in Ruby a while back (yeah I pick all the best languages for my performance-critical projects this was was just an experiment). The absolute simplest tightest loop to simply respond to one UDP packet with another, contaning the next piece of a file, must have been about 20x slower than the stock C version. I suspect there might have been some improvements to make there based around using low-level IO (sysread etc.) but the slowness might just be in the fact there is no low-level byte data type - every little read is copied out into a String. This is just speculation though, I didn't take this project much further but it warned me off relying on snappy I/O.
The main speed recent increase that has gone on, though I'm not fully up-to-date here, is that the virtual machine implementation was redone for 1.9, resulting in faster code execution. However I don't think the GC has changed, and I'm pretty sure there's nothing new on the I/O front. But I'm not fully up-to-date on bleeding-edge Ruby so someone else might want to chip in here.

I assume that you're asking, "what particular techniques in Ruby tend to be slow."
One is object instantiation. If you are doing large amounts of it, you want to look at (reasonable) ways of reducing that, such as using the flyweight pattern, even if memory usage is not a problem. In one library where I reworked it not to be creating a lot of very similar objects over and over again, I doubled the overall speed of the library.

Steve Dekorte: "Writing a Mandelbrot set calculator in a high level language is like trying to run the Indy 500 in a bus."
http://www.dekorte.com/blog/blog.cgi?do=item&id=4047
I recommend to learn various tools in order to use the right tool for the job. Doing matrix transformations could be done efficiently using high-level API which wraps around tight loops with arithmetic-intensive computations. See RubyInline gem for an example of embedding C or C++ code into Ruby script.
There is also Io language which is much slower than Ruby, but it efficiently renders movies in Pixar and outperforms raw C on vector arithmetics by using SIMD acceleration.
http://iolanguage.com
https://renderman.pixar.com/products/tools/it.html
http://iolanguage.com/scm/git/checkout/Io/docs/IoGuide.html#Primitives-Vector

Ruby 1.9.1 is about twice as fast as PHP, and a little bit faster than Perl, according to some benchmarks.
(Update: My source is this (screenshot). I don't know what his source is, though.)
Ruby is not slow. The old 1.8 is, but the current Ruby isn't.

Ruby is slow because it was designed to optimize the programmers experience, not the program's execution time. Slowness is just a symptom of that design decision. If you would prefer performance to pleasure, you should probably use a different language. Ruby's not for everything.

IMO, dynamic languages are all slow in general. They do something in runtime that static languages do in compiling time.
Syntax Check, Interpreting and Like type checking, converting. this is inevitable, therefore ruby is slower than c/c++/java, correct me if I am wrong.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio