How can JVM implementations like Jython and JRuby beat their native counterparts?

How can JVM implementations like Jython and JRuby beat their native counterparts? - performance

I was watching this video here, where Robert Nicholson discusses P8, an implementation of PHP on the JVM. At some point he mentions that they aim to surpass native PHP in performance some time in the future.
He mentions JRuby and Jython, which started out slower than their native counterparts, but eventually surpassed them. Quercus, another PHP interpreter on the JVM claims to be 4x faster than mod_php and is also worth of note.
Does that mean that the general idea that the JVM is slower than C is wrong, or are there flaws in the original C implementations?

Does that mean that the general idea that the JVM is slower than C is wrong, or are there >flaws in the original C implementations?
A bit of both
The JVM has been around for a long time and has made significant progress in efficiency. The garbage collection, jitting, caching and other areas are more advanced than in 'reference' implementations such as PHP.
Anyone taking a look under the hood of PHP will understand why efficiency gains are easy to achieve.
I am personally doubtful that the JVM can outperform the CPython however ... but I could be wrong ... I am, this is down to the JVM GC being faster, and IronPython too. Performance improvements may be a non-reliance on the C call stack such as in stackless Python. The Jython site states
Jython is approximately as fast as CPython--sometimes faster, sometimes slower. Because >most JVMs--certainly the fastest ones--do long running, hot code will run faster overtime.
Which I can appricate as fact as the JVM will reach C performance levels as caches generate and so on basically negate the higher level aspects to the VM implementation code (a large part of which is written in C anyway)
In many interpreted languages such as PHP and Python are just bridges to equivalent C calls and dives into machine code. In the JVM, the Jitter performs a similar function by reducing the bytecode to machine-code equivalents. Eventually, the intermediate representations such as the high-level syntax and bytecode are usually reduced to C-speed or faster CPU operations anyway ... so it is all the same, just more intermediate steps which only affects the latency to full efficiency when loading new code. There comes a point in RAM where you say "what is the real difference?" and the answer is only the process that gets it there and the final representation that determines the speed of stack winding, garbage collection algorithms, register usage and logic representation such as arithmetic.

It's not too hard. If you write your implementation in C you have to write your own GC, JIT and more (to be fast and efficient). To do that really good you need really smart people with a lot of experience and give them a lot of time.
I will go out on a limb here and say that the current implementation of PHP (not based not on the knowledge of the inner working but rather on the benchmarks I have seen and on stuff people who know more about PHP told me) is not state of the art. Facebook tries to address this but they do it in a uncommon way (because of there special needs and the typical use of PHP see http://www.stanford.edu/class/ee380/Abstracts/100505.html).
Summary:
So if somebody implements PHP in java (or on any fast VM) he doesn't need to write a super GC or JIT to be fast "only" a compiler (which can be simple).

There are some hints about what the virtual machine does here. For example, it looks like the Java Virtual Machine first checks which parts of the bytecode are executed most often and then compiles the relevant parts into native code (which then should then execute with similar speed as e.g. compiled C code).
By the way, does PHP compile into bytecode or is it just interpreted using an in-memory data structure ? By translating PHP first into bytecode executable by the Java virtual machine, one benefits automatically from the existing (language-agnostic) optimizations of bytecode execution.

Related

rules for exploiting pypy performance speedup vs python

I have written a python program and run it both with pypy and python
i have inserted some timing prints to measure the performance difference.
in some cases the speedup is 10X, in other there is no change.
Can anyone explain if there are rules to follow when writing a program
in order to exploit the potential speedup given by pypy?
E.G. avoid some syntax, prefer some data strcutures vs others....
I found a speedup between 12x and 1,5x

PyPy tends to be fast in pure numerically-intensive codes with hot loops dealing with (small) integers/float number as it can directly use native types instead of variable-sized dynamic integer/float objects. It will still be slower than a natively-compiled C/C++ code because it needs to check the types at runtime and compile the code at runtime.
PyPy does not like (large) dynamic codes. It uses a tracing just-in-time compiler that can track which part of the code are more likely to be executed and compile this path dynamically at runtime if it is executed often. When there are many path executed changing dynamically, the overhead of the JIT can be significant, and in the worst case, PyPy can choose not to compile any path. The thing is the PyPy fallback interpreter is slower than the one of CPython (due to the ability to trace and compile the code at runtime). Some dynamic features like frame introspection is supported but it is slow (since a code is not expected to use it massively).
PyPy is not fast for short-running scripts since the JIT has to compile the code at runtime and the overhead of the JIT or using the fallback interpreter (slower than CPython) can be higher than just interpreting the code with CPython for such scripts.
PyPy use a Garbage Collector (GC) as opposed to CPython which use Automatic Reference Counting (ARC). GCs can be faster to allocate/free many objects (especially small temporary objects), but they needs to track the object alive to know which one are dead and then free them. This means codes dealing with a huge amount of references and regular object allocations can actually be slower. This includes dynamic graph-based data structures and trees for example.
The C binding APIs (C extension and CTypes but not CFFI) tends to be slower than CPython (mainly because it has been designed for CPython in the first place). This means glue codes calling a lot of wrapped C function will actually be slower with PyPy. A lot of work has been done recently to significantly improve the performance of PyPy in this case, but AFAIK PyPy is still slower. An example of use case is operation on large Numpy arrays (for small ones, embedded JITs like Numba are certainly better), as well as the CSV and pickling packages.
String operations tends to be often slower, and especially string concatenation. One reason is that CPython use efficient algorithms for string operations that are pretty-well optimized and written in C at the expense of a large and complex code base. This is a significant work for the small PyPy teams which needs to reimplement this and maintain it with the additional complexity of the JIT and the GC. As a result, operations can be less well optimized. Regarding the concatenation, the inefficiency comes from the JIT which cannot optimize out intermediate copies. That being said, note that string-appending loops should be avoided anyway.
Generators tends to be slower than simple basic loops. The simpler, the better. One should not expect the JIT to perform complex expensive optimizations at runtime since the overhead of the JIT should not be too big compared to the rest of the code (and PyPy do not know the time taken by the code ahead of time).
Global variables are slow in CPython but not in PyPy. That being said, they should not be used for software engineering reasons anyway.
This is a pretty broad topic. There are many other interesting point to consider to evaluate the performance of a given Python code. For more information, please read:
https://www.pypy.org/performance.html (performance tips)
https://speed.pypy.org (benchmark)
https://www.pypy.org/blog (articles about the development of PyPy)

There are no hard and fast rules. There are some hints in the PyPy FAQ https://doc.pypy.org/en/latest/faq.html#how-fast-is-pypy. PyPy can JIT hot python code, and may be able to make lists, dictionaries, and tuples that store one type of object (int, float, string) more efficient.

Performance of Google's Go?

So has anyone used Google's Go? I was wondering how the mathematical performance (e.g. flops) is compared to other languages with a garbage collector... like Java or .NET?
Has anyone investigated this?

Theoretical performance: The theoretical performance of pure Go programs is somewhere between C/C++ and Java. This assumes an advanced optimizing compiler and it also assumes the programmer takes advantage of all features of the language (be it C, C++, Java or Go) and refactors the code to fit the programming language.
Practical performance (as of July 2011): The standard Go compiler (5g/6g/8g) is currently unable to generate efficient instruction streams for high-performance numerical codes, so the performance will be lower than C/C++ or Java. There are multiple reasons for this: each function call has an overhead of a couple of additional instructions (compared to C/C++ or Java), no function inlining, average-quality register allocation, average-quality garbage collector, limited ability to erase bound checks, no access to vector instructions from Go, compiler has no support for SSE2 on 32-bit x86 CPUs, etc.
Bottom line: As a rule of thumb, expect the performance of numerical codes implemented in pure Go, compiled by 5g/6g/8g, to be 2 times lower than C/C++ or Java. Expect the performance to get better in the future.
Practical performance (September 2013): Compared to older Go from July 2011, Go 1.1.2 is capable of generating more efficient numerical codes but they remain to run slightly slower than C/C++ and Java. The compiler utilizes SSE2 instructions even on 32-bit x86 CPUs which causes 32-bit numerical codes to run much faster, most likely thanks to better register allocation. The compiler now implements function inlining and escape analysis. The garbage collector has also been improved but it remains to be less advanced than Java's garbage collector. There is still no support for accessing vector instructions from Go.
Bottom line: The performance gap seems sufficiently small for Go to be an alternative to C/C++ and Java in numerical computing, unless the competing implementation is using vector instructions.

The Go math package is largely written in assembler for performance.
Benchmarks are often unreliable and are subject to interpretation. For example, Robert Hundt's paper Loop Recognition in C++/Java/Go/Scala looks flawed. The Go blog post on Profiling Go Programs dissects Hundt's claims.

You're actually asking several different questions. First of all, Go's math performance is going to be about as fast as anything else. Any language that compiles down to native code (which arguably includes even JIT languages like .NET) is going to perform extremely well at raw math -- as fast as the machine can go. Simple math operations are very easy to compile into a zero-overhead form. This is the area where compiled (including JIT) languages have a advantage over interpreted ones.
The other question you asked was about garbage collection. This is, to a certain extent, a bit of a side issue if you're talking about heavy math. That's not to say that GC doesn't impact performance -- actually it impacts quite a bit. But the common solution for tight loops is to avoid or minimize GC sweeps. This is often quite simple if you're doing a tight loop -- you just re-use your old variables instead of constantly allocating and discarding them. This can speed your code by several orders of magnitude.
As for the GC implementations themselves -- Go and .NET both use mark-and-sweep garbage collection. Microsoft has put a lot of focus and engineering into their GC engine, and I'm obliged to think that it's quite good all things considered. Go's GC engine is a work in progress, and while it doesn't feel any slower than .NET's architecture, the Golang folks insist that it needs some work. The fact that Go's specification disallows destructors goes a long way in speeding things up, which may be why it doesn't seem that slow.
Finally, in my own anecdotal experience, I've found Go to be extremely fast. I've written very simple and easy programs that have stood up in my own benchmarks against highly-optimized C code from some long-standing and well-respected open source projects that pride themselves on performance.
The catch is that not all Go code is going to be efficient, just like not all C code is efficient. You've got to build it correctly, which often means doing things differently than what you're used to from other languages. The profiling blog post mentioned here several times is a good example of that.

Google did a study comparing Go to some other popular languages (C++, Java, Scala). They concluded it was not as strong performance-wise:
https://days2011.scala-lang.org/sites/days2011/files/ws3-1-Hundt.pdf
Quote from the Conclusion, about Go:
Go offers interesting language features, which also allow for a concise and standardized notation. The compilers for this language are still immature, which reﬂects in both performance and binary sizes.

Questions about possible java(or other memory managed language) optimizations

From what I have read java (usually) seems to compile java to not very (is at all?) optimised java bytecode, leaving it to the jit to optimise. Is this true? And if it is has there been any exploration (possibly in alternative implementations) of getting the compiler to optimise the code so the jit has less work to do (is this possible)?
Also many people seem to have a dislike for native code generation (sometimes referred to as ahead of time compilation) for Java (and many other high level memory managed languages) , for many reasons such as loss of portability (and ect.) , but also partially because (at least for those languages that have a just in time compiler) the thinking goes that ahead of time compilation to machine code will miss the possible optimisations that can be done by a jit compiler and therefore may be slower in the long run.
This leads me to wonder whether anyone has ever tried to implement http://en.wikipedia.org/wiki/Profile-guided_optimization (compiling to a binary + some extras then running the program and analysing the runtime information of the test run to generate a hopefully more optimised binary for real world usage) for java/(other memory managed languages) and how this would compare to jit code? Anyone have a clue?

Personally, I think the big difference is not between JIT compiling and AOT compiling, but between class-compilation and whole-program optimization.
When you run javac, it only looks at a single .java file, compiling it into a single .class file. All the interface implementations and virtual methods and overrides are checked for validity but left unresolved (because it's impossible to know the true method invocation targets without analyzing the whole program).
The JVM uses "runtime loading and linking" to assemble all of your classes into a coherent program (and any class in your program can invoke specialized behavior to change the default loading/linking behavior).
But then, at runtime, the JVM can remove the vast majority of virtual methods. It can inline all of your getters and setters, turning them into raw fields. And when those raw fields are inlined, it can perform constant-propagation to further optimize the code. (At runtime, there's no such thing as a private field.) And if there's only one thread running, the JVM can eliminate all synchronization primitives.
To make a long story short, there are a lot of optimizations that aren't possible without analyzing the whole program, and the best time for doing whole program analysis is at runtime.

Profile-guided optimization has some caveats, one of them mentioned even in the Wiki article you linked. It's results are valid
for the given samples, representing how your code is actually used by the user or other code.
for the given platform (CPU, memory + other hardware, OS, whatever).
From the performance point of view there are quite big differences even among platforms that are usually considered (more or less) the same (e.g. compare a single core, old Athlon with 512M with a 6 core Intel with 8G, running on Linux, but with very different kernel versions).
for the given JVM and its config.
If any of these change then your profiling results (and the optimizations based on them) are not necessary valid any more. Most likely some of the optimizations will still have a beneficial effect, but some of them may turn out suboptimal (or even degrading performance).
As it was mentioned the JIT JVMs do something very similar to profiling, but they do it on the fly. It's also called 'hotspot', because it constantly monitors the executed code, looks for hot spots that are executed frequently and will try to optimize only those parts. At this point it will be able to exploit more knowledge about the code (knowing the context of it, how it is used by other classes, etc.) so - as mentioned by you and the other answers - it can do better optimizations as a static one. It will continue monitoring and if its needed it will do another turn of optimization later, this time trying even harder (looking for more, more expensive optimizations).
Working on the real life data (usage statistics + platform + config) it can avoid the caveats mentioned before.
The price of it is some additional time it needs to spend on "profiling" + JIT-ing. Most of the time its spent quite well.
I guess a profile-guided optimizer could still compete with it (or even beat it), but only in some special cases, if you can avoid the caveats:
you are quite sure that your samples represent the real life scenario well and they won't change too much during execution.
you know your target platform quite precisely and can do the profiling on it.
and of course you know/control the JVM and its config.
It will happen rarely and I guess in general JIT will give you better results, but I have no evidence for it.
Another possibility for getting value from the profile-guided optimization if you target a JVM that can't do JIT optimization (I think most small devices have such a JVM).
BTW one disadvantage mentioned in other answers would be quite easy to avoid: if static/profile guided optimization is slow (which is probably the case) then do it only for releases (or RCs going to testers) or during nightly builds (where time does not matter so much).
I think the much bigger problem would be to have good sample test cases. Creating and maintaining them is usually not easy and takes a lot of time. Especially if you want to be able to execute them automatically, which would be quite essential in this case.

The official Java Hot Spot compiler does "adaptive optimisation" at runtime, which is essentially the same as the profile-guided optimisation you mentioned. This has been a feature of at least this particular Java implementation for a long time.
The trade-off to performing more static analysis or optimisation passes up-front at compile time is essentially the (ever-diminishing) returns you get from this extra effort against the time it takes for the compiler to run. A compiler like MLton (for Standard ML) is a whole-program optimising compiler with a lot of static checks. It produces very good code, but becomes very, very slow on medium-to-large programs, even on a fast system.
So the Java approach seems to be to use JIT and adaptive optimisation as much as possible, with the initial compilation pass just producing an acceptable valid binary. The absolute opposite end is to use an approach like that of something like MLKit, which does a lot of static inference of regions and memory behaviour.

Why are Interpreted Languages Slow?

I was reading about the pros and cons of interpreted languages, and one of the most common cons is the slowness, but why are programs in interpreted languages slow?

Native programs runs using instructions written for the processor they run on.
Interpreted languages are just that, "interpreted". Some other form of instruction is read, and interpreted, by a runtime, which in turn executes native machine instructions.
Think of it this way. If you can talk in your native language to someone, that would generally work faster than having an interpreter having to translate your language into some other language for the listener to understand.
Note that what I am describing above is for when a language is running in an interpreter. There are interpreters for many languages that there is also native linkers for that build native machine instructions. The speed reduction (however the size of that might be) only applies to the interpreted context.
So, it is slightly incorrect to say that the language is slow, rather it is the context in which it is running that is slow.
C# is not an interpreted language, even though it employs an intermediate language (IL), this is JITted to native instructions before being executed, so it has some of the same speed reduction, but not all of it, but I'd bet that if you built a fully fledged interpreter for C# or C++, it would run slower as well.
And just to be clear, when I say "slow", that is of course a relative term.

All answers seem to miss the real important point here. It's the detail how "interpreted" code is implemented.
Interpreted script languages are slower because their method, object and global variable space model is dynamic. In my opinion this is the real definition of of script language not the fact that it is interpreted. This requires many extra hash-table lookups on each access to a variable or method call. And its the main reason why they are all terrible at multithreading and using a GIL (Global Interpreter Lock). This lookups is where most of the time is spent. It is a painful random memory lookup, which really hurts when you get a L1/L2 cache-miss.
Google's Javascript Core8 is so fast and targeting almost C speed for a simple optimization: they take the object data model as fixed and create internal code to access it like the data structure of a native compiled program. When a new variable or method is added or removed then the whole compiled code is discarded and compiled again.
The technique is well explained in the Deutsch/Schiffman paper "Efficient Implementation of the Smalltalk-80 System".
The question why php, python and ruby aren't doing this is pretty simple to answer: the technique is extremely complicated to implement.
And only Google has the money to pay for JavaScript because a fast browser-based JavaScript interpreter is their fundamental need of their billion dollar business model.

Think of the interpeter as an emulator for a machine you don't happen to have
The short answer is that the compiled languages are executed by machine instructions whereas the interpreted ones are executed by a program (written in a compiled language) that reads either the source or a bytecode and then essentially emulates a hypothetical machine that would have run the program directly if the machine existed.
Think of the interpreted runtime as an emulator for a machine that you don't happen to actually have around at the moment.
This is obviously complicated by the JIT (Just In Time) compilers that Java, C#, and others have. In theory, they are just as good as "AOT" ("At One Time") compilers but in practice those languages run slower and are handicapped by needing to have the compiler around using up memory and time at the program's runtime. But if you say any of that here on SO be prepared to attract rabid JIT defenders who insist that there is no theoretical difference between JIT and AOT. If you ask them if Java and C# are as fast as C and C++, then they start making excuses and kind of calm down a little. :-)
So, C++ totally rules in games where the maximum amount of available computing can always be put to use.
On the desktop and web, information-oriented tasks are often done by languages with more abstraction or at least less compilation, because the computers are very fast and the problems are not computationally intensive, so we can spend some time on goals like time-to-market, programmer productivity, reliable memory-safe environments, dynamic modularity, and other powerful tools.

This is a good question, but should be formulated a little different in my opinion, for example: "Why are interpreted languages slower than compiled languages?"
I think it is a common misconception that interpreted languages are slow per se. Interpreted languages are not slow, but, depending on the use case, might be slower than the compiled version. In most cases interpreted languages are actually fast enough!
"Fast enough", plus the increase in productivity from using a language like Python over, for example, C should be justification enough to consider an interpreted language. Also, you can always replace certain parts of your interpreted program with a fast C implementation, if you really need speed. But then again, measure first and determine if speed is really the problem, then optimize.

In addition to the other answers there's optimization: when you're compiling a programme, you don't usually care how long it takes to compile - the compiler has lots of time to optimize your code. When you're interpreting code, it has to be done very quickly so some of the more clever optimizations might not be able to be made.

Loop a 100 times, the contents of the loop are interpreted 100 times into low level code.
Not cached, not reused, not optimised.
In simple terms, a compiler interprets once into low level code
Edit, after comments:
JIT is compiled code, not interpreted. It's just compiled later not up-front
I refer to the classical definition, not modern practical implementations

A simple question, without any real simple answer. The bottom line is that all computers really "understand" is binary instructions, which is what "fast" languages like C are compiled into.
Then there are virtual machines, which understand different binary instructions (like Java and .NET) but those have to be translated on the fly to machine instructions by a Just-In-Compiler (JIT). That is almost as fast (even faster in some specific cases because the JIT has more information than a static compiler on how the code is being used.)
Then there are interpreted languages, which usually also have their own intermediate binary instructions, but the interpreter functions much like a loop with a large switch statement in it with a case for every instruction, and how to execute it. This level of abstraction over the underlying machine code is slow. There are more instructions involved, long chains of function calls in the interpreter to do even simple things, and it can be argued that the memory and cache aren't used as effectively as a result.
But interpreted languages are often fast enough for the purposes for which they're used. Web applications are invariably bound by IO (usually database access) which is an order of magnitude slower than any interpreter.

From about.com:
An Interpreted language is processed
at runtime. Every line is read,
analysed, and executed. Having to
reprocess a line every time in a loop
is what makes interpreted languages so
slow. This overhead means that
interpreted code runs between 5 - 10
times slower than compiled code. The
interpreted languages like Basic or
JavaScript are the slowest. Their
advantage is not needing to be
recompiled after changes and that is
handy when you're learning to program.
The 5-10 times slower is not necessarily true for languages like Java and C#, however. They are interpreted, but the just-in-time compilers can generate machine language instructions for some operations, speeding things up dramatically (near the speed of a compiled language at times).

There is no such thing as an interpreted language. Any language can be implemented by an interpreter or a compiler. These days most languages have implementations using a compiler.
That said, interpreters are usually slower, because they need process the language or something rather close to it at runtime and translate it to machine instructions. A compiler does this translation to machine instructions only once, after that they are executed directly.

Yeah, interpreted languages are slow...
However, consider the following. I had a problem to solve. It took me 4 minutes to solve the problem in Python, and the program took 0.15 seconds to run. Then I tried to write it in C, and I got a runtime of 0.12 seconds, and it took me 1 hour to write it. All this because the practical way to solve problem in question was to use hashtables, and the hashtable dominated the runtime anyway.

Interpreted languages need to read and interpret your source code at execution time. With compiled code a lot of that interpretation is done ahead of time (at compilation time).

Very few contemporary scripting languages are "interpreted" these days; they're typically compiled on the fly, either into machine code or into some intermediate bytecode language, which is (more efficiently) executed in a virtual machine.
Having said that, they're slower because your cpu is executing many more instructions per "line of code", since many of the instructions are spent understanding the code rather than doing whatever the semantics of the line suggest!

Read this Pros And Cons Of Interpreted Languages
This is the relevant idea in that post to your problem.
An execution by an interpreter is
usually much less efficient then
regular program execution. It happens
because either every instruction
should pass an interpretation at
runtime or as in newer
implementations, the code has to be
compiled to an intermediate
representation before every execution.

For the same reason that it's slower to talk via translator than in native language. Or, reading with dictionary. It takes time to translate.
Update: no, I didn't see that my answer is the same as the accepted one, to a degree ;-)

Wikipedia says,
Interpreting code is slower than running the compiled code because the interpreter must analyze each statement in the program each time it is executed and then perform the desired action, whereas the compiled code just performs the action within a fixed context determined by the compilation. This run-time analysis is known as "interpretive overhead". Access to variables is also slower in an interpreter because the mapping of identifiers to storage locations must be done repeatedly at run-time rather than at compile time.
Refer this IBM doc,
Interpreted program must be translated each time it is executed, there is a higher overhead. Thus, an interpreted language is generally more suited to ad hoc requests than predefined requests.
In Java though it is considered as an interpreted language, It uses JIT (Just-in-Time) compilation which mitigate the above issue by using a caching technique to cache the compiled bytecode.
The JIT compiler reads the bytecodes in many sections (or in full, rarely) and compiles them dynamically into machine code so the program can run faster. This can be done per-file, per-function or even on any arbitrary code fragment; the code can be compiled when it is about to be executed (hence the name "just-in-time"), and then cached and reused later without needing to be recompiled.

What makes Ruby slow? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Ruby is slow at certain things. But what parts of it are the most problematic?
How much does the garbage collector affect performance? I know I've had times when running the garbage collector alone took several seconds, especially when working with OpenGL libraries.
I've used matrix math libraries with Ruby that were particularly slow. Is there an issue with how ruby implements basic math?
Are there any dynamic features in Ruby that simply cannot be implemented efficiently? If so, how do other languages like Lua and Python solve these problems?
Has there been recent work that has significantly improved performance?

Ruby is slow. But what parts of it are the most problematic?
It does "late lookup" for methods, to allow for flexibility. This slows it down quite a bit. It also has to remember variable names per context to allow for eval, so its frames and method calls are slower. Also it lacks a good JIT compiler currently, though MRI 1.9 has a bytecode compiler (which is better), and jruby compiles it down to java bytecode, which then (can) compile via the HotSpot JVM's JIT compiler, but it ends up being about the same speed as 1.9.
How much does the garbage collector effect performance? I know I've had times when running the garbage collector alone took several seconds, especially when working with OpenGL libraries.
from some of the graphs at http://www.igvita.com/2009/06/13/profiling-ruby-with-googles-perftools/ I'd say it takes about 10% which is quite a bit--you can decrease that hit by increasing the malloc_limit in gc.c and recompiling.
I've used matrix math libraries with Ruby that were particularly slow. Is there an issue with how ruby implements basic math?
Ruby 1.8 "didn't" implement basic math it implemented Numeric classes and you'd call things like Fixnum#+ Fixnum#/ once per call--which was slow. Ruby 1.9 cheats a bit by inlining some of the basic math ops.
Are there any dynamic features in Ruby that simply cannot be implemented efficiently? If so, how do other languages like Lua and Python solve these problems?
Things like eval are hard to implement efficiently, though much work can be done, I'm sure. The kicker for Ruby is that it has to accomodate for somebody in another thread changing the definition of a class spontaneously, so it has to be very conservative.
Has there been recent work that has significantly improved performance?
1.9 is like a 2x speedup. It's also more space efficient. JRuby is constantly trying to improve speed-wise [and probably spends less time in the GC than KRI]. Besides that I'm not aware of much except little hobby things I've been working on. Note also that 1.9's strings are at times slower because of encoding friendliness.

Ruby is very good for delivering solutions quickly. Less so for delivering quick solutions. It depends what kind of problem you're trying to solve. I'm reminded of the discussions on the old CompuServe MSBASIC forum in the early 90s: when asked which was faster for Windows development, VB or C, the usual answer was "VB, by about 6 months".
In its MRI 1.8 form, Ruby is - relatively - slow to perform some types of computationally-intensive tasks. Pretty much any interpreted language suffers in that way in comparison to most mainstream compiled languages.
The reasons are several: some fairly easily addressable (the primitive garbage collection in 1.8, for example), some less so.
1.9 addresses some of the issues, although it's probably going to be some time before it becomes generally available. Some of the other implementation that target pre-existing runtimes, JRuby, IronRuby, MagLev for example, have the potential to be significantly quicker.
Regarding mathematical performance, I wouldn't be surprised to see fairly slow throughput: it's part of the price you pay for arbitrary precision. Again, pick your problem. I've solved 70+ of the Project Euler problems in Ruby with almost no solution taking more than a mintue to run. How fast do you need it to run and how soon do you need it?

The most problematic part is "everyone".
Bonus points if that "everyone" didn't really use the language, ever.
Seriously, 1.9 is much faster and now is on par with python, and jruby is faster than jython.
Garbage collectors are everywhere; for example, Java has one, and it's faster than C++ on dynamic memory handling. Ruby isn't suited well for number crunching; but few languages are, so if you have computational-intensive parts in your program in any language, you better rewrite them in C (Java is fast with math due to its primitive types, but it paid dearly for them, they're clearly #1 in ugliest parts of the language).
As for dynamic features: they aren't fast, but code without them in static languages can be even slower; for example, java would use a XML config instead of Ruby using a DSL; and it would likely be SLOWER since XML parsing is costly.

Hmm - I worked on a project a few years ago where I scraped the barrel with Ruby performance, and I'm not sure much has changed since. Right now it's caveat emptor - you have to know not to do certain things, and frankly games / realtime applications would be one of them (since you mention OpenGL).
The culprit for killing interactive performance is the garbage collector - others here mention that Java and other environments have garbage collection too, but Ruby's has to stop the world to run. That is to say, it has to stop running your program, scan through every register and memory pointer from scratch, mark the memory that's still in use, and free the rest. The process can't be interrupted while this happens, and as you might have noticed, it can take hundreds of milliseconds.
Its frequency and length of execution is proportional to the number of objects you create and destroy, but unless you disable it altogether, you have no control. My experience was there were several unsatisfactory strategies to smooth out my Ruby animation loop:
GC.disable / GC.enable around critical animation loops and maybe an opportunistic GC.start to force it to go when it can't do any harm. (because my target platform at the time was a 64MB Windows NT machine, this caused the system to run out of memory occasionally. But fundamentally it's a bad idea - unless you can pre-calculate how much memory you might need before doing this, you're risking memory exhaustion)
Reduce the number of objects you create so the GC has less work to do (reduces the frequency / length of its execution)
Rewrite your animation loop in C (a cop-out, but the one I went with!)
These days I would probably also see if JRuby would work as an alternative runtime, as I believe it relies on Java's more sophisticated garbage collector.
The other major performance issue I've found is basic I/O when trying to write a TFTP server in Ruby a while back (yeah I pick all the best languages for my performance-critical projects this was was just an experiment). The absolute simplest tightest loop to simply respond to one UDP packet with another, contaning the next piece of a file, must have been about 20x slower than the stock C version. I suspect there might have been some improvements to make there based around using low-level IO (sysread etc.) but the slowness might just be in the fact there is no low-level byte data type - every little read is copied out into a String. This is just speculation though, I didn't take this project much further but it warned me off relying on snappy I/O.
The main speed recent increase that has gone on, though I'm not fully up-to-date here, is that the virtual machine implementation was redone for 1.9, resulting in faster code execution. However I don't think the GC has changed, and I'm pretty sure there's nothing new on the I/O front. But I'm not fully up-to-date on bleeding-edge Ruby so someone else might want to chip in here.

I assume that you're asking, "what particular techniques in Ruby tend to be slow."
One is object instantiation. If you are doing large amounts of it, you want to look at (reasonable) ways of reducing that, such as using the flyweight pattern, even if memory usage is not a problem. In one library where I reworked it not to be creating a lot of very similar objects over and over again, I doubled the overall speed of the library.

Steve Dekorte: "Writing a Mandelbrot set calculator in a high level language is like trying to run the Indy 500 in a bus."
http://www.dekorte.com/blog/blog.cgi?do=item&id=4047
I recommend to learn various tools in order to use the right tool for the job. Doing matrix transformations could be done efficiently using high-level API which wraps around tight loops with arithmetic-intensive computations. See RubyInline gem for an example of embedding C or C++ code into Ruby script.
There is also Io language which is much slower than Ruby, but it efficiently renders movies in Pixar and outperforms raw C on vector arithmetics by using SIMD acceleration.
http://iolanguage.com
https://renderman.pixar.com/products/tools/it.html
http://iolanguage.com/scm/git/checkout/Io/docs/IoGuide.html#Primitives-Vector

Ruby 1.9.1 is about twice as fast as PHP, and a little bit faster than Perl, according to some benchmarks.
(Update: My source is this (screenshot). I don't know what his source is, though.)
Ruby is not slow. The old 1.8 is, but the current Ruby isn't.

Ruby is slow because it was designed to optimize the programmers experience, not the program's execution time. Slowness is just a symptom of that design decision. If you would prefer performance to pleasure, you should probably use a different language. Ruby's not for everything.

IMO, dynamic languages are all slow in general. They do something in runtime that static languages do in compiling time.
Syntax Check, Interpreting and Like type checking, converting. this is inevitable, therefore ruby is slower than c/c++/java, correct me if I am wrong.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio