Heap exploitation with Glibc 2.12.1 - glibc

I've been searching with no good results.
I wonder if the techniques explained in texts as Malloc Maleficarum or Malloc Des-Malleficarum are effective in glibc version 2.12.1.
In the second mentioned text is said that the techniques are tested in glibc version 2.7 and 2.8, so I don't really know if they will work with my glibc version. Of course I could test them, but, first, only by their own the techniques are really difficult and, on the other hand, if they don't work I wouldn't know if it would be because of the glibc version or my fault.
Moreover, I haven't found any actual heap exploit. And, also, I couldn't find the changes implemented through these glibc versions.
Thanks in advance.

As with my other questions about this topic, given that nobody have answered, I will answer it just in case it will be useful for someone.
The first thing to say is that nowadays there are techniques of the Malloc Maleficarum that are already patched. For example, the House of Mind was patched in the glibc 2.11 so nowadays they are of no use.
But the most important thing is that in the majority of the techniques in the MM, you need the address of one buffer placed in the heap, therefore those techniques are completely useless in systems with aslr activated (all?), unless you can find a memory leak. But much more important is that if you are able to know that buffer address, you don't need any of the MM techniques, you can use the oldy unlink technique (with some tricks).
On the other hand, I've only found one exploit using one of the techniques explained in the MM (the house of mind). I haven't tested it, so try it at your own [1].
Another thing to be said, as my opinion after doing some research, MM was a mind blowing document, but in practice, the techniques explained on it a really difficult to apply in a real case. They have too many requisites and if you fulfill some of them, you can return to the unlink technique and forget about all the MM headaches.
P.S.: I feel dirty when setting my own answers as correct...
[1] https://sites.google.com/site/felipeandresmanzano/popplerPOC.tar.bz2

Related

Rubinus or MRI 1.9.3 (YARV)?

So, I have a few questions that I have to ask, I did browse the internet, but there weren't too many reliable answers. Mostly blog posts that would cancel each-other out because they both praised different things and had benchmarks to "prove their viewpoint" (I have never seen so many contradicting benchmarks in my life).
Anyway, my questions are:
Is Rubinius really faster? I was pretty impressed by this apparently honest pro-Rubinius presentation. Another thing that confuses me a little is that a lot of Rubinius is written in Ruby itself, yet somehow it is faster than C-Ruby? It must be a pretty damn good implementation of the language, then!
Does EventMachine work with Ruinius? As far as I know, EventMachine partially relies on Fibers (correct me if I'm wrong) which weren't implemented until 1.9. I know Rubinius will eventually support 1.9, too; I don't mind waiting a little.
Do C extensions work in Rubinius? I have written a C extension which "serializes" binary messages received from a TCP stream into Ruby Objects and vice-versa (I suppose the details are not important, but if it helps answer this question I will update the post). This can be a lot of messages! I managed to write the same code in Ruby (although, it made little sense after a month), but it proved to be a real bottle-neck in the application. So, I had to use C as a "solution" to my problem.
EDIT: I just remembered, I use C for another task, it is a hit-test method for Arrays. Basically it just checks if a "point" is inside an a polygon, it was impossibly slow in CRuby.
If the previous answer was a "No," is there then an alternative for C extensions in Rubinus? I gather the VM is written in C++, so that then.
A few "bonus" questions:
Will C-Ruby (2.0+, YARV) ever get rid of GIL? Or at least modify it so CRuby supports true parallelism?
What is exactly mruby? I see matz is working on it, and as far as the description goes it seems pretty awesome. How different is it from CRuby (performance-wise)?
I apologize for this text-storm I unleashed upon you! ♥
Is Rubinius really faster?
In most benchmarks, yes.
But benchmarks are... dumb. Apps are what we really care about. So the best thing to do is benchmark your app & see how well it performs. The 2 areas where Rubinius will real shine over MRI are parallelism & memory usage. Rubinius has no GIL, so you can utilize all available threads. It also has a much more sophisticated GC, so in general it could perform better with respect to GC.
I did those benchmarks back in Oct '11 for my talk on MagLev at RubyConf
Does EventMachine work with Rubinius?
Yes, and if there are parts that don't work, then the issue should be reported. With that said, currently the EM tests don't pass on any Ruby implementation.
Do C extensions work in Rubinius?
Yes. I maintain the compatibility issue for C-exts, so if there is one you have that is tested on Travis, Rubinius would like to see it pass against rbx. Rubinius has historically had good support for the C-api and C-exts, though it would be nice if someday Rubinius could run Ruby so fast one would not need C-exts or the C-api.
Will C-Ruby (2.0+, YARV) ever get rid of GIL? Or at least modify it so CRuby supports true parallelism?
No, most likely not. Jesse Storimer has a succinct writeup of Matz's opinion (or lack thereof) on threads from RubyConf 2012. Koichi Sasada tried to remove the GIL once and MRI perf just tanked. Evan Phoenix also tried once, before he created Rubinius, but didn't have good results.
What is exactly mruby?
An embeddable Ruby interpreter, akin to Lua. Matt Aimonetti has a few articles that might shed some light for you.
I am not too much into Ruby but I might be able to answer the first question.
Is Rubinius really faster?
I've seen different Benchmarks telling different things. However, the fact that Rubinius is partially written in Ruby does not have to mean that it is slower. I thought the same about PyPy which is Python in Python. After some research and the right classes in college I knew why.
As far as I know both are written in a subset of their language which should be much simpler. An (e.g. C) interpreter can be be optimized much easier for such a subset than the whole language.
Writing the Ruby/Python interpreter in its own language allows much more flexibility and quicker prototyping of new interpretation algorithms. The whole point of the existence of Ruby and Python are among others that algorithms can be implemented much quicker than in e.g. C or even assembler. A faster algorithm outweighs the little overhead of an interpreter a lot of the time.
Btw. writing an interpreter for a language in the same language is also a common academic practice to show how mighty the language is. In one class we've written Lisp in Lisp in Lisp.

What is the reason why high level abstractions that use lock free programming deep down aren't popular?

From what I gathered on the lock free programming, it is incredibly hard to do right... and I agree.
Just thinking about some problems makes my head hurt. But what I wonder is, why isn't
there a widespread use of high-level wrappers around (e.g. lock free queue and similar stuff)?
For example boost has no lock free library, although one was suggested as far as I know.
I mean I guess that there is a lot of applications where you cant avoid the fact that the critical
section is the big part of the load. So what are the reasons? Is it...
Patents - I heard that some stuff related to lock-free programming is patented.
Performance.
Google, and Microsoft have internal libraries like that but none of them are public...
Something else?
So my question is: Why are high level abstractions that use lock free programming deep down not very
popular, while at the same time "regular" multi-threaded programming is "in"?
EDIT: boost got a lockfree lib :)
There are few people who are familiar enough with the field to implement easy-to-use lock-free libraries. Of those few, even fewer publish work for free and of those almost none do the vital additional work to make the library useable - e.g. publish full API docs, etc. They tend to just release a zip file with code in, which is almost useless. Then of course you also need to find a library which is written in the language you want to use, compiles on the platform you're using and finally, word of the library has to get out, so people know it exists.
Patents are an issue, in that they limit what can be offered. There is, for example, to my knowledge no unpatented singly-linked list. All the skip list stuff is heavily patented, too.
A hero in this field is Cliff Click, who came up with a lock-free hash, which he has more-or-less placed in the public domain.
You can find my lock-free library here;
http://www.liblfds.org
Another is Samy Bahra's Concurrency Kit;
http://www.concurrencykit.org
FYI Microsoft's .Net framework gained some lock free classes in .Net 4.0. Namely container classes in the System.Collections.Concurrent namespace, which are:
ConcurrentDictionary
ConcurrentQueue
ConcurrentStack
I've looked into their implementation and they are relatively fiddly/complex under the hood therefore they do represent a significant amount of effort in designing and testing (threading issues are of course notoriously difficult to test to a high standard).
You can take a look at libcds C++ library. It is collection of lock-free containers (stacks, queues, sets and maps) and safe memory reclamation algorithms.
IMHO regarding C++ (I'm not advanced in other languages). New C++ standard has just been released and the compiler developers need a time to implement its requirements. Today, all compilers do not support C++11 memory model entirely since it requires significant changes in compiler’s optimization rules. Recently, Microsoft announces support of the atomic operations that is the base of lock-free programming in VC++ 11 Developer Preview. It is good news for us. As I know, GCC is going to support it in 4.8 (or above).
Second problem is patents. Many interesting lock-free container algorithms are patented that is a barrier to include them to vendor’s libraries.
Third, the main part of lock-free containers is garbage collecting (safe memory reclamation). C++ is free from any GC (fortunately). There are a few GC algos (Hazard Pointer, Pass-the-Buck, epoch-based and so on) but most of them are patented too.
Fourth, not enough instruments to prove the correctness of memory fences applied in your lock-free implementation. Now I known only one – relacy(http://www.1024cores.net/home/relacy-race-detector).
I think after 2-3 years we’ll see many production-ready multiplatform C++ libraries of lock-free containers and algorithms. These libraries are being developed by vendors and enthusiasts.
However, in my opinion, our future is the hardware transaction memory (HTM). Today AMD, Sun (sorry, Oracle), Intel (?) are investigating HTM with very interesting results. Let’s wait.
There is at least one "lock free” framework that is somewhat popular: Erlang.
One major problem is that unless one uses an excessive number of memory barriers, it's hard to be certain that one has enough; if one does use an excessive number of memory barriers, performance is likely to be inferior to what one would have gotten using locks.
The biggest problem with locks is not performance, but robustness. If a thread gets waylaid while it holds a lock, the system dies. By contrast, if a thread which is accessing a lock-free data structure gets waylaid, it won't affect other threads' use thereof. In some situations, a lock-free data structure may be preferable to one using locks, even if performance is inferior, because one must protect the system from being brought down by a malfunctioning thread (for example, even if one was prepared to kill off a thread which hit a StackOverflowException without taking down the process, how would one protect against a thread putting a lot of stuff on its stack before calling a method to access a lock-protected data structure that the method, such that the lock-guarded method hit a stack overflow?) If one uses lock-free data structures, such risks aren't a problem.

Is buffer-overflow considered a "solved problem" ? (at least for future systems)

I am looking at various buffer/heap/stack protection technologies such as PAX, DEP, NX, CANARIES, etc
And a new one SMEP - http://vulnfactory.org/blog/2011/06/05/smep-what-is-it-and-how-to-beat-it-on-linux/
Assuming that i am using the latest kernel on the latest mainstream processor
Assuming i recompile all my apps with various compiler protections
Assuming i run a good DEP, ASLR NX bit whatever protection
Is it reasonable to say that most buffer-overflow attack will fail?
And this has been solved for future systems?
As corollary, is it true to say that as currently implemented, current Win7 systems are
hopelessly vulnerable to ROT and other techniques particularly on 32bit apps ?
p.s. I am purposely not going into the "managed code is safe" argument here
[disclaimer]
I am not advocating sloppy coding.
I realize there are many other attacks.
I know that existing systems far outnumber the "idealized" security configuration
No, buffer-overflow is not a solved problem, at least in unmanaged code.
The basic problem with all the technologies you listed (DEP/NX, ASLR, Canaries, SMEP) is that they happen after the fact: they cannot solve the buffer overflow problem, because they do not prevent the root cause - buffer overflow and memory corruption has already happened.
All of the schemes are simply 'good' but unreliable heuristic ways to attempt to either detect the fault after corruption happens but before it becomes a working exploit, or just make it too hard to engineer a reliable working exploit.
e.g. DEP, ASLR, and canaries on Windows Vista and 7 can be circumvented by ROT plus other techniques, and therefore just raise the difficulty bar of creating a working exploit. But, the difficulty increase is quite significant. This is the point of the techniques.
If you are interested in solving the problem at its root by preventing the original buffer corruption, I think that would be most productive, but it would probably bring us back to the discussion of managed code...
If all those technologies are enabled and strong compiler checks are in place, most methods that get around current protections are disabled, but as that smep article points out it only raises the bar of attacks and new ones will require additional checks and complexity. The best bet for your required condition of "most" is eventually the complexity required for exploits will not be able to fit into the space available, as more and more areas and spaces become protected, but we are not there yet.

Byte code stack versus three address

When designing a byte code interpreter, is there a consensus these days on whether stack or three address format (or something else?) is better? I'm looking at these considerations:
The objective language is a dynamic language fairly similar to Javascript.
Performance is important, but development speed and portability are more so for the moment.
Therefore the implementation will be strictly an interpreter for the time being; a JIT compiler may come later, resources permitting.
The interpreter will be written in C.
Read The evolution of Lua and The implementation of Lua 5.0 for how Lua changed from a stack-based virtual machine to a register-based virtual machine and why it gained performance doing it.
Experiments done by David Gregg and Roberto Ierusalimschy have shown that a register-based bytecode works better than a stack-based bytecode because fewer bytecode instructions (and therefore less decoding overhead) are required to do the same tasks. So three-address format is a clear winner.
I don't have much (not really any) experience in this area, so you might want to verify some of the following for yourself (or maybe someone else can correct me where necessary?).
The two languages I work with most nowadays are C# and Java, so I am naturally inclined to their methodologies. As most people know, both are compiled to byte code, and both platforms (the CLR and the JVM) utilize JIT (at least in the mainstream implementations). Also, I would guess that the jitters for each platform are written in C/C++, but I really don't know for sure.
All-in-all, these languages and their respective platforms are pretty similar to your situation (aside from the dynamic part, but I'm not sure if this matters). Also, since they are such mainstream languages, I'm sure their implementations can serve as a pretty good guide for your design.
With that out of the way, I know for sure that both the CLR and the JVM are stack-based architectures. Some of the advantages which I remember for stack-based vs register-based are
Smaller generated code
Simpler interpreters
Simpler compilers
etc.
Also, I find stack-based to be a little more intuitive and readable, but that's a subjective thing, and like I said before, I haven't seen too much byte code yet.
Some advantages of the register-based architecture are
Less instructions must be executed
Faster interpreters (follows from #1)
Can more readily be translated to machine code, since most commonplace hardwares are register based
etc.
Of course, there are always ways to offset the disadvantages for each, but I think these describe the obvious things to consider.
Take a look at the OCaml bytecode interpreter - it's one of the fastest of its kind. It is pretty much a stack machine, translated into a threaded code on loading (using the GNU computed goto extension). You can generate a Forth-like threaded code as well, should be relatively easy to do.
But if you're keeping a future JIT compilation in mind, make sure that your stack machine is not really a full-featured stack machine, but an expression tree serialisation form instead (like .NET CLI) - this way you'd be able to translate your "stack" bytecode into a 3-address form and then into an SSA.
If you have JIT in your mind then bytecodes is the only option.
Just in case you can take a look on my TIScript: http://www.codeproject.com/KB/recipes/TIScript.aspx
and sources: http://code.google.com/p/tiscript/

Scaling multithreaded applications on multicored machines

I'm working on a project were we need more performance. Over time we've continued to evolve the design to work more in parallel(both threaded and distributed). Then latest step has been to move part of it onto a new machine with 16 cores. I'm finding that we need to rethink how we do things to scale to that many cores in a shared memory model. For example the standard memory allocator isn't good enough.
What resources would people recommend?
So far I've found Sutter's column Dr. Dobbs to be a good start.
I just got The Art of Multiprocessor Programming and The O'Reilly book on Intel Threading Building Blocks
A couple of other books that are going to be helpful are:
Synchronization Algorithms and Concurrent Programming
Patterns for Parallel Programming
Communicating Sequential Processes by C. A. R. Hoare (a classic, free PDF at that link)
Also, consider relying less on sharing state between concurrent processes. You'll scale much, much better if you can avoid it because you'll be able to parcel out independent units of work without having to do as much synchronization between them.
Even if you need to share some state, see if you can partition the shared state from the actual processing. That will let you do as much of the processing in parallel, independently from the integration of the completed units of work back into the shared state. Obviously this doesn't work if you have dependencies among units of work, but it's worth investigating instead of just assuming that the state is always going to be shared.
You might want to check out Google's Performance Tools. They've released their version of malloc they use for multi-threaded applications. It also includes a nice set of profiling tools.
Jeffrey Richter is into threading a lot. He has a few chapters on threading in his books and check out his blog:
http://www.wintellect.com/cs/blogs/jeffreyr/default.aspx.
As monty python would say "and now for something completely different" - you could try a language/environment that doesn't use threads, but processes and messaging (no shared state). One of the most mature ones is erlang (and this excellent and fun book: http://www.pragprog.com/titles/jaerlang/programming-erlang). May not be exactly relevant to your circumstances, but you can still learn a lot of ideas that you may be able to apply in other tools.
For other environments:
.Net has F# (to learn functional programming).
JVM has Scala (which has actors, very much like Erlang, and is functional hybrid language). Also there is the "fork join" framework from Doug Lea for Java which does a lot of the hard work for you.
The allocator in FreeBSD recently got an update for FreeBSD 7. The new one is called jemaloc and is apparently much more scaleable with respect to multiple threads.
You didn't mention which platform you are using, so perhaps this allocator is available to you. (I believe Firefox 3 uses jemalloc, even on windows. So ports must exist somewhere.)
Take a look at Hoard if you are doing a lot of memory allocation.
Roll your own Lock Free List. A good resource is here - it's in C# but the ideas are portable. Once you get used to how they work you start seeing other places where they can be used and not just in lists.
I will have to check-out Hoard, Google Perftools and jemalloc sometime. For now we are using scalable_malloc from Intel Threading Building Blocks and it performs well enough.
For better or worse, we're using C++ on Windows, though much of our code will compile with gcc just fine. Unless there's a compelling reason to move to redhat (the main linux distro we use), I doubt it's worth the headache/political trouble to move.
I would love to use Erlang, but there way to much here to redo it now. If we think about the requirements around the development of Erlang in a telco setting, the are very similar to our world (electronic trading). Armstrong's book is on my to read stack :)
In my testing to scale out from 4 cores to 16 cores I've learned to appreciate the cost of any locking/contention in the parallel portion of the code. Luckily we have a large portion that scales with the data, but even that didn't work at first because of an extra lock and the memory allocator.
I maintain a concurrency link blog that may be of ongoing interest:
http://concurrency.tumblr.com

Resources