Memory barriers through programming

Memory barriers through programming - memory-management

Can memory barriers be achieved through code(without using CAS or other locking primitives like volatile,atomic classes etc.)?
I believe the disruptor is able to achieve it,without actually ersorting to any of the locking primitives.
Any pointers or references in understanding this would be helpful.
Suggestions on other programmatic modes(preferrably in java) is also appreciated.

The notion of a memory barrier is orthogonal to CAS and other locking primitives. For example, C++11 allows a CAS operation to not have any memory barrier at all if specified with memory_order_relaxed. Some hardware, notably x68, always associates a memory barrier with an atomic read-modify-write operation.
The best example of an algorithm that requires a memory barrier, but no CAS or locking, is Dekker's protocol. Section 1 of "Location-Based Memory Fences" gives a good overview of the protocol.
See my blog Volatile: Almost Useless for Multi-Threaded Programming for why volatile is useless as a memory barrier.
C++-specific information: In C++11, use std::atomic_thread_fence. The preceding link has a nice example of using it without locking. If dealing with older C++ compilers, you'll need to resort to vendor-specific routines. One way is to use Intel Threading Building Blocks' tbb::atomic_fence(). It's a wrapper around whatever platform-specific fence we could find.

Related

Why do we have a slow `malloc`?

As far as I know, custom memory managers are used in several medium and large-scale projects. This recent answer on security.se discusses the fact that a custom memory allocator in OpenSSL was included for performance reasons and ultimately ended up making the Heartbleed exploit worse. This old thread here discusses memory allocators, and in particular one of the answer links to an academic paper that shows that while people write custom memory allocators for performance reasons because malloc is slow, a general-purpose state-of-the-art allocator easily beats them and causes fewer problems than developers reinventing the wheel in every project.
As someone who does not program professionally, I am curious about how we ended up in this state and why we seem to be stuck there --- assuming my view is correct, which is not necessarily true. I imagine there must be subtler issues such as thread safety. Apologies if I am presenting the situation wrongly.
Why is the system malloc not developed and optimized to match the performance of these "general-purpose state-of-the-art allocators"? It seems to me that it should be quite an important feature for OS and standard library writers to focus on. I have heard a lot of talking about the scheduler implementation in Linux kernel in the past, for instance, and naively I would expect to see more or less the same amount of interest for memory allocators. How come standard malloc is so bad that so many people feel the need roll out a custom allocator? If there are alternative implementations that work so much better, why haven't system programmers included them in Linux and Windows, either as default or as a linking-time option?

There are two problems:
No single allocation scheme fits all application needs.
The C library was poorly designed (or not designed). Some non-eunuchs operating systems have configurable memory managers that can allow the application to choose the allocation scheme. In eunuchs-land, the solution is to link in your own malloc/free implementation into your application.
There is no real standard malloc implementation (GNU LIBC's is the probably the closest to standard). The malloc implementations that come with the OS tend to work fine for more applications.

What to choose between Slab and Slub Allocator in Linux Kernel?

What are the factors which help to decide the choice of memory allocators in Linux Kernel?
In the present Linux Kernel we have the option of choosing SLAB,SLUB or SLOB. I have read that SLOB is used for Kernel of smaller footprints. But I want to know the factors which help to decide between Slab Allocator and Slub Allocator.

In the search of answer, I posted the same question on Quora and Robert Love answered it:
I'm assuming you are asking this from the point-of-view of the user of
a system, or perhaps someone building a kernel for a particular
product. As a kernel developer, you don't care what "slab" allocator
is in use; the API is the same.
First, "slab" has become a generic name referring to a memory
allocation strategy employing an object cache, enabling efficient
allocation and deallocation of kernel objects. It was first documented
by Sun engineer Jeff Bonwick1 and implemented in the Solaris 2.4
kernel.
Linux currently offers three choices for its "slab" allocator:
Slab is the original, based on Bonwick's seminal paper and available
since Linux kernel version 2.2. It is a faithful implementation of
Bonwick's proposal, augmented by the multiprocessor changes described
in Bonwick's follow-up paper2.
Slub is the next-generation replacement memory allocator, which has
been the default in the Linux kernel since 2.6.23. It continues to
employ the basic "slab" model, but fixes several deficiencies in
Slab's design, particularly around systems with large numbers of
processors. Slub is simpler than Slab.
SLOB (Simple List Of Blocks) is a memory allocator optimized for
embedded systems with very little memory—on the order of megabytes. It
applies a very simple first-fit algorithm on a list of blocks, not
unlike the old K&R-style heap allocator. In eliminating nearly all of
the overhad from the memory allocator, SLOB is a good fit for systems
under extreme memory constraints, but it offers none of the benefits
described in 1 and can suffer from pathological fragmentation.
What should you use? Slub, unless you are building a kernel for an
embedded device with limited in memory. In that case, I would
benchmark Slub versus SLOB and see what works best for your workload.
There is no reason to use Slab; it will likely be removed from future
Linux kernel releases.
Please refer to this link for an original response.

SLOB - it's Old.
SLAB - it's Ancient.
SLUB - Use it.

What types of code domains is OpenCL suited to?

I read the OpenCL overview, and it states it is suitable for code that runs of CPUs, GPGPUs, DSPs, etc. However, from looking through the command reference, it seems to be all math and image type operations. I didn't see anything for say strings.
This makes me wonder what would you run on a CPU via OpenCL?
Further, I know OpenCL can be used to perform sorting on GPGPUs. But would one ever use it (or, for that matter, a current GPGPU) to perform string processing such as pattern matching, metaphone extraction, dictionary lookup, or anything else that requires the processing of arrays of strings.
EDIT
I noticed that Intel's upcoming Ivy Bridge is touted as "OpenCL compliant" with reference to its graphics units. Does this infer that the CPU cores are not OpenCL compliant, or is there no such inference?
EDIT
In the interests of non-debate and constructiveness, I would appreciate if anyone could point me to official references that would answer my question.

You can think of OpenCL as a combination of a runtime (for device discovery, queueing) and a C-based programming language. This programming language has native vector types and built-in functions and operations for doing all sorts fun stuff to these vectors. This is nice in that you can write a vectorized kernel in OpenCL, and it it the responsibility of the implementation to map that to the actual vector ISA of your hardware.
From this 4/2011 article, which might vanish:
There are two major CPU architectures out there, x86 and ARM, both of
which should soon run OpenCL code.
If you write an OpenCL application that targets both of these architectures, you wouldn't have to worry about writing two versions, one SSE and one NEON. Just write OpenCL C and be done with it. Yes, I know. This assumes the vendor has done his job and written a solid implementation that fully utilizes the underlying ISA. But if he doesn't, complain!
In addition, some CL implementations offer auto-vectorization of scalar kernels, which are usually easier to write. A good auto-vectorizer would give you a solid performance increase for no effort. Since CL kernels are compiled "online," obtaining such a benefit wouldn't require shipping rebuilt code.

No links, but I would assume this is because algorithms that use strings may do a lot of dynamic memory allocation and branching, both of which GPGPUs are not well-suited for. GPGPUs also have a lot in common with vector processing, so doing units of work with different sized blocks of memory (which a string algorithm will generally work on, you usually don't have a homogeneous group of strings), yields poorer performance and is hard to program.
GPUs were designed to do the same work, with little to no branching, on a homogeneous group of data (such as per-vector or per-pixel operations). Algorithms that can mimic this type of behavior are great on GPUs.

This makes me wonder what would you run on a CPU via OpenCL?
I prefer to use ocl to offload work from the cpu to my graphics hardware. Sometimes there is a limitation with my video card, so I like having a backup kernel for cpu use. Such limitations can be memory size, memory bottleneck, low clock speed, or when the pci-e bus gets in the way.
I say I like using a separate kernel for cpu, because I think all kernels should be tweaked to run on their target hardware. I even like to have an openmp backup plan, as most algorithms I use get tested out in this manner ahead of time.
I suppose it is best practice to test out a gpu kernel on the cpu to make sure it runs as expected. If a user of your software has opencl installed, but only a cpu (or a low-end gpu) it's nice to be able to execute the same code on the different devices.

Atomic load/store for OSs other than BSD?

Among the atomic operations provided by BSD (as given on the atomic(9) man page), there are atomic_load_acq_int() and atomic_store_rel_int(). In looking for the equivalent for other OSs (for example, by reading the atomic(3) man page for Mac OS X, the atomic_ops(3C) man page for Solaris, and the Interlocked*() functions for Windows), there don't seem to be any (obvious) equivalents for just atomically reading/writing an int.
Is this because that it's implied for those OSs that reads/writes for int are guaranteed to be atomic by default? (Or must you use declare them volatile in C/C++?)
If not, then how does one do atomic reads/writes of an int on those OSs?
(Atomic reads can be simulated by returning the result of an atomic add of 0, but there's no equivalent for doing atomic writes.)

I think you are mixing together atomic memory access with cache coherence. The former is the required hardware support for building synchronization primitives in software (spin-locks, semaphores, and mutexes), while the latter is the hardware support for multiple chips (several CPUs, and peripheral devices) working over the same bus, and having consistent view of the main memory.
Different compilers/libraries provide different utilities for the first. Here's, for example, GCC intrinsics for atomic memory access. They all boil down to generating either compare-and-swap or load-linked/store-conditional based instruction blocks depending on the platform support. Compile your source with, say, -S for GCC and see the assembler generated.
You don't have to do anything explicitly for cache coherency - it's all handled in hardware - but it definitely helps to understand how it works to avoid things like cache line ping-pong.
With all that, aligned single word reads and writes are atomic on all commodity platforms (somebody correct me if I'm wrong here). Since ints are less or equal to processor word in size, you are covered (see the GCC builtins link above).
It's the order of reads and writes that is important. Here's where architecture memory model is important. It dictates what operations can and cannot be re-ordered by the hardware. Example would be updating a linked list - you don't want other CPUs see a new item linked until the item itself is in consistent state. Explicit memory barriers (also often called "memory fences") might be required. Acquire barrier ensures that subsequent operations are not re-ordererd before the barrier (say you read the linked-list item pointer before the content of the item), Release barrier ensures that previous operations are not re-ordered after the barrier (you write the item content before writing the new link pointer).
volatile is often misunderstood as being related to all the above. In fact it is just an instruction to the compiler not to cache variable value in register, but read it from memory on each access. Many argue that it's "almost useless" for concurrent programming.
Apologies for lengthy reply. Hope this clears it a bit.
Edit:
Upcoming C++0x standard finally addresses concurrency, see Hans Boehm's C++ memory model papers for many details.

Memory Allocation/Deallocation Bottleneck?

How much of a bottleneck is memory allocation/deallocation in typical real-world programs? Answers from any type of program where performance typically matters are welcome. Are decent implementations of malloc/free/garbage collection fast enough that it's only a bottleneck in a few corner cases, or would most performance-critical software benefit significantly from trying to keep the amount of memory allocations down or having a faster malloc/free/garbage collection implementation?
Note: I'm not talking about real-time stuff here. By performance-critical, I mean stuff where throughput matters, but latency doesn't necessarily.
Edit: Although I mention malloc, this question is not intended to be C/C++ specific.

It's significant, especially as fragmentation grows and the allocator has to hunt harder across larger heaps for the contiguous regions you request. Most performance-sensitive applications typically write their own fixed-size block allocators (eg, they ask the OS for memory 16MB at a time and then parcel it out in fixed blocks of 4kb, 16kb, etc) to avoid this issue.
In games I've seen calls to malloc()/free() consume as much as 15% of the CPU (in poorly written products), or with carefully written and optimized block allocators, as little as 5%. Given that a game has to have a consistent throughput of sixty hertz, having it stall for 500ms while a garbage collector runs occasionally isn't practical.

Nearly every high performance application now has to use threads to exploit parallel computation. This is where the real memory allocation speed killer comes in when writing C/C++ applications.
In a C or C++ application, malloc/new must take a lock on the global heap for every operation. Even without contention locks are far from free and should be avoided as much as possible.
Java and C# are better at this because threading was designed in from the start and the memory allocators work from per-thread pools. This can be done in C/C++ as well, but it isn't automatic.

First off, since you said malloc, I assume you're talking about C or C++.
Memory allocation and deallocation tend to be a significant bottleneck for real-world programs. A lot goes on "under the hood" when you allocate or deallocate memory, and all of it is system-specific; memory may actually be moved or defragmented, pages may be reorganized--there's no platform-independent way way to know what the impact will be. Some systems (like a lot of game consoles) also don't do memory defragmentation, so on those systems, you'll start to get out-of-memory errors as memory becomes fragmented.
A typical workaround is to allocate as much memory up front as possible, and hang on to it until your program exits. You can either use that memory to store big monolithic sets of data, or use a memory pool implementation to dole it out in chunks. Many C/C++ standard library implementations do a certain amount of memory pooling themselves for just this reason.
No two ways about it, though--if you have a time-sensitive C/C++ program, doing a lot of memory allocation/deallocation will kill performance.

In general the cost of memory allocation is probably dwarfed by lock contention, algorithmic complexity, or other performance issues in most applications. In general, I'd say this is probably not in the top-10 of performance issues I'd worry about.
Now, grabbing very large chunks of memory might be an issue. And grabbing but not properly getting rid of memory is something I'd worry about.
In Java and JVM-based languages, new'ing objects is now very, very, very fast.
Here's one decent article by a guy who knows his stuff with some references at the bottom to more related links:
http://www.ibm.com/developerworks/java/library/j-jtp09275.html

A Java VM will claim and release memory from the operating system pretty much indepdently of what the application code is doing. This allows it to grab and release memory in large chunks, which is hugely more efficient than doing it in tiny individual operations, as you get with manual memory management.
This article was written in 2005, and JVM-style memory management was already streets ahead. The situation has only improved since then.
Which language boasts faster raw
allocation performance, the Java
language, or C/C++? The answer may
surprise you -- allocation in modern
JVMs is far faster than the best
performing malloc implementations. The
common code path for new Object() in
HotSpot 1.4.2 and later is
approximately 10 machine instructions
(data provided by Sun; see Resources),
whereas the best performing malloc
implementations in C require on
average between 60 and 100
instructions per call (Detlefs, et.
al.; see Resources). And allocation
performance is not a trivial component
of overall performance -- benchmarks
show that many real-world C and C++
programs, such as Perl and
Ghostscript, spend 20 to 30 percent of
their total execution time in malloc
and free -- far more than the
allocation and garbage collection
overhead of a healthy Java
application.

In Java (and potentially other languages with a decent GC implementation) allocating an object is very cheap. In the SUN JVM it only needs 10 CPU Cycles. A malloc in C/c++ is much more expensive, just because it has to do more work.
Still even allocation objects in Java is very cheap, doing so for a lot of users of a web application in parallel can still lead to performance problems, because more Garbage Collector runs will be triggered.
Therefore there are those indirect costs of an allocation in Java caused by the deallocation done by the GC. These costs are difficult to quantify because they depend very much on your setup (how much memory do you have) and your application.

Allocating and releasing memory in terms of performance are relatively costly operations. The calls in modern operating systems have to go all the way down to the kernel so that the operating system is able to deal with virtual memory, paging/mapping, execution protection etc.
On the other side, almost all modern programming languages hide these operations behind "allocators" which work with pre-allocated buffers.
This concept is also used by most applications which have a focus on throughput.

I know I answered earlier, however, that was ananswer to the other answer's, not to your question.
To speak to you directly, if I understand correctly, your performance use case criteria is throughput.
This to me, means's that you should be looking almost exclusivly at NUMA aware allocators.
None of the earlier references; IBM JVM paper, Microquill C, SUN JVM. Cover this point so I am highly suspect of their application today, where, at least on the AMD ABI, NUMA is the pre-eminent memory-cpu governer.
Hands down; real world, fake world, whatever world... NUMA aware memory request/use technologies are faster. Unfortunately, I'm running Windows currently, and I have not found the "numastat" which is available in linux.
A friend of mine has written about this in depth in his implmentation for the FreeBSD kernel.
Dispite me being able to show at-hoc, the typically VERY large amount of local node memory requests on top of the remote node (underscoring the obvious performance throughput advantage), you can surly benchmark yourself, and that would likely be what you need todo as your performance charicterisitc is going to be highly specific.
I do know that in a lot of ways, at least earlier 5.x VMWARE faired rather poorly, at that time at least, for not taking advantage of NUMA, frequently demanding pages from the remote node. However, VM's are a very unique beast when it comes to memory compartmentailization or containerization.
One of the references I cited is to Microsoft's API implmentation for the AMD ABI, which has NUMA allocation specialized interfaces for user land application developers to exploit ;)
Here's a fairly recent analysis, visual and all, from some browser add-on developers who compare 4 different heap implmentations. Naturally the one they developed turns out on top (odd how the people who do the testing often exhibit the highest score's).
They do cover in some ways quantifiably, at least for their use case, what the exact trade off is between space/time, generally they had identified the LFH (oh ya and by the way LFH is simply a mode apparently of the standard heap) or similarly designed approach essentially consumes signifcantly more memory off the bat however over time, may wind up using less memory... the grafix are neat too...
I would think however that selecting a HEAP implmentation based on your typical workload after you well understand it ;) is a good idea, but to well understand your needs, first make sure your basic operations are correct before you optimize these odds and ends ;)

This is where c/c++'s memory allocation system works the best. The default allocation strategy is OK for most cases but it can be changed to suit whatever is needed. In GC systems there's not a lot you can do to change allocation strategies. Of course, there is a price to pay, and that's the need to track allocations and free them correctly. C++ takes this further and the allocation strategy can be specified per class using the new operator:
class AClass
{
public:
void *operator new (size_t size); // this will be called whenever there's a new AClass
void *operator new [] (size_t size); // this will be called whenever there's a new AClass []
void operator delete (void *memory); // if you define new, you really need to define delete as well
void operator delete [] (void *memory);define delete as well
};
Many of the STL templates allow you to define custom allocators as well.
As with all things to do with optimisation, you must first determine, through run time analysis, if memory allocation really is the bottleneck before writing your own allocators.

According to MicroQuill SmartHeap Technical Specification, "a typical application [...] spends 40% of its total execution time on managing memory". You can take this figure as an upper bound, i personally feel that a typical application spends more like 10-15% of execution time allocating/deallocating memory. It rarely is a bottleneck in single-threaded application.
In multithreaded C/C++ applications standard allocators become an issue due to lock contention. This is where you start to look for more scalable solutions. But keep in mind Amdahl's Law.

Pretty much all of you are off base if you are talking about the Microsoft heap. Syncronization is effortlessly handled as is fragmentation.
The current perferrred heap is the LFH, (LOW FRAGMENTATION HEAP), it is default in vista+ OS's and can be configured on XP, via gflag, with out much trouble
It is easy to avoid any locking/blocking/contention/bus-bandwitth issues and the lot with the
HEAP_NO_SERIALIZE
option during HeapAlloc or HeapCreate. This will allow you to create/use a heap without entering into an interlocked wait.
I would reccomend creating several heaps, with HeapCreate, and defining a macro, perhaps, mallocx(enum my_heaps_set, size_t);
would be fine, of course, you need realloc, free also to be setup as appropiate. If you want to get fancy, make free/realloc auto-detect which heap handle on it's own by evaluating the address of the pointer, or even adding some logic to allow malloc to identify which heap to use based on it's thread id, and building a heierarchy of per-thread heaps and shared global heap's/pools.
The Heap* api's are called internally by malloc/new.
Here's a nice article on some dynamic memory management issues, with some even nicer references. To instrument and analyze heap activity.

Others have covered C/C++ so I'll just add a little information on .NET.
In .NET heap allocation is generally really fast, as it it just a matter of just grabbing the memory in the generation zero part of the heap. Obviously this cannot go on forever, which is where garbage collection comes in. Garbage collection may affect the performance of your application significantly since user threads must be suspended during compaction of memory. The fewer full collects, the better.
There are various things you can do to affect the workload of the garbage collector in .NET. Generally if you have a lot of memory reference the garbage collector will have to do more work. E.g. by implementing a graph using an adjacency matrix instead of references between nodes the garbage collector will have to analyze fewer references.
Whether that is actually significant in your application or not depends on several factors and you should profile the application with actual data before turning to such optimizations.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio