why stack heap collision is not implemented in different segment - stack-overflow

I think stack and heap collision could have been prevented by just putting both in different segments. Any reason it wasn't implemented that way?

Compilers produce codes that adhere to the operating system's ABI. Thus the memory model of the compiler actually follows the memory model of the operating system. If you are asking about Linux, *BSD or other OS that has flat VA with overlapping code/data/stack segments, it's easier to program that way and it greatly simplifies the memory management. Being able to prevent heap-stack collision is too small of a gain for what the OS will loose in terms of ease of memory management (managing a flat VA is already complicated enough).
See what happened to OS/2 - it did use full segmenting in protected mode...

Related

Is there performance advantage to ARM64

Recently 64-bit ARM mobiles started appearing. But is there any practical advantage to building an application 64-bit? Specifically considering application that does not have much use for the increased virtual address space¹, but would waste some space due to increased pointer size.
So does ARM64 have any other advantages than the larger address that would actually warrant building such application 64bit?
Note: I've seen 64-bit Performance Advantages, but it only mentions x86-64 which does have other improvements besides extended virtual address space. I also recall that the situation is indeed specific to x86 and on some other platforms that went 64-bit like Sparc the usual approach was to only compile kernel and the applications that actually did use lot of memory as 64-bit and everything else as 32-bit.
¹The application is multi-platform and it still needs to be built for and run on devices with as little as 48MiB of memory. Does have some large data that it reads from external storage, but it never needs more than some megabytes of it at once.
I am not sure a general response can be given, but I can provide some examples of differences. There are of course additional differences added in version 8 of the ARM architecture, which apply regardless of target instruction set.
Performance-positive additions in AArch64
32 General-purpose registers gives compilers more wiggle room.
I/D cache synchronization mechanisms accessible from user mode (no system call needed).
Load/Store-Pair instructions makes it possible to load 128-bits of data with one instruction, and still remain RISC-like.
The removal of near-universal conditional execution makes more out-of-ordering possible.
The change in layout of NEON registers (D0 is still lower half of Q0, but D1 is now lower half of Q1 rather than upper half of Q0) makes more out-of-ordering possible.
64-bit pointers make pointer tagging possible.
CSEL enables all kind of crazy optimizations.
Performance-negative changes in AArch64
More registers may also mean higher pressure on the stack.
Larger pointers mean larger memory footprint.
Removal of near-universal conditional execution may cause higher pressure on branch predictor.
Removal of load/store-multiple means more instructions needed for function entry/exit.
Performance-relevant changes in ARMv8-A
Load-Aquire/Store-Release semantics remove need for explicit memory barriers for basic synchronization operations.
I probably forgot lots of things, but those are some of the more obvious changes.

What to choose between Slab and Slub Allocator in Linux Kernel?

What are the factors which help to decide the choice of memory allocators in Linux Kernel?
In the present Linux Kernel we have the option of choosing SLAB,SLUB or SLOB. I have read that SLOB is used for Kernel of smaller footprints. But I want to know the factors which help to decide between Slab Allocator and Slub Allocator.
In the search of answer, I posted the same question on Quora and Robert Love answered it:
I'm assuming you are asking this from the point-of-view of the user of
a system, or perhaps someone building a kernel for a particular
product. As a kernel developer, you don't care what "slab" allocator
is in use; the API is the same.
First, "slab" has become a generic name referring to a memory
allocation strategy employing an object cache, enabling efficient
allocation and deallocation of kernel objects. It was first documented
by Sun engineer Jeff Bonwick1 and implemented in the Solaris 2.4
kernel.
Linux currently offers three choices for its "slab" allocator:
Slab is the original, based on Bonwick's seminal paper and available
since Linux kernel version 2.2. It is a faithful implementation of
Bonwick's proposal, augmented by the multiprocessor changes described
in Bonwick's follow-up paper2.
Slub is the next-generation replacement memory allocator, which has
been the default in the Linux kernel since 2.6.23. It continues to
employ the basic "slab" model, but fixes several deficiencies in
Slab's design, particularly around systems with large numbers of
processors. Slub is simpler than Slab.
SLOB (Simple List Of Blocks) is a memory allocator optimized for
embedded systems with very little memory—on the order of megabytes. It
applies a very simple first-fit algorithm on a list of blocks, not
unlike the old K&R-style heap allocator. In eliminating nearly all of
the overhad from the memory allocator, SLOB is a good fit for systems
under extreme memory constraints, but it offers none of the benefits
described in 1 and can suffer from pathological fragmentation.
What should you use? Slub, unless you are building a kernel for an
embedded device with limited in memory. In that case, I would
benchmark Slub versus SLOB and see what works best for your workload.
There is no reason to use Slab; it will likely be removed from future
Linux kernel releases.
Please refer to this link for an original response.
SLOB - it's Old.
SLAB - it's Ancient.
SLUB - Use it.

What data structure is used to implement the dynamic memory allocation heap?

I always assumed a heap (data structure) is used to implement a heap (dynamic memory allocation), but I've been told I'm wrong.
How are heaps (for example, the one implemented by typical malloc routines, or by Windows's HeapCreate, etc.) implemented, typically? What data structures do they use?
What I'm not asking:
While searching online, I've seen tons of descriptions of how to implement heaps with severe restrictions.
To name a few, I've seen lots of descriptions of how to implement:
Heaps that never release memory back to the OS (!)
Heaps that only give reasonable performance on small, similarly-sized blocks
Heaps that only give reasonable performance for large, contiguous blocks
etc.
And it's funny, they all avoid the harder question:
How are "normal", general-purpose heaps (like the one behind malloc, HeapCreate) implemented?
What data structures (and perhaps algorithms) do they use?
Allocators tend to be quite complex and often differ significantly in how they're implemented.
You can't really describe them in terms of one common data structure or algorithm, but there are some common themes:
Memory is taken from the system in large chunks -- often megabytes at a time.
These chunks are then split up into various smaller chunks as you perform allocations. Not exactly the same size as you allocate, but usually in certain ranges (200-250 bytes, 251-500 bytes, etc.). Sometimes this is multi-tiered, where you'd have an additional layer of "medium chunks" which come before your actual requests.
Controlling which "large chunk" to break a piece off of is a very difficult and important thing to do -- this greatly affects memory fragmentation.
One or more free pools (aka "free list", "memory pool", "lookaside list") are maintained for each of these ranges. Sometimes even thread-local pools. This can greatly speed up a pattern of allocating/deallocating many objects of similar size.
Large allocations are treated a bit differently so as to not waste a lot of RAM and not be pooled quite so much if at all.
If you wanted to check out some source code, jemalloc is a modern high-performance allocator and should be representative in complexity of other common ones. TCMalloc is another common general-purpose allocator, and their website goes into all the gory implementation details. Intel's Thread Building Blocks has an allocator built specifically for high concurrency.
One interesting difference can be seen between Windows and *nix. In *nix, the allocator has very low-level control over the address space an app uses. In Windows, you basically have a course-grained, slow allocator VirtualAlloc to base your own allocator off of.
This results in *nix-compatible allocators typically directly giving you an malloc/free implementation where it's assumed you'll only use one allocator for everything (otherwise they'd trample each-other), while Windows-specific allocators provide additional functions, leaving malloc/free alone, and can be used in harmony (for instance, you can use HeapCreate to make private heaps which can work alongside others).
In practice, this trade in flexibility gives *nix allocators a small leg up performance-wise. It's very rare to see an app intentionally use multiple heaps on Windows -- mostly it's by accident due to different DLLs using different runtimes which each have their own malloc/free, and can cause a lot of headaches if you're not diligent in tracking which heap some memory came from.
Note: The following answer assumes you're using a typical, modern system with virtual memory. The C and C++ standards do not require virtual memory; therefore of course you can't rely on such assumptions on hardware without this feature (e.g. GPUs typically don't have this feature; nor do extremely small hardware like the PIC).
This depends on the platform you're using. Heaps can be very complicated beasts; they don't use only a single data structure; and there is no "standard" data structure. Even where the heap code is located is different depending on the platform. For instance, the heap code is typically provided by the C Runtime on Unix boxes; but is typically provided by the operating system on Windows.
Yes, this is common on Unix machines; due to the way *nix's underlying APIs and memory model operate. Basically, the standard API to return memory to the operating system on these systems only allows returning memory on the "frontier" between where user memory is allocated and the "hole" in between user memory and system facilities like the stack. (The API in question is brk or sbrk). Instead of returning memory to the operating system, many heaps only try to reuse memory no longer in use by the program proper, and don't try to return memory to the system. This is less common on Windows, because its equivalent to sbrk (VirtualAlloc) doesn't have this limitation. (But like sbrk, it is very expensive and has caveats like only allocating page-sized and page-aligned chunks. So heaps try to call either as rarely as possible)
This sounds like a "block allocator", which divides the memory into fixed size chunks, and then just return one of the free chunks. To my (albeit limited) understanding, Windows' RtlHeap maintains a number of data structures like this for different known block sizes. (E.g. it'll have one for blocks of size 16, for instance) RtlHeap calls these "lookaside lists".
I don't really know of a specific structure that handles this case well. Large blocks are problematic for most allocation systems because they cause fragmentation of the address space.
The best reference I've found discussing the common allocation strategies employed on major platforms is the book Secure Coding in C and C++, by Robert Seacord. All of chapter 4 is dedicated to heap data structures (and problems caused when users use said heap systems incorrectly).

GPU access to system RAM

I am currently involved in developing a large scientific computing project, and I am exploring the possibility of hardware acceleration with GPUs as an alternative to the MPI/cluster approach. We are in a mainly memory-bound situation, with too much data to put in memory to fit on a GPU. To this end, I have two questions:
1) The books I have read say that it is illegal to access memory on the host with a pointer on the device (for obvious reasons). Instead, one must copy the memory from the host's memory to the device memory, then do the computations, and then copy back. My question is whether there is a work-around for this -- is there any way to read a value in system RAM from the GPU?
2) More generally, what algorithms/solutions exist for optimizing the data transfer between the CPU and the GPU during memory-bound computations such as these?
Thanks for your help in this! I am enthusiastic about making the switch to CUDA, simply because the parallelization is much more intuitive!
1) Yes, you can do this with most GPGPU packages.
The one I'm most familair with -- the AMD Stream SDK lets you allocate a buffer in "system" memory and use that as a texture that is read or written by your kernel. Cuda and OpenCL have the same ability, the key is to set the correct flags on the buffer allocation.
BUT...
You might not want to do that because the data is being read/written across the PCIe bus, which has a lot of overhead.
The implementation is free to interpret your requests liberally. I mean you can tell it to locate the buffer in system memory, but the software stack is free to do things like relocate it into GPU memory on the fly -- as long as the computed results are the same
2) All of the major GPGPU software enviroments (Cuda, OpenCL, the Stream SDK) support DMA transfers, which is what you probably want.
Even if you could do this, you probably wouldn't want to, since transfers over PCI-whatever will tend to be a bottleneck, whereas bandwidth between the GPU and its own memory is typically very high.
Having said that, if you have relatively little computation to perform per element on a large data set then GPGPU is probably not going to work well for you anyway.
I suggest cuda programming guide.
you will find many answers there.
Check for streams, unified addressing, cudaHostRegister.

Why does the Mac ABI require 16-byte stack alignment for x86-32?

I can understand this requirement for the old PPC RISC systems and even for x86-64, but for the old tried-and-true x86? In this case, the stack needs to be aligned on 4 byte boundaries only. Yes, some of the MMX/SSE instructions require 16byte alignments, but if that is a requirement of the callee, then it should ensure the alignments are correct. Why burden every caller with this extra requirement? This can actually cause some drops in performance because every call-site must manage this requirement. Am I missing something?
Update: After some more investigation into this and some consultation with some internal colleagues, I have some theories about this:
Consistency between the PPC, x86, and x64 version of the OS
It seems that the GCC codegen now consistently does a sub esp,xxx and then "mov"s the data onto the stack rather than simply doing a "push" instruction. This could actually be faster on some hardware.
While this does complicate the call sites a little, there is very little extra overhead when using the default "cdecl" convention where the caller cleans up the stack.
The issue I have with the last item, is that for calling conventions that rely on the callee cleaning the stack, the above requirements really "uglifies" the codegen. For instance, what some compiler decided to implement a faster register-based calling style for its own internal use (ie any code that isn't intended to be called from other languages or sources)? This stack-alignment thing could negate some of the performance gains achieved by passing some parameters in registers.
Update: So far the only real answers have been consistency, but to me that's a bit too easy of an answer. I have well over 20 years experience with the x86 architecture and if consistency, not performance, or something else concrete, is really the reason then I respectfully suggest that is a bit naive for the developers to require it. They're ignoring nearly three decades of tools and support. Especially if they're expecting tools vendors to quickly and easily adapt their tools for their platform (maybe not... it is Apple...) without having to jump through several seemingly unnecessary hoops.
I'll give this topic another day or so then close it...
Related
It’s my stack frame, I don’t care about your stack frame!
From "Intel®64 and IA-32 Architectures Optimization Reference Manual", section 4.4.2:
"For best performance, the Streaming SIMD Extensions and Streaming SIMD Extensions 2 require their memory operands to be aligned to 16-byte boundaries. Unaligned data can cause significant performance penalties compared to aligned data."
From Appendix D:
"It is important to ensure that the stack frame is aligned to a 16-byte boundary upon function entry to keep local __m128 data, parameters, and XMM register spill locations aligned throughout a function invocation."
http://www.intel.com/Assets/PDF/manual/248966.pdf
I am not sure as I don't have first hand proof, but I believe the reason is SSE. SSE is much faster if your buffers are already aligned on a 16 bytes boundary (movps vs movups), and any x86 has at least sse2 for mac os x. It can be taken care of by the application user, but the cost is pretty significant. If the overall cost for making it mandatory in the ABI is not too significant, it may worth it. SSE is used quite pervasively in mac os X: accelerate framework, etc...
I believe it's to keep it inline with the x86-64 ABI.
First, note that the 16 bytes alignment is an exception introduced by Apple to the System V IA-32 ABI.
The stack alignment is only needed when calling system functions, because many system libraries are using SSE or Altivec extensions which require the 16 bytes alignment. I found an explicit reference in the libgmalloc MAN page.
You can perfectly handle your stack frame the way you want, but if you try to call a system function with a misaligned stack, you will end up with a misaligned_stack_error message.
Edit:
For the record, you can get rid of alignment problems when compiling with GCC by using the mstack-realign option.
This is an efficiency issue.
Making sure the stack is 16-byte aligned in every function that uses the new SSE instructions adds a lot of overhead for using those instructions, effectively reducing performance.
On the other hand, keeping the stack 16-byte aligned at all times ensures that you can use SSE instructions freely with no performance penalty. There is no cost to this (cost measured in instructions at least). It only involves changing a constant in the prologue of the function.
Wasting stack space is cheap, it is probably the hottest part of the cache.
My guess is that Apple believes everyone just uses XCode (gcc) which aligns the stack for you. So requiring the stack to be aligned so the kernel doesn't have to is just a micro-optimization.
While I cannot really answer your question of WHY, you may find the manuals at the following site useful:
http://www.agner.org/optimize/
Regarding the ABI, have a look especially at:
http://www.agner.org/optimize/calling_conventions.pdf
Hope that's useful.
Hmm, didn't OS X ABI also do funny RISC like things like passing small structs in registers?
So that points to the consistency with other platforms theory.
Come to think of it, the FreeBSD syscall api also aligns 64-bit values. (like e.g. lseek and mmap)
In order to maintain consistency in kernel. This allows the same kernel to be booted on multiple architectures without modicfication.
Not sure why no one has considered the possibility of easy portability from legacy PowerPC-based platform?
Read this:
http://developer.apple.com/library/mac/#documentation/DeveloperTools/Conceptual/LowLevelABI/100-32-bit_PowerPC_Function_Calling_Conventions/32bitPowerPC.html#//apple_ref/doc/uid/TP40002438-SW20
And then zoomed into "32-bit PowerPC Function Calling Conventions" and finally this:
"These are the embedding alignment modes available in the 32-bit
PowerPC environment:
Power alignment mode is derived from the alignment rules used by the
IBM XLC compiler for the AIX operating system. It is the default
alignment mode for the PowerPC-architecture version of GCC used on AIX
and Mac OS X. Because this mode is most likely to be compatible
between PowerPC-architecture compilers from different vendors, it’s
typically used with data structures that are shared between different
programs."
In view of the legacy PowerPC-based background of OSX, portability is a major consideration - it dictates following the convention all the way back to AIX's XLC compiler. When you think in terms of the need to make sure all the tools and applications will work together with minimal rework, I think it is important to stick to the same legacy ABI as far as possible.
That gives the philosophy, and reading further is the rule explicitly mentioned ("Prolog and Epilog"):
The called function is responsible for allocating
its own stack frame, making sure to preserve 16-byte alignment in the
stack. This operation is accomplished by a section of code called the
prolog, which the compiler places before the body of the subroutine.
After the body of the subroutine, the compiler places an epilog to
restore the processor to the state it was prior to the subroutine
call.

Resources