Question about the implementation of page_address() - linux-kernel

On x86 machines with highmem, when the kernel wants to query the kernel virtual address of a physical frame, it will use page_address. How it works depends on whether the macro WANT_PAGE_VIRTUAL is defined, which decides if void *virtual is added in struct page. If there is no void *virtual, the kernel will use a hash table page_address_htable to do the conversion, and this seems the method x86 applies. On the contrary, mips and m68k just take void *virtual(I'm not very sure for this).
So my question is, why x86 prefers a hash table to a improved struct page? What are the benefits it brings about?

Since there are a lot of struct page needed (one for every page!) it is very expensive to add even one more word to the structure (or conversely, shrinking the struct by even one word gives a lot of benefit). On 32-bit architectures WANT_PAGE_VIRTUAL is especially expensive -- without it, struct page is exactly 32 bytes, which means it packs nicely into cachelines etc.
On x86 the hash lookup is fast enough (since multiplication is fast on x86) that the tradeoff is strongly in favor of making struct page smaller. I guess on m68k multiplication is expensive enough that bloating struct page to 36 bytes is worth it.

Related

Why are far pointers slow?

I have been programming some stuff on 16-bit DOS recently for fun. I see a lot of people mentioning that far pointers are slower than near pointers and to avoid them.
Why?
From an assembly point of view, this makes sense. There are several extra instruction involved. You have to store the old value of a segment register, store the new segment into a register, then move that into CS or DS (immediate mode stores aren't a valid opcode in 8086). You can then do whatever you need to do in that segment. After, you have to restore the old value.
It sounds like a lot, but in reality this doesn't eat up a lot of cycles. I guess if every pointer you were using were in a different segment this could add up, but data is usually grouped. So unless you were bouncing all over the place, which is slow for other reasons, the penalty shouldn't be that bad. If you have to hit DRAM, that should dominate the cost, right?
I feel like there is more to this story and I am having a hard time tracking it down. Hoping an 8086 wizard is hanging around who remembers this stuff.
To clarify: I am interested in actual 16-bit processors like the 8086 and the 80286 in real mode.
Why are far pointers slow?
Segment register loads are too complex for a core's front-end to convert into micro-ops. Instead they're "emulated" by micro-ops stored in a little ROM. This begins with some branches (which CPU mode is it?), and typically those branches can't benefit from CPU's branch prediction causing stall/s.
To avoid segment register loads (e.g. when the same far pointer is used multiple times) software tends to use more segment registers (e.g. tends to use ES, FS and GS); and this adds more prefixes (segment override prefixes) to instructions. These extra prefixes can also slow down instruction decoding.
I guess if every pointer you were using were in a different segment this could add up, but data is usually grouped.
Compilers aren't that smart. If a small piece of code uses 4 far pointers that all happen to use the same segment, the compiler won't know that they're all in the same segment and will do expensive segment register loads regardless. To work around that you could describe the data as a structure (e.g. 1 pointer to a structure that has 4 fields instead of 4 different pointers); but that requires the programmer to write software differently.
For an example; if you do something like "int foo(void) { int a; return bar(&a); }" then ss will probably be passed on the stack and then the callee (bar) will load that into another segment register, because bar() has to assume that the pointer could point anywhere.
The other problem is that sometimes data is larger than a segment (e.g. an array of 100000 bytes that doesn't fit in a 64 KiB segment); so someone (the programmer or the compiler) has to calculate and load a different segment to access (parts of) the same data. All pointer arithmetic may need to take this into account (e.g. a trivial looking pointer++; may become something more like offset++; segment += (offset >> 4); offset &= 0x000F; that causes a segment register load).
If you have to hit DRAM, that should dominate the cost, right?
For real mode; you're limited to about 640 KiB of RAM and the caches in CPUs are typically much larger, so you can expect that every memory access is going to be a cache hit. In some cases (e.g. cascade lake CPUs with 1 MiB of L2 cache per core) you won't even use the L3 cache (it'll be all L2 hits).
You can also expect that a segment register load is more expensive than a cache hit.
Hoping an 8086 wizard is hanging around who remembers this stuff.
When people say "segmentation is awful and should be avoided" they're not thinking of a CPU that's been obsolete for 40 years (8086), they're thinking of CPUs that were relevant during this century. Some of them may also be thinking of more than just performance alone (especially for assembly language programmers, segmentation is an annoyance/extra burden).

How to split one heap into several heaps inside one process?

Which ecosystems allow to create multiple heaps right now?
Is it possible to have multiple heaps in java?
garbage collection and memory management in Erlang
Is there any benefit to use multiple heaps for memory management purposes?
AppDomains don't create new heaps (there is still one heap for all domains). So, what one need to do to launch several different GC inside the single process?
Which syntactic primitives does one need to create? How a runtime should support that primitives?
Which ecosystems allow to create multiple heaps right now?
One obvious answer would be "C++" (feel free to fill in surrounding pieces as you see fit, if you don't consider a language to be an "ecosystem" in itself).
C++ allows you to specify heaps along a few different axes. One is by the type of an object--you can specify allocation for a particular type by overloading operator new and operator delete for that type:
class Foo {
static void *operator new(size_t size);
static void operator delete(void *block, size_t size);
};
It's then up to you to connect these heap management functions to an actual source of memory. You might allocate that via ::operator new, or you might (for example) go directly to the OS, such as with something like GlobalAlloc or VirtualAlloc on Windows, sbrk on UNIX-like systems, or just have pre-specified blocks of memory on a bare-metal embedded system.
Along a somewhat different axis, all the containers in the C++ standard library allocate and free memory via Allocator classes. The Allocator for any particular collection is specified as a template parameter, so (for example) a declaration for std::vector looks something like this:
template <class T, class Alloc=std::allocator<T>>
class vector {
// ...
};
This lets you specify a heap that will be used to allocate objects in that collection. Much as with operator new and operator delete, this really only specifies the interface by which the collection will allocate and free memory--it's up to you to connect that to code that actually manages the heap.
Garbage Collection
As far as garbage collection goes: I personally find it annoying, and advise against its use as a general rule. The problem is that it while it can (at least from one perspective) fix some types of problems with memory management, it does nothing to help management of other resources--and (unfortunately) I haven't seen anything like a tracing collector for file handles, network sockets, database connections, and so on. RAII provides a uniform method for dealing with resource management in general.
That said, if you really insist on using GC, C++ does support that as well. Prior to C++11, GC was entirely usable on a practical level, but led to what was technically undefined behavior under a few obscure circumstances, such as:
storing a pointer in a file, and reading it back in, or
modifying the bits of a pointer, later un-doing that modification
...and later taking the re-constituted pointer and dereferencing it. Obviously, while the pointer wasn't visible to the CPU, the pointed-to block of memory became eligible for GC, so the later dereference caused problems. C++11 defined these circumstances, and added a few library calls (e.g., declare_reachable, undeclare_reachable) to deal with them (e.g., if you call decalare_reachable(block);, that block is not eligible for collection, regardless of whether a pointer to it is visible). As such, if you want to use GC with C++ you can, and the bounds of defined behavior are thoroughly specified. The only problem is that essentially no code ever calls declare_reachable and/or undeclare_reachable, so in real use they're likely to be of little or no help (but pointer swizzling and/or storage in a file are sufficiently rare that this is unlikely to pose a real problem).
For a practical example, you might want to look at the Boehm-Demers-Weiser collector (if you haven't already).

Does unordered map frees its memory after calling clear?

I know that unordered map removes the objects that were stored in it when doing clear() but does it also relinquishes the memory (back to OS) that it holds to create itself.
struct A{
A(){std::cout << "constructor called\n";}
~A(){std::cout << "destructor called\n";}
};
// this will reserve size of 1000 buckets
std::unordered_map<int, A> my_map{1000};
// now insert into the map
my_map[1] = "asdf";
my_map[2] = "asdff";
....
then I do my_map.clear(); This will call destructor of A.
So my question is will the 1000 buckets that was reserved also be freed? I tried looking at the size() after doing clear it says zero, so is there something like capacity for unordered_map that is similar to vector that would let me view the reserved size?
will the 1000 buckets that was reserved also be freed?
This is not specified by the standard.
is there something like capacity for unordered_map that is similar to vector that would let me view the reserved size?
A simple "capacity" value doesn't make sense for a hash map. Rehashing happens when load_factor exceeds max_load_factor. You can check the number of buckets using bucket_count.
I know that unordered map removes the objects that were stored in it when doing clear() but does it also relinquishes the memory (back to OS) that it holds to create itself.
There is no guarantee (in particular because the C++11 specification does not know what an OS is), and in practice it often don't release the memory to the OS (that is, your virtual address space might stay unchanged).
What happens is that (without specific Allocator argument to the std::map template, or to std::unordered_map template) the memory is delete-d (or delete[]-d). And if you really care a lot, implement your own Allocator andd pass it appropriately to your container templates. I'm not sure that is worth the trouble.
Usually new calls malloc & delete calls free. In that case in most (but not all) cases the memory is not released to the kernel (e.g. with munmap(2) on Linux...) but just marked as re-usable by future calls to new.
Details are of course implementation specific. Many free (or delete) implementations would for example release with munmap a large enough block (of several megabytes or more). The exact threshold vary from one implementation to the next one.
Notice that if you use free software (e.g. a Linux system with GCC, GNU libc, ....), you could look into the source code of your C++ standard library (the code of std::map and of ::operator delete etc... inside GCC) and of your C standard library (the code of free) and find out by yourself these gory details.

Atomicity, Volatility and Thread Safety in Windows

It's my understanding of atomicity that it's used to make sure a value will be read/written in whole rather than in parts. For example, a 64-bit value that is really two 32-bit DWORDs (assume x86 here) must be atomic when shared between threads so that both DWORDs are read/written at the same time. That way one thread can't read half variable that's not updated. How do you guarantee atomicity?
Furthermore it's my understanding that volatility does not guarantee thread safety at all. Is that true?
I've seen it implied many places that simply being atomic/volatile is thread-safe. I don't see how that is. Won't I need a memory barrier as well to ensure that any values, atomic or otherwise, are read/written before they can actually be guaranteed to be read/written in the other thread?
So for example let's say I create a thread suspended, do some calculations to change some values to a struct available to the thread and then resume, for example:
HANDLE hThread = CreateThread(NULL, 0, thread_entry, (void *)&data, CREATE_SUSPENDED, NULL);
data->val64 = SomeCalculation();
ResumeThread(hThread);
I suppose this would depend on any memory barriers in ResumeThread? Should I do an interlocked exchange for val64? What if the thread were running, how does that change things?
I'm sure I'm asking a lot here but basically what I'm trying to figure out is what I asked in the title: a good explanation for atomicity, volatility and thread safety in Windows. Thanks
it's used to make sure a value will be read/written in whole
That's just a small part of atomicity. At its core it means "uninterruptible", an instruction on a processor whose side-effects cannot be interleaved with another instruction. By design, a memory update is atomic when it can be executed with a single memory-bus cycle. Which requires the address of the memory location to be aligned so that a single cycle can update it. An unaligned access requires extra work, part of the bytes written by one cycle and part by another. Now it is not uninterruptible anymore.
Getting aligned updates is pretty easy, it is a guarantee provided by the compiler. Or, more broadly, by the memory model implemented by the compiler. Which simply chooses memory addresses that are aligned, sometimes intentionally leaving unused gaps of a few bytes to get the next variable aligned. An update to a variable that's larger than the native word size of the processor can never be atomic.
But much more important are the kind of processor instructions you need to make threading work. Every processor implements a variant of the CAS instruction, compare-and-swap. It is the core atomic instruction you need to implement synchronization. Higher level synchronization primitives, like monitors (aka condition variables), mutexes, signals, critical sections and semaphores are all built on top of that core instruction.
That's the minimum, a processor usually provide extra ones to make simple operations atomic. Like incrementing a variable, at its core an interruptible operation since it requires a read-modify-write operation. Having a need for it be atomic is very common, most any C++ program relies on it for example to implement reference counting.
volatility does not guarantee thread safety at all
It doesn't. It is an attribute that dates from much easier times, back when machines only had a single processor core. It only affects code generation, in particular the way a code optimizer tries to eliminate memory accesses and use a copy of the value in a processor register instead. Makes a big, big difference to code execution speed, reading a value from a register is easily 3 times faster than having to read it from memory.
Applying volatile ensures that the code optimizer does not consider the value in the register to be accurate and forces it to read memory again. It truly only matters on the kind of memory values that are not stable by themselves, devices that expose their registers through memory-mapped I/O. It has been abused heavily since that core meaning to try to put semantics on top of processors with a weak memory model, Itanium being the most egregious example. What you get with volatile today is strongly dependent on the specific compiler and runtime you use. Never use it for thread-safety, always use a synchronization primitive instead.
simply being atomic/volatile is thread-safe
Programming would be much simpler if that was true. Atomic operations only cover the very simple operations, a real program often needs to keep an entire object thread-safe. Having all its members updated atomically and never expose a view of the object that is partially updated. Something as simple as iterating a list is a core example, you can't have another thread modifying the list while you are looking at its elements. That's when you need to reach for the higher-level synchronization primitives, the kind that can block code until it is safe to proceed.
Real programs often suffer from this synchronization need and exhibit Amdahls' law behavior. In other words, adding an extra thread does not actually make the program faster. Sometimes actually making it slower. Whomever finds a better mouse-trap for this is guaranteed a Nobel, we're still waiting.
In general, C and C++ don't give any guarantees about how reading or writing a 'volatile' object behaves in multithreaded programs. (The 'new' C++11 probably does since it now includes threads as part of the standard, but tradiationally threads have not been part of standard C or C++.) Using volatile and making assumptions about atomicity and cache-coherence in code that's meant to be portable is a problem. It's a crap-shoot as to whether a particular compiler and platform will treat accesses to 'volatile' objects in a thread-safe way.
The general rule is: 'volatile' is not enough to ensure thread safe access. You should use some platform-provided mechanism (usually some functions or synchronisation objects) to access thread-shared values safely.
Now, specifically on Windows, specifically with the VC++ 2005+ compiler, and specifically on x86 and x64 systems, accessing a primitive object (like an int) can be made thread-safe if:
On 64- and 32-bit Windows, the object has to be a 32-bit type, and it has to be 32-bit aligned.
On 64-bit Windows, the object may also be a 64-bit type, and it has to be 64-bit aligned.
It must be declared volatile.
If those are true, then accesses to the object will be volatile, atomic and be surrounded by instructions that ensure cache-coherency. The size and alignment conditions must be met so that the compiler makes code that performs atomic operations when accessing the object. Declaring the object volatile ensures that the compiler doesn't make code optimisations related to caching previous values it may have read into a register and ensures that code generated includes appropriate memory barrier instructions when it's accessed.
Even so, you're probably still better off using something like the Interlocked* functions for accessing small things, and bog standard synchronisation objects like Mutexes or CriticalSections for larger objects and data structures. Ideally, get libraries for and use data structures that already include appropriate locks. Let your libraries & OS do the hard work as much as possible!
In your example, I expect you do need to use a thread-safe access to update val64 whether the thread is started yet or not.
If the thread was already running, then you would definitely need some kind of thread-safe write to val64, either using InterchangeExchange64 or similar, or by acquiring and releasing some kind of synchronisation object which will perform appropriate memory barrier instructions. Similarly, the thread would need to use a thread-safe accessor to read it as well.
In the case where the thread hasn't been resumed yet, it's a bit less clear. It's possible that ResumeThread might use or act like a synchronisation function and do the memory barrier operations, but the documentation doesn't specify that it does, so it is better to assume that it doesn't.
References:
On atomicity of 32- and 64- bit aligned types... https://msdn.microsoft.com/en-us/library/windows/desktop/ms684122%28v=vs.85%29.aspx
On 'volatile' including memory fences... https://msdn.microsoft.com/en-us/library/windows/desktop/ms686355%28v=vs.85%29.aspx

Structure memory alignment - Compile time vs Dynamically allocated memory

I was just going through glibc manual for description about posix_memalign function when I encountered this statement:
The address of a block returned by malloc or realloc in the GNU system is always a
multiple of eight (or sixteen on 64-bit systems). If you need a block whose address is a
multiple of a higher power of two than that, use memalign, posix_memalign, or valloc.
If I consider a simple structure containing just an int data member:
struct Mystruct
{
int member;
};
Then I can see that Mystruct should be 4-byte aligned. But according to libc manual on a 64-bit architecture, dynamically allocating memory for such structure would return memory allocated on an address with 16 byte alignment.
Correct me if I am wrong. To me it seems like compiler uses natural alignment of a structure only for global/static/automatic variables (data, bss, stack). But on the other hand, to allocate the same structure on heap memory, the malloc call uses a predefined alignment(8 on 32-bit architectures and 16 on 64 bit architectures) ?
One thing worth remembering is that malloc() is nothing magical -- it's just a function. Likewise, the heap is just a bunch of memory that the compiler promises not to touch for its stuff. Allocating things statically (on the stack or in data/bss) is something that the compiler does do, so it can align them with whatever alignment is specified. malloc() (and the associated functions) are functions that manage the heap, but they are called at runtime. The compiler doesn't know anything about malloc() except that it's a function, and malloc() doesn't know anything about the memory that you're asking it to allocate except how much you're asking for. Because 16-bit aligned is usually fine for most common uses, malloc() ensures this alignment to be safe.
PS: A fun (and eye-opening) exercise is to write your own malloc() implementation.

Resources