difference between the d-cache and bss - caching

reading up on branch prediction and came up on two terms i have'nt heard before i-cache and d-cache ,looked them up i-cache is for instruction and d-cache is for data.
i stumbled upon this website and it says
Compiled software binaries usually consist of two or more "segments" that seperate code from data (global and static variables, constants, etc.)
as far as i know the bss stores static data and non-assigned variables ,is there any difference between the bss and d-cache

Related

Does page walk take advantage of shared tables?

Suppose two address spaces share a largish lump of non-contiguous memory.
The system might want to share physical page table(s) between them.
These tables wouldn't use Global bits (even if supported), and would tie them to asids if supported.
There are immediate benefits since the data cache will be less polluted than by a copy, less pinned ram, etc.
Does the page walk take explicit advantage of this in any known architecture?
If so, does that imply the mmu is explicitly caching & sharing interior page tree nodes based on physical tag?
Sorry for the multiple questions; it really is one broken down. I am trying to determine if it is worth devising a measurement test for this.
On modern x86 CPUs (like Sandybridge-family), page walks fetch through the cache hierarchy (L1d / L2 / L3), so yes there's an obvious benefit there for having to different page directories point to the same subtree for a shared region of virtual address space. Or for some AMD, fetch through L2, skipping L1d.
What happens after a L2 TLB miss? has more details about the fact that page-walk definitely fetches through cache, e.g. Broadwell perf counters exist to measure hits.
("The MMU" is part of a CPU core; the L1dTLB is tightly coupled to load/store execution units. The page walker is a fairly separate thing, though, and runs in parallel with instruction execution, but is still part of the core and can be triggered speculatively, etc. So it's tightly coupled enough to access memory through L1d cache.)
Higher-level PDEs (page directory entries) can be worth caching inside the page-walk hardware. Section 3 of this paper confirms that Intel and AMD actually do this in practice, so you need to flush the TLB in cases where you might think you didn't need to.
However, I don't think you'll find that PDE caching happening across a change in the top-level page-tables.
On x86, you install a new page table with a mov to CR3; that implicitly flushes all cached translations and internal page-walker PDE caching, like invlpg does for one virtual address. (Or with ASIDs, makes TLB entries from different ASIDs unavailable for hits).
The main issue is that TLB the and page-walker internal caches are not coherent with main memory / data caches. I think all ISAs that do HW page walks at all require manual flushing of TLBs, with semantics like x86 for installing a new page table. (Some ISAs like MIPS only do software TLB management, invoking a special kernel TLB-miss handler; your question won't apply there.)
So yes, they could detect same physical address, but for sanity you also have to avoid using stale cached data from after a store to that physical address.
Without hardware-managed coherence between page-table stores and TLB/pagewalk, there's no way this cache could happen safely.
That said; some x86 CPUs do go beyond what's on paper and do limited coherency with stores, but only protecting you from speculative page walks for backwards compat with OSes that assumed a valid but not-yet-used PTE could be modified without invlpg. http://blog.stuffedcow.net/2015/08/pagewalk-coherence/
So it's not unheard of for microarchitectures to snoop stores to detect stores to certain ranges; you could plausibly have stores snoop the address ranges near locations the page-walker had internally cached, effectively providing coherence for internal page-walker caches.
Modern x86 does in practice detect self-modifying code by snoop for stores near any in-flight instructions. Observing stale instruction fetching on x86 with self-modifying code In that case snoop hits are handled by nuking the whole back-end state back to retirement state.
So it's plausible that you could in theory design a CPU with an efficient mechanism to be able to take advantage of this transparently, but it has significant cost (snooping every store against a CAM to check for matches on page-walker-cached addresses) for very low benefit. Unless I'm missing something, I don't think there's an easier way to do this, so I'd bet money that no real designs actually do this.
Hard to imagine outside of x86; almost everything else takes a "weaker" / "fewer guarantees" approach and would only snoop the store buffer (for store-forwarding). CAMs (content-addressable-memory = hardware hash table) are power-hungry, and handling the special case of a hit would complicate the pipeline. Especially an OoO exec pipeline where the store to a PTE might not have its store-address ready until after a load wanted to use that TLB entry. Introducing more pipeline nukes is a bad thing.
The benefit of this would be tiny
After the first page-walk fetches data from L1d cache (or farther away if it wasn't hot in L1d either), then the usual cache-within-page-walker mechanisms can act normally.
So further page walks for nearby pages before the next context switch can benefit from page-walker internal caches. This has benefits, and is what some real HW does (at least some x86; IDK about others).
All the argument above about why this would require snooping for coherent page tables is about having the page-walker internal caches stay hot across a context switch.
L1d can easily do that; VIPT caches that behave like PIPT (no aliasing) simply cache based on physical address and don't need flushing on context switch.
If you're context-switching very frequently, the ASIDs let TLB entries proper stay cached. If you're still getting a lot of TLB misses, the worst case is that they have to fetch through cache all the way from the top. This is really not bad and very much not worth spending a lot of transistors and power budget on.
I'm only considering OS on bare metal, not HW virtualization with nested page tables. (Hypervisor virtualizing the guest OS's page tables). I think all the same arguments basically apply, though. Page walk still definitely fetches through cache.

CUDA caches data into the unified cache from the global memory to store them into the shared memory?

As far as I know GPU follows steps(global memory-l2-l1-register-shared memory) to store data into the shared memory for previous NVIDIA GPU architectures.
However, the maxwell gpu(GTX980) has physically separated unified cache and shared memory, and I want to know that this architecture also follows the same step to store data into the shared memory? or do they support direct communication between global and shared memory?
the unified cache is enabled with option "-dlcm=ca"
This might answer most of your questions about memory types and steps within the Maxwell architecture :
As with Kepler, global loads in Maxwell are cached in L2 only, unless using the LDG read-only data cache mechanism introduced in Kepler.
In a manner similar to Kepler GK110B, GM204 retains this behavior by default but also allows applications to opt-in to caching of global loads in its unified L1/Texture cache. The opt-in mechanism is the same as with GK110B: pass the -Xptxas -dlcm=ca flag to nvcc at compile time.
Local loads also are cached in L2 only, which could increase the cost of register spilling if L1 local load hit rates were high with Kepler. The balance of occupancy versus spilling should therefore be reevaluated to ensure best performance. Especially given the improvements to arithmetic latencies, code built for Maxwell may benefit from somewhat lower occupancy (due to increased registers per thread) in exchange for lower spilling.
The unified L1/texture cache acts as a coalescing buffer for memory accesses, gathering up the data requested by the threads of a warp prior to delivery of that data to the warp. This function previously was served by the separate L1 cache in Fermi and Kepler.
From section "1.4.2. Memory Throughput", sub-section "1.4.2.1. Unified L1/Texture Cache" in the Maxwell tuning guide from Nvidia.
The other sections and sub-sections following these two also teach and/or explicit useful other details about shared memory sizes/bandwidth, caching, etc.
Give it a try !

____cacheline_aligned_in_smp for structure in the Linux kernel

In the Linux kernel, why do many structures use the ____cacheline_aligned_in_smp macro? Does it help increase performance when accessing the structure? If yes then how?
____cacheline_aligned instructs the compiler to instantiate a struct or variable at an address corresponding to the beginning of an L1 cache line, for the specific architecture, i.e., so that it is L1 cache-line aligned. ____cacheline_aligned_in_smp is similar, but is actually L1 cache-line aligned only when the kernel is compiled in SMP configuration (i.e., with option CONFIG_SMP). These are defined in file include/linux/cache.h
These definitions are useful for variables (and data structures) that are not allocated dynamically, via some allocator, but are global, compiler-allocated variables (a similar effect can be accomplished by dynamic memory allocators that can allocate memory at specific alignment).
The reason for cache-line aligned variables is to manage the cache-to-cache transfers of these variables, by hardware cache coherence mechanisms, in SMP systems, so that their movement does not implicitly occur when other variables are moved. This is for performance critical code, where one expects contention in the access of variables by multiple cpus (cores). The usual problem one tries to avoid, in this case, is false sharing.
A variable's memory starting at the beginning of a cache line is half the work for this purpose; one also needs to "pack with it" only variables that should move together. An example is an array of variables, where each element of the array is to be accessed by only one cpu (core):
struct my_data {
long int a;
int b;
} ____cacheline_aligned_in_smp cpu_data[ NR_CPUS ];
This kind of definition will require from the compiler (in an SMP configuration of the kernel) that each cpu's struct will begin at a cache line boundary. The compiler will, implicitly, allocate extra space after each cpu's struct, so that the next cpu's struct will begin at a cache line boundary, also.
An alternative is to pad the data structure with a cache line's size of dummy, unused bytes:
struct my_data {
long int a;
int b;
char dummy[L1_CACHE_BYTES];
} cpu_data[ NR_CPUS ];
In this case, only dummy, unused data will be moved unintentionally and those actually accessed by each cpu will only move from cache to memory and vise versa, due to cache capacity misses.
Each cache line in any cache (dcache or icache) is 64 bytes (in x86) architecture. Cache alignment is required to avoid false sharing of cache lines. If the cache lines are shared between global variables (happens more in kernel) If one of the global variables changed by one of the processor in its cache then it marks that cache line as dirty. In remaining CPU cache line it becomes stale entry, which needs to be flushed and re-fetched from memory. This might lead to cache line misses, which requires more CPU cycles. This reduces the performance of the system. Remember this is for global variables. Most of the kernel data strucuters use this to avoid cache line misses.
Linux manages the CPU Cache in a very similar fashion to the TLB. CPU caches, like TLB caches, take advantage of the fact that programs tend to exhibit a locality of reference. To avoid having to fetch data from main memory for each reference, the CPU will instead cache very small amounts of data in the CPU cache. Frequently, there is two levels called the Level 1 and Level 2 CPU caches. The Level 2 CPU caches are larger but slower than the L1 cache, but Linux only concerns itself with the Level 1 or L1 cache.
CPU caches are organised into lines. Each line is typically quite small, usually 32 bytes and each line is aligned to it's boundary size. In other words, a cache line of 32 bytes will be aligned on a 32 byte address. With Linux, the size of the line is L1_CACHE_BYTES which is defined by each architecture.
How addresses are mapped to cache lines vary between architectures but the mappings come under three headings, direct mapping, associative mapping and set associative mapping. Direct mapping is the simplest approach where each block of memory maps to only one possible cache line. With associative mapping, any block of memory can map to any cache line. Set associative mapping is a hybrid approach where any block of memory can map to any line but only within a subset of the available lines.
Regardless of the mapping scheme, they each have one thing in common, addresses that are close together and aligned to the cache size are likely to use different lines. Hence Linux employs simple tricks to try and maximise cache usage
Frequently accessed structure fields are at the start of the
structure to increase the chance that only one line is needed to
address the common fields;
Unrelated items in a structure should try
to be at least cache size bytes apart to avoid false sharing between
CPUs;
Objects in the general caches, such as the mm_struct cache, are
aligned to the L1 CPU cache to avoid false sharing.
If the CPU references an address that is not in the cache, a cache miss occurs and the data is fetched from main memory. The cost of cache misses is quite high as a reference to cache can typically be performed in less than 10ns where a reference to main memory typically will cost between 100ns and 200ns. The basic objective is then to have as many cache hits and as few cache misses as possible.
Just as some architectures do not automatically manage their TLBs, some do not automatically manage their CPU caches. The hooks are placed in locations where the virtual to physical mapping changes, such as during a page table update. The CPU cache flushes should always take place first as some CPUs require a virtual to physical mapping to exist when the virtual address is being flushed from the cache.
More Information here

Is it true, that modern OS may skip copy when realloc is called

While reading the https://stackoverflow.com/a/3190489/196561 I have a question. What the Qt authors says in the Inside the Qt 4 Containers:
... QVector uses realloc() to grow by increments of 4096 bytes. This makes sense because modern operating systems don't copy the entire data when reallocating a buffer; the physical memory pages are simply reordered, and only the data on the first and last pages need to be copied.
My questions are:
1) Is it true that modern OS (Linux - the most interesting for me; FreeBSD, OSX, Windows) and their realloc implementations really capable to realloc pages of data using reordering of virtual-to-physical mapping and without byte-by-byte copy?
2) What is the system call used to achieve this memory move? (I think it can be splice with SPLICE_F_MOVE, but it was flawed and no-op now (?))
3) Is it profitable to use such page shuffling instead of byte-by-byte copy, especially in multicore multithreaded world, where every change of virtual-to-physical mapping need to flush (invalidate) changed page table entries from TLBs in all tens of CPU cores with IPI? (In linux this is smth like flush_tlb_range or flush_tlb_page)
upd for q3: some tests of mremap vs memcpy
Is it true that modern OS (Linux - the most interesting for me; FreeBSD, OSX, Windows) and their realloc implementations really capable to realloc pages of data using reordering of virtual-to-physical mapping and without byte-by-byte copy?
What is the system call used to achieve this memory move? (I think it can be splice with SPLICE_F_MOVE, but it was flawed and no-op now (?))
See thejh's answer.
Who are the actors?
You have at least three actors with your Qt example.
Qt Vector class
glibc's realloc()
Linux's mremap
QVector::capacity() shows that Qt allocates more elements than required. This means that a typical addition of an element will not realloc() anything. The glibc allocator is based on Doug Lea's allocator. This is a binning allocator which supports the use of Linux's mremap. A binning allocator groups similar sized allocations in bins, so a typical random sized allocation will still have some room to grow without needing to call the system. Ie, the free pool or slack is located a the end of the allocated memory.
An answer
Is it profitable to use such page shuffling instead of byte-by-byte copy, especially in multicore multithreaded world, where every change of virtual-to-physical mapping need to flush (invalidate) changed page table entries from TLBs in all tens of CPU cores with IPI? (In linux this is smth like flush_tlb_range or flush_tlb_page)
First off, faster ... than mremap is mis-using mremap() as R notes there.
There are several things that make mremap() valuable as a primitive for realloc().
Reduced memory consumption.
Preserving page mappings.
Avoiding moving data.
Everything in this answer is based upon Linux's implementation, but the semantics can be transferred to other OS's.
Reduce memory consumption
Consider a naive realloc().
void *realloc(void *ptr, size_t size)
{
size_t old_size = get_sz(ptr); /* From bin, address, map table, etc */
if(size <= old_size) {
resize(ptr);
return ptr;
}
void * new_p = malloc(size);
if(new_p) {
memcpy(new_p, ptr, old_size); /* fully committed old_size + new size */
free(ptr);
}
return new_p;
}
In order to support this, you may need double the memory of the realloc() before you go to swap or simply fail to reallocate.
Preserve page mappings
Linux will by default map new allocations to a zero page; a 4k page full of zero data. This is useful for sparsely mapped data structures. If no one writes to the data page, then no physical memory is allocated besides a possible PTE table. These are copy on write or COW. By using the naive realloc(), these mappings will not be preserved and full physical memory is allocated for all zero pages.
If the task is involved in a fork(), the initial realloc() data maybe shared between parent and child. Again, COW will cause physical allocation of pages. The naive implementation will discount this and require separate physical memory per process.
If the system is under memory pressure, the existing realloc() pages may not be in physical memory but in swap. The naive realloc will cause disk reads of the swap page into memory, copy to the updated location, and then likely write the data out to the disk.
Avoid moving data
The issue you consider of updating TLBs is minimal compared to the data. A single TLB is typically 4 bytes and represents a page (4K) of physical data. If you flush the entire TLB for a 4GB system, that is 4MB of data that needs to be restored. Copying large amounts of data will blow the L1 and L2 caches. TLB fetches naturally pipeline better than d-cache and i-cache. It is rare that code will get two TLB misses in a row as most code is sequential.
CPUs are of two variants, VIVT (non-x86) and VIPT as per x86. The VIVT versions usually have mechanisms to invalidate single TLB entries. For a VIPT system, the caches should not need to be invalidated as they are physically tagged.
On a multi-core systems, it is atypical to run one process on all cores. Only cores with the process performing mremap() need to have page table updates. As a process is migrated to a core (typical context switch), it needs to have the page table migrated anyways.
Conclusion
You can construct some pathological cases where a naive copy will work better. As Linux (and most OS's) are for multi-tasking, more than one process will be running. Also, the worst case will be when swapping and the naive implementation will always be worse here (unless you have a disk faster than memory). For minimal realloc() sizes, the dlmalloc or QVector should have fallow space to avoid system level mremap(). A typical mremap() may just expand a virtual address range by growing the region with a random page from the free pool. It is only when the virtual address range must move that mremap() may need a tlb flush, with all the following being true,
The realloc() memory should not be shared with a parent or child process.
The memory should not be sparse (mostly zero or untouched).
The system should not be under memory pressure using swap.
The tlb flush and IPI needs to take place only if the same process is current on other cores. L1-cache loading in not needed for the mremap(), but is for the naive version. The L2 is usually shared between cores and will be current in all cases. The naive version will force the L2 to reload. The mremap() might leave some unused data out of the L2-cache; this is normally a good thing, but could be a disadvantage under some work loads. There are probably better ways to do this such as pre-fetching data.
For Linux, see man 2 mremap:
void *mremap(void *old_address, size_t old_size,
size_t new_size, int flags, ... /* void *new_address */);
mremap() expands (or shrinks) an existing memory mapping, potentially moving it at the same
time (controlled by the flags argument and the available virtual address space).

Confusion of Assembly and hardware level memory fetching, processing, segmentation, offsets, scope of memory addressing, etc

I am very perplexed having studied Assembly for some time, and reviewing many great tutorials on it.
It is surprisingly difficult I must say to fully understand the whole scheme of its usefulness, aside from memorizing a few instructions to do some things you don't completely understand.
I seek to be an operating system developer and designer, so I have to know low-level hardware data processing, memory management, processor fetching, decoding, and memory segmentation, memory usage, bit and byte usage, call stack and hardware stacks, and the mechanics of a machine-level program from the hardware itself.
Here are my main questions I am confused about:
The processor fetches bytes from RAM. When writing a bootloader you "jump" to an address before writing instructions. The first instruction executed after jumping to the address in memory, such as a move/data copy MOV AL, MOV BL kind of instruction retrieves data on the CPU's pipeline which is not directly used in memory. But how can the processor generate a code data segment on its pipeline if the instruction is loaded/fetched from memory? Or do I have it all wrong here? What is the basic steps the microprocessor does in a bootloader, and how does the CPU generate code data from a pipeline without using memory if instructions are all fetched from memory supposedly(e.g. code segments in Assembly, but data segments and text segments are all instructions for the processor)?
Also, my next main question is probably very easy to answer for some more experienced than me:
Why is memory/RAM on x86 and other architectures stored as "segments" with offsets? To me this is more complex than it needs to be. Why can't all memory be linear, addressed, fetched, stored, and computed, and moved in and out of the registers to the memory cells in a more straightforward manner? Would that not make the illustration and understanding of the architecture easier to understand, and more direct than having multitudes of registers process bi-dimensional segmentations of memory-based data storage and accessing?
It's more than "assembly" vs "high level language".
The real issue is "Real" vs. "Protected" (virtual memory) modes.
And unfortunately, most x86 assembly examples happen to be DOS examples. Which, IMHO, have little/no relevance to contemporary 32/64 bit virtual memory architectures (including, but not limited to, x86).
Excellent primer:
Programming from the Ground Up
PS:
Address space is effectively linear, event for x86, on most modern OS's (including Windows, Linux and Mac OS). x86 segment registers are largely anachronisms from the DOS era.
If you're interested, here's a good overview of the Linux boot process:
http://www.ibm.com/developerworks/library/l-linuxboot/index.html

Resources