AArch64 cache maintenance points

AArch64 cache maintenance points - caching

The official documentation for cache maintenance operations details the various data cache operations that can be done to a single cache line:
cleaning,
invalidating,
and zeroing,
and the ways in which they can be done:
by set/way,
by virtual address to PoC (point of coherence),
and by virtual address to PoU (point of unification).
The previous page explains what PoC and PoU mean. However, as I was reading the documentation for DC, there seem to be noticeably more options. From what I can tell, the extra operations are:
set allocation tag,
clean allocation tag,
and invalidate allocation tag,
and can be mixed with the usual data operations for data (e.g. cleaning both data and allocation tags by set/way would be dc cgdsw). I also noticed extra "points" to which these operation can be done:
by virtual address to PoP (point of persistence),
by virtual address to PoDP (point of deep persistence),
and by physical address to PoPA (point of physical aliasing).
However, it is not clear what, exactly, these points actually are in comparison to PoC and PoU, and what purpose they serve. In other words, when would __VAP, __VADP, and __PAPA be used?

Related

How to acquire virtual address space and separately provision memory on Win64, MacOS and Linux?

I would like to build a memory allocator to manage a large chunk of non-shared, linear address space (e.g. 32GB). I need to run this on the Big-3 operating systems, but I can live with having OS-specific code.
Functionality desired:
At start, acquire a 2^N byte contiguous region of virtual address space aligned
on a 2^N byte boundary
Initially, the entire region should be inaccessible (i.e. touching any address
within this region should result in that OS's equivalent of a SEGV)
Over time, from the base of this region (the lowest address), I will grow the
portion that should be made accessible (i.e. allocate backing store, provision
PTEs)
It is acceptable, while running, to encounter a failure due to backing store
exhaustion
Bonus points:
It is desirable that, initially, no backing store be reserved
Unprovisioning the entire region without releasing the address space
Encourage use of larger pages and TLB entries
Questions:
I know that I can implement this scheme using mmap on Linux. Is it doable on
Win64 and MacOS?
If yes, what primitives / system calls should I be looking at?
How can I best achieve 2^N byte alignment when initially acquiring address space?

virtual memory effects and relations between paging and segmentation

This is my first post. I want to ask about how are virtual memory related to paging and segmentation. I am searching internet for few days, but still can't manage to put that information into right order. Here is what I know so far:
We can talk about addresses (we could say they are levels of memory abstraction) in memory:
physical level (CPU talking to memory controller, "hey give me contents of address 0xFFEABCD", these adresses are adresses of cells in RAM, so cell 0xABCD has physical address 0xABCD. memory controller can only use physical adresses, so if adress is not physical it must be changed to physical.
logical level.This is abstraction over physical addresses. Here processes if ask for memory, (assume successfull allocation) are given address which has no direct relation to cells in RAM. We can say these addresses are from different pool (world?) than physical addresses. As I said before memory controller only understand physical adresses, so to use logical addresses, we need to convert them to physical addresses. There are two ways for OS to be able to create logical adresses:
paging - in which physical memory (RAM) is divided into continous blocks of memory (called frames), and logical memory (this other world) is also divided in same in length blocks (called pages). Now OS keep in RAM data structure called page table. It's an associative array (map) and it's primary goal of existence is to translate logical level addresses to physical level adresses. Paging has following effect: memory allocated by process in RAM (so in frames in physical memory belonging to program) may not be in contingous manner (so there may be holes inside).
segmentation - program is divided into parts called segments. Segments sizes are not fixed, so different segments may have different sizes. Program is divided in few segments and each segment will have its own place in RAM (physical) memory. So one segment (call it sementA), and another (call it segmentB) may not be near each other. In other words segmentA don't have to has segmentB as a neighbour.
internal fragmentation - when memory which belongs to process isn't used in 100%. So if process want to have 2 bytes for its use, OS need to allocate page/pages which total size need to be greater or equal than amount of memory requested by program. Typical size of page is 4KB. Unit in which OS gives memory to process are pages. So it can't give less than 4KB. So if we use 2 bytes, 4KB - 2B = 4094 bytes are wasted (memory is associated with our process so other processes can't use it. Only we can use it, but we only need 2B).
external fragmentation - when allocated blocks of memory are one near another, but there is a little hole between them. Its free, so other programs, can use it, but it is unlikly because it is very small. That holes with high probability will be wasted. More holes - more wasted memory.
Paging may cause effect of internal fragmentation. Segmentation may cause effect of external fragmentation.
virtual level - addresses used in virtual memory. This is extension of logical memory level. Now program don't even need to have all of it's allocated pages in RAM to start execution. It can be implemented with following techniques:
paged segmentation - method in which segments are divided into pages.
segmented paging - less used method but also possible.
Combining them takes a positive aspects from both solutions.
What i have read about pros and cons of virtual memory:
PROS:
processes have their own address space which mean if we have two processes A and B, and both of them have a pointer to address eg. 17 processA pointer will be showing to different frame than pointer in processB. this results in greater process isolation. Processes are protected from each other (so one process can't do things with another process memory if it isn't shared memory because in its mapping don't exist such mapping entry), and OS is more protected from processes.
have more memory than you physical first order memory(RAM, due to swapping to secondary order memory).
better use of memory due to:
swapping unused parts of programs to secondary memory.
making sharings pages possible, also make possible "copy on write".
improved multiprogram capability (when not needed parts of programs are swapped out to secondary memory, they made free space in ram which could be used for new procesess.)
improved CPU utilisation (if you can have more processes loaded into memory you have bigger probability than there exist some program that now need do CPU stuff, not IO stuff. In such cases you can better utilise CPU).
CONS:
virtual memory has it's overhead because we need to get access to memory twice (but here a lot of improvment can be achieved using TLB buffers)
it makes OS part managing memory more complicated.
So here we came to parts which I don't really understand:
Why in some sources logical address and virtual addresses are described as synonymes? Do I get something wrong?
Is really virtual memory making protection to processes? I mean, in segmentation for example there was also check if process do not acces other memory (resulting in segfault if it does), paging also has a protection bit in a page table, so doesn't the protection come from simply extending abstraction of logic level addresses? If VM (Virtual Memory) brings extended protection features, what are they and how they work? In other words: does creating separate address space for each process, bring extended memory protection. If so, what can't be achieved is paging without VM?
How really differ paged segmentation from segmented paging. I know that the difference between these two will be how a address is constructed (a page number, segment number, that stuff..), but I suppose it isn't enough to develop 2 strategies. This reason is like nothing. I read that segmented paging is less elastic, and that's the reason why it is rarely used. But why it it less elastic? Is the reason for that, that in program you can have only few segments instead a lot of pages. If thats the case paging indeed allow better "granularity".
If VM make separate address space for each process, does it mean, paging without VM use logic addresses from "one pool" (is then every logic address globally unique in that case?).
Any help on that topic would be appreciated.
Edit: #1
Ok. I finally understood that paging not on demand is also a virtual memory. I just found some clarification was helpful to understand the topic. Below is link to image which I made to visualize differences. Thanks for help.
differences between paging, demand paging and swapping

Why in some sources logical address and virtual addresses are described as synonymes? Do I get something wrong?
Many sources conflate logical and virtual memory translation. In ye olde days, logical address translation never took place without virtual address translation so processor documentation referred to them as the same.
Now we have large memory systems that use logical memory translation without virtual memory.
Is really virtual memory making protection to processes?
It is the logical memory translation that implements page protections.
How really differ paged segmentation from segmented paging.
You can really ignore segments. No rationally designed processor architecture designed after 1970 used segments and they are finally dying out.
If VM make separate address space for each process, does it mean, paging without VM use logic addresses from "one pool"
It is logical memory that creates the separate address space for each process. Paging is virtual memory. You cannot have one without the other.

Does the address translation of paging decrease memory access performance?

When paging is enabled, some hardware is responsible for translating virtual memory addresses into physical addresses. Known translations are usually kept in some sort of cache, the translation look aside buffer (TLB).
Assuming a memory access where the address translation is cached, is it any slower than directly accessing memory without paging enabled?
I'm wondering about the overhead of that translation, even when it's cached, since the access to that cache will probably also take some (although very short) time. Or is that time planned as part of the clock cycle?
(To make it clear, my question is not about page faults or cache misses of the TLB)

Like everything in life, it depends! :-)
Let's assume, for the sake of simplicity, that (a) we're talking about data rather than instructions (b) all data memory accesses hit the level 1 cache (c) the level 1 data cache is a typical set associative cache.
Each block of the data cache must be identified with an address (less the offset). If the cache uses virtual addresses then no translation need take place and there is no overhead. If the cache uses physical addresses then the address must be translated prior to the data access adding additional latency to the request. Even for a small TLB, I don't think a high performance processor could both translate the address and then complete the cache request within the same cycle. So it's fair to assume that a physically addressed cache does indeed have overhead of address translation.
So virtually addressed caches sound like the better deal, right? Unfortunately it's a double-edged sword. The problem is that virtual memory often allows multiple virtual addresses to map to the same physical address. If in our cache there are two virtual addresses that map to a single physical address, modifying one will not be reflected in the other.
So, there is an option between these two extremes. Still assuming a set associative cache, we can use the virtual address as the index while simultaneously translating the address to a physical one. After, we use the physical address as the tag to access the data. This way we take the TLB translation off the critical path thus achieving similar performance to the virtually addressed cache. It also allows us to avoid this virtual/physical aliasing problem, although it often needs a little extra help from the operating system.
So, you can see that it can be the same or slower, depending on how the cache is configured.

Does memory address translation need extra access to memory?

I've got a question about virtual memory management, more specifically, the address translation.
When an application runs, the CPU receives instructions containing virtual memory addresses, and translates them into physical addresses via the page table.
My question is, since the page table also aside at a memory block, does that means the CPU has to access the memory twice in a single memory-access instruction? If the answer is no, then how does this actually work? Which part did I miss?
Could anyone give me some details about this?

As usual the answer is neither yes or no.
Worst case you have to do a walk of the page table, which is indeed stored in (some kind of) memory, this is not necessarily only one lookup, it can be multiple lookups, see for example a two-level table (example from wikipedia).
However, typically this page table is accompanied by a hardware assist called the translation lookaside buffer, this is essentially a cache for the page table, the lookup process can be seen in this image. It works just as you would expect a cache too work, if a lookup succeeds you happily continue with the physical fetch, if it fails you proceed to the aforementioned page walk and you update the cache afterwards.
This hardware assist is usually implemented as a CAM (Content Addressable Memory), something that's most used in network processing but is also very useful here. It is a memory-component that does not do the lookup based upon an address but based upon 'content', or any generic key (the keys dont' have to be contiguous, incrementing numbers). In this case the key would be your virtual address, and the resulting memory lookup would be your physical address. As this CAM is a separate component and as it is very fast you could state that as long as you hit it you don't incur any extra memory overhead for virtual -> physical address translation.
You could ask why they don't put the whole page table in a CAM? Quite simply, CAM's are both quite expensive and more importantly quite energy-hungry, so you don't want to make them too big (we wouldn't want a laptop that requires 1KW to run do we?).

Sometimes.
The MMU contains a cache of virtual to physical address mapping, called a TLB (Translation Lookaside Buffer).
If the page in question is not in the TLB (a TLB miss), then it needs to load the relevant piece of page table from main memory into that cache first, which will need additional memory access.
Finally, if the page cannot be found at all, a trap is issued to the CPU (a page fault), and the CPU have an opportunity to fix this - e.g. allocate memory, load the piece from a file, swap space and similar.
The details on how this is done varies between architectures, on some, the TLB miss also involves the CPU to configure the TLB, though on most this is automatic. (but the CPU would have to flush the TLB when doing a context switch, and load a new pagetable for e.g. a new process)
More info e.g. here https://www.kernel.org/doc/gorman/html/understand/understand006.html

Does virtual address matching matter in shared mem IPC?

I'm implementing IPC between two processes on the same machine (Linux x86_64 shmget and friends), and I'm trying to maximize the throughput of the data between the processes: for example I have restricted the two processes to only run on the same CPU, so as to take advantage of hardware caching.
My question is, does it matter where in the virtual address space each process puts the shared object? For example would it be advantageous to map the object to the same location in both processes? Why or why not?

It doesn't matter as long as the OS is concerned. It would have been advantageous to use the same base address in both processes if the TLB cache wasn't flushed between context switches. The Translation Lookaside Buffer (TLB) cache is a small buffer that caches virtual to physical address translations for individual pages in order to reduce the number of expensive memory reads from the process page table. Whenever a context switch occurs, the TLB cache is flushed - you don't want processes to be able to read a small portion of the memory of other processes, just because its page table entries are still cached in the TLB.
Context switch does not occur between processes running on different cores. But then each core has its own TLB cache and its content is completely uncorrelated with the content of the TLB cache of the other core. TLB flush does not occur when switching between threads from the same process. But threads share their whole virtual address space nevertheless.
It only makes sense to attach the shared memory segment at the same virtual address if you pass around absolute pointers to areas inside it. Imagine, for example, a linked list structure in shared memory. The usual practice is to use offsets from the beginning of the block instead of aboslute pointers. But this is slower as it involves additional pointer arithmetic. That's why you might get better performance with absolute pointers, but finding a suitable place in the virtual address space of both processes might not be an easy task (at least not doing it in a portable way), even on platforms with vast VA spaces like x86-64.

I'm not an expert here, but seeing as there are no other answers I will give it a go. I don't think it will really make a difference, because the virutal address does not necessarily correspond to the physical address. Said another way, the underlying physical address the OS maps your virtual address to is not dependent on the virtual address the OS gives you.
Again, I'm not a memory master. Sorry if I am way off here.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio