I'm doing a research about how memory is managed in RTEMS using an ARM-based Xilinx Zynq. The program runs on two cores with SMP.
I have read about memory barriers and out-of-order execution paradigm, I concluded that a barrier or a fence is a hardware implementation rather than software.
RAM is divided in several sections, however there are some sections called barriers which shared areas with other sections. I attach you a capture.
xbarrier starts where the next section begins and ends where previous section ends. Another example:
In this one, the barrier starts at the same addres as the previous section and it ends before the next section starts.
Are these memory sections related with barrier instructions? Why are these memory sections implemented?
Thanks in advance,
Googling "section .rwbarrier" will get you to https://lists.rtems.org/pipermail/users/2015-May/028893.html, which says:
This section helps to protect the code and read-only sections from write access via the MMU.
It looks like this is not linked to barrier instructions at all. Could it be a section of memory which is called like this just to separate a region which is read-write from a region which is read-only (vector) ?
The barrier instructions are used to force order in a multiprocessor system, they will never be linked to an address. The barrier instruction is used to split the visibility (For other CPUs or threads) between:
Load and store instructions before the barrier
Load and store instructions after the barrier.
Related
On x86/x64, non-temporal store instructions such as MOVNTI and MOVNTPS make weaker memory ordering guarantees than "regular" stores. I understand fences (e.g. SFENCE) are necessary when sharing memory that will be written to non-temporally across threads. However, are fence instructions ever necessary for thread-local memory? If I write to a location via MOVNTPS, is the write guaranteed to be visible to subsequent instructions in the same thread without any fence instruction?
Yes, they will be visible without fences. See section 8.2.2 Memory Ordering in P6 and More Recent Processor Families in the IntelĀ® 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide, Part 1 which says, among others:
for memory regions defined as write-back cacheable, [...]
Reads may be reordered with older writes to different locations but
not with older writes to the same location.
and
Writes to memory are not reordered with other writes, with the
following exceptions:
-- streaming stores (writes) executed with the non-temporal move instructions
(MOVNTI, MOVNTQ, MOVNTDQ, MOVNTPS, and MOVNTPD);
I am very perplexed having studied Assembly for some time, and reviewing many great tutorials on it.
It is surprisingly difficult I must say to fully understand the whole scheme of its usefulness, aside from memorizing a few instructions to do some things you don't completely understand.
I seek to be an operating system developer and designer, so I have to know low-level hardware data processing, memory management, processor fetching, decoding, and memory segmentation, memory usage, bit and byte usage, call stack and hardware stacks, and the mechanics of a machine-level program from the hardware itself.
Here are my main questions I am confused about:
The processor fetches bytes from RAM. When writing a bootloader you "jump" to an address before writing instructions. The first instruction executed after jumping to the address in memory, such as a move/data copy MOV AL, MOV BL kind of instruction retrieves data on the CPU's pipeline which is not directly used in memory. But how can the processor generate a code data segment on its pipeline if the instruction is loaded/fetched from memory? Or do I have it all wrong here? What is the basic steps the microprocessor does in a bootloader, and how does the CPU generate code data from a pipeline without using memory if instructions are all fetched from memory supposedly(e.g. code segments in Assembly, but data segments and text segments are all instructions for the processor)?
Also, my next main question is probably very easy to answer for some more experienced than me:
Why is memory/RAM on x86 and other architectures stored as "segments" with offsets? To me this is more complex than it needs to be. Why can't all memory be linear, addressed, fetched, stored, and computed, and moved in and out of the registers to the memory cells in a more straightforward manner? Would that not make the illustration and understanding of the architecture easier to understand, and more direct than having multitudes of registers process bi-dimensional segmentations of memory-based data storage and accessing?
It's more than "assembly" vs "high level language".
The real issue is "Real" vs. "Protected" (virtual memory) modes.
And unfortunately, most x86 assembly examples happen to be DOS examples. Which, IMHO, have little/no relevance to contemporary 32/64 bit virtual memory architectures (including, but not limited to, x86).
Excellent primer:
Programming from the Ground Up
PS:
Address space is effectively linear, event for x86, on most modern OS's (including Windows, Linux and Mac OS). x86 segment registers are largely anachronisms from the DOS era.
If you're interested, here's a good overview of the Linux boot process:
http://www.ibm.com/developerworks/library/l-linuxboot/index.html
I have a sequential user space program (some kind of memory intensive search data structure). The program's performance, measured as number of CPU cycles, depends on memory layout of the underlying data structures and data cache size (LLC).
So far my user space program is tuned to death, now I am wondering if I can get performance gain by moving the user space code into kernel (as a kernel module). I can think of the following factors that improve the performance in kernel space ...
No system call overhead (how many CPU cycles is gained per system call). This is less critical as I am barely using any system call in my program except for allocating memory that too just when the program starts.
Control over scheduling, I can create a kernel thread and make it run on a given core without being thrown away.
I can use kmalloc memory allocation and thus can have more control over memory allocated, may can also control the cache coloring more precisely by controlling allocated memory. Is it worth trying?
My questions to the kernel experts...
Have I missed any factors in the above list that can improve performance further?
Is it worth trying or it is straight way known that I will NOT get much performance improvement?
If performance gain is possible in kernel, is there any estimate how much gain it can be (any theoretical guess)?
Thanks.
Regarding point 1: kernel threads can still be preempted, so unless you're making lots of syscalls (which you aren't) this won't buy you much.
Regarding point 2: you can pin a thread to a specific core by setting its affinity, using sched_setaffinity() on Linux.
Regarding point 3: What extra control are you expecting? You can already allocate page-aligned memory from user space using mmap(). This already lets you control for the cache's set associativity, and you can use inline assembly or compiler intrinsics for any manual prefetching hints or non-temporal writes. The main difference between memory allocated in the kernel and in user space is that kmalloc() allocates wired (non-pageable) memory. I don't see how this would help.
I suspect you'll see much better ROI on parallelising using SIMD, multithreading or making further algorithmic or memory optimisations.
Create a dedicated cpuset for your program and move all other processes out of it. Then bump your process' priority to realtime with FIFO scheduling policy using something like:
struct sched_param schedparams;
// Be portable - don't just set priority to 99 :)
schedparams.sched_priority = sched_get_priority_max(SCHED_FIFO);
sched_setscheduler(0, SCHED_FIFO, &schedparams);
Don't do that on a single-core system!
Reserve large enough stack space with alloca(3) and touch all of the allocated stack memory, map more than enough heap space and then use mlock(2) or mlockall(2) to pin process memory.
Even if your program is a sequential one, if run on a multisocket Nehalem or post-Nehalem Intel system or an AMD64 system, NUMA effects can slow your program down. Use API functions from numa(3) to allocate and keep memory as close to the NUMA node where your program executes as possible.
Try other compilers - some of them might optimise better than the compiler that you are currently using. Intel's compiler for example is very aggresive on laying out instructions as to benefit from out of order execution, pipelining and branch prediction.
I read that the first 3 GBs are reserved for the process and the last GB is for the Kernel. I also read that the kernel is loaded starting from the 2nd MB of the physical address space (depending on the configuration). My question is that is the mapping of that last 1 GB is same for all processes and maps to this physical area of memory?
Another question is, when a process switches to kernel mode (eg, when a sys call occurs), then what page tables are used, the process page tables or the kernel page tables? If kernel page tables are used, then they can't access the memory locations belonging to the process. If that is the case, then there is apparently no use for the kernel virtual memory since all access to kernel code and data will be through the mapping of the last 1 GB of process address space. Please help me clarify this (any useful links will be much appreciated)
It seems, you are talking about 32-bit x86 systems, right?
If I am not mistaken, the kernel can be configured not only for 3Gb/1Gb memory distrubution, there could be other variants (e.g. 2Gb/2Gb). Still, 3Gb/1Gb is probably the most common one on x86-32.
The kernel part of the address space should be inaccessible from the user space. From the kernel's point of view, yes, the mapping of the memory occupied by the kernel itself is always the same. No matter, in the context of which process (or interrupt handler, or whatever else) the kernel currently operates.
As one of the consequences, if you look at the addresses of kernel symbols in /proc/kallsyms from different processes, you will see the same addresses each time. And these are exactly the addresses of the respective kernel functions, variables and others from the kernel's point of view.
So I suppose, the answer to your first question is "yes" but it is probably not very useful for the user-space code as the kernel space memory is not directly accessible from there anyway.
As for the second question, well, if the kernel currently operates in the context of some process, it can actually access the user-space memory of that process. I can't describe it in detail but probably the implementation of kernel functions copy_from_user and copy_to_user could give you some hints. See arch/x86/lib/usercopy_32.c and arch/x86/include/asm/uaccess.h in the kernel sources. It seems, on x86-32, the user-space memory is accessed in these functions directly, using the default memory mappings for the current process context. The 'magic' stuff there is only related to the optimizations and checking the address of the memory area for correctness.
Yes, the mapping of the kernel part of the address space is the same in all processes. Part of it does map that part of the physical memory where the kernel image is loaded, but that's not the bulk of it - the remainder is used to map other physical memory locations for the kernel's runtime working set.
When a process switches to kernel mode, the page tables are not changed. The kernel part of the address space simply becomes accessible because the CPL (Current Privilege Level) is now zero.
I recently downloaded linux source from http://www.kernel.org/pub/linux/kernel/v2.6/linux-2.6.34.1.tar.bz2 . I came across the below paragraph in the file called spinlocks.txt in linux-2.6.34.1\Documentation folder.
" it does mean that if you have some code that does
cli();
.. critical section ..
sti();
and another sequence that does
spin_lock_irqsave(flags);
.. critical section ..
spin_unlock_irqrestore(flags);
then they are NOT mutually exclusive, and the critical regions can happen
at the same time on two different CPU's. That's fine per se, but the
critical regions had better be critical for different things (ie they
can't stomp on each other). "
How can they impact if some code is using cli()/sti() and other part of the same code uses spin_lock_irqsave(flags)/spin_unlock_irqrestore(flags) ?
The key part here is "on two different CPUs". Some background:
Historically on uni-processor (UP) systems the only source of concurrency was hardware interrupts. It was enough to cli/sti around the critical section to prevent an IRQ handler from messing things up.
Then there was the giant lock design where the kernel would effectively run on a single CPU and only one process could be in the kernel at a time (that what the giant lock was for). Again, disabling interrupts was enough to protect kernel from itself.
On full SMP systems, where multiple threads could be active in the kernel at the same time and interrupts could be delivered to pretty much any CPU, it's no longer enough to only disable interrupts on single processor, or only grab a single lock. Both are required: disabling interrupts protects from IRQ handler on the same CPU, holding a lock protects from other threads entering the same critical sections on different CPU. This is exactly why spin_lock_irqsave() and spin_unlock_irqrestore() were invented.