Flush a cache line from user mode on ARMv7(rpi2) - caching

I am using the following code to flush a cache line on a raspberry pi 2:
static inline void flush(void addr)
{
asm volatile("mcr p15, 0, %0, c7, c6, 1"::"r"(addr));
}
I am getting an error that this is a privileged instruction when I run this. Is this code correct? Is there any way to flush the cache line from user space on this machine? On x86 clflush works without any modification.

Is this code correct?
As a matter of fact, no. That's some bogus non-existent system register encoding - cache maintenance operations live in the c7 space, not c12.
What's more incorrect, though, is the assumption that you can do this. Prior to ARMv8, all cache maintenance operations can only be executed in privileged modes. From userspace, you'd need support from the OS to allow you to request it; Linux, for example, has an ARM-specific syscall which GCC provides an interface to via __clear_cache() - there might be some permission-related caveats, although I don't see any reference to VMA permissions in the current mainline kernel code, so maybe it was a quirk of older kernels.
Either way, the only cache maintenance concern which really applies to userspace code is coherency between the instruction and data caches, to cater for JITs or self-modifying code. Things like data cache coherency with main memory should never be relevant to userspace code (which would normally be calling into driver code within the OS in situations where such things did matter), and on many systems require separate outer cache maintenance which only the OS is in a position to manage anyway.

Related

What is a TRAMPOLINE_ADDR for ARM and ARM64(aarch64)?

I am writing a basic check-pointing mechanism for ARM64 using PTrace in order to do so I am using some code from cryopid and I found a TRAMPOLINE_ADDR macro like the following:
#define TRAMPOLINE_ADDR 0x00800000 /* 8MB mark */ for x86
#define TRAMPOLINE_ADDR 0x00300000 /* 3MB mark */ for x86_64
So when I read about trampolines it is something related to jump statements. But my questions is from where the above values came and what would the corresponding values for the ARM and ARM64 platform.
Thank you
Just read the wikipedia page.
There is nothing magic about a trampoline or certainly a particular address, any address where you can have code that executes can hold a trampoline. there are many use cases for them...for example
say you are booting off of a flash, a spi flash, running at some safe rate so that the chip boots for all users. But you want to increase the rate of the spi flash and the spi peripheral does not allow you to change while executing code. So you would copy some code to ram, that code boosts the spi flash rate to a faster rate so you can use and/or run the flash faster, then you bounce back to running from the flash. you have bounced or trampolined off of that little bit of code in ram.
you have a chip that boots from flash, but has the ability to re-map that address space to ram for example, so you copy some code to some other ram, branch to it that little bit of trampoline code remaps the address space, then bounces you back or bounces you to where the flash is now mapped to or whatever.
you will see the gnu linker sometimes add a small trampoline, say you compile some modules as thumb and some others for arm, you no longer have to use that interwork thing, the linker takes care of cleaning this up, it may add an instruction or two to trampoline you between modes, sometimes it modifies the code to just go where it needs to sometimes it modifies the code to branch link somewhere close and that somewhere close is a trampoline.
I assume there may be a need to do the same thing for aarch64 if/when switching to that mode.
so there should be no magic. your specific application might have one or many trampolines, and the one you are interested might not even be called that, but is probably application specific, absolutely no reason why there would be one address for everyone, unless it is some very rigid operating specific (again "application specific") thing and one specific trampoline for that operating system is at some DEFINEd address.

What's the reason why we must flush data cache after copy code from flash to ram?

In embedded system, boot-loader used to init the board and load image. Usually, boot-loader runs in norflash during the 1st stage and need copy itself (.txte+.date code) from flash to ram, then jump to ram execute code.
My question is: when copy code from flash to RAM and cache enable, do we must flush data cache and invalidate the instruction cache? I found uboot and other bootloader execute this operation, but if I don't do that, the system still could boot successful, what's the reason why we must flush data cache after copy code from flash to ram?
Simple embedded MCUs usually do not have any means to "snoop" the bus checking if anybody (even itself) invalidates cache contents with writes to cached memory addresses.
If your MCU has separate data and instruction caches (which most modern MCUs have) and you copy code as data from flash to RAM, you need to flush the data cache (to ensure everything you copied is physically written to RAM) and invalidate the instruction cache (which might contain "old" information from before the copy) to really execute the code you just copied instead of executing what was there before and still resides in instruction cache.
You might get away not doing the latter if you can be sure your MCU has never "seen" the memory area before you just copied (since it will not have cached anything and needs to physically read RAM anyway), but it's good practice to do data cache flushes and instruction cache invalidation nevertheless to stay on the safe side.
Copying code from flash to RAM is a special case of self modifying code and you as a programmer need to make sure it's doing no damage.
I think a main reason can be found in "more than one core" CPUs. Very important in case of asymmetrical cores, eg. i.MX6SoloX (Cortex A9 and Cortex M4 on single chip).
For example in i.MX6SoloX if the slave core (M5) run on RAM (DDR) the main core (A9) is the main cpu that has to provide the M4 core code loaded into RAM at the correct position. Those cores have different D-caches that don't see each other. If the A9 core, after the FLASH to RAM copy, doesn’t flush it’s D-Chache, some part of code is not copied to RAM actually, because of still in D-Cache memory. If you perform this copy from u-boot you can see that A9 (that is running U-Boot) see all data correctly copied, but the M4 see all code, but not the code that still in D-Cache of A9 core.
In your case (a single core like you, I guess) is not mandatory that U-Boot flush the D-cache (after the kernel copy I guess), because the owner of the D-cache is the core itself: it can see its code inside all its memories.
At the very end the reason is that to grant that a performed data copy completely wrote data to a specific address you have to flush D-Cache, otherwise some data can still in caches.

Can I dump/modify the content of x86 CPU cache/TLB

any apps or the system kernel can access or even modify the content of CPU cahce and/or TLB?
I found a short description about the CPU cache from this webiste:
"No programming language has direct access to CPU cache. Reading and writing the cache is something done automatically by the hardware; there's NO way to write instructions which treat the cache as any kind of separate entity. Reads and writes to the cache happen as side-effect to all instructions that touch memory."
From this message, it seems there is no way to read/write the content of CPU cahce/TLB.
However, I also got another information that conflicts with the above one. That information implies that a debug tool may be able to dump/show the content of CPU cache.
Currently I'm confused. so please help me.
I got some answers from another post: dump the contents of TLB buffer of x86 CPU. Thanks adamdunson.
People could read this document about test registers, but it is only available on very old x86 machines test registers
Another descriptions from wiki https://en.wikipedia.org/wiki/Test_register:
A test register, in the Intel 80486 processor, was a register used by
the processor, usually to do a self-test. Most of these registers were
undocumented, and used by specialized software. The test registers
were named TR3 to TR7. Regular programs don't usually require these
registers to work. With the Pentium, the test registers were replaced
by a variety of model-specific registers (MSRs).
Two test registers, TR6 and TR7, were provided for the purpose of
testing. TR6 was the test command register, and TR7 was the test data
register. These registers were accessed by variants of the MOV
instruction. A test register may either be the source operand or the
destination operand. The MOV instructions are defined in both
real-address mode and protected mode. The test registers are
privileged resources. In protected mode, the MOV instructions that
access them can only be executed at privilege level 0. An attempt to
read or write the test registers when executing at any other privilege
level causes a general protection exception. Also, those instructions
generate invalid opcode exception on any CPU newer than 80486.
In fact, I'm still expecting some similar functions on Intel i7 or i5. Unfortunately, I do not find any related document about that. If anyone has such information, please let me know.

Cleared RW (write protect) flag for PTEs of a process in kernel yet no segmentation fault on write

I implemented incremental process checkpointing at page level(I just dump the data from the process address space into a file).
The approach I used is as follows. I used two system calls:
Complete Checkpoint: copy entire address space. Also if write bit
is set for a page, clear it.
Incremental checkpoint: only dump data if write bit is set and clear it again. So basically, I check if write bit is set for an incremental checkpoint. If yes, dump the page data.
Test program:
char a[10000];
sys_cp_range(a,a+10000);
a[3]='A';
sys_incr_cp_range(a,a+10000);
From what I know, the kernel should be doing page fault and handle illegal write case by killing the process with SIGSEGV. Yet the program is successfully checkpointed.
What is exactly happening here ?
If you modify a PTE when it's still cached in the TLB, the effect of the modification may be unseen for a while (until the PTE gets evicted from the TLB and has to be reread from the page table).
You need to invalidate the PTE in the TLB with the invlpg (I'm assuming x86) instruction after PTE modification. And it has to be done on all CPUs. There must be a dedicated function for this purpose in the kernel.
Also it wouldn't hurt to double check that the compiler didn't reorder or throw away anything from the above code.

How does Windows protect transition into kernel mode?

How does Windows protect against a user-mode thread from arbitrarily transitioning the CPU to kernel-mode?
I understand these things are true:
User-mode threads DO actually transition to kernel-mode when a system call is made through NTDLL.
The transition to kernel-mode is done through processor-specific instructions.
So what is special about these system calls through NTDLL? Why can't the user-mode thread fake-it and execute the processor-specific instructions to transition to kernel-mode? I know I'm missing some key piece of Windows architecture here...what is it?
You're probably thinking that thread running in user mode is calling into Ring 0, but that's not what's actually happening. The user mode thread is causing an exception that's caught by the Ring 0 code. The user mode thread is halted and the CPU switches to a kernel/ring 0 thread, which can then inspect the context (e.g., call stack and registers) of the user mode thread to figure out what to do. Before syscall, it really was an exception rather than a special exception specifically to invoke ring 0 code.
If you take the advice of the other responses and read the Intel manuals, you'll see syscall/sysenter don't take any parameters - the OS decides what happens. You can't call arbitrary code. WinNT uses function numbers that map to which kernel mode function the user mode code will execute (for example, NtOpenFile is fnc 75h on my Windows XP machine (the numbers change all the time; it's one of the jobs of NTDll is to map a function call to a fnc number, put it in EAX, point EDX to the incoming parameters then invoke sysenter).
Intel CPUs enforce security using what's called 'Protection Rings'.
There are 4 of these, numbered from 0 to 3. Code running in ring 0 has the highest privileges; it can (practically) do whatever it pleases with your computer. The code in ring 3, on the other hand, is always on a tight leash; it has only limited powers to influence things. And rings 1 and 2 are currently not used for any purpose at all.
A thread running in a higher privileged ring (such as ring 0) can transition to lower privilege ring (such as ring 1, 2 or 3) at will. However, the transition the other way around is strictly regulated. This is how the security of high privileged resources (such as memory) etc. is maintained.
Naturally, your user mode code (applications and all) runs in ring 3 while the OS's code runs in ring 0. This ensures that the user mode threads can't mess with the OS's data structures and other critical resources.
For details on how all this is actually implemented you could read this article. In addition, you may also want to go through Intel Manuals, especially Vol 1 and Vol 3A, which you can download here.
This is the story for Intel processors. I'm sure other architectures have something similar going on.
I think (I may be wrong) that the mechanism which it uses for transition is simple:
User-mode code executes a software interrupt
This (interrupt) causes a branch to a location specified in the interrupt descriptor table (IDT)
The thing that prevents user-mode code from usurping this is as follows: you need to be priviledged to write to the IDT; so only the kernel is able to specify what happens when an interrupt is executed.
Code running in User Mode (Ring 3) can't arbitrarily change to Kernel Mode (Ring 0). It can only do so using special routes -- jump gates, interrupts, and sysenter vectors. These routes are highly protected and input is scrubbed so that bad data can't (shouldn't) cause bad behavior.
All of this is set up by the kernel, usually on startup. It can only be configured in Kernel Mode so User-Mode code can't modify it.
It's probably fair to say that it does it in a (relatively) similar way to what Linux does. In both cases it's going to be CPU-specific, but on x86 probably either a software interrupt with the INT instruction, or via SYSENTER instruction.
The advantage of looking at how Linux does it is that you can do so without a Windows source licence.
The userspace source part is here here at LXR and the
kernel space bit - look at entry_32.S and entry_64.S
Under Linux on x86 there are three different mechanisms, int 0x80, syscall and sysenter.
A library which is built at runtime by the kernel called vdso is called by the C library to implement the syscall function, which uses a different mechanism depending on the CPU and which system call it is. The kernel then has handlers for those mechanisms (if they exist on the specific CPU variant).

Resources