What does write_cr0(read_cr0() | 0x10000) do? - linux-kernel

I searched the web a lot but didn't find a short explanation about what write_cr0(read_cr0() | 0x10000) really do. It is related to the Linux kernel and I curios about developing LKM's. I want to know what this really do and what are the security issues with this.
It used to remove the write protection on the syscall table.
But how it is really works? and what does each thing in this line?

CR0 is one of the control registers available on x86 CPUs, which contains flags controlling CPU features related to memory protection, multitasking, paging, etc. You can find a full description in Volume 3, Section 2.5 of Intel's Software Developer's Manual.
These registers are accessed by special instructions that the compiler doesn't normally generate, so read_cr0() is a function which executes the instruction to read this register (via inline assembly) and returns the result in a general-purpose register. Likewise, write_cr0() writes to this register.
The function calls are likely to be inlined, so that the generated code would be something like
mov eax, cr0
or eax, 0x10000
mov cr0, eax
The OR with 0x10000 sets bit 16, the Write Protect bit. On early 32-bit x86 CPUs, code running at supervisor level (like the kernel) was always allowed to write all of virtual memory, regardless of whether the page was marked read-only. This bit makes that optional, so that when it is set, such accesses will cause page faults. This line of code probably follows an earlier line which temporarily cleared the bit.

Related

Why does Windows use RCX, RDX for pointers in a fresh x64 process, different from EAX, EBX in a newly created 32-bit process?

When I create a Windows x86 process in a suspended state (CREATE_SUSPENDED) its CONTEXT contains:
Virtual Address of Entry Point in Eax register;
Virtual Address of Process Environment Block structure in Ebx register.
But when I do the same for x86_64 process then CONTEXT contains:
Virtual Address of Entry Point in Rcx register (why not Rax?)
Virtual Address of PEB structure in Rdx register (why not Rbx?)
It seems logical to me to take Rax in x64 in place of Eax in x86 and Rbx in x64 in place of Ebx in x86 .
But instead of Eax→Rax and Ebx→Rbx we see Eax→Rcx and Ebx→Rdx.
Also, I see that 64-bit Cheat Engine is aware of this when opening the 32-bit process (notice the migration of the values eax↔ecx and ebx↔edx:
What was the reason to move from *ax register to *cx and from *bx to *dx in 64-bit processes?
Is it somehow connected to calling conventions?
Is it related to Windows only or do other OSes also have this kind of register repurposing?
Update:
Screenshots of just created x64 process in a suspended state:
It seems logical to me to take Rax in x64 in place of Eax in x86 and Rbx in x64 in place of Ebx in x86.
I don't see why it would be logical to assume so.
Even if, at MS, they had defined an internal ABI documenting the context of a just-created 32-bit process, the 64-bit version of would have been designed anew, so there is no reason to assume it carries anything over from the old 32-bit ABI.
If Windows uses sysret to return to user space, a process created with a suspended state may leak the target address in rcx.
Returning via other mechanisms (e.g. iret/retf), as could be the case for 32-bit code, will of course leak different data in different registers.
What you are seeing is probably an artifact of how Windows returns to user mode. I don't know exactly what the Windows kernel code to return to user mode is, but it is reasonable to assume that MS kept the same interface for 32-bit processes and that this interface was designed before sysret was widely used.
Note that at the PE entry-point rcx contains a pointer to the PEB and rdx to the entry-point (not the other way around). The former appears to be an undocumented parameter passed to the entry-point function, the latter may be just an artifact of how the entry-point is called.
In fact, a 32-bit process will find a pointer to the PEB in the stack, as the first parameter for the PE entry-point code.
Regarding other OSes, anything that is not documented to be stable is free to change at any time (including what's left in the registers). This is true in general.
As far as stability goes, passing from a 32-bit to a 64-bit implementation is a pretty big step and, again, there is no reason to keep using a very old interface (but with wider registers) instead of improving it with all the recent knowledge.
You can easily see that, for example, Linux "repurposed" the registers in the 64-bit system call ABI.

saving general purpose registers in switch_to() in linux 2.6

I saw the code of switch_to in the article "Evolution of the x86 context switch in Linux" in the link https://www.maizure.org/projects/evolution_x86_context_switch_linux/
Most versions of switch_to only save/restore ESP/RSP and/or EBP/RBP, not other call-preserved registers in the inline asm. But the Linux 2.2.0 version does save them in this function, because it uses software context switching instead of relying on hardware TSS stuff. Later Linux versions still do software context switching, but don't have these push / pop instructions.
Are the registers are saved in other function (maybe in the schedule() function)? Or is there no need to save these registers in the kernel context?
(I know that those registers of the user context are saved in the kernel stack when the system enters kernel mode).
Linux versions before 2.2.0 use hardware task switching, where the TSS saves/restores registers for you. That's what the "ljmp %0\n\t" is doing. (ljmp is AT&T syntax for a far jmp, presumably to a task gate). I'm not really familiar with hardware TSS stuff because it's not very relevant; it's still used in modern kernels for getting RSP pointing to the kernel stack for interrupt handlers, but not for context switching between tasks.
Hardware task switching is slow, so later kernels avoid it. Linux 2.2 does save/restore the call-preserved registers manually, with push/pop before/after swapping stacks. EAX, EDX, and ECX are declared as dummy outputs ("=a" (eax), "=d" (edx), "=c" (ecx)) so the compiler knows that the old values of those registers are no longer available.
This is a sensible choice because switch_to is probably used inside a non-inline function. The caller will make a function call that eventually returns (after running another task for a while) with the call-preserved registers restored, and the call-clobbered registers clobbered, just like a regular function call. (So compiler code-gen for the function that uses the switch_to macro doesn't need to emit save/restore code outside of the inline asm). If you think about writing a whole context switch function in asm (not inline asm), you'd get this clobbering of volatile registers for free because callers expect that.
So how do later kernels avoid saving/restoring those registers in inline asm?
Linux 2.4 uses "=b" (last) as an output operand, so the compiler has to save/restore EBX in a function that uses this asm. The asm still saves/restores ESI, EDI, and EBP (as well as ESP). The text of the article notes this:
The 2.4 kernel context switch brings a few minor changes: EBX is no longer pushed/popped, but it is now included in the output of the inline assembly. We have a new input argument.
I don't see where they tell the compiler about EAX, ECX, and EDX not surviving, so that's odd. It might be a bug that they get away with by making the function noinline or something?
Linux 2.6 on i386 uses more output operands that get the compiler to handle the save/restore.
But Linux 2.6 for x86-64 introduces the trick that hands off the save/restore to the compiler easily: #define __EXTRA_CLOBBER ,"rcx","rbx","rdx","r8","r9","r10", "r11","r12","r13","r14","r15"
Notice the clobbers declaration: : "memory", "cc" __EXTRA_CLOBBER
This tells the compiler that the inline asm destroys all those registers, so the compiler will emit instructions to save/restore these registers at the start/end of whatever function switch_to ultimately inlines into.
Telling the compiler that all the registers are destroyed after a context switch solves the same problem as manually saving/restoring them with inline asm. The compiler will still make a function that obeys the calling convention.
The context-switch swaps to the new task's stack, so the compiler-generated save/restore code is always running with the appropriate stack pointer. Notice that the explicit push/pop instructions inside the inline asm int Linux 2.2 and 2.4 are before / after everything else.

Can I use a register as a loop counter?

Since the calling convention of a function states which registers are preserved, can a register be used as a loop counter?
I first thought that the ecx register is used as a loop counter, but after finding out that an stdcall function I have used has not preserved the value of ecx, I thought otherwise.
Is there a register that is guaranteed (by mostly used calling conventions at least) to be preserved?
Note: I don't have a problem in using a stack variable as a loop counter, I just want to make sure that it is the only way.
You can use any general-purpose register, and occasionally others, as the loop counter (just not the stack pointer of course ☺).
Either you use one to loop manually, i.e. replace…
loop label
… with…
dec ebp
jnz label
… which is faster anyway (because AMD (and later Intel, when they caught up, MHz-wise) artificially slowed down the loop instruction as otherwise, Windows® and some Turbo Pascal compiled software crashed).
Or you just save the counter in between:
label:
push ecx
call func
pop ecx
loop label
Both are standard strategies.
Is there a register that is guaranteed (by mostly used calling conventions at least) to be preserved?
You can choose any free register in your own code if your loop code will not call any external entity.
If your loop code will call an external entity where the only guaranteed contract is the ABI and calling convention then you must save/restore your registers and make the register choice case-by-case.
Quoting Agner Fog's excellent paper Calling conventions for different C++ compilers and operating systems:
6 Register usage
The rules for register usage depend on the operating system, as shown in table 4. Scratch registers are registers that can be used for temporary storage without restrictions (also called caller-save or volatile registers). Callee-save registers are registers that you have to save before using them and restore after using them (also called non-volatile registers). You can rely on these registers having the same value after a call as before the call...
...
See also:
Wikipedia: x86 calling conventions

Can I dump/modify the content of x86 CPU cache/TLB

any apps or the system kernel can access or even modify the content of CPU cahce and/or TLB?
I found a short description about the CPU cache from this webiste:
"No programming language has direct access to CPU cache. Reading and writing the cache is something done automatically by the hardware; there's NO way to write instructions which treat the cache as any kind of separate entity. Reads and writes to the cache happen as side-effect to all instructions that touch memory."
From this message, it seems there is no way to read/write the content of CPU cahce/TLB.
However, I also got another information that conflicts with the above one. That information implies that a debug tool may be able to dump/show the content of CPU cache.
Currently I'm confused. so please help me.
I got some answers from another post: dump the contents of TLB buffer of x86 CPU. Thanks adamdunson.
People could read this document about test registers, but it is only available on very old x86 machines test registers
Another descriptions from wiki https://en.wikipedia.org/wiki/Test_register:
A test register, in the Intel 80486 processor, was a register used by
the processor, usually to do a self-test. Most of these registers were
undocumented, and used by specialized software. The test registers
were named TR3 to TR7. Regular programs don't usually require these
registers to work. With the Pentium, the test registers were replaced
by a variety of model-specific registers (MSRs).
Two test registers, TR6 and TR7, were provided for the purpose of
testing. TR6 was the test command register, and TR7 was the test data
register. These registers were accessed by variants of the MOV
instruction. A test register may either be the source operand or the
destination operand. The MOV instructions are defined in both
real-address mode and protected mode. The test registers are
privileged resources. In protected mode, the MOV instructions that
access them can only be executed at privilege level 0. An attempt to
read or write the test registers when executing at any other privilege
level causes a general protection exception. Also, those instructions
generate invalid opcode exception on any CPU newer than 80486.
In fact, I'm still expecting some similar functions on Intel i7 or i5. Unfortunately, I do not find any related document about that. If anyone has such information, please let me know.

Is it possible to check if a cache line is invalid and/or a way to manually revalidate a cache line?

(I'm primarily interested in x86 architectures, but would be interested to hear if there's a way to do this in other architectures also)
Is there any way to programmatically check the state of a cache line containing a certain memory address? I don't want to read the contents of that address, and don't want the penalty of reading from main memory incurred from a cache miss, I just want to check the state of the cache line.
And is there any way to programmatically revalidate an invalid cache line without writing through to memory? (well, I guess with MESI, by "revalidate" I mean change its state to "Modified")
The only thing I can think of is to use the PREFETCH optimization hint. So, instead of:
mov eax, (DWORD PTR [esi])
You would do:
prefetch [esi]
;
; give processor time to load cacheline...
;
mov eax, (DWORD PTR [esi])
Fundamentally I think what you're trying to do is flawed. If you were running on a custom single-tasking operating system it might work, but on today's modern multi-tasking systems it's simply impossible to control the processor like that.
Cache is completely invisible to programmer. Programmer works with virtual addresses while cache work with physical addresses.

Resources