Does Mutex call a system call?

Does Mutex call a system call? - windows

CRITICAL_SECTION locking (enter) and unlocking (leave) are efficient because
CS testing is performed in user space without making the kernel system call that
a mutex makes. Unlocking is performed entirely in user space, whereas ReleaseMutex requires a system call.
I just read these sentences in this book.
What the kernel system call mean? Could you give me the function's name?
I'm a English newbie. I interpreted them like this.
CS testing doesn't use a system call.
Mutex testing uses a system call.(But I don't know the function name. Let me know)
CS unlocking doesn't call a system call.
Mutex unlocking requires a system call.(But I don't know the function name. Let me know)
Another question.
I think CRITICAL_SECTION might call WaitForSingleObject or family functions. Don't these functions require a system call? I guess they do. So CS testing doesn't use a system call is very weird to me.

The implementation of critical sections in Windows has changed over the years, but it has always been a combination of user-mode and kernel calls.
The CRITICAL_SECTION is a structure that contains a user-mode updated values, a handle to a kernel-mode object - EVENT or something like that, and debug information.
EnterCriticalSection uses an interlocked test-and-set operation to acquire the lock. If successful, this is all that is required (almost, it also updates the owner thread). If the test-and-set operation fails to aquire, a longer path is used which usually requires waiting on a kernel object with WaitForSignleObject. If you initialized with InitializeCriticalSectionAndSpinCount then EnterCriticalSection may spin an retry to acquire using interlocked operation in user-mode.
Below is a diassembly of the "fast" / uncontended path of EnterCriticialSection in Windows 7 (64-bit) with some comments inline
0:000> u rtlentercriticalsection rtlentercriticalsection+35
ntdll!RtlEnterCriticalSection:
00000000`77ae2fc0 fff3 push rbx
00000000`77ae2fc2 4883ec20 sub rsp,20h
; RCX points to the critical section rcx+8 is the LockCount
00000000`77ae2fc6 f00fba710800 lock btr dword ptr [rcx+8],0
00000000`77ae2fcc 488bd9 mov rbx,rcx
00000000`77ae2fcf 0f83e9b1ffff jae ntdll!RtlEnterCriticalSection+0x31 (00000000`77ade1be)
; got the critical section - update the owner thread and recursion count
00000000`77ae2fd5 65488b042530000000 mov rax,qword ptr gs:[30h]
00000000`77ae2fde 488b4848 mov rcx,qword ptr [rax+48h]
00000000`77ae2fe2 c7430c01000000 mov dword ptr [rbx+0Ch],1
00000000`77ae2fe9 33c0 xor eax,eax
00000000`77ae2feb 48894b10 mov qword ptr [rbx+10h],rcx
00000000`77ae2fef 4883c420 add rsp,20h
00000000`77ae2ff3 5b pop rbx
00000000`77ae2ff4 c3 ret
So the bottom line is that if the thread does not need to block it will not use a system call, just an interlocked test-and-set operation. If blocking is required, there will be a system call. The release path also uses an interlocked test-and-set and may require a system call if other threads are blocked.
Compare this to Mutex which always requires a system call NtWaitForSingleObject and NtReleaseMutant

Calling to the kernel requires a context switch, which is takes a small (but measurable) performance hit for every context switch. The function in question is ReleaseMutex() itself.
The critical section functions are available in kernel32.dll (at least from the caller's point of view - see comments for discussion about ntdll.dll) and can often avoid making any calls into the kernel.
It is worthwhile to know that Mutex objects can be accessed from different processes at the same time. On the other hand, CRITICAL_SECTION objects are limited to one process.

To my knowledge critical sections are implemented using semaphores.
The critical section functions are implemented in NTDLL, which implements some runtime functions in user mode and passes control so the kernel for others (system call). The functions in kernel32.dll are simple function forwarders.
Mutexes on the other hand are kernel objects and require a system call as such. The kernel calls them "mutants", by the way (no joke).

Critical section calls only transition to kernel mode if there is contention and only then if they can't relieve the contention by spinning. In that case the thread blocks and calls a wait function – that's a system call.

Related

What does write_cr0(read_cr0() | 0x10000) do?

I searched the web a lot but didn't find a short explanation about what write_cr0(read_cr0() | 0x10000) really do. It is related to the Linux kernel and I curios about developing LKM's. I want to know what this really do and what are the security issues with this.
It used to remove the write protection on the syscall table.
But how it is really works? and what does each thing in this line?

CR0 is one of the control registers available on x86 CPUs, which contains flags controlling CPU features related to memory protection, multitasking, paging, etc. You can find a full description in Volume 3, Section 2.5 of Intel's Software Developer's Manual.
These registers are accessed by special instructions that the compiler doesn't normally generate, so read_cr0() is a function which executes the instruction to read this register (via inline assembly) and returns the result in a general-purpose register. Likewise, write_cr0() writes to this register.
The function calls are likely to be inlined, so that the generated code would be something like
mov eax, cr0
or eax, 0x10000
mov cr0, eax
The OR with 0x10000 sets bit 16, the Write Protect bit. On early 32-bit x86 CPUs, code running at supervisor level (like the kernel) was always allowed to write all of virtual memory, regardless of whether the page was marked read-only. This bit makes that optional, so that when it is set, such accesses will cause page faults. This line of code probably follows an earlier line which temporarily cleared the bit.

Is it valid to write below ESP?

For a 32-bit windows application is it valid to use stack memory below ESP for temporary swap space without explicitly decrementing ESP?
Consider a function that returns a floating point value in ST(0). If our value is currently in EAX we would, for example,
PUSH EAX
FLD [ESP]
ADD ESP,4 // or POP EAX, etc
// return...
Or without modifying the ESP register, we could just :
MOV [ESP-4], EAX
FLD [ESP-4]
// return...
In both cases the same thing happens except that in the first case we take care to decrement the stack pointer before using the memory, and then to increment it afterwards. In the latter case we do not.
Notwithstanding any real need to persist this value on the stack (reentrancy issues, function calls between PUSHing and reading the value back, etc) is there any fundamental reason why writing to the stack below ESP like this would be invalid?

TL:DR: no, there are some SEH corner cases that can make it unsafe in practice, as well as being documented as unsafe. #Raymond Chen recently wrote a blog post that you should probably read instead of this answer.
His example of a code-fetch page-fault I/O error that can be "fixed" by prompting the user to insert a CD-ROM and retry is also my conclusion for the only practically-recoverable fault if there aren't any other possibly-faulting instructions between store and reload below ESP/RSP.
Or if you ask a debugger to call a function in the program being debugged, it will also use the target process's stack.
This answer has a list of some things you'd think would potentially step on memory below ESP, but actually don't, which might be interesting. It seems to be only SEH and debuggers that can be a problem in practice.
First of all, if you care about efficiency, can't you avoid x87 in your calling convention? movd xmm0, eax is a more efficient way to return a float that was in an integer register. (And you can often avoid moving FP values to integer registers in the first place, using SSE2 integer instructions to pick apart exponent / mantissa for a log(x), or integer add 1 for nextafter(x).) But if you need to support very old hardware, then you need a 32-bit x87 version of your program as well as an efficient 64-bit version.
But there are other use-cases for small amounts of scratch space on the stack where it would be nice to save a couple instructions that offset ESP/RSP.
Trying to collect up the combined wisdom of other answers and discussion in comments under them (and on this answer):
It is explicitly documented as being not safe by Microsoft: (for 64-bit code, I didn't find an equivalent statement for 32-bit code but I'm sure there is one)
Stack Usage (for x64)
All memory beyond the current address of RSP is considered volatile: The OS, or a debugger, may overwrite this memory during a user debug session, or an interrupt handler.
So that's the documentation, but the interrupt reason stated doesn't make sense for the user-space stack, only the kernel stack. The important part is that they document it as not guaranteed safe, not the reasons given.
Hardware interrupts can't use the user stack; that would let user-space crash the kernel with mov esp, 0, or worse take over the kernel by having another thread in the user-space process modify return addresses while an interrupt handler was running. This is why kernels always configure things so interrupt context is pushed onto the kernel stack.
Modern debuggers run in a separate process, and are not "intrusive". Back in 16-bit DOS days, without a multi-tasking protected-memory OS to give each task its own address space, debuggers would use the same stack as the program being debugged, between any two instructions while single-stepping.
#RossRidge points out that a debugger might want to let you call a function in the context of the current thread, e.g. with SetThreadContext. This would run with ESP/RSP just below the current value. This could obviously have side-effects for the process being debugged (intentional on the part of the user running the debugger), but clobbering local variables of the current function below ESP/RSP would be an undesirable and unexpected side-effect. (So compilers can't put them there.)
(In a calling convention with a red-zone below ESP/RSP, a debugger could respect that red-zone by decrementing ESP/RSP before making the function call.)
There are existing program that intentionally break when being debugged at all, and consider this a feature (to defend against efforts to reverse-engineer them).
Related: the x86-64 System V ABI (Linux, OS X, all other non-Windows systems) does define a red-zone for user-space code (64-bit only): 128 bytes below RSP that is guaranteed not to be asynchronously clobbered. Unix signal handlers can run asynchronously between any two user-space instructions, but the kernel respects the red-zone by leaving a 128 byte gap below the old user-space RSP, in case it was in use. With no signal handlers installed, you have an effectively unlimited red-zone even in 32-bit mode (where the ABI does not guarantee a red-zone). Compiler-generated code, or library code, of course can't assume that nothing else in the whole program (or in a library the program called) has installed a signal handler.
So the question becomes: is there anything on Windows that can asynchronously run code using the user-space stack between two arbitrary instructions? (i.e. any equivalent to a Unix signal handler.)
As far as we can tell, SEH (Structured Exception Handling) is the only real obstacle to what you propose for user-space code on current 32 and 64-bit Windows. (But future Windows could include a new feature.)
And I guess debugging if you happen ask your debugger to call a function in the target process/thread as mentioned above.
In this specific case, not touching any other memory other than the stack, or doing anything else that could fault, it's probably safe even from SEH.
SEH (Structured Exception Handling) lets user-space software have hardware exceptions like divide by zero delivered somewhat similarly to C++ exceptions. These are not truly asynchronous: they're for exceptions triggered by instructions you ran, not for events that happened to come after some random instruction.
But unlike normal exceptions, one thing a SEH handler can do is resume from where the exception occurred. (#RossRidge commented: SEH handlers are are initially called in the context of the unwound stack and can choose to ignore the exception and continue executing at the point where the exception occurred.)
So that's a problem even if there's no catch() clause in the current function.
Normally HW exceptions can only be triggered synchronously. e.g. by a div instruction, or by a memory access which could fault with STATUS_ACCESS_VIOLATION (the Windows equivalent of a Linux SIGSEGV segmentation fault). You control what instructions you use, so you can avoid instructions that might fault.
If you limit your code to only accessing stack memory between the store and reload, and you respect the stack-growth guard page, your program won't fault from accessing [esp-4]. (Unless you reached the max stack size (Stack Overflow), in which case push eax would fault, too, and you can't really recover from this situation because there's no stack space for SEH to use.)
So we can rule out STATUS_ACCESS_VIOLATION as a problem, because if we get that on accessing stack memory we're hosed anyway.
An SEH handler for STATUS_IN_PAGE_ERROR could run before any load instruction. Windows can page out any page it wants to, and transparently page it back in if it's needed again (virtual memory paging). But if there's an I/O error, your Windows attempts to let your process handle the failure by delivering a STATUS_IN_PAGE_ERROR
Again, if that happens to the current stack, we're hosed.
But code-fetch could cause STATUS_IN_PAGE_ERROR, and you could plausibly recover from that. But not by resuming execution at the place where the exception occurred (unless we can somehow remap that page to another copy in a highly fault-tolerant system??), so we might still be ok here.
An I/O error paging in the code that wants to read what we stored below ESP rules out any chance of reading it. If you weren't planning to do that anyway, you're fine. A generic SEH handler that doesn't know about this specific piece of code wouldn't be trying to do that anyway. I think usually a STATUS_IN_PAGE_ERROR would at most try to print an error message or maybe log something, not try to carry on whatever computation was happening.
Accessing other memory in between the store and reload to memory below ESP could trigger a STATUS_IN_PAGE_ERROR for that memory. In library code, you probably can't assume that some other pointer you passed isn't going to be weird and the caller is expecting to handle STATUS_ACCESS_VIOLATION or PAGE_ERROR for it.
Current compilers don't take advantage of space below ESP/RSP on Windows, even though they do take advantage of the red-zone in x86-64 System V (in leaf functions that need to spill / reload something, exactly like what you're doing for int -> x87.) That's because MS says it isn't safe, and they don't know whether SEH handlers exist that could try to resume after an SEH.
Things that you'd think might be a problem in current Windows, and why they're not:
The guard page stuff below ESP: as long as you don't go too far below the current ESP, you'll be touching the guard page and trigger allocation of more stack space instead of faulting. This is fine as long as the kernel doesn't check user-space ESP and find out that you're touching stack space without having "reserved" it first.
kernel reclaim of pages below ESP/RSP: apparently Windows doesn't currently do this. So using a lot of stack space once ever will keep those pages allocated for the rest of your process lifetime, unless you manually VirtualAlloc(MEM_RESET) them. (The kernel would be allowed to do this, though, because the docs say memory below RSP is volatile. The kernel could effectively zero it asynchronously if it wants to, copy-on-write mapping it to a zero page instead of writing it to the pagefile under memory pressure.)
APC (Asynchronous Procedure Calls): They can only be delivered when the process is in an "alertable state", which means only when inside a call to a function like SleepEx(0,1). calling a function already uses an unknown amount of space below E/RSP, so you already have to assume that every call clobbers everything below the stack pointer. Thus these "async" callbacks are not truly asynchronous with respect to normal execution the way Unix signal handlers are. (fun fact: POSIX async io does use signal handlers to run callbacks).
Console-application callbacks for ctrl-C and other events (SetConsoleCtrlHandler). This looks exactly like registering a Unix signal handler, but in Windows the handler runs in a separate thread with its own stack. (See RbMm's comment)
SetThreadContext: another thread could change our EIP/RIP asynchronously while this thread is suspended, but the whole program has to be written specially for that to make any sense. Unless it's a debugger using it. Correctness is normally not required when some other thread is messing around with your EIP unless the circumstances are very controlled.
And apparently there are no other ways that another process (or something this thread registered) can trigger execution of anything asynchronously with respect to the execution of user-space code on Windows.
If there are no SEH handlers that could try to resume, Windows more or less has a 4096 byte red-zone below ESP (or maybe more if you touch it incrementally?), but RbMm says nobody takes advantage of it in practice. This is unsurprising because MS says not to, and you can't always know if your callers might have done something with SEH.
Obviously anything that would synchronously clobber it (like a call) must also be avoided, again same as when using the red-zone in the x86-64 System V calling convention. (See https://stackoverflow.com/tags/red-zone/info for more about it.)

in general case (x86/x64 platform) - interrupt can be executed at any time, which overwrite memory bellow stack pointer (if it executed on current stack). because this, even temporary save something bellow stack pointer, not valid in kernel mode - interrupt will be use current kernel stack. but in user mode situation another - windows build interrupt table (IDT) suchwise that when interrupt raised - it will be always executed in kernel mode and in kernel stack. as result user mode stack (below stack pointer) will be not affected. and possible temporary use some stack space bellow it pointer, until you not do any functions calls. if exception will be (say by access invalid address) - also space bellow stack pointer will be overwritten - cpu exception of course begin executed in kernel mode and kernel stack, but than kernel execute callback in user space via ntdll.KiDispatchExecption already on current stack space. so in general this is valid in windows user mode (in current implementation), but you need good understand what you doing. however this is very rarely i think used
of course, how correct noted in comments that we can, in windows user mode, write below stack pointer - is just the current implementation behavior. this not documented or guaranteed.
but this is very fundamental - unlikely will be changed: interrupts always will be executed in privileged kernel mode only. and kernel mode will be use only kernel mode stack. the user mode context not trusted at all. what will be if user mode program set incorrect stack pointer ? say by
mov rsp,1 or mov esp,1 ? and just after this instruction interrupt will be raised. what will be if it begin executed on such invalid esp/rsp ? all operation system just crashed. exactly because this interrupt will be executed only on kernel stack. and not overwrite user stack space.
also need note that stack is limited space (even in user mode), access it bellow 1 page (4Kb)already error (need do stack probing page by page, for move guard page down).
and finally really there is no need usually access [ESP-4], EAX - in what problem decrement ESP first ? even if we need access stack space in loop huge count of time - decrement stack pointer need only once - 1 additional instruction (not in loop) nothing change in performance or code size.
so despite formal this is will be correct work in windows user mode, better (and not need) use this
of course formal documentation say:
Stack Usage
All memory beyond the current address of RSP is considered volatile
but this is for common case, including kernel mode too. i wrote about user mode and based on current implementation
possible in future windows and add "direct" apc or some "direct" signals - some code will be executed via callback just after thread enter to kernel (during usual hardware interrupt). after this all below esp will be undefined. but until this not exist. until this code will be work always(in current builds) correct.

In general (not specifically related to any OS); it's not safe to write below ESP if:
It's possible for the code to be interrupted and the interrupt handler will run at the same privilege level. Note: This is typically very unlikely for "user-space" code, but extremely likely for kernel code.
You call any other code (where either the call or the stack used by the called routine can trash the data you stored below ESP)
Something else depends on "normal" stack use. This can include signal handling, (language based) exception unwinding, debuggers, "stack smashing protector"
It's safe to write below ESP if it's not "not safe".
Note that for 64-bit code, writing below RSP is built into the x86-64 ABI ("red zone"); and is made safe by support for it in tool chains/compilers and everything else.

When a thread gets created, Windows reserves a contiguous region of virtual memory of a configurable size (the default is 1 MB) for the thread's stack. Initially, the stack looks like this (the stack grows downwards):
--------------
| committed |
--------------
| guard page |
--------------
| . |
| reserved |
| . |
| . |
| |
--------------
ESP will be pointing somewhere inside the committed page. The guard page is used to support automatic stack growth. The reserved pages region ensures that the requested stack size is available in virtual memory.
Consider the two instructions from the question:
MOV [ESP-4], EAX
FLD [ESP-4]
There are three possibilities:
The first instruction executes successfully. There is nothing that uses the user-mode stack that can execute between the two instructions. So the second instruction will use the correct value (#RbMm stated this in the comments under his answer and I agree).
The first instruction raises an exception and an exception handler does not return EXCEPTION_CONTINUE_EXECUTION. As long as the second instruction is immediately after the first one (it is not in the exception handler or placed after it), then the second instruction will not execute. So you're still safe. Execution continues from stack frame where the exception handler exists.
The first instruction raises an exception and an exception handler returns EXCEPTION_CONTINUE_EXECUTION. Execution continues from the same instruction that raised the exception (potentially with a context modified by the handler). In this particular example, the first will be re-executed to write a value below ESP. No problem. If the second instruction raised an exception or there are more than two instructions, then the exception might occur a place after a value is written below ESP. When the exception handler gets called, it may overwrite the value and then return EXCEPTION_CONTINUE_EXECUTION. But when execution resumes, the value written is assumed to still be there, but it's not anymore. This is a situation where it's not safe to write below ESP. This applies even if all of the instructions are placed consecutively. Thanks to #RaymondChen for pointing this out.
In general, if the two instructions are not placed back-to-back, if you are writing to locations beyond ESP, there is no guarantee that the written values won't get corrupted or overwritten. One case that I can think of where this might happen is structured exception handling (SEH). If a hardware-defined exception (such as divide by zero) occurs, the kernel exception handler will be invoked (KiUserExceptionDispatcher) in kernel-mode, which will invoke the user-mode side of the handler (RtlDispatchException). When switching from user-mode to kernel-mode and then back to user-mode, whatever value was in ESP will be saved and restored. However, the user-mode handler itself uses the user-mode stack and will iterate over a registered list of exception handlers, each of which uses the user-mode stack. These functions will modify ESP as required. This may lead to losing the values you've written beyond ESP. A similar situation occurs when using software-define exceptions (throw in VC++).
I think you can deal with this by registering your own exception handler before any other exception handlers (so that it is called first). When your handler gets called, you can save your data beyond ESP elsewhere. Later, during unwinding, you get the cleanup opportunity to restore your data to the same location (or any other location) on the stack.
You need also to similarly watch out for asynchronous procedure calls (APCs) and callbacks.

Several answers here mention APCs (Asynchronous Procedure Calls), saying that they can only be delivered when the process is in an "alertable state", and are not truly asynchronous with respect to normal execution the way Unix signal handlers are
Windows 10 version 1809 introduces Special User APCs, which can fire at any moment just like Unix signals. See this article for low level details.
The Special User APC is a mechanism that was added in RS5 (and exposed through NtQueueApcThreadEx), but lately (in an insider build) was exposed through a new syscall - NtQueueApcThreadEx2. If this type of APC is used, the thread is signaled in the middle of the execution to execute the special APC.

writing a !address equivalent in WinAPI

Implementing !address feature of Windbg...
I am using VirtualQueryEx to query another Process memory and using getModuleFileName on the base addresses returned from VirtualQueryEx gives the module name.
What is left are the other non-module regions of a Process. How do I determine if a file is mapped to a region, or if the region represents the stack or the heap or PEB/TEB etc.
Basically, How do I figure out if a region represents Heap, the stack or PEB. How does Windbg do it?

One approach is to disassemble the code in the debugger extension DLL that implements !address. There is documentation within the Windbg help file on writing an extension. You could use that documentation to reverse engineer where the handler of !address is located. Then browsing through the disassembly you can see what functions it calls.
Windbg has support for debugging another instance of Windbg, specifically to debug an extension DLL. You can use this facility to better delve into the implementation of !address.
While the reverse engineering approach may be tedious, it will be more deterministic than theorizing how !address is implemented and trying out each theory.

To add to #Χpẘ answer, the reverse of the command shouldn't be really hard as debugger extensions DLLs come with symbols (I already reversed one to explain the internal flag of the !heap command).
Note that it is just a quick overview, I haven't perused inside it too much.
According to the !address documentation the command is located in exts.dll library. The command itself is located in Extension::address.
There are two commands handled there, a kernel mode (KmAnalyzeAddress) and a user mode one (UmAnalyzeAddress).
Inside UmAnalyzeAddress, the code:
Parse the command line: UmParseCommandLine(CmdArgs &,UmFilterData &)
Check if the process PEB is available IsTypeAvailable(char const *,ulong *) with "${$ntdllsym}!_PEB"
Allocate a std::list of user mode ranges: std::list<UmRange,std::allocator<UmRange>>::list<UmRange,std::allocator<UmRange>>(void)
Starts a loop to gather the required information:
UmRangeData::GetWowState(void)
UmMapBuild
UmMapFileMappings
UmMapModules
UmMapPebs
UmMapTebsAndStacks
UmMapHeaps
UmMapPageHeaps
UmMapCLR
UmMapOthers
Finally the results are finally output to screen using UmPrintResults.
Each of the above function can be simplfied to basic components, e.g. UmFileMappingshas the following central code:
.text:101119E0 push edi ; hFile
.text:101119E1 push offset LibFileName ; "psapi.dll"
.text:101119E6 call ds:LoadLibraryExW(x,x,x)
.text:101119EC mov [ebp+hLibModule], eax
.text:101119F2 test eax, eax
.text:101119F4 jz loc_10111BC3
.text:101119FA push offset ProcName ; "GetMappedFileNameW"
.text:101119FF push eax ; hModule
.text:10111A00 mov byte ptr [ebp+var_4], 1
.text:10111A04 call ds:GetProcAddress(x,x)
Another example, to find each stacks, the code just loops trhough all threads, get their TEB and call:
.text:1010F44C push offset aNttib_stackbas ; "NtTib.StackBase"
.text:1010F451 lea edx, [ebp+var_17C]
.text:1010F457 lea ecx, [ebp+var_CC]
.text:1010F45D call ExtRemoteTyped::Field(char const *)
There is a lot of fetching from _PEB, _TEB, _HEAP and other internal structures so it's not probably doable without going directly through those structures. So, I guess that some of the information returned by !address are not accessible through usual / common APIs.

You need to determine if the address you are interested in lies within a memory mapped file. Check out --> GetMappedFileName. Getting the heap and stack addresses of a process will be a little more problematic as the ranges are dynamic and don't always lie sequentially.
Lol, I don't know, I would start with a handle to the heap. If you can spawn/inherit a process then you more than likely can access the handle to the heap. This function looks promising: GetProcessHeap . That debug app runs as admin, it can walk the process chain and spy on any user level process. I don't think you will be able to access protected memory of kernel mode apps such as File System Filters, however, as they are dug down a little lower by policy.

Can I use a register as a loop counter?

Since the calling convention of a function states which registers are preserved, can a register be used as a loop counter?
I first thought that the ecx register is used as a loop counter, but after finding out that an stdcall function I have used has not preserved the value of ecx, I thought otherwise.
Is there a register that is guaranteed (by mostly used calling conventions at least) to be preserved?
Note: I don't have a problem in using a stack variable as a loop counter, I just want to make sure that it is the only way.

You can use any general-purpose register, and occasionally others, as the loop counter (just not the stack pointer of course ☺).
Either you use one to loop manually, i.e. replace…
loop label
… with…
dec ebp
jnz label
… which is faster anyway (because AMD (and later Intel, when they caught up, MHz-wise) artificially slowed down the loop instruction as otherwise, Windows® and some Turbo Pascal compiled software crashed).
Or you just save the counter in between:
label:
push ecx
call func
pop ecx
loop label
Both are standard strategies.

Is there a register that is guaranteed (by mostly used calling conventions at least) to be preserved?
You can choose any free register in your own code if your loop code will not call any external entity.
If your loop code will call an external entity where the only guaranteed contract is the ABI and calling convention then you must save/restore your registers and make the register choice case-by-case.
Quoting Agner Fog's excellent paper Calling conventions for different C++ compilers and operating systems:
6 Register usage
The rules for register usage depend on the operating system, as shown in table 4. Scratch registers are registers that can be used for temporary storage without restrictions (also called caller-save or volatile registers). Callee-save registers are registers that you have to save before using them and restore after using them (also called non-volatile registers). You can rely on these registers having the same value after a call as before the call...
...
See also:
Wikipedia: x86 calling conventions

hooking ,dll injection and thread safe

When I'm overwriting the first opcodes of a function with the jmp opcode , I'm actually writting 5 bytes (or 2 for jmp short).
But what if another thread (from the same proccess) will call this function while I'm changing it?
This will cause unexpected behavior.
But I didn't find any explaination . The hooking articles igonre it , like there is no problem.
Maybe in win32api you use the fact that there are nops with mov edi,edi . but my question is more theoretical
thanks

It is quite possible to cause issues. You can create a critical section on the to-change code and enter the critical section to ensure exclusive access while changing the code.
In the mutual access case, the executing thread can (theoretically) see the first byte and will proceed to execute a jump on the following 4 bytes (in case of a long jump). In case of a call, the next instruction (IP) is pushed prior to the jump, and that is current + 5. Theoretically, a ret may cause that thread to run into unmodified instructions (where you might need a nop, for example).
This is all theoretical, but you should prevent mutual access while changing code.

If you inject into a specific process you are able to suspend the process, install all your hooks and continue after that.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio