TIB Custom Storage

TIB Custom Storage - windows

After quite a bit of googling and some hints given here, I finally managed to find a layout of the FS segment (used by windows to store TIB data). Of particular interest to me is the ArbitraryUserPointer member provided in the PSDK:
typedef struct _NT_TIB {
struct _EXCEPTION_REGISTRATION_RECORD *ExceptionList;
PVOID StackBase;
PVOID StackLimit;
PVOID SubSystemTib;
union {
PVOID FiberData;
DWORD Version;
};
PVOID ArbitraryUserPointer;
struct _NT_TIB *Self;
} NT_TIB;
How safe exactly is it to use this variable (under Vista and above)? and does it still exist on x64?
Secondary to that is the access of this variable. I'm using MSVC, and as such I
have access to the __readfsdword & __readgsqword intrinsics, however, MSDN for some reason marks these as privileged instructions:
These intrinsics are only available in kernel mode, and the routines are only available as intrinsics.
They are of course not kernel only, but why are they marked as such, just incorrect documentation? (my offline VS 2008 docs don't have this clause).
Finally, is it safe to access ArbitraryUserPointer directly via a single __readfsdword(0x14) or is it preferred to use it via the linear TIB address? (which will still require a read from FS).

ArbitraryUserPointer is an internal field not for general use. The operating system uses it internally, and if you overwrite it, you will corrupt stuff. I concede that it has a very poor name.

In case you're still for an answer, I've had the same problem too and posted my question, similar to yours:
Thread-local storage in kernel mode?
I need a TLS-equivalent in the kernel-mode driver. To be exact, I have a deep function call tree which originates at some point (driver's dispatch routine for instance), and I need to pass the context information.
In my specific case the catch is that I don't need a persistent storage, I just need a thread-specific placeholder for something for a single top-level function call. Hence I decided to use an arbitrary entry in the TLS array for the function call, and after it's done - restore its original value.
You get the TLS array by the following:
DWORD* get_Tls()
{
return (DWORD*) (__readfsdword(0x18) + 0xe10);
}
BTW I have no idea why the TIB is usually accessed by reading the contents of fs:[0x18]. It's just pointed by the fs selector. But this is how all the MS's code accesses it, hence I decided to do this as well.
Next, you choose an arbitrary TLS index, say 0.
const DWORD g_dwMyTlsIndex = 0;
void MyTopLevelFunc()
{
// prolog
DWORD dwOrgVal = get_Tls()[g_dwMyTlsIndex];
get_Tls()[g_dwMyTlsIndex] = dwMyContextValue;
DoSomething();
// epilog
get_Tls()[g_dwMyTlsIndex] = dwOrgVal;
}
void DoSomething()
{
DWORD dwMyContext = get_Tls()[g_dwMyTlsIndex];
}

Related

CUDA dynamic parallelism: Access child kernel results in global memory

I am currently trying my first dynamic parallelism code in CUDA. It is pretty simple. In the parent kernel I am doing something like this:
int aPayloads[32];
// Compute aPayloads start values here
int* aGlobalPayloads = nullptr;
cudaMalloc(&aGlobalPayloads, (sizeof(int) *32));
cudaMemcpyAsync(aGlobalPayloads, aPayloads, (sizeof(int)*32), cudaMemcpyDeviceToDevice));
mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
cudaDeviceSynchronize();
// Access results in payload array here
Assuming that I do things right so far, what is the fastest way to access the results in aGlobalPayloads after kernel execution? (I tried cudaMemcpy() to copy aGlobalPayloads back to aPayloads but cudaMemcpy() is not allowed in device code).

You can directly access the data in aGlobalPayloads from your parent kernel code, without any copying:
mykernel<<<1, 1>>>(aGlobalPayloads); // Modifies data in aGlobalPayloads
cudaDeviceSynchronize();
int myval = aGlobalPayloads[0];
I'd encourage careful error checking (Read the whole accepted answer here). You do it in device code the same way as in host code. The programming guide states: "May not pass in local or shared memory pointers". Your usage of aPayloads is a local memory pointer.
If for some reason you want that data to be explicitly put back in your local array, you can use in-kernel memcpy for that:
memcpy(aPayloads, aGlobalPayloads, sizeof(int)*32);
int myval = aPayloads[0]; // retrieves the same value
(that is also how I would fix the issue I mention in item 2 - use in-kernel memcpy)

How do I disable ASLR for heap addresses for a program compiled and linked with mingw-w64 GCC? [duplicate]

For debugging purposes, I would like malloc to return the same addresses every time the program is executed, however in MSVC this is not the case.
For example:
#include <stdlib.h>
#include <stdio.h>
int main() {
int test = 5;
printf("Stack: %p\n", &test);
printf("Heap: %p\n", malloc(4));
return 0;
}
Compiling with cygwin's gcc, I get the same Stack address and Heap address everytime, while compiling with MSVC with aslr off...
cl t.c /link /DYNAMICBASE:NO /NXCOMPAT:NO
...I get the same Stack address every time, but the Heap address changes.
I have already tried adding the registry value HKLM\SYSTEM\CurrentControlSet\Control\Session Manager\Memory Management\MoveImages but it does not work.

Both the stack address and the pointer returned by malloc() may be different every time. As a matter of fact both differ when the program is compiled and run on Mac/OS multiple times.
The compiler and/or the OS may cause this behavior to try and make it more difficult to exploit software flaws. There might be a way to prevent this in some cases, but if your goal is to replay the same series of malloc() addresses, other factors may change the addresses, such as time sensitive behaviors, file system side effects, not to mention non-deterministic thread behavior. You should try and avoid relying on this for your tests.
Note also that &test should be cast as (void *) as %p expects a void pointer, which is not guaranteed to have the same representation as int *.

It turns out that you may not be able to obtain deterministic behaviour from the MSVC runtime libraries. Both the debug and the production versions of the C/C++ runtime libraries end up calling a function named _malloc_base(), which in turn calls the Win32 API function HeapAlloc(). Unfortunately, neither HeapAlloc() nor the function that provides its heap, HeapCreate(), document a flag or other way to obtain deterministic behaviour.
You could roll up your own allocation scheme on top of VirtualAlloc(), as suggested by #Enosh_Cohen, but then you'd loose the debug functionality offered by the MSVC allocation functions.

Diomidis' answer suggests making a new malloc on top of VirtualAlloc, so I did that. It turned out to be somewhat challenging because VirtualAlloc itself is not deterministic, so I'm documenting the procedure I used.
First, grab Doug Lea's malloc. (The ftp link to the source is broken; use this http alternative.)
Then, replace the win32mmap function with this (hereby placed into the public domain, just like Doug Lea's malloc itself):
static void* win32mmap(size_t size) {
/* Where to ask for the next address from VirtualAlloc. */
static char *next_address = (char*)(0x1000000);
/* Return value from VirtualAlloc. */
void *ptr = 0;
/* Number of calls to VirtualAlloc we have made. */
int tries = 0;
while (!ptr && tries < 100) {
ptr = VirtualAlloc(next_address, size,
MEM_RESERVE|MEM_COMMIT, PAGE_READWRITE);
if (!ptr) {
/* Perhaps the requested address is already in use. Try again
* after moving the pointer. */
next_address += 0x1000000;
tries++;
}
else {
/* Advance the request boundary. */
next_address += size;
}
}
/* Either we got a non-NULL result, or we exceeded the retry limit
* and are going to return MFAIL. */
return (ptr != 0)? ptr: MFAIL;
}
Now compile and link the resulting malloc.c with your program, thereby overriding the MSVCRT allocator.
With this, I now get consistent malloc addresses.
But beware:
The exact address I used, 0x1000000, was chosen by enumerating my address space using VirtualQuery to look for a large, consistently available hole. The address space layout appears to have some unavoidable non-determinism even with ASLR disabled. You may have to adjust the value.
I confirmed this works, in my particular circumstances, to get the same addresses during 100 sequential runs. That's good enough for the debugging I want to do, but the values might change after enough iterations, or after rebooting, etc.
This modification should not be used in production code, only for debugging. The retry limit is a hack, and I've done nothing to track when the heap shrinks.

Using Thrust Functions with raw pointers: Controlling the allocation of memory

I have a question regarding the thrust library when using CUDA.
I am using a thrust function, i.e. exclusive_scan, and I want to use raw pointers. I am using raw (device) pointers because I want to have full control of when the memory is allocated and deallocated.
After the function call, I will hand over the pointer to another data structure and then free the memory in either the destructor of this data structure, or in the next function call, when I recompute my (device) pointers. I came across for example this problem here now, which recommends to wrap the data structure in a device_vector. But then I run into the problem that the memory is freed once my device_vector goes out of scope, which I do not want. Having the device pointer globally is also not an option, since I am hacking code, i.e. it is used as a buffer and I would have to rewrite a lot if I wanted to do something like that.
Does anyone have a good workaround regarding this? The only chance I do see right now is to rewrite the thrust-function on my own, only using raw device-pointers.
EDIT: I misread, I can wrap it in a device_ptr instead of a device_vector.
Asking further though, how could I solve this if there wasn't the option of using a device_ptr?

There is no problem using plain pointers in thrust methods.
For data on the device do:
....
struct DoSomething {
__device__ int operator()(int item) { return 1; }
};
int* IntData;
cudaMalloc(&IntData, sizeof(int) * count);
auto dev_data = device_pointer_cast(IntData);
thrust::generate(dev_data, dev_data + count, DoSomething());
thrust::sort(dev_data, dev_data + count);
....
cudaFree(IntData);
For data on the host use plain malloc/free and raw_pointer_cast instead of device_pointer_cast.
See: thrust: Memory management

DRIVER_OBJECT.DriverSection

Does anyone have an idea what is the structure of the DriverSection pointer in the x64 bit version of win7. In 32 bit I used the following:
typedef struct _KLDR_DATA_TABLE_ENTRY {
LIST_ENTRY InLoadOrderLinks;
PVOID ExceptionTable;
ULONG ExceptionTableSize;
//ULONG padding1;
PVOID GpValue;
PVOID NonPagedDebugInfo;
PVOID DllBase;
PVOID EntryPoint;
ULONG SizeOfImage;
UNICODE_STRING FullDllName;
UNICODE_STRING BaseDllName;
ULONG Flags;
USHORT LoadCount;
USHORT __Unused5;
PVOID SectionPointer;
ULONG CheckSum;
//ULONG Padding2;
PVOID LoadedImports;
PVOID PatchInformation;
} KLDR_DATA_TABLE_ENTRY, *PKLDR_DATA_TABLE_ENTRY;
And everything was working but on x64 it is crashing when trying to dereference the LIST_ENTRY. Any pointers/tips would be greatly appreciated

And everything was working but on x64 it is crashing when trying to dereference the LIST_ENTRY. Any pointers/tips would be greatly appreciated
If you can hook up a kernel debugger, you can verify whether or not the DriverSection object matches your definition. To do this, pick a driver you wish to debug - I tend to use a simple one I have. Load its symbols by fixing the symbol path to include its pdb, then break into windbg or kd and type:
.reload
To reload the symbols. Then you can load the driver with:
sc start drivername
having created its service assuming it is a legacy driver. Type:
bu drivername!DriverEntry
to set a breakpoint on the DriverEntry for this module. The difference between bp and bu is that bu breakpoints are evaluated and set on module load. Currently, of course, DriverEntry won't be called, but if we reload the driver it will:
sc stop drivername
sc start drivername
Now your breakpoint should be hit and rcx will contain the DRIVER_OBJECT structure, since it is a pointer argument and pointer/integer arguments are passed in rcx,rdx,r8,r9 according to the Windows ABI. So, you can print out the driver object structure with:
dt _DRIVER_OBJECT (address of rcx)
Which will give you a pointer to the driver section. Then, type:
dt _LDR_DATA_TABLE_ENTRY (driver section object pointer)
This should give you your driver section object. _LDR_DATA_TABLE_ENTRY is actually present in the Windows symbols, so this will work.
Using the debugger, you should be able to dereference the LIST_ENTRY pointers (.flink and .blink) successfully (try dpps on the address of the _LDR_DATA_TABLE_ENTRYstructure, for example). If you do it successfully, one of those addresses will resolve tont!PsLoadedModuleList`.
What I'm trying to say is that in a roundabout way, is that:
Either there is a bug in your code somewhere, or
You've hit upon a synchronisation issue. Remember, this structure is supposed to be opaque and we're not supposed to be modifying it in any way. It's also liable to change on us, and we don't know where the lock for synchronising access to it is.
If you are certain 1 is not the case, it is likely 2 is. Luckily, Microsoft actually provided a function to get the information from these structures called AuxKlibQueryModuleInformation(). You do need to add an extra library to your driver, but that's not the end of the world. Include Aux_klib.h. There's also a code sample on the MSDN page showing how to use it - it's pretty straightforward.

Simple way to hook registry access for specific process

Is there a simple way to hook registry access of a process that my code executes? I know about SetWindowsHookEx and friends, but its just too complex... I still have hopes that there is a way as simple as LD_PRELOAD on Unix...

Read up on the theory of DLL Injection here: http://en.wikipedia.org/wiki/DLL_injection
However, I will supply you with a DLL Injection snippet here: http://www.dreamincode.net/code/snippet407.htm
It's pretty easy to do these types of things once you're in the memory of an external application, upon injection, you might as well be a part of the process.
There's something called detouring, which I believe is what you're looking for, it simply hooks a function, and when that process calls it, it executes your own function instead. (To ensure that it doesn't crash, call the function at the end of your function)
So if you were wanting to write your own function over CreateRegKeyEx
(http://msdn.microsoft.com/en-us/library/ms724844%28v=vs.85%29.aspx)
It might look something like this:
LONG WINAPI myRegCreateKeyEx(HKEY hKey, LPCTSTR lpSubKey, DWORD Reserved, LPTSTR lpClass, DWORD dwOptions, REGSAM samDesired, LPSECURITY_ATTRIBUTES lpSecurityAttributes, PHKEY phkResult, LPDWORD lpdwDisposition)
{
//check for suspicious keys being made via the parameters
RegCreateKeyEx(hKey, lpSubKey, Reserved, lpClass, dwOptions, samDesired, lpSecurityAttributes, phkResult, lpdwDisposition);
}
You can get a very well written detour library called DetourXS here: http://www.gamedeception.net/threads/10649-DetourXS
Here is his example code of how to establish a detour using it:
#include <detourxs.h>
typedef DWORD (WINAPI* tGetTickCount)(void);
tGetTickCount oGetTickCount;
DWORD WINAPI hGetTickCount(void)
{
printf("GetTickCount hooked!");
return oGetTickCount();
}
// To create the detour
oGetTickCount = (tGetTickCount) DetourCreate("kernel32.dll", "GetTickCount", hGetTickCount, DETOUR_TYPE_JMP);
// ...Or an address
oGetTickCount = (tGetTickCount) DetourCreate(0x00000000, hGetTickCount, DETOUR_TYPE_JMP);
// ...You can also specify the detour len
oGetTickCount = (tGetTickCount) DetourCreate(0x00000000, hGetTickCount, DETOUR_TYPE_JMP, 5);
// To remove the detour
DetourRemove(oGetTickCount);
And if you can't tell, that snippet is hooking GetTickCount() and whenever the function is called, he writes "GetTickCount hooked!" -- then he executes the function GetTickCount is it was intended.
Sorry for being so scattered with info, but I hope this helps. :)
-- I realize this is an old question. --

Most winapi calls generate symbol table entries for inter modular calls, this makes it pretty simple to hook them, all you need to do is overwrite the IAT addresses. Using something such as MSDetours, it can be done safely in a few lines of code. MSDetours also provides the tools to inject a custom dll into the target process so you can do the hooking

SetWindowsHookEx won't help at all - it provides different functionality.
Check if https://web.archive.org/web/20080212040635/http://www.codeproject.com/KB/system/RegMon.aspx helps. SysInternals' RegMon uses a kernel-mode driver which is very complicated way.
Update: Our company offers CallbackRegistry product, that lets you track registry operations without hassle. And BTW we offer free non-commercial licenses upon request (subject to approval on case by case basis).

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio