Is there a way to force windows to cache a file? - windows

Is there like batch command or something that will force windows to cache that file? I am trying to create a game preloader that loads certain game files into cache before starting the game. Is there any way I can do this?
updated int main code:
int main(int argc, const char** argv)
{
if(argc >= 2) for(int i = 1; argv[i]; ++i) pf("C:\\Games\World_of_Tanks\res\packages\gui.pkg"[i]);
return 0;
}

All you need to do is load the files, either using ReadFile or by memory mapping the files and touching every page (in fact, due to allocation granularity every 16th page suffices, but in theory you should be touching every page).
Memory mapping is faster and more cache-friendly, since you do not need to allocate extra memory to hold the data (which you aren't going to use for anything useful!). The OS will reuse the same physical memory for the cache and for the virtual memory that your process can see.
Several mainstream applications, including Microsoft Office and Adobe Reader do exactly that to launch faster. It's those "delayed start" services that keep your harddisk light flashing for a dozen seconds after you log in.
Do note, however, that while you can force Windows1 to cache files that way, but you cannot force it to keep the files in the cache indefinitively. If there is not enough physical RAM available, the system will throw away cache contents in order to satisfy application demands.
EDIT: Minimum working example implementation using filemapping:
#include <windows.h>
#include <cstdio>
void pf(const char* name)
{
HANDLE file = CreateFile(name, GENERIC_READ, FILE_SHARE_READ | FILE_SHARE_WRITE, 0, OPEN_EXISTING, FILE_ATTRIBUTE_NORMAL, 0);
if(file == INVALID_HANDLE_VALUE) { printf("couldn't open %s\n", name); return; };
unsigned int len = GetFileSize(file, 0);
HANDLE mapping = CreateFileMapping(file, 0, PAGE_READONLY, 0, 0, 0);
if(mapping == 0) { printf("couldn't map %s\n", name); return; }
const char* data = (const char*) MapViewOfFile(mapping, FILE_MAP_READ, 0, 0, 0);
if(data)
{
printf("prefetching %s... ", name);
// need volatile or need to use result - compiler will otherwise optimize out whole loop
volatile unsigned int touch = 0;
for(unsigned int i = 0; i < len; i += 4096)
touch += data[i];
}
else
printf("couldn't create view of %s\n", name);
UnmapViewOfFile(data);
CloseHandle(mapping);
CloseHandle(file);
}
int main(int argc, const char** argv)
{
if(argc >= 2) for(int i = 1; argv[i]; ++i) pf(argv[i]);
return 0;
}
The program will try to prefetch any filename given on the commandline.
The code isn't overly pretty but it works. It uses ANSI filenames, and leaks a file handle in case opening succeeds but mapping fails (but bleh... it's not really a problem, the OS will clean up after the program exits -- if that annoys you, wrap the handles in RAII). It's also limited to ca. 1.8GiB file size due to address space in a 32-bit build, otherwise limited to 4GiB due to GetFileSize, but that's also trivial to fix if you really need that big a file.
Instead of volatile one might want to return or otherwise consume the "result", but either way works (volatile does not truly have a measurable impact on performance, compared to a disk access!).
1Truth being told, you actually can't force Windows, but it incidentially always works that way unless you explicitly request unbuffered I/O.
In theory, you could force the OS to read pages into memory and even force it to keep them in RAM by locking the memory, but your working set quota (wich is very small, and you need aministrative rights to modify it) will not normally let you do this. That's a good thing though, since locking large amounts of memory is a very bad idea.

Related

How to apply Unified Memory to existing aligned host memory

I'm involved in effort integrating CUDA into some existing software. The software I'm integrating into is pseudo real-time, so it has a memory manager library that manually passes pointers from a single large memory allocation that is allocated up front. CUDA's Unified Memory is attractive to us, since in theory we'd theoretically be able to change this large memory chunk to Unified Memory, have the existing CPU code still work, and allow us to add GPU kernels with very little changes to the existing data I/O stream.
Parts of our existing CPU processing code requires memory to be aligned to certain alignment. cudaMallocManaged() does not allow me to specify the alignment for memory, and I feel like having to copy between "managed" and strict CPU buffers for these CPU sections almost defeats the purpose of UM. Is there a known way to address this issue that I'm missing?
I found this link on Stack Overflow that seems to solve it in theory, but I've been unable to produce good results with this method. Using CUDA 9.1, Tesla M40 (24GB):
#include <stdio.h>
#include <malloc.h>
#include <cuda.h>
#define USE_HOST_REGISTER 1
int main (int argc, char **argv)
{
int num_float = 10;
int num_bytes = num_float * sizeof(float);
float *f_data = NULL;
#if (USE_HOST_REGISTER > 0)
printf(
"%s: Using memalign + cudaHostRegister..\n",
argv[0]);
f_data = (float *) memalign(32, num_bytes);
cudaHostRegister(
(void *) f_data,
num_bytes,
cudaHostRegisterDefault);
#else
printf(
"%s: Using cudaMallocManaged..\n",
argv[0]);
cudaMallocManaged(
(void **) &f_data,
num_bytes);
#endif
struct cudaPointerAttributes att;
cudaPointerGetAttributes(
&att,
f_data);
printf(
"%s: ptr is managed: %i\n",
argv[0],
att.isManaged);
fflush(stdout);
return 0;
}
When using memalign() + cudaHostRegister() (USE_HOST_REGISTER == 1), the last print statement prints 0. Device accesses via kernel launches in larger files unsurprisingly report illegal accesses.
When using cudaMallocManaged() (USE_HOST_REGISTER == 0), the last print statement prints 1 as expected.
edit: cudaHostRegister() and cudaMallocManaged() do return successful error codes for me. Left this error-checking out in my sample I shared, but I did check them during my initial integration work. Just added the code to check, and both still return CUDA_SUCCESS.
Thanks for your insights and suggestions.
There is no method currently available in CUDA to take an existing host memory allocation and convert it into a managed memory allocation.

How to identify what parts of the allocated virtual memory a process is using

I want to be able to search through the allocated memory of a process (say you open notepad and type “HelloWorld” then ran the search looking for the string “HelloWorld”). For 32bit applications this is not a problem but for 64 bit applications the large quantity of allocated virtual memory takes hours to search through.
Obviously the vast majority of applications are not utilising the full amount of virtual memory allocated. I can identify the areas in memory allocated to each process with VirtualQueryEX and read them with ReadProcessMemory but when it comes to 64 bit applications this still takes hours to complete.
Does anyone know of any resources or any methods that could be used to help narrow down the amount of memory to be searched?
It is important that you only scan proper memory. If you just scanned from 0x0 to 0xFFFFFFFFF it would take at least 5 seconds in most processes. You can skip bad regions of memory by checking the memory page settings by using VirtualQueryEx. This will retrieve a MEMORY_BASIC_INFORMATION which will define the state of that memory region.
If the MemoryBasicInformation.state is not MEM_COMMIT then it is bad memory
If the MBI.Protect is PAGE_NOACCESS you also want to skip this memory.
If VirtualQuery fails then you skip to the next region.
In this manner it should only take 0-2 seconds to scan the memory on your average process because it is only scanning good memory.
char* ScanEx(char* pattern, char* mask, char* begin, intptr_t size, HANDLE hProc)
{
char* match{ nullptr };
SIZE_T bytesRead;
DWORD oldprotect;
char* buffer{ nullptr };
MEMORY_BASIC_INFORMATION mbi;
mbi.RegionSize = 0x1000;//
VirtualQueryEx(hProc, (LPCVOID)begin, &mbi, sizeof(mbi));
for (char* curr = begin; curr < begin + size; curr += mbi.RegionSize)
{
if (!VirtualQueryEx(hProc, curr, &mbi, sizeof(mbi))) continue;
if (mbi.State != MEM_COMMIT || mbi.Protect == PAGE_NOACCESS) continue;
delete[] buffer;
buffer = new char[mbi.RegionSize];
if (VirtualProtectEx(hProc, mbi.BaseAddress, mbi.RegionSize, PAGE_EXECUTE_READWRITE, &oldprotect))
{
ReadProcessMemory(hProc, mbi.BaseAddress, buffer, mbi.RegionSize, &bytesRead);
VirtualProtectEx(hProc, mbi.BaseAddress, mbi.RegionSize, oldprotect, &oldprotect);
char* internalAddr = ScanBasic(pattern, mask, buffer, (intptr_t)bytesRead);
if (internalAddr != nullptr)
{
//calculate from internal to external
match = curr + (internalAddr - buffer);
break;
}
}
}
delete[] buffer;
return match;
}
ScanBasic is just a standard comparison function which compares your pattern against the buffer.
Second, if you know the address is relative to a module, only scan the address range of that module, you can get the size of the module via ToolHelp32Snapshot. If you know it's dynamic memory on the heap, then only scan the heap. You can get all the heaps also with ToolHelp32Snapshot and TH32CS_SNAPHEAPLIST.
You can make a wrapper for this function as well for scanning the entire address space of the process might look something like this
char* Pattern::Ex::ScanProc(char* pattern, char* mask, ProcEx& proc)
{
unsigned long long int kernelMemory = IsWow64Proc(proc.handle) ? 0x80000000 : 0x800000000000;
return Scan(pattern, mask, 0x0, (intptr_t)kernelMemory, proc.handle);
}

Trap memory accesses inside a standard executable built with MinGW

So my problem sounds like this.
I have some platform dependent code (embedded system) which writes to some MMIO locations that are hardcoded at specific addresses.
I compile this code with some management code inside a standard executable (mainly for testing) but also for simulation (because it takes longer to find basic bugs inside the actual HW platform).
To alleviate the hardcoded pointers, i just redefine them to some variables inside the memory pool. And this works really well.
The problem is that there is specific hardware behavior on some of the MMIO locations (w1c for example) which makes "correct" testing hard to impossible.
These are the solutions i thought of:
1 - Somehow redefine the accesses to those registers and try to insert some immediate function to simulate the dynamic behavior. This is not really usable since there are various ways to write to the MMIO locations (pointers and stuff).
2 - Somehow leave the addresses hardcoded and trap the illegal access through a seg fault, find the location that triggered, extract exactly where the access was made, handle and return. I am not really sure how this would work (and even if it's possible).
3 - Use some sort of emulation. This will surely work, but it will void the whole purpose of running fast and native on a standard computer.
4 - Virtualization ?? Probably will take a lot of time to implement. Not really sure if the gain is justifiable.
Does anyone have any idea if this can be accomplished without going too deep? Maybe is there a way to manipulate the compiler in some way to define a memory area for which every access will generate a callback. Not really an expert in x86/gcc stuff.
Edit: It seems that it's not really possible to do this in a platform independent way, and since it will be only windows, i will use the available API (which seems to work as expected). Found this Q here:
Is set single step trap available on win 7?
I will put the whole "simulated" register file inside a number of pages, guard them, and trigger a callback from which i will extract all the necessary info, do my stuff then continue execution.
Thanks all for responding.
I think #2 is the best approach. I routinely use approach #4, but I use it to test code that is running in the kernel, so I need a layer below the kernel to trap and emulate the accesses. Since you have already put your code into a user-mode application, #2 should be simpler.
The answers to this question may provide help in implementing #2. How to write a signal handler to catch SIGSEGV?
What you really want to do, though, is to emulate the memory access and then have the segv handler return to the instruction after the access. This sample code works on Linux. I'm not sure if the behavior it is taking advantage of is undefined, though.
#include <stdint.h>
#include <stdio.h>
#include <signal.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static void segv_handler(int, siginfo_t *, void *);
int main()
{
struct sigaction action = { 0, };
action.sa_sigaction = segv_handler;
action.sa_flags = SA_SIGINFO;
sigaction(SIGSEGV, &action, NULL);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static void segv_handler(int, siginfo_t *info, void *ucontext_arg)
{
ucontext_t *ucontext = static_cast<ucontext_t *>(ucontext_arg);
ucontext->uc_mcontext.gregs[REG_RAX] = 1234;
ucontext->uc_mcontext.gregs[REG_RIP] += 2;
}
The code to read the register is written in assembly to ensure that both the destination register and the length of the instruction are known.
This is how the Windows version of prl's answer could look like:
#include <stdint.h>
#include <stdio.h>
#include <windows.h>
#define REG_ADDR ((volatile uint32_t *)0x12340000f000ULL)
static uint32_t read_reg(volatile uint32_t *reg_addr)
{
uint32_t r;
asm("mov (%1), %0" : "=a"(r) : "r"(reg_addr));
return r;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *);
int main()
{
SetUnhandledExceptionFilter(segv_handler);
// force sigsegv
uint32_t a = read_reg(REG_ADDR);
printf("after segv, a = %d\n", a);
return 0;
}
static LONG WINAPI segv_handler(EXCEPTION_POINTERS *ep)
{
// only handle read access violation of REG_ADDR
if (ep->ExceptionRecord->ExceptionCode != EXCEPTION_ACCESS_VIOLATION ||
ep->ExceptionRecord->ExceptionInformation[0] != 0 ||
ep->ExceptionRecord->ExceptionInformation[1] != (ULONG_PTR)REG_ADDR)
return EXCEPTION_CONTINUE_SEARCH;
ep->ContextRecord->Rax = 1234;
ep->ContextRecord->Rip += 2;
return EXCEPTION_CONTINUE_EXECUTION;
}
So, the solution (code snippet) is as follows:
First of all, i have a variable:
__attribute__ ((aligned (4096))) int g_test;
Second, inside my main function, i do the following:
AddVectoredExceptionHandler(1, VectoredHandler);
DWORD old;
VirtualProtect(&g_test, 4096, PAGE_READWRITE | PAGE_GUARD, &old);
The handler looks like this:
LONG WINAPI VectoredHandler(struct _EXCEPTION_POINTERS *ExceptionInfo)
{
static DWORD last_addr;
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_GUARD_PAGE_VIOLATION) {
last_addr = ExceptionInfo->ExceptionRecord->ExceptionInformation[1];
ExceptionInfo->ContextRecord->EFlags |= 0x100; /* Single step to trigger the next one */
return EXCEPTION_CONTINUE_EXECUTION;
}
if (ExceptionInfo->ExceptionRecord->ExceptionCode == STATUS_SINGLE_STEP) {
DWORD old;
VirtualProtect((PVOID)(last_addr & ~PAGE_MASK), 4096, PAGE_READWRITE | PAGE_GUARD, &old);
return EXCEPTION_CONTINUE_EXECUTION;
}
return EXCEPTION_CONTINUE_SEARCH;
}
This is only a basic skeleton for the functionality. Basically I guard the page on which the variable resides, i have some linked lists in which i hold pointers to the function and values for the address in question. I check that the fault generating address is inside my list then i trigger the callback.
On first guard hit, the page protection will be disabled by the system, but i can call my PRE_WRITE callback where i can save the variable state. Because a single step is issued through the EFlags, it will be followed immediately by a single step exception (which means that the variable was written), and i can trigger a WRITE callback. All the data required for the operation is contained inside the ExceptionInformation array.
When someone tries to write to that variable:
*(int *)&g_test = 1;
A PRE_WRITE followed by a WRITE will be triggered,
When i do:
int x = *(int *)&g_test;
A READ will be issued.
In this way i can manipulate the data flow in a way that does not require modifications of the original source code.
Note: This is intended to be used as part of a test framework and any penalty hit is deemed acceptable.
For example, W1C (Write 1 to clear) operation can be accomplished:
void MYREG_hook(reg_cbk_t type)
{
/** We need to save the pre-write state
* This is safe since we are assured to be called with
* both PRE_WRITE and WRITE in the correct order
*/
static int pre;
switch (type) {
case REG_READ: /* Called pre-read */
break;
case REG_PRE_WRITE: /* Called pre-write */
pre = g_test;
break;
case REG_WRITE: /* Called after write */
g_test = pre & ~g_test; /* W1C */
break;
default:
break;
}
}
This was possible also with seg-faults on illegal addresses, but i had to issue one for each R/W, and keep track of a "virtual register file" so a bigger penalty hit. In this way i can only guard specific areas of memory or none, depending on the registered monitors.

When MPI_Send doesn't block

I have used some code that implements manual MPI broadcast, basically a demo that unicasts an integer from root to all other nodes. Of course, unicasting to many nodes is less efficient than MPI_Bcast() but I just want to check how things work.
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>
void my_bcast(void* data, int count, MPI::Datatype datatype, int root, MPI::Intracomm communicator) {
int world_size = communicator.Get_size();
int world_rank = communicator.Get_rank();
if (world_rank == root) {
// If we are the root process, send our data to everyone
int i;
for (i = 0; i < world_size; i++) {
if (i != world_rank) {
communicator.Send(data, count, datatype, i, 0);
}
}
} else {
// If we are a receiver process, receive the data from the root
communicator.Recv(data, count, datatype, root, 0);
}
}
int main(int argc, char** argv) {
MPI::Init();
int world_rank = MPI::COMM_WORLD.Get_rank();
int data;
if (world_rank == 0) {
data = 100;
printf("Process 0 broadcasting data %d\n", data);
my_bcast(&data, 1, MPI::INT, 0, MPI::COMM_WORLD);
} else {
my_bcast(&data, 1, MPI::INT, 0, MPI::COMM_WORLD);
printf("Process %d received data %d from root process\n", world_rank, data);
}
MPI::Finalize();
}
What I noticed is that if I remove the check that the root doesn't send to itself,
if (i != world_rank) {
...
}
the program still works and doesn't block whereas the default behavior of MPI_Send() is supposed to be blocking i.e. to wait until the data has been received at the other end. But MPI_Recv() is never invoked by the root. Can someone explain why this is happening?
I run the code from the root with the following command (the cluster is set up on Amazon EC2 and using NFS as shared storage among the nodes and all machines have Open MPI 1.10.2 installed)
mpirun -mca btl ^openib -mca plm_rsh_no_tree_spawn 1 /EC2_NFS/my_bcast
The C file is compiled with
mpic++ my_bcast.c
and mpic++ version is 5.4.0.
The code is taken from www.mpitutorial.com
You are mistaking blocking for synchronous behaviour. Blocking means that the call does not return until the operation has completed. The standard send operation (MPI_Send) completes once the supplied buffer is free to be reused by the program. This means either that the message is fully in transit to the receiver or that it was stored internally by the MPI library for later delivery (buffered send). The buffering behaviour is implementation-specific, but most libraries will buffer messages the size of a single integer. Force the synchronous mode by using MPI_Ssend (or the C++ equivalent) to have your program hang.
Please note that the C++ MPI bindings are no longer part of the standard and should not be used in the development of new software. Use the C bindings MPI_Blabla instead.

How Can synchronize data between differernt cores on Xeon (linux how to use memory barriers)

I wrote a simple program to test memory Synchronization. Use a global queue to share with two
processes, and bind two processes to different cores. my code is blew.
#include<stdio.h>
#include<sched.h>
#define __USE_GNU
void bindcpu(int pid) {
int cpuid;
cpu_set_t mask;
cpu_set_t get;
CPU_ZERO(&mask);
if (pid > 0) {
cpuid = 1;
} else {
cpuid = 5;
}
CPU_SET(cpuid, &mask);
if (sched_setaffinity(0, sizeof(mask), &mask) == -1) {
printf("warning: could not set CPU affinity, continuing...\n");
}
}
#define Q_LENGTH 512
int g_queue[512];
struct point {
int volatile w;
int volatile r;
};
volatile struct point g_p;
void iwrite(int x) {
while (g_p.r == g_p.w);
sleep(0.1);
g_queue[g_p.w] = x;
g_p.w = (g_p.w + 1) % Q_LENGTH;
printf("#%d!%d", g_p.w, g_p.r);
}
void iread(int *x) {
while (((g_p.r + 1) % Q_LENGTH) == g_p.w);
*x = g_queue[g_p.r];
g_p.r = (g_p.r + 1) % Q_LENGTH;
printf("-%d*%d", g_p.r, g_p.w);
}
int main(int argc, char * argv[]) {
//int num = sysconf(_SC_NPROCESSORS_CONF);
int pid;
pid = fork();
g_p.r = Q_LENGTH;
bindcpu(pid);
int i = 0, j = 0;
if (pid > 0) {
printf("call iwrite \0");
while (1) {
iread(&j);
}
} else {
printf("call iread\0");
while (1) {
iwrite(i);
i++;
}
}
}
The data between the two processesIntel(R) Xeon(R) CPU E3-1230 and two cores didn't synchronized.
CPU: Intel(R) Xeon(R) CPU E3-1230
OS: 3.8.0-35-generic #50~precise1-Ubuntu SMP
I want to know beyond IPC How I can synchronize the data between the different cores in user
space ?
If you are wanting your application to manipulate the cpus shared-cache in order to accomplish IPC I don't believe you will be able to do that.
Chapter 9 of "Linux Kernel Development Second Edition" has information on synchronizing multi-threaded applications (including atomic operations, semiphores, barriers, etc...):
http://www.makelinux.net/books/lkd2/ch09
so you may get some ideas on what you are looking for there.
here is a decent write up for Intel® Smart Cache "Software Techniques for Shared-Cache Multi-Core Systems": http://archive.is/hm0y
here are some stackoverflow questions/answers that may help you find the information you are looking for:
Storing C/C++ variables in processor cache instead of system memory
C++: Working with the CPU cache
Understanding how the CPU decides what gets loaded into cache memory
Sorry for bombarding you with links but this is the best I can do without a clearer understanding of what you are looking to accomplish.
I suggest reading "Volatile: Almost Useless for Multi-Threaded Programming" for why volatile should be removed from the example code. Instead, use C11 or C++11 atomic operations. See also the Fenced Data Transfer example in of the TBB Design Patterns Manual.
Below I show the parts of the question example that I changed to use C++11 atomics. I compiled it with g++ 4.7.2.
#include <atomic>
...
struct point g_p;
struct point {
std::atomic<int> w;
std::atomic<int> r;
};
void iwrite(int x) {
int w = g_p.w.load(std::memory_order_relaxed);
int r;
while ((r=g_p.r.load(std::memory_order_acquire)) == w);
sleep(0.1);
g_queue[w] = x;
w = (w+1)%Q_LENGTH;
g_p.w.store( w, std::memory_order_release);
printf("#%d!%d", w, r);
}
void iread(int *x) {
int r = g_p.r.load(std::memory_order_relaxed);
int w;
while (((r + 1) % Q_LENGTH) == (w=g_p.w.load(std::memory_order_acquire)));
*x = g_queue[r];
g_p.r.store( (r + 1) % Q_LENGTH, std::memory_order_release );
printf("-%d*%d", r, w);
}
The key changes are:
I removed "volatile" everywhere.
The members of struct point are declared as std::atomic
Some loads and stores of g_p.r and g_p.w are fenced. Others are hoisted.
When loading a variable modified by another thread, the code "snapshots" it into a local variable.
The code uses "relaxed load" (no fence) where a thread loads a variable that no other thread modifies. I hoisted those loads out of the spin loops since there is no point in repeating them.
The code uses "acquiring load" where a thread loads a "message is ready" indicator that is set by another thread, and uses a "releasing store" where it is storing a "message is ready" indicator" to be read by another thread. The release is necessary to ensure that the "message" (queue data) is written before the "ready" indicator (member of g_p) is written. The acquire is likewise necessary to ensure that the "message" is read after the "ready" indicator is seen.
The snapshots are used so that the printf reports the value that the thread actually used, as opposed to some new value that appeared later. In general I like to use the snapshot style for two reasons. First, touching shared memory can be expensive because it often requires cache-line transfers. Second, the style gives me a stable value to use locally without having to worry that a reread might return a different value.

Resources