When using VirtualAlloc I can (ab)use the following property to simplify memory management.
Actual physical pages are not allocated unless/until the virtual addresses are actually accessed.
I run the following code to allocate the block.
type
PArrayMem = ^TArrayMem; //pointer
TArrayMem = packed record //as per documentation
RefCount: Integer;
Length: NativeInt;
Elements: Integer;
end;
var
a: array of integer; //dynamic array, structure see above
procedure TForm38.Button1Click(Sender: TObject);
const
AllocSize = 1024 * 1024 * 1024; //1 GB
var
ArrayMem: PArrayMem;
begin
//SetLength(a, 1024*1024*1024); //1G x 8*16
ArrayMem:= VirtualAlloc(nil, AllocSize, MEM_COMMIT or MEM_RESERVE, PAGE_READWRITE);
ArrayMem.RefCount:= 1;
ArrayMem.Length:= AllocSize div SizeOf(Integer);
a:= #ArrayMem.Elements; //a:= AddressOf(elements)
a[1]:= 10; //testing, works
a[0]:= 4;
a[500000]:= 56; //Works, autocommits, only adds a few k to the used memory
button1.Caption:= IntToStr(a[500000]); //displays '56'
end;
All this works great. If my structure grows to 1.000.000 elements everything works.
However suppose afterwards my structure shrinks back to 1.000 elements.
How do I release the RAM so that it will get auto-magically committed when needed again?
WARNING
David warned my that allocating an committing large (huge) continous pages of memory carries a large cost.
So it may be more advantageous to split up the array in smaller blocks and abstract away the internals using a class/record.
You can decommit pages using VirtualFree passing the MEM_DECOMMIT flag. Then you can commit again using VirtualAlloc.
Or you may use the DiscardVirtualMemory function introduced in Windows 8.1.
Use this function to discard memory contents that are no longer needed, while keeping the memory region itself committed. Discarding memory may give physical RAM back to the system. When the region of memory is again accessed by the application, the backing RAM is restored, and the contents of the memory is undefined.
You may find something useful in the comments to this related question: New Windows 8.1 APIs for virtual memory management: `DiscardVirtualMemory()` vs `VirtualAlloc()` and `MEM_RESET` and `MEM_RESET_UNDO`
Related
I was wondering if there is an existing system call/API for accessing getting the physical address of the virtual address?
If there is none then some direction on how to get that working ?
Also, how to get the physical address of MMIO which is non-pageable physical memory ?
The answer lies in IOMemoryDescriptor and IODMACommand objects.
If the memory in question is kernel-allocated, it should be allocated by creating an IOBufferMemoryDescriptor in the first place. If that's not possible, or if it's a buffer allocated in user space, you can wrap the relevant pointer using IOMemoryDescriptor::withAddressRange(address, length, options, task) or one of the other factory functions. In the case of withAddressRange, the address passed in must be virtual, in the address space of task.
You can directly grab physical address ranges from an IOMemoryDescriptor by calling the getPhysicalSegment() function (only valid between prepare()…complete() calls). However, normally you would do this for creating scatter-gather lists (DMA), and for this purpose Apple strongly recommends the IODMACommand. You can create these using IODMACommand::withSpecification(). Then use the genIOVMSegments() function to generate the scatter-gather list.
Modern Macs, and also some old PPC G5s contain an IOMMU (Intel calls this VT-d), so the system memory addresses you pass to PCI/Thunderbolt devices are not in fact physical, but IO-Mapped. IODMACommand will do this for you, as long as you use the "system mapper" (the default) and set mappingOptions to kMapped. If you're preparing addresses for the CPU, not a device, you will want to turn off mapping - use kIOMemoryMapperNone in your IOMemoryDescriptor options. Depending on what exactly you're trying to do, you probably don't need IODMACommand in this case either.
Note: it's often wise to pool and reuse your IODMACommand objects, rather than freeing and reallocating them.
Regarding MMIO, I assume you mean PCI BARs and similar - for IOPCIDevice, you can grab an IOMemoryDescriptor representing the memory-mapped device range using getDeviceMemoryWithRegister() and similar functions.
Example:
If all you want are pure CPU-space physical addresses for a given virtual memory range in some task, you can do something like this (untested as a complete kext that uses it would be rather large):
// INPUTS:
mach_vm_address_t virtual_range_start = …; // start address of virtual memory
mach_vm_size_t virtual_range_size_bytes = …; // number of bytes in range
task_t task = …; // Task object of process in which the virtual memory address is mapped
IOOptionBits direction = kIODirectionInOut; // whether the memory will be written or read, or both during the operation
IOOptionBits options =
kIOMemoryMapperNone // we want raw physical addresses, not IO-mapped
| direction;
// Process for getting physical addresses:
IOMemoryDescriptor* md = IOMemoryDescriptor::withAddressRange(
virtual_range_start, virtual_range_size_bytes, direction, task);
// TODO: check for md == nullptr
// Wire down virtual range to specific physical pages
IOReturn result = md->prepare(direction);
// TODO: do error handling
IOByteCount offset = 0;
while (offset < virtual_range_size_bytes)
{
IOByteCount segment_len = 0;
addr64_t phys_addr = md->getPhysicalSegment(offset, &len, kIOMemoryMapperNone);
// TODO: do something with physical range of segment_len bytes at address phys_addr here
offset += segment_len;
}
/* Unwire. Call this only once you're done with the physical ranges
* as the pager can change the physical-virtual mapping outside of
* prepare…complete blocks. */
md->complete(direction);
md->release();
As explained above, this is not suitable for generating DMA scatter-gather lists for device I/O. Note also this code is only valid for 64-bit kernels. You'll need to be careful if you still need to support ancient 32-bit kernels (OS X 10.7 and earlier) because virtual and physical addresses can still be 64-bit (64-bit user processes and PAE, respectively), but not all memory descriptor functions are set up for that. There are 64-bit-safe variants available to be used for 32-bit kexts.
I'm trying to manually update TLB to translate new virtual memory pages into a certain set of physical pages in the kernel space. I want to do this in do_page_fault so that whenever a load/store instruction happens in a particular virtual address range (not already assigned), it put a page table entry in TLB in advance. The translation is simple. For example, I would like the following piece of code work properly:
int d;
int *pd = (int*)(&d + 0xffff000000000000);
*pd = 25; // A page fault occurs and TLB is updated
printk("d = %d\n", *pd); // No page fault (TLB is already up-to-date)
So, the translation is just a subtraction by 0xffff000000000000. I was wondering what is the best way to implement the TLB update functionality?
Edit: The main reason for doing that is to be able to reserve a large virtual memory space in the kernel. I just want to handle page faults in a particular address range in the kernel. So, first I have to reserve the address range (maybe exceeds the 100TB limitation). Then, I have to implement a page cache for that particular range. If it is not possible, what is the best solution for that?
A co-worker mentioned to me a few months ago that one of our internal Delphi applications seems to be taking up 8 GB of RAM. I told him:
That's not possible
A 32-bit application only has a 32-bit virtual address space. Even if there was a memory leak, the most memory it could consume is 2 GB. After that allocations would fail (as there would be no empty space in the virtual address space). And in the case of a memory leak, the virtual pages will be swapped out to the pagefile, freeing up physical RAM.
But he noted that Windows Resource Monitor indicated that less than 1 GB of RAM was available on the system. And while our app was only using 220 MB of virtual memory: closing it freed up 8 GB of physical RAM.
So I tested it
I let the application run for a few weeks, and today I finally decided to test it.
First I look at memory usage before closing the app, using Process Explorer:
the working set (RAM) is: 241 MB
total virtual memory used: 409 MB
And I used Resource Monitor to check memory used by the app, and total RAM in use:
virtual memory allocated by application: 252 MB
physical memory in use: 14 GB
And then memory usage after closing the app:
physical memory in use: 6.6 GB (7.4 GB less)
I also used Process Explorer to look at a breakdown of physical RAM use before and after. The only difference is that 8 GB of RAM really was uncommitted and now free:
Item
Before
After
Commit Charge (K)
15,516,388
7,264,420
Physical Memory Available (K)
1,959,480
9,990,012
Zeroed Paging List (K)
539,212
8,556,340
Note: It's somewhat interesting that Windows would waste time instantly zeroing out all the memory, rather than simply putting it on a standby list, and zero it out as needed (as memory requests need to be satisfied).
None of those things explain what the RAM was doing (What are you doing just sitting there! What do you contain!?)
What is in that memory?
That RAM must contain something useful; it must have some purpose. For that I turned to SysInternals' RAMMap. It can break down memory allocations.
The only clue that RAMMap provides is that the 8 GB of physical memory was associated with something called Session Private. These Session Private allocations are not associated with any process (i.e. not my process):
Item
Before
After
Session Private
8,031 MB
276 MB
Unused
1,111 MB
8,342 MB
I'm certainly not doing anything with EMS, XMS, AWE, etc.
What could possibly be happening in a 32-bit non-Administrator application that is causing Windows to allocate an additional 7 GB of RAM?
It's not a cache of swapped out items
it's not a SuperFetch cache
It's just there, consuming RAM.
Session Private
The only information about "Session Private" memory is from a blog post announcing RAMMap:
Session Private: Memory that is private to a particular logged in session. This will be higher on RDS Session Host servers.
What kind of app is this?
This is a 32-bit native Windows application (i.e. not Java, not .NET). Because it is a native Windows application it, of course, makes heavy use of the Windows API.
It should be noted that I wasn't asking people to debug the application; I was hoping a Windows developer out there would know why Windows might hold memory that I never allocated. Having said that, the only thing changed recently (in the last 2 or 3 years) that could cause such a thing is the feature that takes a screenshot every 5 minutes and saves it to the user's %LocalAppData% folder. A timer fires every five minutes:
QueueUserWorkItem(TakeScreenshotThreadProc);
And pseudo-code of the thread method:
void TakeScreenshotThreadProc(Pointer data)
{
String szFolder = GetFolderPath(CSIDL_LOCAL_APPDTA);
ForceDirectoryExists(szFolder);
String szFile = szFolder + "\\" + FormatDateTime("yyyyMMdd'_'hhnnss", Now()) + ".jpg";
Image destImage = new Image();
try
{
CaptureDesktop(destImage);
JPEGImage jpg = new JPEGImage();
jpg.CopyFrom(destImage);
jpg.CompressionQuality = 13;
jpg.Compress();
HANDLE hFile = CreateFile(szFile, GENERIC_WRITE,
FILE_SHARE_READ | FILE_SHARE_WRITE, null, CREATE_ALWAYS,
FILE_ATTRIBUTE_ARCHIVE | FILE_ATTRIBUTE_ENCRYPTED, 0);
//error checking elucidated
try
{
Stream stm = new HandleStream(hFile);
try
{
jpg.SaveToStream(stm);
}
finally
{
stm.Free();
}
}
finally
{
CloseHandle(hFile);
}
}
finally
{
destImage.Free();
}
}
Most likely somewhere in your application you are allocating system resources and not releasing them. Any WinApi call that creates an object and returns a handle could be a suspect. For example (be careful running this on a system with limited memory - if you don't have 6GB free it will page badly):
Program Project1;
{$APPTYPE CONSOLE}
uses
Windows;
var
b : Array[0..3000000] of byte;
i : integer;
begin
for i := 1 to 2000 do
CreateBitmap(1000, 1000, 3, 8, #b);
ReadLn;
end.
This consumes 6GB of session memory due to the allocation of bitmap objects that are not subsequently released. Application memory consumption remains low because the objects are not created on the application's heap.
Without knowing more about your application, however, it is very difficult to be more specific. The above is one way to demonstrate the behaviour you are observing. Beyond that, I think you need to debug.
In this case, there are a large number of GDI objects allocated - this isn't necessarily indicative, however, since there are often a large number of small GDI objects allocated in an application rather than a large number of large objects (The Delphi IDE, for example, will routinely create >3000 GDI objects and this is not necessarily a problem).
In #Abelisto's example (in comments), by contrast :
Program Project1;
{$APPTYPE CONSOLE}
uses
SysUtils;
var
i : integer;
sr : TSearchRec;
begin
for i := 1 to 1000000 do FindFirst('c:\*', faAnyFile, sr);
ReadLn;
end.
Here the returned handles are not to GDI objects but are rather search handles (which fall under the general category of Kernel Objects). Here we can see that there are a large number of handles used by the process. Again, process memory consumption is low but there is a large increase in session memory used.
Similarly, the objects might be User Objects - these are created by calls to things like CreateWindow, CreateCursor, or by setting hooks with SetWindowsHookEx. For a list of WinAPI calls that create objects and return handles of each type, see :
Handles and Objects : Object Categories -- MSDN
This can help you start to track down the issue by narrowing it to the type of call that could be causing the problem. It may also be in a buggy third-party component, if you are using any.
A tool like AQTime can profile Windows allocations, but I'm not sure if there is a version that supports Delphi5. There may be other allocation profilers that can help track this down.
I have very basic question which i fail to understand after going through documents. I am facing this issue while executing one of my project as the output i get is totally corrupted and i believe problem is either with memory allocation or with thread sync.
ok the question is:
Can every thread creates separate copy of all the variables and pointers passed to the kernal function ? or it just creates copy of variable but the pointers we pass that memory is shared amoung all threads.
e.g.
int main()
{
const int DC4_SIZE = 3;
const int DC4_BYTES = DC4_SIZE * sizeof(float);
float * dDC4_in;
float * dDC4_out;
float hDC4_out[DC4_SIZE];
float hDC4_out[DC4_SIZE];
gpuErrchk(cudaMalloc((void**) &dDC4_in, DC4_BYTES));
gpuErrchk(cudaMalloc((void**) &dDC4_out, DC4_BYTES));
// dc4 initialization function on host which allocates some values to DC4[] array
gpuErrchk(cudaMemcpy(dDC4_in, hDC4_in, DC4_BYTES, cudaMemcpyHostToDevice));
mykernel<<<10,128>>>(VolDepth,dDC4_in);
cudaMemcpy(hDC4_out, dDC4_out, DC4_BYTES, cudaMemcpyDeviceToHost);
}
__global__ void mykernel(float VolDepth,float * dDC4_in,float * dDC4_out)
{
for(int index =0 to end)
dDC4_out[index]=dDC4_in[index] * VolDepth;
}
so i am passing dDC4_in and dDC4_out pointers to GPU with dDC4_in initialized with some values and computing dDC4_out and copying back to host,
so does my all 1280 threads will have separate dDC4_in/out copies or they all will work on same copy on GPU overwriting the values of other threads?
global memory is shared by all threads in a grid. The parameters you pass to your kernel (that you've allocated with cudaMalloc) are in the global memory space.
Threads do have their own memory (local memory), but in your example dDC4_in and dDC4_out are shared by all of your threads.
As a general run-down (taken from the CUDA Best Practices documentation):
On the DRAM side:
Local memory (and registers) is per-thread, shared memory is per-block, and global, constant, and texture are per-grid.
In addition, global/constant/texture memory can be read and modified on the host, while local and shared memory are only around for the duration of your kernel. That is, if you have some important information in your local or shared memory and your kernel finishes, that memory is reclaimed and your information lost. Also, this means that the only way to get data into your kernel from the host is via global/constant/texture memory.
Anyways, in your case it's a bit hard to recommend how to fix your code, because you don't take threads into account at all. Not only that, in the code you posted, you're only passing 2 arguments to your kernel (which takes 3 parameters), so it's no surprise your results are somewhat lacking. Even if your code were valid, you would have every thread looping from 0 to end and writing the to the same location in memory (which would be serialized, but you wouldn't know which write would be the last one to go through). In addition to that race condition, you have every thread doing the same computation; each of your 1280 threads will execute that for loop and perform the same steps. You have to decide on a mapping of threads to data elements, divide up the work in your kernel based on your thread to element mapping, and perform your computation based on that.
e.g. if you have a 1 thread : 1 element mapping,
__global__ void mykernel(float VolDepth,float * dDC4_in,float * dDC4_out)
{
int index = threadIdx.x + blockIdx.x*blockDim.x;
dDC4_out[index]=dDC4_in[index] * VolDepth;
}
of course this would also necessitate changing your kernel launch configuration to have the correct number of threads, and if the threads and elements aren't exact multiples, you'll want some added bounds checking in your kernel.
Is it possible to wrap up memory mapped files something like this?
TVirtualMemoryManager = class
public
function AllocMem (Size : Integer) : Pointer;
procedure FreeMem (Ptr : Pointer);
end;
Since the memory mapped file API functions all take offsets I don't know how to manage the free areas in the memory mapped files. My only idea is to implement some kind of basic memory management (mainting free lists for different block sizes) but I don' t know how efficient this will be.
EDIT: What I really want (as David made clear to me) is this:
IVirtualMemory = interface
function ReadMem (Addr : Int64) : TBytes;
function AllocateMem (Data : TBytes) : Int64;
procedure FreeMem (Addr : Int64);
end;
I need to store continous blocks of bytes (each relatively small) in virtual memory and be able to read them back into memory using a 64-bit adress. Most of the time access is read-only. If a write is necessary I would just use FreeMem followed by AllocMem since the size will be different anyway.
I want a wrapper for a memory mapped file with this interface. Internally it has a handle to a memory mapped files and uses MapViewOfFile on each ReadMem request. The Addr 64-bit integers are just offsets into the memory mapped file. The open question is how to assign those adresses - I currently keep a list of free blocks that I maintain.
Your proposal that "Internally it has a handle to a memory mapped files and uses MapViewOfFile on each ReadMem request" will be just a waste of CPU resource, IMHO.
It is worth saying that your GetMem / FreeMem requirement won't be able to break the 3/4 GB barrier. Since all allocated memory will be mapped into memory until a call to FreeMem, you'll be short of memory space, just as with the regular Delphi memory manager. The best you can do is to rely of FastMM4, and change your program to reduce its memory use.
IMHO you'll have to change/update your specification. For instance, your "updated" question sounds just like a regular storage problem.
What you want is to be able to allocate more than 3/4 GB of data for your application. You have a working implementation of such a feature in our SynBigTable open source unit. This is a fast and light NoSQL solution in pure Delphi.
It is able to create a file of any size (only 64 bit limited), then will map the content of each record into memory, on request. It will use a memory mapping of the file, if possible. You can implement your interface very directly with TSynBigTable methods: ReadMem=Get, AllocMem=Add, FreeMem=Delete. The IDs will be your pointer-like values, and RawByteString will be used instead of TBytes.
You can access any block of data using an integer ID, or a string ID, or even use a sophisticated field layout (inside the record, or as in-memory metadata - including indexes and fast search).
Or rely on a regular embedded SQL database. For instance, SQLite3 is very good at handling BLOB fields, and is able to store huge amount of data. With a simple in-memory caching mechanism for most used records, it could be a powerful solution.