Is it at all possible to allocate large (i.e. 32mb) physically contiguous memory regions from kernel code at runtime (i.e. not using bootmem)? From my experiments, it seems like it's not possible to get anything more than a 4mb chunk successfully, no matter what GFP flags I use. According to the documentation I've read, GFP_NOFAIL is supposed to make the kmalloc just wait as long as is necessary to free the requested amount, but from what I can tell it just makes the request hang indefinitely if you request more than is availble - it doesn't seem to be actively trying to free memory to fulfil the request (i.e. kswapd doesn't seem to be running). Is there some way to tell the kernel to aggressively start swapping stuff out in order to free the requested allocation?
Edit: So I see from Eugene's response that it's not going to be possible to get a 32mb region from a single kmalloc.... but is there any possibility of getting it done in more of a hackish kind of way? Like identifying the largest available contiguous region, then manually migrating/swapping away data on either side of it?
Or how about something like this:
1) Grab a bunch of 4mb chunks until you're out of memory.
2) Check them all to see if any of them happen to be contiguous, if so,
combine them.
3) kfree the rest
4) goto 1)
Might that work, if given enough time to run?
You might want to take a look at the Contiguous Memory Allocator patches. Judgging from the LWN article, these patches are exactly what you need.
Mircea's link is one option; if you have an IOMMU on your device you may be able to use that to present a contiguous view over a set of non-contiguous pages in memory as well.
Related
Problem (think of the mark phase of a GC)
I have a graph of “objects” that I need to walk, visiting all objects.
I can store in each object if it has been visited.
All the objects are stored in memory and linked together using normal pointers.
The objects are not all the same size.
Sometimes there is not enough ram in the system to hold all the objects in memory at the same time, and I wish to avoid “page thrashing”.
I also wish to avoid TLB faults
Other times, there is more than enough ram.
I do not mind writing low-level code.
I do not mind different code for windows and linux.
The code must run in “user space” without needing none standard permissions.
I don't care the order I visit the nodes in.
I am going to ask more detail questions about possible solutions, linking back to this questions.
Page faults aren't necessarily bad, as long as they're not stalling your progress.
This means that if you have a node Node* p with two candidate successors p->left and p->right, it can be useful to pick the nearest (in terms of (char*)p - (char*)p->next) and pre-fetch the other (e.g. with PrefetchVirtualMemory).
How efficient this will be cannot be predicted; it greatly depends on your graph topology. But the prefetch is virtually free when you have enough RAM.
Closer to the CPU, there's cache prefetching. Same idea, different storage
Use 2M hugepages for address ranges that are full of "hot" data that the kernel can't usefully swap out any / many 4k chunks of. This will reduce TLB misses, but costs extra physical memory if there are any 4k chunks of a hugepage that aren't hot.
Linux does this transparently for anonymous pages (https://www.kernel.org/doc/Documentation/vm/transhuge.txt), but you can use madvise(MADV_HUGEPAGE) on pages you know are worth it, to encourage the kernel to defrag physical memory even if that's not the default in /sys/kernel/mm/transparent_hugepage/defrag. (You can look at /proc/PID/smaps to see how many transparent hugepages are in use for any given mapping.)
Based on what you posted in your answer: An ordered set of nodesToVisit would give you the most locality, but might be too expensive to maintain. Multiple accesses within the same 64-byte cache line are much cheaper than coming back to it later after it's been evicted from L3 cache and has to come from DRAM again.
If you have lots of addresses to visit in your Set, doing one pass of a radix-sort into 2M buckets would give you locality within one hugepage. 2M is also smaller than L3 cache size, so you'll probably get some cache hits when visiting multiple objects in the same cache line, even if you don't hit them back to back.
Depending on how big your Set is, throwing around that many pointers even to partial-sort them might not be worth the memory traffic that takes. But there's probably some sweet spot of taking a window of data and at least partially sorting it. Using the pointers before they are evicted from cache is nice.
SW prefetch can trigger a page-walk to avoid a TLB miss, so you could _mm_prefetch(_MM_HINT_T2) one address from the next 2M bucket before starting on the current bucket. See also Prefetching Examples?. I haven't tested this, but it might work well. It won't help with page faults: prefetch from an unmapped page won't cause a page fault, and you don't want to trigger an actual PF until you're ready to touch the page.
MSalter's suggestion to ask the OS to prefetch and wire the next page is interesting (I think madvise(MADV_WILLNEED) is the Linux equivalent), but a system call will be slow for no benefit if the page was already mapped+wired into the HW page table. There's no x86 asm instruction that just asks if a page is mapped without faulting if it isn't, so I can't think of a way to efficiently choose not to call it. And BTW, I think Linux breaks up transparent hugepages into 4k regular pages for paging in/out. But don't write a big loop that just does _mm_prefetch() or madvise on all the 4k pages in a 2M block; that probably sucks. The prefetcht2 part would probably just result in excess prefetch requests being dropped.
Use perf counters to look at cache hit/miss rates. On Intel CPUs, the mem_load_retired.l1_miss and/or .l2_miss event should show you whether you're getting cache hits on accessing the Set itself, as well as on accessing dereferencing those pointers. Those counters are precise events, so they should map accurately to asm load instructions. (e.g. perf record -e mem_load_retired.l2_miss ./my_program / perf report on Linux).
We remove one item at a time from nodesToVisit
I don't know much about GC design, but can't you use a sequence number or tagged-pointer or something to avoid modifying the Set data structure itself every GC pass? If your minimum object alignment is 4 bytes, you have 2 bits to play with at the bottom of every pointer. ANDing them off before dereferencing is very cheap.
x86-64 with full 64-bit pointers currently requires the top 16 to be the sign-extension of the low 48. So you could use bits there (16 bits, or maybe just the top byte) if you re-canonicalize pointers. (redo sign extension, or just zero the high 16 bits if you want to assume user-space pointers; Linux uses a high-half kernel VM layout so user-space addresses are always in the low half of virtual address space. IDK what Windows does.)
On x86-64, you might consider using the x32 ABI (32-bit pointers in long mode) if 4GiB of address space is enough, especially if you're hitting physical memory limits and swapping. Smaller pointers mean smaller data structures, thus half the cache footprint.
Some Linux systems are built without kernel support for x32, though, only classic x86-64 and usually 32-bit mode. But if it works on your systems, consider gcc -mx32.
These are my first thoughts about a possible solution, they are clearly not optimal. I will delete this answer if someone posts a better answer.
The basic method:
Assume we have a Set<NodePointer> nodesToVisit that contains all nodes we have not yet visited.
We remove one item at a time from nodesToVisit,
and if it has not been visited before we add all “pointers to other nodes” to nodesToVisit.
Improvements:
But we can clearly do better, by ordering nodesToVisit based on address, so that we are more likely to visit nodes that are contained in pages we have recently accessed. This could be as simple as having a second Set<NodePointer> nodesToVisitLater, and putting any node that has an address a long way from the current node into it.
Or we could skip over any node that are contained in pages that are not resident in memory, visiting these nodes after we have visited all nodes that are currently in memory.
(The"set" could just be a stack, as visiting a node more than once is a "no-opp")
https://patents.google.com/patent/US7653797B1/en seems to be related, but I have not read it yet.
https://hosking.github.io/links/Cher+2004ASPLOS.pdf
https://people.cs.umass.edu/~emery/pubs/cramm.pdf
https://people.cs.umass.edu/~emery/pubs/f034-hertz.pdf
https://people.cs.umass.edu/~emery/pubs/04-16.pdf
OpenCL is of course designed to abstract away the details of hardware implementation, so going down too much of a rabbit hole with respect to worrying about how the hardware is configured is probably a bad idea.
Having said that, I am wondering how much local memory is efficient to use for any particular kernel. For example if I have a work group which contains 64 work items then presumably more than one of these may simultaneously run within a compute unit. However it seems that the local memory size as returned by CL_DEVICE_LOCAL_MEM_SIZE queries is applicable to the whole compute unit, whereas it would be more useful if this information was for the work group. Is there a way to know how many work groups will need to share this same memory pool if they coexist on the same compute unit?
I had thought that making sure that my work group memory usage was below one quarter of total local memory size was a good idea. Is this too conservative? Is tuning by hand the only way to go? To me that means that you are only tuning for one GPU model.
Lastly, I would like to know if the whole local memory size is available for user allocation for local memory, or if there are other system overheads that make it less? I hear that if you allocate too much then data is just placed in global memory. Is there a way of determining if this is the case?
Is there a way to know how many work groups will need to share this same memory pool if they coexist on the same compute unit?
Not in one step, but you can compute it. First, you need to know how much local memory a workgroup will need. To do so, you can use clGetKernelWorkGroupInfo with the flag CL_KERNEL_LOCAL_MEM_SIZE (strictly speaking it's the local memory required by one kernel). Since you know how much local memory there is per compute unit, you can know the maximum number of workgroups that can coexist on one compute unit.
Actually, this is not that simple. You have to take into consideration other parameters, such as the max number of threads that can reside on one compute unit.
This is a problem of occupancy (that you should try to maximize). Unfortunately, occupancy will vary depending of the underlying architecture.
AMD publish an article on how to compute occupancy for different architectures here.
NVIDIA provide an xls sheet that compute the occupancy for the different architectures.
Not all the necessary information to do the calculation can be queried with OCL (if I recall correctly), but nothing stops you from storing info about different architectures in your application.
I had thought that making sure that my work group memory usage was below one quarter of total local memory size was a good idea. Is this too conservative?
It is quite rigid, and with clGetKernelWorkGroupInfo you don't need to do that. However there is something about CL_KERNEL_LOCAL_MEM_SIZE that needs to be taken into account:
If the local memory size, for any pointer argument to the kernel
declared with the __local address qualifier, is not specified, its
size is assumed to be 0.
Since you might need to compute dynamically the size of the necessary local memory per workgroup, here is a workaround based on the fact that the kernels are compiled in JIT.
You can define a constant in you kernel file and then use the -D option to set its value (previously computed) when calling clBuildProgram.
I would like to know if the whole local memory size is available for user allocation for local memory, or if there are other system overheads that make it less?
Again CL_KERNEL_LOCAL_MEM_SIZE is the answer. the standard states:
This includes local memory that may be needed by an implementation to
execute the kernel...
If your work is fairly independent and doesn't re-use input data you can safely ignore everything about work groups and shared local memory. However, if your work items can share any input data (classic example is a 3x3 or 5x5 convolution that re-reads input data) then the optimal implementation will need shared local memory. Non-independent work can also benefit. One way to think of shared local memory is programmer-managed cache.
Title says it pretty much all : is there a way to get the lowest free virtual memory address under windows ? I should add that I am interested by this information at the beginning of the program (before any dynamic memory allocation has been done).
Why I need it : trying to build a malloc implementation under Windows. If it is not possible I would have to really to whatever VirtualAlloc() returns when given NULL as first parameter. While you would expect it to do something sensible, like allocation memory at the bottom of what is available, there are no guarantees.
This can be implemented yourself by using VirtualQuery looking for pages that are marked as free. It would be relatively slow though. (You will also need to consider allocation granularity which is different from page size.)
I will say that unless you need contiguous blocks of memory, trying to keep everything close together is mostly meaningless since if two pages of virtual memory might be next to each other in the address space, there is no reason to assume they are close to each other in physical memory. In fact, even if they are close to each other at some point in time, if those pages get moved to backing store and then faulted back into memory, the page would not be faulted to the same physical address page.
The OS uses more complicated metrics than just what is the "lowest" memory address available. Specifically, VirtualAlloc allocates pages of memory, so depending on how much you're asking for, at least one page of unused address space has to be available at the starting address. So even if you think there's a "lower" address that it should have used, that address might not have been compatible with the operation that you asked for.
I know that memory usage is a very complex issue on Windows.
I am trying to write a UI control for a large application that shows a 'percentage of memory used' number, in order to give the user an indication that it may be time to clear up some memory, or more likely restart the application.
One implementation used ullAvailVirtual from MEMORYSTATUSEX as a base, then used HeapWalk() to walk the process heap looking for additional free memory. The HeapWalk() step was needed because we noticed that after a while of running the memory allocated and freed by the heap was never returned and reported by the ullAvailVirtual number. After hours of intensive working, the ullAvailVirtual number no longer would accurately report the amount of memory available.
However, this method proved not ideal, due to occasional odd errors that HeapWalk() would return, even when the process heap was not corrupted. Further, since this is a UI control, the heap walking code was executing every 5-10 seconds. I tried contacting Microsoft about why HeapWalk() was failing, escalated a case via MSDN, but never got an answer other than "you probably shouldn't do that".
So, as a second implementation, I used PagefileUsage from PROCESS_MEMORY_COUNTERS as a base. Then I used VirtualQueryEx to walk the virtual address space adding up all regions that weren't MEM_FREE and returned a value for GetMappedFileNameA(). My thinking was that the PageFileUsage was essentially 'private bytes' so if I added to that value the total size of the DLLs my process was using, it would be a good approximation of the amount of memory my process was using.
This second method seems to (sorta) work, at least it doesn't cause crashes like the heap walker method. However, when both methods are enabled, the values are not the same. So one of the methods is wrong.
So, StackOverflow world...how would you implement this?
which method is more promising, or do you have a third, better method?
should I go back to the original method, and further debug the odd errors?
should I stay away from walking the heap every 5-10 seconds?
Keep in mind the whole point is to indicate to the user that it is getting 'dangerous', and they should either free up memory or restart the application. Perhaps a 'percentage used' isn't the best solution to this problem? What is? Another idea I had was a color based system (red, yellow, green, which I could base on more factors than just a single number)
Yes, the Windows memory manager was optimized to fulfill requests for memory as quickly and efficiently possible, it was not optimized to easily measure how much space is used. The first downfall is that heap blocks that are released are rarely unmapped. They are simply marked as "free", to be used by the next allocation. That's why VirtualQueryEx() cannot work.
The problem with HeapWalk is that you have to lock the heap (HeapLock) so that it can walk it without the heap allocation changing. That lock can have very detrimental side-effects. Quoting:
Walking a heap may degrade
performance, especially on symmetric
multiprocessing (SMP) computers. The
side effects may last until the
process ends.
Even then, the number you get back is pretty meaningless. A program never runs out of free space, it runs out of a large enough contiguous chunk of memory to fulfill the request. No happy answers I'm afraid. Except one: a 64-bit operating system cost less than two hundred bucks.
The place to start is probably GetProcessMemoryInfo(). This fills in a structure for you that has, among other things, the current working set in bytes.
Have a look at the following article .NET and running processes
It uses WMI to check for the memory usage of processes, specifically using the
System.Diagnostics.Process
and another link on how to use WMI in C#: WMI Made Easy for C#
Hope this helps.
I need to store large amounts of data on-disk in approximately 1k blocks. I will be accessing these objects in a way that is hard to predict, but where patterns probably exist.
Is there an algorithm or heuristic I can use that will rearrange the objects on disk based on my access patterns to try to maximize sequential access, and thus minimize disk seek time?
On modern OSes (Windows, Linux, etc) there is absolutely nothing you can do to optimise seek times! Here's why:
You are in a pre-emptive multitasking system. Your application and all it's data can be flushed to disk at any time - user switches task, screen saver kicks in, battery runs out of charge, etc.
You cannot guarantee that the file is contiguous on disk. Doing Aaron's first bullet point will not ensure an unfragmented file. When you start writing the file, the OS doesn't know how big the file is going to be so it could put it in a small space, fragmenting it as you write more data to it.
Memory mapping the file only works as long as the file size is less than the available address range in your application. On Win32, the amount of address space available is about 2Gb - memory used by application. Mapping larger files usually involves un-mapping and re-mapping portions of the file, which won't be the best of things to do.
Putting data in the centre of the file is no help as, for all you know, the central portion of the file could be the most fragmented bit.
To paraphrase Raymond Chen, if you have to ask about OS limits, you're probably doing something wrong. Treat your filesystem as an immutable black box, it just is what it is (I know, you can use RAID and so on to help).
The first step you must take (and must be taken whenever you're optimising) is to measure what you've currently got. Never assume anything. Verify everything with hard data.
From your post, it sounds like you haven't actually written any code yet, or, if you have, there is no performance problem at the moment.
The only real solution is to look at the bigger picture and develop methods to get data off the disk without stalling the application. This would usually be through asynchronous access and speculative loading. If your application is always accessing the disk and doing work with small subsets of the data, you may want to consider reorganising the data to put all the useful stuff in one place and the other data elsewhere. Without knowing the full problem domain it's not possible to to be really helpful.
Depending on what you mean by "hard to predict", I can think of a few options:
If you always seek based on the same block field/property, store the records on disk sorted by that field. This lets you use binary search for O(log n) efficiency.
If you seek on different block fields, consider storing an external index for each field. A b-tree gives you O(log n) efficiency. When you seek, grab the appropriate index, search it for your block's data file address and jump to it.
Better yet, if your blocks are homogeneous, consider breaking them down into database records. A database gives you optimized storage, indexing, and the ability to perform advanced queries for free.
Use memory-mapped file access rather than the usual open-seek-read/write pattern. This technique works on Windows and Unix platforms.
In this way the operating system's virtual memory system will handle the caching for you. Accesses of blocks that are already in memory will result in no disk seek or read time. Writes from memory back to disk are handled automatically and efficiently and without blocking your application.
Aaron's notes are good too as they will affect initial-load time for a chunk that's not in memory. Combine that with the memory-mapped technique -- after all it's easier to reorder chunks using memcpy() than by reading/writing from disk and attempting swapouts etc.
The most simple way to solve this is to use an OS which solves that for you under the hood, like Linux. Give it enough RAM to hold 10% of the objects in RAM and it will try to keep as many of them in the cache as possible reducing the load time to 0. The recent server versions of Windows might work, too (some of them didn't for me, that's why I'm mentioning this).
If this is a no go, try this algorithm:
Create a very big file on the harddisk. It is very important that you write this in one go so the OS will allocate a continuous space on disk.
Write all your objects into that file. Make sure that each object is the same size (or give each the same space in the file and note the length in the first few bytes of of each chunk). Use an empty harddisk or a disk which has just been defragmented.
In a data structure, keep the offsets of each data chunk and how often it is accessed. When it is accessed very often, swap its position in the file with a chunk that is closer to the start of the file and which has a lesser access count.
[EDIT] Access this file with the memory-mapped API of your OS to allow the OS to effectively cache the most used parts to get best performance until you can optimize the file layout next time.
Over time, heavily accessed chunks will bubble to the top. Note that you can collect the access patterns over some time, analyze them and do the reorder over night when there is little load on your machine. Or you can do the reorder on a completely different machine and swap the file (and the offset table) when that's done.
That said, you should really rely on a modern OS where a lot of clever people have thought long and hard to solve these issues for you.
That's an interesting challenge. Unfortunately, I don't know how to solve this out of the box, either. Corbin's approach sounds reasonable to me.
Here's a little optimization suggestion, at least: Place the most-accessed items at the center of your disk (or unfragmented file), not at the start of end. That way, seeking to lesser-used data will be closer by average. Err, that's pretty obvious, though.
Please let us know if you figure out a solution yourself.