I'm working on an Entity-Component System, and now that I got everything up and running I decided to delve a bit into optimizations.
Currently, I have a Gameobject class that stores an unordered_map with component IDs and respective component pointers. To speed things up, I want to try to allocate components on the heap in contiguous memory without copying them into a vector. With this I would be able to iterate over them simply by getting the first component then moving up from that address by the size of the component, essentially eliminating the constant searching the computer would otherwise have to go through to find the different memory addresses when iterating over their pointers.
Is there a way to do such a thing?
Related
I am working on some project related with huge objects in physical memory in Windows.
I wanted to create really big structure of data, but therefore I found some problems.
While I am trying to allocate huge amount of data I can just create as large object as heap allows (it also depends on architecture of operating system).
I am not sure if this is restricted by private heap of thread, or some other way.
When I was looking for how operating system places data in memory, I found out that the data is stored in some particular order.
And here comes some questions...
If I want to create large objects, should I have one very large heap region to allocate memory inside? If so, I have to fragmentate data.
In other way, there came an idea, of finding starting addresses of empty regions, and then use this unused place to put data in some data structure.
If this idea is possible to realize, then how it could be done?
Another question is that, do you think that list would be the best option for that sort of huge object? Or maybe it would be better to use another data structure?
Do you think that chosen data structure could be divided into two regions of data separately, but standing as one object?
Thanks in advance, every answer for my questions could be helpful.
There seems to be some kind of misconception about memory allocation here.
(1) Most operating systems do not allocate memory linearly. There usually are discontinuities in the memory mapped to a process address space.
(2) If you want to allocate a huge amount of memory, you should do it directly with the operating system; not through a heap.
I'm trying to figure it out, but I'm getting a little stuck.
The way the types and heaps are related is simple, if a bit strange. (why not just give VkMemoryHeap a VkMemoryType member?)
I think I understand what all the VkMemoryPropertyFlags mean, they seem fairly straightforward.
But what's with the VkMemoryHeap.flags member? It apparently only has one non-zero valid value, VkMemoryHeapFlagBits.VK_MEMORY_HEAP_DEVICE_LOCAL_BIT, and though that wouldn't be too odd on it's own, but there's also a VkMemoryPropertyFlagBits.VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT that could be present on the memory type of the heap.
What does the VkMemoryHeap.flags member mean and how does it relate to the VkMemoryType.flags member?
Vulkan recognizes two distinct concepts when it comes to memory. There are the actual physical pieces of RAM that the device can talk to. Then there are ways to allocate memory from one of those pools of RAM.
A heap represents a specific piece of RAM. VkMemoryHeap is the object that describes one of the available heaps of RAM that the device can talk to. There really aren't that many things that define a particular heap. Just two: the number of bytes of that RAMs storage and the storage's location relative to the Vulkan device (local vs. non-local).
A memory type is a particular means of allocating memory from a specific heap. VkMemoryType is the object that describes a particular way of allocating memory. And there are a lot more descriptive flags for how you can allocate memory from a heap.
For a more concrete example, consider a standard PC setup with a discrete GPU. The device has its own local RAM, but the discrete GPU can also access CPU memory. So a Vulkan device will have two heaps: one of them will be local, the other non-local.
However, there will usually be more than two memory types. You usually have one memory type that represents local memory, which does not have the VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT set. That means you can't map the memory; you can only access it via transfer operations from some other memory type (or from rendering operations or whatever).
But you will often have two memory types that both use the same non-local heap. They will both be VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, thus allowing mapping. However, one of them will likely have the VK_MEMORY_PROPERTY_HOST_CACHED_BIT flag set, while the other will be VK_MEMORY_PROPERTY_HOST_COHERENT_BIT. This allows you to choose whether you want cached CPU access (thus requiring an explicit flush of ranges of modified memory) or uncached CPU access.
But while they are two separate memory types, they both allocate from the same heap. Which is why VkMemoryType has an index that refers to the heap who's memory it is allocating from.
Only thing I'm not getting is how the two DEVICE_LOCAL flags interact.
Did you look at the specification? It's not exactly hiding how this works:
if propertyFlags has the VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT bit set, memory allocated with this type is the most efficient for device access. This property will only be set for memory types belonging to heaps with the VK_MEMORY_HEAP_DEVICE_LOCAL_BIT set.
Is it saying that if the memory is local then all types corresponding to that memory are local, or that they can be local?
You seem to be trying to impose the wrong meaning to these things. Just look at what the specification says and take it at face value.
PROPERTY_DEVICE_LOCAL denotes a memory type which will achieve the best device access performance. The only connection between this and MEMORY_DEVICE_LOCAL is that memory types with PROPERTY_DEVICE_LOCAL will only be associated with memory heaps that use MEMORY_DEVICE_LOCAL.
That's the only relevant meaning here.
If you want an example of when a memory heap would be device local, yet have memory types that aren't, consider a GPU that has no memory of its own. There's only one heap, which is therefore MEMORY_DEVICE_LOCAL.
However, allocating memory from that pool in a way that makes it host-visible may decrease the performance of device access to that memory. Therefore, for such hardware, the host-visible memory types for the same heap will not use PROPERTY_DEVICE_LOCAL.
Then again, other hardware doesn't lose performance from making memory host-visible. So they only have one memory type, which has all of the available properties. For Intel, their on-chip GPUs apparently have access to some level of the CPU's caches.
After some tests with help of Task Manager, I understood one thing about gcnew — memory allocated for local variables remaines allocated even if control leaves function, and is re-allocated only when control re-enters this function — so I'm in perplexity, how to deallocate memory myself. Here is some example of the problem:
void Foo(void)
{
System::Text::StringBuilder ^ t = gcnew System::Text::StringBuilder("");
int i = 0;
while(++i < 20000000) t->Append(i);
return;
}
As I mentioned, memory for variable t remains after leaving Foo(), delete not work as it works for new, and calling Foo() once, only gives me pointless allocated memory.
This is gcnew, which means garbage collected allocation. It will be disposed and deallocated by GC thread
Your function uses memory for code and data. The code is a fixed amount and will be used the entire time the library or program is loaded. The data is only used when the function is executing.
Data used by a program is either static or dynamic. Static data is laid out by the compiler and is basically equivalent to code (except that it might be marked as non-executable and/or read-only to prevent accidents). Dynamic data is temporary and allocated from a stack or heap (or CPU registers).
In a classic program, the stack and heap share the same memory address range with the stack at one end, growing toward the heap and the heap at the other end, trying not to grow into the stack. However, with modern address ranges on the order of 1TB, a heap generally has a lot of room.
Keep in mind that when a program requests an address range, it's just signaling to the operating system that it's okay to use that address for data reading, data writing and/or code execution. Until it actually puts something there, there is no load on the system. Also keep in mind with a virtual memory system, process memory is effectively allocated on the swap file/device (hard drive) with optimizations especially using RAM for caching, copy on write and many other techniques. (Data written to a memory address might never make it to the swap file, but that's up to the operating system.)
The data your function needs is for the two variables: t and i. t is a reference to a garbage collected object. i is an integer. Both are quite small and short-lived. You could think of them as being on the stack. When the function returns, the stack frame is popped and their memory is reused by the next stack operation. If you are looking at memory allocation, there won't be a change because the amount of memory allocated to the stack would not be changed.
Now in the execution of your function, a new object is created and, the way it's filled with data, it takes up quite a bit of memory. You could consider that object to be created in the heap. You don't need to delete it since it is a garbage collection object. When the garbage collector runs by walking all objects reachable from a set of root objects, it will find that the object is not reachable and add its space to a free list. When space for a new object is needed that doesn't fit into any blocks on the free list, more of the heap's address range will be used.
The CLR heap is compactable, which means it can move objects around in order to coalesce free blocks. Using this ability, it can move objects out of areas of allocated memory and give it back to the operating system, thereby freeing up space in the swap file.
So, there are three things that have to happen for you to see a reduction in the amount of memory allocated to the process:
The garbage collection has run to find unreachable objects.
The heap has been compacted.
The heap allocation has been reduced.
None of these things are really necessary until the swap file can't grow anymore. Obviously, the system has been designed for performance and to be a good citizen so it wouldn't take it that far. You can influence when garbage collection runs but this is only very rarely helpful and is generally not done.
I was implementing a heap sort and I start wondering about the different implementations of heaps. When you don need to access the elements by index(like in a heap sort) what are the pros and cons of implementing a heap with an array or doing it like any other linked data structure.
I think it's important to take into account the memory wasted by the nodes and pointers vs the memory wasted by empty spaces in an array, as well as the time it takes to add or remove elements when you have to resize the array.
When I should use each one and why?
As far as space is concerned, there's very little issue with using arrays if you know how much is going into the heap ahead of time -- your values in the heap can always be pointers to the larger structures. This may afford for better cache localization on the heap itself, but you're still going to have to go out someplace to memory for extra data. Ideally, if your comparison is based on a small morsel of data (often just a 4 byte float or integer) you can store that as the key with a pointer to the full data and achieve good cache coherency.
Heap sorts are already not particularly good on cache hits throughout traversing the heap structure itself, however. For small heaps that fit entirely in L1/L2 cache, it's not really so bad. However, as you start hitting main memory performance will dive bomb. Usually this isn't an issue, but if it is, merge sort is your savior.
The larger problem comes in when you want a heap of undetermined size. However, this still isn't so bad, even with arrays. Anymore, in non-embedded environments with nice, pretty memory systems growing an array with some calls (e.g. realloc, please forgive my C background) really isn't all that slow because the data may not need to physically move in memory -- just some address pointer magic for most of it. Added to the fact that if you use a array-size-doubling strategy (array is too small, double the size in a realloc call) you're still ending up with an O(n) amortized cost with relatively few reallocs and at most double wasted space -- but hey, you'd get that with linked lists anyways if you're using a 32-bit key and 32-bit pointer.
So, in short, I'd stick with arrays for the smaller base data structures. When the heap goes away, so do the pointers I don't need anymore with a single deallocation. However, it's easier to read pointer-based code for heaps in my opinion since dealing with the indexing magic isn't quite as straightforward. If performance and memory aren't a concern, I'd recommend that to anyone in a heartbeat.
As part of our system simulation, I'm modeling a memory space with 64-bit addressing using a sparse memory array and keeping a list of objects to keep track of buffers that are allocated within the memory space. Buffers are allocated and de-allocated dynamically.
I have a function that searches for a given address or address range within the allocated buffers to see if accesses to the memory model are in allocated space or not, and my first cut "search through all the buffers until you find a match" is slowing down our simulations by 10%. Our UUT does a lot of memory accesses that have to be vetted by the simulation.
So, I'm trying to optimize. The memory buffer objects contain a starting address and a length. I'm thinking about sorting the object array by starting address at object creation, and then, when the checking function is called, doing a binary search through the array looking to see if a given address falls within a start/end range.
Are there any better/faster ways to do this? There must be some faster/cooler algorithm out there using heaps or hash signatures or some-such, right?
Binary search through a sorted array works but makes allocation/deallocation slow.
A simple case is to make an ordered binary tree (red-black tree, AVR tree, etc.) indexed by the starting address, so that insertion (allocation), removal (deallocation) and searching are all O(log n). Most modern languages provide such data structure (e.g. C++'s std::map) already.
My first thought was also binary search and I think that it is a good idea. You should be able to insert and remove quickly too. Using a hash would just make you put the addresses in buckets (in my opinion) and then you'd get to the right bucket quickly (and then have to search through the bucket).
Basically your problem is that you have a defined intervals of "valid" memory, memory outside those intervals is "invalid", and you want to check for a given address whether it is inside a valid memory block or not.
You can definitely do this by storing the start addresses of all allocated blocks in a binary tree; then search for the largest address at or below the queried address, and just verify that the address falls within the length of the valid address. This gives you O(log n) query time where n = number of allocated blocks. The same query of course can be used also to actually the find the block itself, so you can also read the contents of the block at the given address, which I guess you'd need also.
However, this is not the most efficient scheme. Instead, you could use additionally one-dimensional spatial subdivision trees to mark invalid memory areas. For example, use a tree with branching factor of 256 (corresponding to 8 bits) that maps all those 16kB blocks that have only invalid addresses inside them to "1" and others to "0"; the tree will have only two levels and will be very efficient to query. When you see an address, first ask form this tree if it's certainly invalid; only when it's not, query the other one. This will speed things up ONLY IF YOU ACTUALLY GET LOTS OF INVALID MEMORY REFERENCES; if all the memory references are actually valid and you're just asserting, you won't save anything. But you can flip this idea also around and use the tree mark to all those 16kB or 256B blocks that have only valid addresses inside them; how big the tree grows depends on how your simulated memory allocator works.