Large 3D volume bad_alloc - memory-management

I'm developing an application that creates a 3D Voronoi Diagram created from a 3D point cloud using boost multi_array allocated dynamically to store the whole diagram.
One of the test cases I'm using requires a large amount of memory (around [600][600][600]), which is over the limit allowed and results in bad_alloc.
I already tried to separate the diagram in small pieces but also it doesn't work, as it seems that the total memory is already over the limits.
My question is, how can I work with such large 3D volume under the PC constraints?
EDIT
The Element type is a struct as follows:
struct Elem{
int R[3];
int d;
int label;
}
The elements are indexed in the multiarray based on their position in the 3D space.
The multiarray is constructed by setting specific points on the space from a file and then filling the intermediate spaces by passing a forward and a backward mask over the whole space.

You didn't say how do you get all your points. If you read them from a file, then don't read them all. If you compute them, then you can probably recompute them as needed. In both cases you can implement some cache that will store most often used ones. If you know how your algorithm will use the data, then you can predict which values will be needed next. You can even do this in a different thread.
The second solution is to work on your data so they fit in your RAM. You have 216 millions of points, but we don't know what's the size of a point. They are 3D but do they use floats or doubles? Are they a classes or simple structs? Do they have vtables? Do you use Debug build? (in Debug objects may be bigger). Do you allocate entire array at the beginning or incrementally? I believe there should be no problem storing 216M of 3D points on current PC but it depends on answers for all those questions.
The third way that comes to my mind is to use Memory Mapped Files, but i never used them personally.
Here are few things to try:
Try to allocate in different batches, like: 1 * 216M, 1k * 216k, 1M * 216 to see how much memory can you get.
Try to change boost map to std::vector and even raw void* and compare maximum RAM you can get.

You didn't mention the element type. Give the element is a four-byte float, a 600*600*600 matrix only takes about 820M bytes, which is not very big actually. I'd suggest you to check your operating system's limit on memory usage per process. For Linux, check it with ulimit -a.
If you really cannot allocate the matrix in memory, create a file of desired size on disk map it to memory using mmap. Then pass the memory address returned by mmap to boost::multi_array_ref.

Related

DirectX11 - Buffer size of instanced vertices, with various size

With DirectX/C++, suppose you are drawing the same model many times to the screen. You can do this with DrawIndexedInstanced(). You need to set the size of the instance buffer when you create it:
D3D11_BUFFER_DESC_instance.ByteWidth = sizeof(struct_with_instance_data)* instance_count;
If the instance_count can vary between a low and high value, is it customary to create the buffer with the max value (max_instance_count)? And only draw what is required.
Wouldn't that permanently use a lot of memory?
Recreating the buffer is a slow solution?
What are good methods?
Thank you.
All methods have pros and cons.
Create max.count — as you pointed out you’ll consume extra memory.
Create some initial count, and implement exponential growth — you can OOM in runtime, also growing large buffers may cause spikes in the profiler.
There’s other way which may or may not work for you depending on the application. You can create reasonably large fixed-size buffer, to render more instances call DrawIndexedInstanced multiple times in a loop, replacing the data in the buffer between calls. Will work well if the source data is generated in runtime from something else, you’ll need to rework that part to produce fixed-size (except the last one) batches instead of the complete buffer. Won’t work if the data in the buffer needs to persist across frames, e.g. if you update it with a compute shader.

How to determine the starting address of unused memory region in operating system?

I am working on some project related with huge objects in physical memory in Windows.
I wanted to create really big structure of data, but therefore I found some problems.
While I am trying to allocate huge amount of data I can just create as large object as heap allows (it also depends on architecture of operating system).
I am not sure if this is restricted by private heap of thread, or some other way.
When I was looking for how operating system places data in memory, I found out that the data is stored in some particular order.
And here comes some questions...
If I want to create large objects, should I have one very large heap region to allocate memory inside? If so, I have to fragmentate data.
In other way, there came an idea, of finding starting addresses of empty regions, and then use this unused place to put data in some data structure.
If this idea is possible to realize, then how it could be done?
Another question is that, do you think that list would be the best option for that sort of huge object? Or maybe it would be better to use another data structure?
Do you think that chosen data structure could be divided into two regions of data separately, but standing as one object?
Thanks in advance, every answer for my questions could be helpful.
There seems to be some kind of misconception about memory allocation here.
(1) Most operating systems do not allocate memory linearly. There usually are discontinuities in the memory mapped to a process address space.
(2) If you want to allocate a huge amount of memory, you should do it directly with the operating system; not through a heap.

Efficient way to grow a large std::vector

I'm filling a large (~100 to 1000MiB) std::vector<char> with data using std::ifstream::read(). I know the size of this read in advance, so I can construct the vector to that size.
Afterwards however, I keep reading from the file until I find a particular delimiter. Any data up to that delimiter is to be added to the vector. It's 500KiB at worst, usually much less.
Considering the vector's size, I'm wondering if this causes an expensive growth (and reallocation). Memory is an issue here, as the vector's size is to remain fairly close to that of its construction.
Is it a good solution to extend the vector's capacity slightly beyond its initial size using std::vector::reserve, so that the small amount of extra data doesn't require it to grow? If so, it's probably best to construct the vector empty, reserve the capacity and then resize it to its initial size, right?

space efficient algorithm for tracking writes to 2 power 32 elements

This is one of the requirement i came across my work. We have a (2 power 32) contiguous 4294967296 integers allocated as an array in memory whose function is to provide mapping in another table. Some of the entries gets written more often than the other. We want to track the hot spots and provide an approximate histogram.
The catch is that, this is going to be implemented in firmware and not much memory can be used.
information:
The mapping is for scsi lba from host to lbas on the target probably drives or flash memory.
Lets say we have 1 MB space to handle the meta data required to track hot-cold information. How can we use this efficiently other than just bit mapping which shows whether it is written or not. WE can extent and have a mathematical extension on how accurate the data we collect is based on how larget the memory is used for tracking.

Non-temporal stores of portions of a packed double vector using SSE/AVX

This piggybacks on a previous question that I had regarding fanning out the individual elements of an __m256d vector to different memory locations (a scatter operation). My code stores a lot of data to memory that isn't again accessed for a "long time." I would like to reduce the amount of cache pollution generated by all of these stores by using the non-temporal hint instructions. However, I can't come up with a good way to do this. Here's a summary of what my code looks like now:
__m256d src = ... // data
double *dst;
int dst_dist;
__m128d a = _mm256_extractf128_pd(src, 0);
__m128d b = _mm256_extractf128_pd(src, 1);
_mm_storel_pd(dst + 0*dst_dist, a);
_mm_storeh_pd(dst + 1*dst_dist, a);
_mm_storel_pd(dst + 2*dst_dist, b);
_mm_storeh_pd(dst + 3*dst_dist, b);
I would like to perform the 64-bit stores using the non-temporal hint, but there doesn't seem to be a way to do this directly from an XMM register. What would be the best way to accomplish this?
There is a good reason to avoid using partial register stores with the non-temporal hint. If you try to scatter many small pieces of data to completely unrelated memory locations, CPU's write combining buffers overflow and you get just a usual write through the caches (probably with additional performance cost).
The correct way to use write combining (non-temporal hint) is to fill up the entire cache line. So it is usual to combine data pieces to a complete register, then write it at once with MOVNTDQ.
You can store portions of an SSE vector with a non-temporal hint using the MASKMOVDQU instruction. The semantics don't map precisely onto your example, but it can be made to work. However, this instruction should generally only be used to avoid branching (even then, it is usually be better to use a select and a normal store). It's also simply a little cumbersome to use, since the address to which to store is implicit in the instruction.
The operation that you're performing looks rather a lot like a piece of a matrix transpose (or 90 degree image rotation). Do you eventually store other data to the adjacent addresses? Is there some way you can modify your algorithm to batch up those stores and write complete vectors instead (possibly even by using contiguous writes to a small cacheable scratch buffer and doing some write-combining in software)?

Resources