How to allocate 16GB of memory in Go? - memory-management

I'm using the following simple Go code to allocate a 3D array of size 1024x1024x1024:
grid = make([][][]TColor, 1024)
for x = 0; x < 1024; x++ {
grid[x] = make([][]TColor, 1024)
for y = 0; y < 1024; y++ {
grid[x][y] = make([]TColor, 1024)
}
}
That TColor struct is a 4-component float64 vector:
type TColor struct { R, G, B, A float64 }
Halfway (x=477 and y=~600ish) through the allocation, the inner-most make() call panics with... runtime: out of memory: cannot allocate 65536-byte block (17179869184 in use)
This works fine with lower grid resolutions, ie 256³, 128³ etc. Now since the size of the struct is 4x4 bytes, that whole grid should require exactly 16 GB of memory. My machine (openSuse 12.1 64bit) has 32 GB of addressable physical (ie not-virtual) memory. Why can Go (weekly.2012-02-22) not allocate even half of this?

The struct has 4x8 bytes, not 4x4.

In the current implementation of the Go language, on 64-bit CPUs the Go runtime reserves 16GB of virtual memory from the operating system. This limits the total memory used by a Go program to 16GB.
If you plan to use Go in projects that require large amounts of memory you will need to edit the function runtime·mallocinit in file malloc.goc and increase the value of variable arena_size from 16GB to a bigger value (such as 32GB). After the edit, run
cd $GOROOT/src/pkg/runtime
go tool dist install -v
and then recompile your project.

Related

How to leverage multichannel RAM?

My objective
I want to write a RAM bandwidth benchmark
My solution
U64 benchRead(U64 &io_minLoopTicks) //io_minLoopTicks initial value is U64_MAX
{
const U32 elementsCount = 100'000'000;
U64 accumulator = 0;
std::vector<U64> tab;
tab.resize(elementsCount); //800MB
for(U8 i = 0; i < 10; ++i) // Loops to get a reliable value.
{
const U64 startTimestamp = profiler.start(); // profiler uses under the hood QueryPerformanceCounter
for(const U64 j : tab)
accumulator += j; // Do something with the RAM to not be optimized away.
const U64 loopTicks = profiler.end() - startTimestamp;
if(loopTicks < io_minLoopTicks)
io_minLoopTicks = loopTicks;
}
return accumulator;
}
My problem
The theory would be 57.6 GB/s (8 bytes (bus width) * 2 (channels) * 3.6 GT).
In practise I have 27.7 GB/s (9700K #4.8GHz, 4 single rank 3600MT DDR4 DIMMs).
So my number seems to reflect only one channel performance.
My guess is that the way I coded, the array is in an area of RAM that can only be accessed by one channel. I am not familiar at all with multi-channel so I don't really understand the limits of that technology.
For sure my RAM is dual channel (CPU-Z confirmed it)
Performance analysis
I checked the ouput of Clang14 and it uses SSE (O3 with no -march=native). I could be wrong (not an ASM guy) but it seems it unrolled the loop and does 128 bytes per iteration. A loop is 4.5 cycles on Coffee Lake according to uiCA leading to 139.38 GB/s, way above the RAM theoretical speed. So we should be RAM limited.
My questions
Is my array stored in one bank hence the fact it can only be accessed by one channel ?
How do we write code to leverage the benefits of multichannel RAM (here dual channel) ?
References
Godbolt

Allocate maximum memory available

I currently try to write a programm which is supposed to allocate the maximum memory available. I came to a solution which limits the area of potential available memory until both borders are equal (see listing)
void enforceMemoryLeakage(void** arrayOfAllocMemory)
{
unsigned int maxMemory = 0x80000000;
unsigned int minMemory = 0x50000000;
unsigned int attempAllocatedMemory = minMemory + (maxMemory - minMemory) / 2;
void* pAllocMemory;
while((maxMemory - minMemory) > 1)
{
pAllocMemory = malloc(attempAllocatedMemory);
if (pAllocMemory != NULL)
{
minMemory = attempAllocatedMemory;
attempAllocatedMemory += (maxMemory - minMemory) / 2;
free(pAllocMemory);
}
else
{
maxMemory = attempAllocatedMemory;
attempAllocatedMemory = minMemory + (maxMemory - minMemory) / 2;
}
}
arrayOfAllocMemory[0] = malloc(maxMemory);
void* pAllocAdditionalMemory = malloc(100);
if (pAllocAdditionalMemory == NULL)
std::cout << "Maximum memory: " << minMemory << "\n";
}
The above displayed code works fine. However if the command is executed
void* pAllocAdditionalMemory = malloc(100);
if (pAllocAdditionalMemory == NULL)
std::cout << "Maximum memory: " << minMemory << "\n";
I would have expected that there is no further memory available. However it does not work which brings me to my actual question why the above shown approach does not work.
Best regards
Ratbald
You did not specify OS nor platform so I am completely guessing here so read with extreme prejudice...
Assuming you do not have a bug in your binary search code... My bet is you face memory fragmentation problems as you are successfully allocate/free memory during execution other processes can do the same so you might fragment your memory. Example:
OS has 2 MByte chunk of continuous free memory
you allocate 1.5 MByte of mem (0.5 MByte free)
some other process allocates 1 KByte (0.499 MByte free)
you free the 1 MByte (1.0 + 0.499 MByte free two fragments)
and attempt to allocate 1.25 MByte
but OS does not have 1.25 memory in single continuous chunk so it failed hence you allocate the 1MByte again (0.499 MByte free)
you succesffuly allocate 1 KByte (0.498 MByte free)
Depending on the OS memory management strategies you could sometimes even not need another process interfering to fragment memory ...
There are however another possibilities not related to Fragmentation. In case of emulations or WOW64 the OS will not allocate whole available RAM, also there are limits on single continuous chunk size. For example Win32 will not allow more than ~1.25 GByte but that does not mean there is only 1.25 GByte of free RAM ...

CUDA my Tiled Matrix Multiplication using shared memory return 0 values at specific matrix size [duplicate]

I have the following matrix multiplication code, implemented using CUDA 3.2 and VS 2008. I am running on Windows server 2008 r2 enterprise. I am running a Nvidia GTX 480. The following code works fine with values of "Width" (Matrix width) up to about 2500 or so.
int size = Width*Width*sizeof(float);
float* Md, *Nd, *Pd;
cudaError_t err = cudaSuccess;
//Allocate Device Memory for M, N and P
err = cudaMalloc((void**)&Md, size);
err = cudaMalloc((void**)&Nd, size);
err = cudaMalloc((void**)&Pd, size);
//Copy Matrix from Host Memory to Device Memory
err = cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
err = cudaMemcpy(Nd, N, size, cudaMemcpyHostToDevice);
//Setup the execution configuration
dim3 dimBlock(TileWidth, TileWidth, 1);
dim3 dimGrid(ceil((float)(Width)/TileWidth), ceil((float)(Width)/TileWidth), 1);
MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
//Free Device Memory
cudaFree(Md);
cudaFree(Nd);
cudaFree(Pd);
When I set the "Width" to 3000 or greater, I get the following error after a black screen:
I looked online and I saw that some people has this issue because the watchdog was killing the kernel after it hangs for more than 5 seconds. I tried editing the "TdrDelay" in the registry and this delayed the time before the black screen and same error appeared. So I concluded this was not my issue.
I debugged into my code and found this line to be the culprit:
err = cudaMemcpy(P, Pd, size, cudaMemcpyDeviceToHost);
This is what I use to return my result set from the device after my matrix multiplication kernel function is called. Everything up until this point seems to run fine. I believe I am allocating memory correctly and cannot figure out why this is happening. I thought maybe I didn't have enough memory on my card for this but then shouldn't cudaMalloc have returned an error? (I confirmed it didn't while debugging).
Any ideas/assistance would be greatly appreciated!... Thanks a lot guys!!
Kernel code:
//Matrix Multiplication Kernel - Multi-Block Implementation
__global__ void MatrixMultiplicationMultiBlock_Kernel (float* Md, float* Nd, float* Pd, int Width)
{
int TileWidth = blockDim.x;
//Get row and column from block and thread ids
int Row = (TileWidth*blockIdx.y) + threadIdx.y;
int Column = (TileWidth*blockIdx.x) + threadIdx.x;
//Pvalue store the Pd element that is computed by the thread
float Pvalue = 0;
for (int i = 0; i < Width; ++i)
{
float Mdelement = Md[Row * Width + i];
float Ndelement = Nd[i * Width + Column];
Pvalue += Mdelement * Ndelement;
}
//Write the matrix to device memory each thread writes one element
Pd[Row * Width + Column] = Pvalue;
}
I also have this other function that uses shared memory, and it also gives the same error:
Call:
MatrixMultiplicationSharedMemory_Kernel<<<dimGrid, dimBlock, sizeof(float)*TileWidth*TileWidth*2>>>(Md, Nd, Pd, Width);
Kernel code:
//Matrix Multiplication Kernel - Shared Memory Implementation
__global__ void MatrixMultiplicationSharedMemory_Kernel (float* Md, float* Nd, float* Pd, int Width)
{
int TileWidth = blockDim.x;
//Initialize shared memory
extern __shared__ float sharedArrays[];
float* Mds = (float*) &sharedArrays;
float* Nds = (float*) &Mds[TileWidth*TileWidth];
int tx = threadIdx.x;
int ty = threadIdx.y;
//Get row and column from block and thread ids
int Row = (TileWidth*blockIdx.y) + ty;
int Column = (TileWidth*blockIdx.x) + tx;
float Pvalue = 0;
//For each tile, load the element into shared memory
for( int i = 0; i < ceil((float)Width/TileWidth); ++i)
{
Mds[ty*TileWidth+tx] = Md[Row*Width + (i*TileWidth + tx)];
Nds[ty*TileWidth+tx] = Nd[(ty + (i * TileWidth))*Width + Column];
__syncthreads();
for( int j = 0; j < TileWidth; ++j)
{
Pvalue += Mds[ty*TileWidth+j] * Nds[j*TileWidth+tx];
}
__syncthreads();
}
//Write the matrix to device memory each thread writes one element
Pd[Row * Width + Column] = Pvalue;
}
Controlling the WDDM Timeout
The problem is actually the kernel not the cudaMemcpy(). When you launch the kernel the GPU goes off and does the work asynchronously with the CPU, so it's only when you synchronize with the GPU that you have to wait for the work to finish. cudaMemcpy() involves an implicit synchronization, hence that is where you see the problem.
You could double-check this by calling cudaThreadSynchronize() after the kernel and the problem will appear to be on the cudaThreadSynchronize() instead of the cudaMemcpy().
After changing the TDR timeout, did you restart your machine? Unfortunately Windows needs to be restarted to change the TDR settings. This Microsoft document has a fairly good description of the full settings available.
Kernel problems
In this case the problem is not actually the WDDM timeout. There are errors in the kernel which you would need to resolve (for example you should be able to incremement i by more than one on each iteration) and checking out the matrixMul sample in the SDK may be useful. Incidentally, I hope this is a learning exercise since in reality you would be better off (for performance) using CUBLAS to perform matrix multiplication.
The most critical problem in the code is that you are using shared memory without actually allocating any. In your kernel you have:
//Initialize shared memory
extern __shared__ float sharedArrays[];
But when you launch the kernel you do not specify how much shared memory to allocate for each block:
MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock>>>(Md, Nd, Pd, Width);
The <<<>>> syntax actually takes four arguments where the third and fourth are optional. The fourth is the stream index which is used to get overlap between compute and data transfer (and for concurrent kernel execution) but the third argument specifies the amount of shared memory per block. In this case I assume you want to store TileWidth * TileWidth floats in the shared memory, so you would use:
MatrixMultiplicationMultiBlock_Kernel<<<dimGrid, dimBlock, dimBlock.x * dimBlock.x * sizeof(float)>>>(Md, Nd, Pd, Width);
The main problem
As you mention in your comment, the actual problem was that your matrix width was not a multiple of the block width (and height since it is square, meaning the threads beyond the end would access beyond the end of the array. The code should either handle the non-multiple case or it should ensure that the width is a multiple of the block size.
I should have suggested this earlier, but it is often useful to run cuda-memcheck to check for memeory access violations like this.
You have to change the Driver Timeout settings, is windows feature to prevent faulty drivers to make the system unresponsive.
Check the Microsoft Page describing how to do that.
You should also check the "timeout" flag setting on your GPU Device. If you have the CUDA SDK installed, I believe the "deviceQuery" app will report this property.

Optimize Cuda Kernel time execution

I'm a learning Cuda student, and I would like to optimize the execution time of my kernel function. As a result, I realized a short program computing the difference between two pictures. So I compared the execution time between a classic CPU execution in C, and a GPU execution in Cuda C.
Here you can find the code I'm talking about:
int *imgresult_data = (int *) malloc(width*height*sizeof(int));
int size = width*height;
switch(computing_type)
{
case GPU:
HANDLE_ERROR(cudaMalloc((void**)&dev_data1, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data2, size*sizeof(unsigned char)));
HANDLE_ERROR(cudaMalloc((void**)&dev_data_res, size*sizeof(int)));
HANDLE_ERROR(cudaMemcpy(dev_data1, img1_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data2, img2_data, size*sizeof(unsigned char), cudaMemcpyHostToDevice));
HANDLE_ERROR(cudaMemcpy(dev_data_res, imgresult_data, size*sizeof(int), cudaMemcpyHostToDevice));
float time;
cudaEvent_t start, stop;
HANDLE_ERROR( cudaEventCreate(&start) );
HANDLE_ERROR( cudaEventCreate(&stop) );
HANDLE_ERROR( cudaEventRecord(start, 0) );
for(int m = 0; m < nb_loops ; m++)
{
diff<<<height, width>>>(dev_data1, dev_data2, dev_data_res);
}
HANDLE_ERROR( cudaEventRecord(stop, 0) );
HANDLE_ERROR( cudaEventSynchronize(stop) );
HANDLE_ERROR( cudaEventElapsedTime(&time, start, stop) );
HANDLE_ERROR(cudaMemcpy(imgresult_data, dev_data_res, size*sizeof(int), cudaMemcpyDeviceToHost));
printf("Time to generate: %4.4f ms \n", time/nb_loops);
break;
case CPU:
clock_t begin = clock(), diff;
for (int z=0; z<nb_loops; z++)
{
// Apply the difference between 2 images
for (int i = 0; i < height; i++)
{
tmp = i*imgresult_pitch;
for (int j = 0; j < width; j++)
{
imgresult_data[j + tmp] = (int) img2_data[j + tmp] - (int) img1_data[j + tmp];
}
}
}
diff = clock() - begin;
float msec = diff*1000/CLOCKS_PER_SEC;
msec = msec/nb_loops;
printf("Time taken %4.4f milliseconds", msec);
break;
}
And here is my kernel function:
__global__ void diff(unsigned char *data1 ,unsigned char *data2, int *data_res)
{
int row = blockIdx.x;
int col = threadIdx.x;
int v = col + row*blockDim.x;
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
}
I obtained these execution time for each one
CPU: 1,3210ms
GPU: 0,3229ms
I wonder why GPU result is not as lower as it should be. I am a beginner in Cuda so please be comprehensive if there are some classic errors.
EDIT1:
Thank you for your feedback. I tried to delete the 'if' condition from the kernel but it didn't change deeply my program execution time.
However, after having install Cuda profiler, it told me that my threads weren't running concurrently. I don't understand why I have this kind of message, but it seems true because I only have a 5 or 6 times faster application with GPU than with CPU. This ratio should be greater, because each thread is supposed to process one pixel concurrently to all the other ones. If you have an idea of what I am doing wrong, it would be hepful...
Flow.
Here are two things you could do which may improve the performance of your diff kernel:
1. Let each thread do more work
In your kernel, each thread handles just a single element; but having a thread do anything already has a bunch of overhead, at the block and the thread level, including obtaining the parameters, checking the condition and doing address arithmetic. Now, you could say "Oh, but the reads and writes take much more time then that; this overhead is negligible" - but you would be ignoring the fact, that the latency of these reads and writes is hidden by the presence of many other warps which may be scheduled to do their work.
So, let each thread process more than a single element. Say, 4, as each thread can easily read 4 bytes at once into a register. Or even 8 or 16; experiment with it. Of course you'll need to adjust your grid and block parameters accordingly.
2. "Restrict" your pointers
__restrict is not part of C++, but it is supported in CUDA. It tells the compiler that accesses through different pointers passed to the function never overlap. See:
What does the restrict keyword mean in C++?
Realistic usage of the C99 'restrict' keyword?
Using it allows the CUDA compiler to apply additional optimizations, e.g. loading or storing data via non-coherent cache. Indeed, this happens with your kernel although I haven't measured the effects.
3. Consider using a "SIMD" instruction
CUDA offers this intrinsic:
__device__ ​ unsigned int __vsubss4 ( unsigned int a, unsigned int b )
Which subtracts each signed byte value in a from its corresponding one in b. If you can "live" with the result, rather than expecting a larger int variable, that could save you some of work - and go very well with increasing the number of elements per thread. In fact, it might let you increase it even further to get to the optimum.
I don't think you are measuring times correctly, memory copy is a time consuming step in GPU that you should take into account when measuring your time.
I see some details that you can test:
I suppose you are using MAX_H and MAX_H as constants, you may consider doing so using cudaMemcpyToSymbol().
Remember to sync your threads using __syncthreads(), so you don't get issues between each loop iteration.
CUDA works with warps, so block and number of threads per block work better as multiples of 8, but not larger than 512 threads per block unless your hardware supports it. Here is an example using 128 threads per block: <<<(cols*rows+127)/128,128>>>.
Remember as well to free your allocated memory in GPU and destroying your time events created.
In your kernel function you can have a single variable int v = threadIdx.x + blockIdx.x * blockDim.x .
Have you tested, beside the execution time, that your result is correct? I think you should use cudaMallocPitch() and cudaMemcpy2D() while working with arrays due to padding.
Probably there are other issues with the code, but here's what I see. The following lines in __global__ void diff are considered not optimal:
if (row < MAX_H && col < MAX_W)
{
data_res[v] = (int) data2[v] - (int) data1[v];
}
Conditional operators inside a kernel result in warp divergence. It means that if and else parts inside a warp are executed in sequence, not in parallel. Also, as you might have realized, if evaluates to false only at borders. To avoid the divergence and needless computation, split your image in two parts:
Central part where row < MAX_H && col < MAX_W is always true. Create an additional kernel for this area. if is unnecessary here.
Border areas that will use your diff kernel.
Obviously you'll have modify your code that calls the kernels.
And on a separate note:
GPU has throughput-oriented architecture, but not latency-oriented as CPU. It means CPU may be faster then CUDA when it comes to processing small amounts of data. Have you tried using large data sets?
CUDA Profiler is a very handy tool that will tell you're not optimal in the code.

Largest allocatable array in VC++ 2010 [duplicate]

Is there a max length for an array in C++?
Is it a C++ limit or does it depend on my machine? Is it tweakable? Does it depend on the type the array is made of?
Can I break that limit somehow or do I have to search for a better way of storing information? And what should be the simplest way?
What I have to do is storing long long int on an array, I'm working in a Linux environment. My question is: what do I have to do if I need to store an array of N long long integers with N > 10 digits?
I need this because I'm writing some cryptographic algorithm (as for example the p-Pollard) for school, and hit this wall of integers and length of arrays representation.
Nobody mentioned the limit on the size of the stack frame.
There are two places memory can be allocated:
On the heap (dynamically allocated memory).
The size limit here is a combination of available hardware and the OS's ability to simulate space by using other devices to temporarily store unused data (i.e. move pages to hard disk).
On the stack (Locally declared variables).
The size limit here is compiler defined (with possible hardware limits). If you read the compiler documentation you can often tweak this size.
Thus if you allocate an array dynamically (the limit is large and described in detail by other posts.
int* a1 = new int[SIZE]; // SIZE limited only by OS/Hardware
Alternatively if the array is allocated on the stack then you are limited by the size of the stack frame. N.B. vectors and other containers have a small presence in the stack but usually the bulk of the data will be on the heap.
int a2[SIZE]; // SIZE limited by COMPILER to the size of the stack frame
There are two limits, both not enforced by C++ but rather by the hardware.
The first limit (should never be reached) is set by the restrictions of the size type used to describe an index in the array (and the size thereof). It is given by the maximum value the system's std::size_t can take. This data type is large enough to contain the size in bytes of any object
The other limit is a physical memory limit. The larger your objects in the array are, the sooner this limit is reached because memory is full. For example, a vector<int> of a given size n typically takes multiple times as much memory as an array of type vector<char> (minus a small constant value), since int is usually bigger than char. Therefore, a vector<char> may contain more items than a vector<int> before memory is full. The same counts for raw C-style arrays like int[] and char[].
Additionally, this upper limit may be influenced by the type of allocator used to construct the vector because an allocator is free to manage memory any way it wants. A very odd but nontheless conceivable allocator could pool memory in such a way that identical instances of an object share resources. This way, you could insert a lot of identical objects into a container that would otherwise use up all the available memory.
Apart from that, C++ doesn't enforce any limits.
Looking at it from a practical rather than theoretical standpoint, on a 32 bit Windows system, the maximum total amount of memory available for a single process is 2 GB. You can break the limit by going to a 64 bit operating system with much more physical memory, but whether to do this or look for alternatives depends very much on your intended users and their budgets. You can also extend it somewhat using PAE.
The type of the array is very important, as default structure alignment on many compilers is 8 bytes, which is very wasteful if memory usage is an issue. If you are using Visual C++ to target Windows, check out the #pragma pack directive as a way of overcoming this.
Another thing to do is look at what in memory compression techniques might help you, such as sparse matrices, on the fly compression, etc... Again this is highly application dependent. If you edit your post to give some more information as to what is actually in your arrays, you might get more useful answers.
Edit: Given a bit more information on your exact requirements, your storage needs appear to be between 7.6 GB and 76 GB uncompressed, which would require a rather expensive 64 bit box to store as an array in memory in C++. It raises the question why do you want to store the data in memory, where one presumes for speed of access, and to allow random access. The best way to store this data outside of an array is pretty much based on how you want to access it. If you need to access array members randomly, for most applications there tend to be ways of grouping clumps of data that tend to get accessed at the same time. For example, in large GIS and spatial databases, data often gets tiled by geographic area. In C++ programming terms you can override the [] array operator to fetch portions of your data from external storage as required.
As annoyingly non-specific as all the current answers are, they're mostly right but with many caveats, not always mentioned. The gist is, you have two upper-limits, and only one of them is something actually defined, so YMMV:
1. Compile-time limits
Basically, what your compiler will allow. For Visual C++ 2017 on an x64 Windows 10 box, this is my max limit at compile-time before incurring the 2GB limit,
unsigned __int64 max_ints[255999996]{0};
If I did this instead,
unsigned __int64 max_ints[255999997]{0};
I'd get:
Error C1126 automatic allocation exceeds 2G
I'm not sure how 2G correllates to 255999996/7. I googled both numbers, and the only thing I could find that was possibly related was this *nix Q&A about a precision issue with dc. Either way, it doesn't appear to matter which type of int array you're trying to fill, just how many elements can be allocated.
2. Run-time limits
Your stack and heap have their own limitations. These limits are both values that change based on available system resources, as well as how "heavy" your app itself is. For example, with my current system resources, I can get this to run:
int main()
{
int max_ints[257400]{ 0 };
return 0;
}
But if I tweak it just a little bit...
int main()
{
int max_ints[257500]{ 0 };
return 0;
}
Bam! Stack overflow!
Exception thrown at 0x00007FF7DC6B1B38 in memchk.exe: 0xC00000FD:
Stack overflow (parameters: 0x0000000000000001, 0x000000AA8DE03000).
Unhandled exception at 0x00007FF7DC6B1B38 in memchk.exe: 0xC00000FD:
Stack overflow (parameters: 0x0000000000000001, 0x000000AA8DE03000).
And just to detail the whole heaviness of your app point, this was good to go:
int main()
{
int maxish_ints[257000]{ 0 };
int more_ints[400]{ 0 };
return 0;
}
But this caused a stack overflow:
int main()
{
int maxish_ints[257000]{ 0 };
int more_ints[500]{ 0 };
return 0;
}
I would agree with the above, that if you're intializing your array with
int myArray[SIZE]
then SIZE is limited by the size of an integer. But you can always malloc a chunk of memory and have a pointer to it, as big as you want so long as malloc doesnt return NULL.
To summarize the responses, extend them, and to answer your question directly:
No, C++ does not impose any limits for the dimensions of an array.
But as the array has to be stored somewhere in memory, so memory-related limits imposed by other parts of the computer system apply. Note that these limits do not directly relate to the dimensions (=number of elements) of the array, but rather to its size (=amount of memory taken). Dimensions (D) and in-memory size (S) of an array is not the same, as they are related by memory taken by a single element (E): S=D * E.
Now E depends on:
the type of the array elements (elements can be smaller or bigger)
memory alignment (to increase performance, elements are placed at addresses which are multiplies of some value, which introduces
‘wasted space’ (padding) between elements
size of static parts of objects (in object-oriented programming static components of objects of the same type are only stored once, independent from the number of such same-type objects)
Also note that you generally get different memory-related limitations by allocating the array data on stack (as an automatic variable: int t[N]), or on heap (dynamic alocation with malloc()/new or using STL mechanisms), or in the static part of process memory (as a static variable: static int t[N]). Even when allocating on heap, you still need some tiny amount of memory on stack to store references to the heap-allocated blocks of memory (but this is negligible, usually).
The size of size_t type has no influence on the programmer (I assume programmer uses size_t type for indexing, as it is designed for it), as compiler provider has to typedef it to an integer type big enough to address maximal amount of memory possible for the given platform architecture.
The sources of the memory-size limitations stem from
amount of memory available to the process (which is limited to 2^32 bytes for 32-bit applications, even on 64-bits OS kernels),
the division of process memory (e.g. amount of the process memory designed for stack or heap),
the fragmentation of physical memory (many scattered small free memory fragments are not applicable to storing one monolithic structure),
amount of physical memory,
and the amount of virtual memory.
They can not be ‘tweaked’ at the application level, but you are free to use a different compiler (to change stack size limits), or port your application to 64-bits, or port it to another OS, or change the physical/virtual memory configuration of the (virtual? physical?) machine.
It is not uncommon (and even advisable) to treat all the above factors as external disturbances and thus as possible sources of runtime errors, and to carefully check&react to memory-allocation related errors in your program code.
So finally: while C++ does not impose any limits, you still have to check for adverse memory-related conditions when running your code... :-)
As many excellent answers noted, there are a lot of limits that depend on your version of C++ compiler, operating system and computer characteristics. However, I suggest the following script on Python that checks the limit on your machine.
It uses binary search and on each iteration checks if the middle size is possible by creating a code that attempts to create an array of the size. The script tries to compile it (sorry, this part works only on Linux) and adjust binary search depending on the success. Check it out:
import os
cpp_source = 'int a[{}]; int main() {{ return 0; }}'
def check_if_array_size_compiles(size):
# Write to file 1.cpp
f = open(name='1.cpp', mode='w')
f.write(cpp_source.format(m))
f.close()
# Attempt to compile
os.system('g++ 1.cpp 2> errors')
# Read the errors files
errors = open('errors', 'r').read()
# Return if there is no errors
return len(errors) == 0
# Make a binary search. Try to create array with size m and
# adjust the r and l border depending on wheather we succeeded
# or not
l = 0
r = 10 ** 50
while r - l > 1:
m = (r + l) // 2
if check_if_array_size_compiles(m):
l = m
else:
r = m
answer = l + check_if_array_size_compiles(r)
print '{} is the maximum avaliable length'.format(answer)
You can save it to your machine and launch it, and it will print the maximum size you can create. For my machine it is 2305843009213693951.
I'm surprised the max_size() member function of std::vector has not been mentioned here.
"Returns the maximum number of elements the container is able to hold due to system or library implementation limitations, i.e. std::distance(begin(), end()) for the largest container."
We know that std::vector is implemented as a dynamic array underneath the hood, so max_size() should give a very close approximation of the maximum length of a dynamic array on your machine.
The following program builds a table of approximate maximum array length for various data types.
#include <iostream>
#include <vector>
#include <string>
#include <limits>
template <typename T>
std::string mx(T e) {
std::vector<T> v;
return std::to_string(v.max_size());
}
std::size_t maxColWidth(std::vector<std::string> v) {
std::size_t maxWidth = 0;
for (const auto &s: v)
if (s.length() > maxWidth)
maxWidth = s.length();
// Add 2 for space on each side
return maxWidth + 2;
}
constexpr long double maxStdSize_t = std::numeric_limits<std::size_t>::max();
// cs stands for compared to std::size_t
template <typename T>
std::string cs(T e) {
std::vector<T> v;
long double maxSize = v.max_size();
long double quotient = maxStdSize_t / maxSize;
return std::to_string(quotient);
}
int main() {
bool v0 = 0;
char v1 = 0;
int8_t v2 = 0;
int16_t v3 = 0;
int32_t v4 = 0;
int64_t v5 = 0;
uint8_t v6 = 0;
uint16_t v7 = 0;
uint32_t v8 = 0;
uint64_t v9 = 0;
std::size_t v10 = 0;
double v11 = 0;
long double v12 = 0;
std::vector<std::string> types = {"data types", "bool", "char", "int8_t", "int16_t",
"int32_t", "int64_t", "uint8_t", "uint16_t",
"uint32_t", "uint64_t", "size_t", "double",
"long double"};
std::vector<std::string> sizes = {"approx max array length", mx(v0), mx(v1), mx(v2),
mx(v3), mx(v4), mx(v5), mx(v6), mx(v7), mx(v8),
mx(v9), mx(v10), mx(v11), mx(v12)};
std::vector<std::string> quotients = {"max std::size_t / max array size", cs(v0),
cs(v1), cs(v2), cs(v3), cs(v4), cs(v5), cs(v6),
cs(v7), cs(v8), cs(v9), cs(v10), cs(v11), cs(v12)};
std::size_t max1 = maxColWidth(types);
std::size_t max2 = maxColWidth(sizes);
std::size_t max3 = maxColWidth(quotients);
for (std::size_t i = 0; i < types.size(); ++i) {
while (types[i].length() < (max1 - 1)) {
types[i] = " " + types[i];
}
types[i] += " ";
for (int j = 0; sizes[i].length() < max2; ++j)
sizes[i] = (j % 2 == 0) ? " " + sizes[i] : sizes[i] + " ";
for (int j = 0; quotients[i].length() < max3; ++j)
quotients[i] = (j % 2 == 0) ? " " + quotients[i] : quotients[i] + " ";
std::cout << "|" << types[i] << "|" << sizes[i] << "|" << quotients[i] << "|\n";
}
std::cout << std::endl;
std::cout << "N.B. max std::size_t is: " <<
std::numeric_limits<std::size_t>::max() << std::endl;
return 0;
}
On my macOS (clang version 5.0.1), I get the following:
| data types | approx max array length | max std::size_t / max array size |
| bool | 9223372036854775807 | 2.000000 |
| char | 9223372036854775807 | 2.000000 |
| int8_t | 9223372036854775807 | 2.000000 |
| int16_t | 9223372036854775807 | 2.000000 |
| int32_t | 4611686018427387903 | 4.000000 |
| int64_t | 2305843009213693951 | 8.000000 |
| uint8_t | 9223372036854775807 | 2.000000 |
| uint16_t | 9223372036854775807 | 2.000000 |
| uint32_t | 4611686018427387903 | 4.000000 |
| uint64_t | 2305843009213693951 | 8.000000 |
| size_t | 2305843009213693951 | 8.000000 |
| double | 2305843009213693951 | 8.000000 |
| long double | 1152921504606846975 | 16.000000 |
N.B. max std::size_t is: 18446744073709551615
On ideone gcc 8.3 I get:
| data types | approx max array length | max std::size_t / max array size |
| bool | 9223372036854775744 | 2.000000 |
| char | 18446744073709551615 | 1.000000 |
| int8_t | 18446744073709551615 | 1.000000 |
| int16_t | 9223372036854775807 | 2.000000 |
| int32_t | 4611686018427387903 | 4.000000 |
| int64_t | 2305843009213693951 | 8.000000 |
| uint8_t | 18446744073709551615 | 1.000000 |
| uint16_t | 9223372036854775807 | 2.000000 |
| uint32_t | 4611686018427387903 | 4.000000 |
| uint64_t | 2305843009213693951 | 8.000000 |
| size_t | 2305843009213693951 | 8.000000 |
| double | 2305843009213693951 | 8.000000 |
| long double | 1152921504606846975 | 16.000000 |
N.B. max std::size_t is: 18446744073709551615
It should be noted that this is a theoretical limit and that on most computers, you will run out of memory far before you reach this limit. For example, we see that for type char on gcc, the maximum number of elements is equal to the max of std::size_t. Trying this, we get the error:
prog.cpp: In function ‘int main()’:
prog.cpp:5:61: error: size of array is too large
char* a1 = new char[std::numeric_limits<std::size_t>::max()];
Lastly, as #MartinYork points out, for static arrays the maximum size is limited by the size of your stack.
One thing I don't think has been mentioned in the previous answers.
I'm always sensing a "bad smell" in the refactoring sense when people are using such things in their design.
That's a huge array and possibly not the best way to represent your data both from an efficiency point of view and a performance point of view.
cheers,
Rob
If you have to deal with data that large you'll need to split it up into manageable chunks. It won't all fit into memory on any small computer. You can probably
load a portion of the data from disk (whatever reasonably fits), perform your calculations and changes to it, store it to disk, then repeat until complete.
As has already been pointed out, array size is limited by your hardware and your OS (man ulimit). Your software though, may only be limited by your creativity. For example, can you store your "array" on disk? Do you really need long long ints? Do you really need a dense array? Do you even need an array at all?
One simple solution would be to use 64 bit Linux. Even if you do not physically have enough ram for your array, the OS will allow you to allocate memory as if you do since the virtual memory available to your process is likely much larger than the physical memory. If you really need to access everything in the array, this amounts to storing it on disk. Depending on your access patterns, there may be more efficient ways of doing this (ie: using mmap(), or simply storing the data sequentially in a file (in which case 32 bit Linux would suffice)).
i would go around this by making a 2d dynamic array:
long long** a = new long long*[x];
for (unsigned i = 0; i < x; i++) a[i] = new long long[y];
more on this here https://stackoverflow.com/a/936702/3517001

Resources