I have a very simple example of a strange segfault I am having and it is as follows:
program big_array_segfault
integer :: nX = 13000
integer :: nY = 100000
real(kind = 8), allocatable :: bigarr(:,:)
allocate(bigarr(nX, nY))
end program big_array_segfault
Note that I have 20 GB of RAM to work with and this does not even begin to approach that. Everything I have seen online suggests that this may be a problem with Stack space vs Heap space but I don't know how to control the memory in that way using Fortran.
For what it is worth, the I am compiling with gfortran -o big_arr.exe test.f90 so there is nothing special going on in the compilation.
Related
I have a simple nbody implementation code and try to compile it for launching on NVIDIA GPUs (Tesla K20m/Geforce GTX 650 Ti). I use the following compiler options:
-Minfo=all -acc -Minline -Mfpapprox -ta=tesla:cc35/nvidia
Everything works without -Mfpapprox, but when I use it, the compilation fails with the following output:
346, Accelerator restriction: unsupported operation: RSQRTSS
The 346 line writes as:
float rdistance=1.0f/sqrtf(drSquared);
where
float drSquared=dx*dx+dy*dy+dz*dz+softening;
and dx, dy, dz are float values. This line is inside the #pragma acc parallel loop independent for() construction.
What is the problem with -Mfpapprox?
-Mfpapprox tells the compiler to use very low-precision CPU instructions to approximate DIV or SQRT. These instructions are not supported on the GPU. The GPU SQRT is both fast and precise so no need for a low-precision version.
Actually even on the CPU, I'd recommend you not use -Mfpapprox unless you really understand the mathematics of your code and it can handle a high degree of imprecision (as much as 5-6 bits or ~20Ulps off). We added this flag about 10 years ago since at the time the CPUs divide operation was very expensive. However, CPU performance for divide has greatly improved since then (as has sqrt) so you're generally better off not sacrificing precision for the little bit of speed-up you might get from this flag.
I'll put in an issue report requesting that the compiler ignore -Mfpapprox for GPU code so you wont see this error.
I am testing FFTW in a fortran program, because I need to use it. Since I am working with huge matrixes, my first solution is to use OpenMP. When my matrix has dimension 500 x 500 x 500, the following error happens:
Operating system error:
Program aborted. Backtrace:
Cannot allocate memory
Allocation would exceed memory limit
I compiled the code using the following: gfortran -o test teste_fftw_openmp.f90 -I/usr/local/include -L/usr/lib/x86_64-linux-gnu -lfftw3_omp -lfftw3 -lm -fopenmp
PROGRAM test_fftw
USE omp_lib
USE, intrinsic:: iso_c_binding
IMPLICIT NONE
INCLUDE 'fftw3.f'
INTEGER::i, DD=500
DOUBLE COMPLEX:: OUTPUT_FFTW(3,3,3)
DOUBLE COMPLEX, ALLOCATABLE:: A3D(:,:,:), FINAL_OUTPUT(:,:,:)
integer*8:: plan
integer::iret, nthreads
INTEGER:: indiceX, indiceY, indiceZ, window=2
!! TESTING 3D FFTW with OPENMP
ALLOCATE(A3D(DD,DD,DD))
ALLOCATE(FINAL_OUTPUT(DD-2,DD-2,DD-2))
write(*,*) '---------------'
write(*,*) '------------TEST 3D FFTW WITH OPENMP----------'
A3D = reshape((/(i, i=1,DD*DD*DD)/),shape(A3D))
CALL dfftw_init_threads(iret)
CALL dfftw_plan_with_nthreads(nthreads)
CALL dfftw_plan_dft_3d(plan, 3,3,3, OUTPUT_FFTW, OUTPUT_FFTW, FFTW_FORWARD, FFTW_ESTIMATE)
FINAL_OUTPUT=0.
!$OMP PARALLEL DO DEFAULT(SHARED) SHARED(A3D,plan,window) &
!$OMP PRIVATE(indiceX, indiceY, indiceZ, OUTPUT_FFTW, FINAL_OUTPUT)
DO indiceZ=1,10!500-window
write(*,*) 'INDICE Z=', indiceZ
DO indiceY=1,10!500-window
DO indiceX=1,10!500-window
CALL dfftw_execute_dft(plan, A3D(indiceX:indiceX+window,indiceY:indiceY+window, indiceZ:indiceZ+window), OUTPUT_FFTW)
FINAL_OUTPUT(indiceX,indiceY,indiceZ)=SUM(ABS(OUTPUT_FFTW))
ENDDO
ENDDO
ENDDO
!$OMP END PARALLEL DO
call dfftw_destroy_plan(plan)
CALL dfftw_cleanup_threads()
DEALLOCATE(A3D,FINAL_OUTPUT)
END PROGRAM test_fftw
Notice this error occurs when I just use a huge matrix(A3D) without running the loop in all the values of this matrix (for running in all values, I should have the limits of the three (nested) loops as 500-window.
I tried to solve this(tips here and here) with -mcmodel=medium in the compilation without success.
I had success when I compiled with gfortran -o test teste_fftw_openmp.f90 -I/usr/local/include -L/usr/lib/x86_64-linux-gnu -lfftw3_omp -lfftw3 -lm -fopenmp -fmax-stack-var-size=65536
So, I don't understand:
1) Why there is memory allocation problem, if the huge matrix is a shared variable?
2) The solution I found is going to work if I have more huge matrix variables? For example, 3 more matrixes 500 x 500 x 500 to store calculation results.
3) In the tips I found, people said that using allocatable arrays/matrixes would solve, but I was using without any difference. Is there anything else I need to do for this?
Two double complex arrays with 500 x 500 x 500 elements require 4 gigabytes of memory. It is likely that the amount of available memory in your computer is not sufficient.
If you only work with small windows, you might consider not using the whole array at the whole time, but only parts of it. Or distribute the computation across multiple computers using MPI.
Or just use a computer with bigger RAM.
When testing the following code (notice the *NaN in the second fragment)
tic
x = zeros(1,5000000);
for i=1:10
selector = x > 1;
end
toc
tic
x = zeros(1,5000000)*NaN;
for i=1:10
selector = x > 1;
end
toc
on Matlab revisions
R2012a 64-bit
R2013a 32-bit
I observe the following odd behavior
R2012a 64-bit
Elapsed time is 0.056266 seconds.
Elapsed time is 0.059677 seconds.
R2013a 32-bit
Elapsed time is 0.070116 seconds.
Elapsed time is 3.995697 seconds.
So in case of R2013a 32-bit the presence of NaN values drastically increases runtime. Can anyone give me a hint where this might be comming from?
Best regards,
Thomas
You are using Intel CPU, and of that, for 32-bit code, you are using it's FPU. It is awfully slow with NaN, Inf and denormals and this is an old story. Good news SSE unit is slow with denormals only and handles NaNs at full speed, so if you can convince your compiler to emit SSE code, you should be up to full speed. This is done automatically for x64, because it implies SSE2 and the ABI uses SSE registers, but since x32 floating point ABI uses FPU registers, the FPU is used for doing the calculations to avoid moving things around too much.
I did not dig deeper (we use embedded platforms and not all of them have SSE as of now), but I suspect changing some compiler/flags should help. If it does, checking how things are inlined would be in order to see if you have that SSE-to-FPU-and-back on each function call. If it's a small tight loop somewhere in the code, there is a possibility of using SSE intrinsics.
upd: Oops just noticed this is matlab. The reasoning stays, but for the solutions, you'll have to look yourself.
The problem may be due to the fact that your 32-bit system takes longer to reallocate the ~40MB of memory in the x = zeros(1,5000000)*NaN; line. Perhaps there is not enough available RAM and it needs to swap memory to disk. To check which part (the allocation or the comparison) is problematic, tic-toc these parts separately.
BTW, there is no need to multiply by NaN - you can simply do x = nan(1,5000000);
I need to simulate a memory-hungry process. For example, On a machine with 4.0 GiB, I need a process that would eat 3.2 GiB (give or take few MiB).
I assumed it should be as easy as:
my $mbytes = 3276;
my $huge_string = 'X' x ($mbytes * 1024 * 1024);
But I end up with process eating twice as much memory as I need it to.
this is same on two Windows 7 amd64 machines: one with 64-bit, the other
with 32-bit build of Strawberry Perl
I'm using Sysinternals Process Explorer and watching "Private Bytes"
Of course, I could just $mbytes /= 2 (for now, I'll probably will do that), but:
Is there a better way?
Can anyone explain why the amount is twice as length of the string?
Code adapted from http://www.perlmonks.org/index.pl?node_id=948181, all credit goes to Perlmonk BrowserUk.
my $huge_string = 'X';
$huge_string x= $mbytes * 1024 * 1024;
why the amount is twice as length of the string?
Think about the order of evaluation. The right-hand expression allocates memory for your x expression, and again so does the assignment operation into your new scalar. As usual for Perl, even though the right-hand expression is not referenced anymore, the memory is not freed right away.
Operating on an existing scalar avoids the second allocation, as shown above.
So, ruby enterprise documentation states that all the values in the GC settings are defined in slots: http://www.rubyenterpriseedition.com/documentation.html#_garbage_collector_performance_tuning
(e.g. RUBY_HEAP_MIN_SLOTS)
We fine-tuned our app's min slot size and increment for the best performance by trial and error (we have enough machines to get a good idea how different values affect the number of malloc calls and Full GCs).
But something has been bugging me for a while: How big is 1 slot in bytes?
From Ruby source:
* sizeof(RVALUE) is
* 20 if 32-bit, double is 4-byte aligned
* 24 if 32-bit, double is 8-byte aligned
* 40 if 64-bit
$ rvm use ruby-1.9.2-p136
$ gdb ruby
(gdb) p sizeof(RVALUE)
$1 = 40
The default in 1.9 is 8K
http://svn.ruby-lang.org/repos/ruby/trunk/gc.c
(search for HEAP_SIZE)
Note well that whenever it runs out of space and needs to reallocate, in 1.9 it allocates exponentially more heaps.
In 1.8 it would allocate bigger and bigger heaps.
After diggin' through the code:
1 slot is a size of sizeof(struct RVALUE), which depends on the machine.