Segmentation fault with automatic arrays [duplicate] - memory-management

I have some Fortran code that calls RESHAPE to reorder a matrix such that the dimension that I am now about to loop over becomes the first varying dimension (Column-major order in Fortran).
This has nothing to do with C/Fortran interoperability.
Now the matrix is rather large and when I call the RESHAPE function I get a seg fault which I am very confident is a stack overflow. I know this because I can compile my code in ifort with -heap-arrays and the problem disappears.
I do not want to modify the stack-size. This code needs to be portable for any computer without the user having to concern himself with stack-size.
Is there someway I can get this call of the RESHAPE function to use the heap and not the stack for its internal memory use.
Worst case I will have to 'roll my own' RESHAPE function for this instance but I wish there was a better way.

The Fortran standard does not speak about stack and heap at all, that is an implementation detail. In which part of memory something is placed and whether there are any limits is implementation defined.
Therefore it is impossible to control the stack or heap behaviour from the Fortran code itself. The compiler must be instructed by other means if you want to specify this and the compiler options are used for that. Intel Fortran uses stack by default and has the -heap-arrays n option (n is the limit in kB), gfortran is slightly different and has the opposite -fstack-arrays option (included in -Ofast, but can be disabled).
This is valid for all kinds of temporaries and automatic arrays.

Related

Performance of recursive function in register based compiler

I have a question of whether there will be a performance hit when we write recursive functions in Register based compilers like DVM. I'm aware that recursion isn't recommended in compilers with limited depth like compilers for python.
Being register-based does not help for recursive functions, they still have the same problem: conceptually, every call creates a new stack frame. If that is implemented literally, then a recursive call is inherently a little slower than looping, and perhaps more importantly, uses up a finite resource so the recursion depth is limited. A register-based code representation does not have the concept of an operand stack, but that concept is mostly disjoint from the concept of a call stack, which is still necessary just to have general subroutines. Subroutines can be implemented without a call stack if recursion is banned, in which case they need not be re-entrant so the local variables and the variable that holds the return address can be statically allocated.
Going through a trampoline works around the stack growth by quickly returning to a special caller that then calls the continuation, that way recursion doesn't grow the stack at all since the old frame gets deallocated before a new one is created, but it adds even more run-time overhead. Tail call elimination by rewriting the call into a jump achieves a similar effect but by reusing the same frame, with less associated overhead, this requires explicit support from the VM.
Both of those techniques apply equally to stack based and register based representations of the code, which incidentally is primarily a difference in the format in which the code is stored, and need not reflect a difference in the way the code is actually executed: a JIT compiler can turn both of them into whatever form the machine requires.

Best practices to determine stack usage in Ravenscar program

I am writing an Ada program using the Ravenscar subset (thus, I am aware of the number of running tasks at execution time). The code is compiled by gcc with the -fstack-check switch enabled. This should cause the program raise a STORAGE_ERROR at runtime if any of my tasks exceed their stack.
Ada allows to set the upper limit for those (task-specific) stacks during the specification of the respective task like so:
pragma Storage_Size (Some_Value);
Now I was wondering what options I have to determine Some_Value. What I have heard of so far:
Do wild guesses until no STORAGE_ERROR is raised anymore. This is more or less what the OP suggests here.
Feed the output of -fstack-usage in there.
Use some gnat specific extensions as outlined here (how does this technically differ from item #2?).
Get a stack analyzer like gnatstack and let it do the work for you.
If I understand this correctly all the above techniques are dynamic (i.e. they require the program to run in order to work). Are static approaches also conceivable? E.g. by restricting myself further through some of Ada's high integrity options (such as No_Recursion, what else?).
Perhaps any of you can name some best practices to tackle this problem and/or extend/comment on my (surely incomplete) list.
Bonus question: What is the default size of a task's stack when the above pragma is not specified? GCC's docs only state this value depends on the runtime, without giving any concrete numbers.
You can generally check the stack space required by individual types with the 'Storage_Size attribute (which counts in bits).
Once you have tabulated this (you may need to round it up to whole words/double words), you can add up how much stack space is used by each declarative region, and then walk through your calls to find the maximum stack usage.

Why is Befunge considered hard to compile?

One of the design goals of Befunge was to be hard to compile. However, it is quite easy to interpret. One can write an interpreter in a conventional language, say C. To translate a Befunge program to equivalent machine code, one can hard-code the Befunge code into the C interpreter, and compile the resulting C program to machine code. Or does "compile" mean something more restricted which excludes this translation?
To translate a Befunge program to equivalent machine code, one can hard-code the Befunge code into the C interpreter, and compile the resulting C program to machine code.
Yes, sure. This can be applied to any interpreter, esoteric language or not, and under some definitions this can be called a compiler.
But that's not what is meant in "compilation" in the context of Befunge - and I'd argue that calling this a "compiler" is very much missing the point of compilation, which is to convert code in some (higher) language to semantically equivalent code in some other (lower) language. No such conversion is being done here.
Under this definition, Befunge is indeed a hard language to convert in such a way, since given an instruction it's hard to know - at compile time - what the next instruction will be.
Befunge is impossible to really AOT compile due to p. In terms of JITs, it's a cakewalk compared to all those dynamic languages out there. I've worked on some fast implementations.
marsh gains it's speed by being a threaded interpreter. In order to speed up instruction dispatch it has to create 4 copies of the interpreter, for each direction. I optimize bounds checking & lookup by storing the program in a 80x32 space instead of a 80x25 space
bejit was my observation that the majority of program time is spent in moving around. bejit records a trace as it interprets, & if the same location is ever hit in the same direction we jump to an internal bytecode format that the trace recorded. When p performs a write on program source that we've traced, we drop all traces & return to the interpreter. In practice this executes stuff like mandel.bf 3x faster. It also opens up peephole optimization, where the tracer can apply constant propagation. This is especially useful in Befunge due to constants being built up out of multiple instructions
My python implementations compile the whole program before executing since Python's function's bytecode is immutable. This opens up possibility of whole program analysis
funge.py traces befunge instructions into CPython bytecode. It has to keep an int at the top of the stack to track stack height since CPython doesn't handle stack underflow. I was originally hoping to create a generic python bytecode optimizer, but I ended up realizing that it'd be more efficient to optimize in an intermediate format which lacked jump offsets. Besides that the common advice that arrays are faster than linked lists doesn't apply in CPython as much since arrays are arrays of pointers & a linked list will just be spreading those pointers out. So I created funge2.py
(wfunge.py is a port of funge.py in preparation for http://bugs.python.org/issue26647)
funge2.py traces instructions into a control flow graph. Unfortunately we don't get to have the static stack adjustments that the JVM & CIL demand, so optimizations are a bit harder. funge2.py does constant folding, loop unrolling, some stack depth tracking to reduce stack depth checks, & I'm in the process of adding more (jump to jump optimizations, smarter stack depth juggling, not-jump or jump-pop or dup-jump combining)
By the time funge2 gets to optimizing Befunge, it's a pretty simple IR
load const
binop (+, -, *, /, %, >)
not
pop
dup
swap
printint/printchar/printstr (the last for when constant folding makes these deterministic)
getint/getchar
readmem
writemem
jumprand
jumpif
exit
Which doesn't seem so hard to compile
Befunge is hard to compile due to the p and g commands. With these you can put and get commands during runtime, i.e. write self-altering code.
There is no way you can translate that directly to assembly, let alone binary code.
If you embed a Befunge-program into the interpreter code and compile that, you are still compiling the interpreter, not the Befunge-program...

BLAS and CUBLAS

I'm wondering about NVIDIA's cuBLAS Library. Does anybody have experience with it? For example if I write a C program using BLAS will I be able to replace the calls to BLAS with calls to cuBLAS? Or even better implement a mechanism which let's the user choose at runtime?
What about if I use the BLAS Library provided by Boost with C++?
The answer by janneb is incorrect, cuBLAS is not a drop-in replacement for a CPU BLAS. It assumes data is already on the device, and the function signatures have an extra parameter to keep track of a cuBLAS context.
However, coming in CUDA 6.0 is a new library called NVBLAS which provides exactly this "drop-in" functionality. It intercepts Level3 BLAS calls (GEMM, TRSV, etc) and automatically sends them to the GPU, effectively tiling the PCIE transfer with on-GPU computation.
There is some information here: https://developer.nvidia.com/cublasxt, and CUDA 6.0 is available to CUDA registered developers today.
Full docs will be online once CUDA 6.0 is released to the general public.
CUBLAS does not wrap around BLAS.
CUBLAS also accesses matrices in a column-major ordering, such as some Fortran codes and BLAS.
I am more used to writing code in C, even for CUDA.
A code written with CBLAS (which is a C wrap of BLAS) can easily be change into a CUDA code.
Be aware that Fortran codes that use BLAS are quite different from C/C++ codes that use CBLAS.
Fortran and BLAS normally store matrices or double arrays in column-major ordering,
but C/C++ normally handle Row-major ordering.
I normally handle this problem writing saving the matrices in a 1D arrays,
and use #define to write a macro toa access the element i,j of a matrix as:
/* define macro to access Aij in the row-wise array A[M*N] */
#define indrow(ii,jj,N) (ii-1)*N+jj-1 /* does not depend on rows M */
/* define macro to access Aij in the col-wise array A[M*N] */
#define indcol(ii,jj,M) (jj-1)*M+ii-1
CBLAS library has a well organize parameters and conventions (const enum variables)
to give to each function the ordering of the matrix.
Beware that also the storage of matrices vary, a row-wise banded matrix is not stored the same as a column-wise band matrix.
I don't think there are mechanics to allow the user to choose between using BLAS or CUBLAS,
without writing the code twice.
CUBLAS also has on most function calls a "handle" variable that does not appear on BLAS.
I though of #define to change the name at each function call, but this might not work.
I've been porting BLAS code to CUBLAS. The BLAS library I use is ATLAS, so what I say may be correct only up to choice of BLAS library.
ATLAS BLAS requires you to specify if you are using Column major ordering or row major ordering, and I chose column major ordering since I was using CLAPACK which uses column major ordering. LAPACKE on the other hand would use row major ordering. CUBLAS is column major ordering. You may need to adjust accordingly.
Even if ordering is not an issue porting to CUBLAS was by no means a drop in replacement. The largest issue is that you must move the data onto and off of the GPU's memory space. That memory is setup using cudaMalloc() and released with cudaFree() which acts as one might expect. You move data into GPU memory using cudaMemcpy(). The time to do this will be a large determining factor on if it's worthwhile to move from CPU to GPU.
Once that's done however, the calls are fairly similar. CblasNoTrans becomes CUBLAS_OP_N and CblasTrans becomes CUBLAS_OP_T. If your BLAS library (as ATLAS does) allows you to pass scalars by value you will have to convert that to pass by reference (as is normal for FORTRAN).
Given this, any switch that allows for a choice of CPU/GPU would most easily be at a higher level than within the function using BLAS. In my case I have CPU and GPU variants of the algorithm and chose them at a higher level depending on the size of the problem.

Getting Stack overflow with GNU CLisp (Windows)

I'm getting "Program stack overflow RESET" message while running my program. So I set added a counter to see how many times I'm recursively calling the main function in my program. Turns out that it is around 30,000 times and the data I'm stacking are lists of length around 10 elements, which I think are not so many. My question is whether this amount of recursive call and memory usage are common or not, or is it more likely that I'm doing something wrong? I checked the resource manager of vista and found the memory only grew for like 1MB for lisp.exe process. And how do I adjust the stack overflow limit of CLisp?
http://clisp.cons.org/impnotes.html#faq-stack
Note that if you do tail calls and compile your function(s) there will be no limit at all.
1 MB seems to be the default stack size on Windows. I do not know if it is possible to change it without relinking the program, but in any case I would recommend either converting the program to tail-recursive form and using the CLisp byte compiler, which will optimize it away, or just converting it to iterative form. While many Common Lisp compilers do implement tail call optimization, the standard does not require it, so unbounded recursion should not be used.

Resources