Stack Overflow when using gp2c but not when using gp directly with the same program (PARI/GP) - stack-overflow

So I wanted to calculate a sum from a particular projecteuler question using gp. Here's the, admittedly unreadable, code:
{
n = 10000;
m=100;
sum(k=1,n,eulerphi(k),0.) - (sum(k=1,n,eulerphi(k)*(k * binomial(n-k,m-1) + sum(p = max(m + k - n - 1,1), k-1, (k-p)*binomial(n+p-k-1,m-2),0.)), 0.))/binomial(n,m)
}
This code takes two minutes or three to output the answer on my fairly modest machine but it does it without going over the default parisize = 8000000 (around 8 MB of memory).
Now, I read somewhere that gp2c which compiles gp scripts into c code can improve performance.
So I just made a program.gp file:
calculate() = {n = 10000; m=100; sum(k=1,n,eulerphi(k),0.) - (sum(k=1,n,eulerphi(k)*(k * binomial(n-k,m-1) + sum(p = max(m + k - n - 1,1), k-1, (k-p)*binomial(n+p-k-1,m-2),0.)), 0.))/binomial(n,m)}
And run it with gp2c-run program.gp.
In the interactive prompt that comes up, I just executed calculate(). However, to my surprise, I got a stack overflow with it asking me to increase the stack size even when I changed parisizemax to almost 2 GB.
? default(parisizemax, 2000000000)
*** Warning: new maximum stack size = 2000003072 (1907.352 Mbytes).
? calculate()
*** calculate: Warning: increasing stack size to 16000000.
*** calculate: Warning: increasing stack size to 32000000.
*** calculate: Warning: increasing stack size to 64000000.
*** calculate: Warning: increasing stack size to 128000000.
*** calculate: Warning: increasing stack size to 256000000.
*** calculate: Warning: increasing stack size to 512000000.
*** calculate: Warning: increasing stack size to 1024000000.
*** calculate: Warning: increasing stack size to 2000003072.
*** at top-level: calculate()
*** ^-----------
*** calculate: the PARI stack overflows !
current stack size: 2000003072 (1907.352 Mbytes)
[hint] you can increase 'parisizemax' using default()
*** Break loop: type 'break' to go back to GP prompt
Why does the same program, when compiled to c need so much extra memory?
For reference, the same program with n = 1000 instead of 10000 only shows the answer after increasing the stack size to 256000000 (250 MB) whereas it just needs the default 8 MB when just using gp. Something doesn't add up.

By default, neither gp2c nor gp2c-run generate code to handle the PARI stack, meaning you will get stack overflows quickly. Use gp2c-run -g program.gp: the -g flag will cause gp2c to clean up the stack as the computation progresses. There is such an example in the gp2c tutorial.

Related

How to increase stack size for Julia in Windows?

I wrote a recursive function (basically a flood fill), it works fine on smaller datasets, but for slightly larger input it throws StackOverflowError.
How to increase the stack size for Julia under Windows 10? Ideally the solution should be applicable to JupyterLab.
It's a singe use program, no point in optimizing/rewriting it, I just need to peak at the result and forget about the code.
Update: As a test case, I provide the following MWE. This is just a simple algorithm that recursively visits each cell of n by n array:
n = 120
visited = fill(false, (n,n))
function visit_single_neighbour(i,j,Δi,Δj)
if 1 ≤ i + Δi ≤ n && 1 ≤ j + Δj ≤ n
if !visited[i+Δi, j+Δj]
visited[i+Δi, j+Δj] = true
visit_four_neighbours(i+Δi, j+Δj)
end
end
end
function visit_four_neighbours(i,j)
visit_single_neighbour(i,j,1,0)
visit_single_neighbour(i,j,0,1)
visit_single_neighbour(i,j,-1,0)
visit_single_neighbour(i,j,0,-1)
end
#time visit_four_neighbours(1,1)
For n = 120 the output is 0.003341 seconds, but for n = 121 it throws StackOverflowError.
On a Linux machine with ulimit -s unlimited the code runs no problem for n = 2000 and takes about 2.4 seconds.
I've mirrored the question to Julia Discource: https://discourse.julialang.org/t/ow-to-increase-stack-size-for-julia-in-windows/79932
As you are no doubt aware Julia is not very optimized for recursion and the recommendation will probably always be to rewrite the code in some way.
With that said there are of course ways to increase the stack limit. One undocumented way to achieve it from inside julia is to reserve stack space when creating a Task:
Core.Task(f, reserved_stack::Int=0)
Let's create a function wrapping such a task:
with_stack(f, n) = fetch(schedule(Task(f,n)))
for n = 2000 the following works on both windows and linux (as long as enough memory is available):
julia> with_stack(2_000_000_000) do
visit_four_neighbours(1,1)
end

Minimize integer round-off error in PLL frequency calculation

On a particular STM32 microcontroller, the system clock is driven by a PLL whose frequency F is given by the following formula:
F := (S/M * (N + K/8192)) / P
S is the PLL input source frequency (1 - 64000000, or 64 MHz).
The other factors M, N, K, and P are the parameters the user can modify to calibrate the frequency. Judging by the bitmasks in the SDK I'm using, the value of each can be limited to a maximum of M < 64, N < 512, K < 8192, and P < 128.
Unfortunately, my target firmware does not have FPU support, so floating-point arithmetic is out. Instead, I need to compute F using integer-only arithmetic.
I have tried to rearrange the given formula with 3 goals in mind:
Expand and distribute all multiplication factors
Minimize the number of factors in each denominator
Minimize the total number of divisions performed
If two expressions have the same number of divisions, choose the one whose denominators have the least maximum (identified in earlier paragraph)
However, each of my attempts to expand and rearrange the expression all produce errors greater than the original formula as it was first expressed verbatim.
To test out different arrangements of the formula and compare error, I've written a small Go program you can run online here.
Is it possible to improve this formula so that error is minimized when using integer arithmetic? Also are any of my goals listed above incorrect or useless?
I took your program (your first parentheses is redundant, so I removed):
S K
--- * ( N + ------ )
M 8192
--------------------
P
and ran through QuickMath [1], and I got this:
S * (8192 * N + K)
------------------
8192 * M * P
or in Go code:
S * (8192 * N + K) / (8192 * M * P)
So it does reduce the amount of divisions. You could improve it further by
pulling out the lower constant:
S * (8192 * N + K) / (M * P) >> 13
https://quickmath.com
Looking at the answer by #StevenPerry, I realized the majority of error is introduced by the limited precision we have to represent K/8192. This error then gets propagated into the other factors and dividends.
Postponing that division, however, often results in integer overflow before its ever reached. Thus, the solution I've found unfortunately depends on widening these operands to 64-bit.
The result is of the same form as the other answer, but it must be emphasized that widening the operands to 64-bit is essential. In Go source code, this looks like:
var S, N, M, P, K uint32
...
F := uint32(uint64(S) * uint64(8192*N+K) / uint64(8192*M*P))
To see the accuracy of all three of these expressions, run the code yourself on the Go Playground.

Why do memory managers use (size + PAGE_SIZE-1) / PAGE_SIZE to calculate the number of pages to allocate?

I have met such a formula multiple times in various places (e.g.; Linux kernel and glibc). Why do they use this formula instead of simply:
pages = (size / PAGE_SIZE) + 1;
As a guess, I think the problem with the formula above is when the size is PAGE_SIZE aligned (a multiple of PAGE_SIZE) because, in such a case, it reports one more page than needed, thus we have to also do:
pages = (size / PAGE_SIZE) + 1;
if (!(size & (PAGE_SIZE-1))) /* is size a multiple of PAGE_SIZE? */
pages--;
which is obviously more code than just (size + PAGE_SIZE-1) / PAGE_SIZE!
Yes, it's used to get the ceiling of the division result, i.e. rounding the quotient up
The problem with
pages = (size / PAGE_SIZE) + 1;
if (!(size & (PAGE_SIZE-1))) /* is size a multiple of PAGE_SIZE? */
pages--;
is not only a lot more code but also the fact that
it has worse performance due to the branch
it only works for powers of 2
Of course page size in a binary computer is always a power of 2, but (size + PAGE_SIZE-1) / PAGE_SIZE works for any value of the divisor
See also
Fast ceiling of an integer division in C / C++
Rounding integer division (instead of truncating)
Dividing two integers and rounding up the result, without using floating point
What's the right way to implement integer division-rounding-up?
This is rounding up. You add (PAGE_SIZE - 1) to nbytes so in case you have an exact number of PAGE_SIZE nbytes then you get the minimum number of pages you need, and if you pass it by one, then you get a new page to give space for the extra byte.
There is another way to make it with the log2 of the page size. In the Linux kernel source, it is PAGE_SHIFT. For example, if PAGE_SIZE is 4096 = 2^12 bytes, PAGE_SHIFT is 12 (i.e. log2(2^12)). So, to get the number of pages for a given size, the following formula is often used:
pages = (size + PAGE_SIZE - 1) >> PAGE_SHIFT
By the way, the page size is often defined as:
#define PAGE_SIZE (1 << PAGE_SHIFT)

Why there is a runtime.morestack in make([]int, 14) but not in make([]int, 13)?

I've already known that the runtime.morestack will cause goroutine context switch (If the sysmon goroutine has marked it "has to switch").
And when I do some experiment around this, I've found an interesting fact.
Compare the following codes.
func main() {
_ = make([]int, 13)
}
func main() {
_ = make([]int, 14)
}
And compile them by running the following command: (Tried in go1.9 and go 1.11)
$ go build -gcflags "-S -l -N" x.go
You may find a major difference that the first outputs contains CALL runtime.morestack_noctxt(SB) while the second doesn't.
I guess it is an optimization, but why?
Finally, I got the answer.
Making a slice that less than 65,536 bytes and not escaped from the func will be allocated in the stack, not the heap.
StackGuard0 will be higher than the stack's lowest address by at least 128 bytes. (Even after downsizing)
make([]int, 13) will allocate 128 bytes memories in total.
sizeof(struct slice) + 13 * 8 = 24 + 104 = 128.
So the answer is clear, this is an optimization for amd64.
For a leaf function, if it used less than 128 bytes memories, the compiler wouldn't generate codes that checking if the stack is overflowed (because there are enough spaces).
Here is the explanation in go/src/runtime/stack.go
The per-goroutine g->stackguard is set to point StackGuard bytes
above the bottom of the stack. Each function compares its stack
pointer against g->stackguard to check for overflow. To cut one
instruction from the check sequence for functions with tiny frames,
the stack is allowed to protrude StackSmall bytes below the stack
guard. Functions with large frames don't bother with the check and
always call morestack. The sequences are (for amd64, others are
similar):

CUDA Parallel Cross Product

Disclaimer: I am fairly new to CUDA and parallel programming - so if you're not going to bother to answer my question, just ignore this, or at least point me to the right resources so I can find the answer myself.
Here's the particular problem I'm looking to solve using parallel programming. I have some 1D arrays that store 3D vectors in this format -> [v0x, v0y, v0z, ... vnx, vny, vnz], where n is the vector, and x, y, z are the respective components.
Suppose I want to find the cross product between vectors [v0, v1, ... vn] in one array and their corresponding vectors [v0, v1, ... vn] in another array.
The calculation is pretty straightforward without parallelization:
result[x] = vec1[y]*vec2[z] - vec1[z]*vec2[y];
result[y] = vec1[z]*vec2[x] - vec1[x]*vec2[z];
result[z] = vec1[x]*vec2[y] - vec1[y]*vec2[x];
The problem I'm having is understanding how to implement CUDA parallelization for the arrays I currently have. Since each value in the result vector is a separate calculation, I can effectively run the above calculation for each vector in parallel. Since each component of the resulting cross product is a separate calculation, those too could run in parallel. How would I go about setting up the blocks and threads/ go about thinking about setting up the threads for such a problem?
The top 2 optimization priorities for any CUDA programmer are to use memory efficiently, and expose enough parallelism to hide latency. We'll use those to guide our algorithmic choices.
A very simple thread strategy (the thread strategy answers the question, "what will each thread do or be responsible for?") in any transformation (as opposed to reduction) type problem is to have each thread be responsible for 1 output value. Your problem fits the description of transformation - the output data set size is on the order of the input data set size(s).
I'll assume that you intended to have two equal length vectors containing your 3D vectors, and that you want to take the cross product of the first 3D vectors in each and the 2nd 3D vectors in each, and so on.
If we choose a thread strategy of 1 output point per thread (i.e. result[x] or result[y] or result[z], all together would be 3 output points), then we will need 3 threads to compute the output of each vector cross product. If we have enough vectors to multiply, then we will have enough threads to keep our machine "busy" and do a good job of hiding latency. As a rule of thumb, your problem will start to become interesting on GPUs if the number of threads is 10000 or more, so this means we would want your 1D vectors to consist of about 3000 3D vectors or more. Let's assume that is the case.
In order to tackle the memory efficiency objective, our first task is to load your vector data from global memory. We will want this ideally to be coalesced, which roughly means adjacent threads access adjacent elements in memory. We'll want the output stores to be coalesced also, and our thread strategy of choosing one output point/one vector component per thread will work nicely to support that.
For efficient memory usage, we'd like to ideally load each item from global memory only once. Your algorithm naturally involves a small amount of data reuse. The data reuse is evident since the computation of result[y] depends on vec2[z] and the computation of result[x] also depends on vec2[z] to pick just one example. Therefore a typical strategy when there is data reuse is to load the data first into CUDA shared memory, and then allow the threads to perform their computations based on the data in shared memory. As we will see, this makes it fairly easy/convenient for us to arrange for coalesced loads from global memory, since the global data load arrangement is no longer tightly coupled to the threads or the usage of the data for computation.
The last challenge is to figure out an indexing pattern so that each thread will select the proper elements from shared memory to multiply together. If we look at your calculation pattern that you have depicted in your question, we see that the first load from vec1 follows an offset pattern of +1(modulo 3) from the index that the result is being computed for. So x->y, y->z, and z -> x. Likewise we see a +2(modulo 3) for the next load from vec2, another +2(modulo 3) pattern for the next load from vec1 and another +1(modulo 3) pattern for the final load from vec2.
If we combine all these ideas, we can then write a kernel that should have generally efficient characteristics:
$ cat t1003.cu
#include <stdio.h>
#define TV1 1
#define TV2 2
const size_t N = 4096; // number of 3D vectors
const int blksize = 192; // choose as multiple of 3 and 32, and less than 1024
typedef float mytype;
//pairwise vector cross product
template <typename T>
__global__ void vcp(const T * __restrict__ vec1, const T * __restrict__ vec2, T * __restrict__ res, const size_t n){
__shared__ T sv1[blksize];
__shared__ T sv2[blksize];
size_t idx = threadIdx.x+blockDim.x*blockIdx.x;
while (idx < 3*n){ // grid-stride loop
// load shared memory using coalesced pattern to global memory
sv1[threadIdx.x] = vec1[idx];
sv2[threadIdx.x] = vec2[idx];
// compute modulo/offset indexing for thread loads of shared data from vec1, vec2
int my_mod = threadIdx.x%3; // costly, but possibly hidden by global load latency
int off1 = my_mod+1;
if (off1 > 2) off1 -= 3;
int off2 = my_mod+2;
if (off2 > 2) off2 -= 3;
__syncthreads();
// each thread loads its computation elements from shared memory
T t1 = sv1[threadIdx.x-my_mod+off1];
T t2 = sv2[threadIdx.x-my_mod+off2];
T t3 = sv1[threadIdx.x-my_mod+off2];
T t4 = sv2[threadIdx.x-my_mod+off1];
// compute result, and store using coalesced pattern, to global memory
res[idx] = t1*t2-t3*t4;
idx += gridDim.x*blockDim.x;} // for grid-stride loop
}
int main(){
mytype *h_v1, *h_v2, *d_v1, *d_v2, *h_res, *d_res;
h_v1 = (mytype *)malloc(N*3*sizeof(mytype));
h_v2 = (mytype *)malloc(N*3*sizeof(mytype));
h_res = (mytype *)malloc(N*3*sizeof(mytype));
cudaMalloc(&d_v1, N*3*sizeof(mytype));
cudaMalloc(&d_v2, N*3*sizeof(mytype));
cudaMalloc(&d_res, N*3*sizeof(mytype));
for (int i = 0; i<N; i++){
h_v1[3*i] = TV1;
h_v1[3*i+1] = 0;
h_v1[3*i+2] = 0;
h_v2[3*i] = 0;
h_v2[3*i+1] = TV2;
h_v2[3*i+2] = 0;
h_res[3*i] = 0;
h_res[3*i+1] = 0;
h_res[3*i+2] = 0;}
cudaMemcpy(d_v1, h_v1, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_v2, h_v2, N*3*sizeof(mytype), cudaMemcpyHostToDevice);
vcp<<<(N*3+blksize-1)/blksize, blksize>>>(d_v1, d_v2, d_res, N);
cudaMemcpy(h_res, d_res, N*3*sizeof(mytype), cudaMemcpyDeviceToHost);
// verification
for (int i = 0; i < N; i++) if ((h_res[3*i] != 0) || (h_res[3*i+1] != 0) || (h_res[3*i+2] != TV1*TV2)) { printf("mismatch at %d, was: %f, %f, %f, should be: %f, %f, %f\n", i, h_res[3*i], h_res[3*i+1], h_res[3*i+2], (float)0, (float)0, (float)(TV1*TV2)); return -1;}
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
return 0;
}
$ nvcc t1003.cu -o t1003
$ cuda-memcheck ./t1003
========= CUDA-MEMCHECK
no error
========= ERROR SUMMARY: 0 errors
$
Note that I've chosen to write the kernel using a grid-stride loop. This isn't terribly important to this discussion, and not that relevant for this problem, because I've chosen a grid size equal to the problem size (4096*3). However for much larger problem sizes, you might choose a smaller grid size than the overall problem size, for some possible small efficiency gain.
For such a simple problem as this, it's fairly easy to define "optimality". The optimal scenario would be however long it takes to load the input data (just once) and write the output data. If we consider a larger version of the test code above, changing N to 40960 (and making no other changes), then the total data read and written would be 40960*3*4*3 bytes. If we profile that code and then compare to bandwidthTest as a proxy for peak achievable memory bandwidth, we observe:
$ CUDA_VISIBLE_DEVICES="1" nvprof ./t1003
==27861== NVPROF is profiling process 27861, command: ./t1003
no error
==27861== Profiling application: ./t1003
==27861== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 65.97% 162.22us 2 81.109us 77.733us 84.485us [CUDA memcpy HtoD]
30.04% 73.860us 1 73.860us 73.860us 73.860us [CUDA memcpy DtoH]
4.00% 9.8240us 1 9.8240us 9.8240us 9.8240us void vcp<float>(float const *, float const *, float*, unsigned long)
API calls: 99.10% 249.79ms 3 83.263ms 6.8890us 249.52ms cudaMalloc
0.46% 1.1518ms 96 11.998us 374ns 454.09us cuDeviceGetAttribute
0.25% 640.18us 3 213.39us 186.99us 229.86us cudaMemcpy
0.10% 255.00us 1 255.00us 255.00us 255.00us cuDeviceTotalMem
0.05% 133.16us 1 133.16us 133.16us 133.16us cuDeviceGetName
0.03% 71.903us 1 71.903us 71.903us 71.903us cudaLaunchKernel
0.01% 15.156us 1 15.156us 15.156us 15.156us cuDeviceGetPCIBusId
0.00% 7.0920us 3 2.3640us 711ns 4.6520us cuDeviceGetCount
0.00% 2.7780us 2 1.3890us 612ns 2.1660us cuDeviceGet
0.00% 1.9670us 1 1.9670us 1.9670us 1.9670us cudaGetLastError
0.00% 361ns 1 361ns 361ns 361ns cudaGetErrorString
$ CUDA_VISIBLE_DEVICES="1" /usr/local/cuda/samples/bin/x86_64/linux/release/bandwidthTest
[CUDA Bandwidth Test] - Starting...
Running on...
Device 0: Tesla K20Xm
Quick Mode
Host to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6375.8
Device to Host Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 6554.3
Device to Device Bandwidth, 1 Device(s)
PINNED Memory Transfers
Transfer Size (Bytes) Bandwidth(MB/s)
33554432 171220.3
Result = PASS
NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.
$
The kernel takes 9.8240us to execute, and in that time loads or stores a total of 40960*3*4*3 bytes of data. Therefore the achieved memory bandwidth by the kernel is 40960*3*4*3/0.000009824 or 150 GB/s. The proxy measurement for peak achievable on this GPU is 171 GB/s, so this kernel achieves 88% of the optimal throughput. With more careful benchmarking to run the kernel twice in a row, the 2nd execution requires only 8.99us to execute. This brings the achieved bandwidth in this case up to 96% of peak achievable throughput.

Resources