Here is the code:
System.out.println("Runtime max: " + mb(Runtime.getRuntime().maxMemory()));
MemoryMXBean m = ManagementFactory.getMemoryMXBean();
System.out.println("Non-heap: " + mb(m.getNonHeapMemoryUsage().getMax()));
System.out.println("Heap: " + mb(m.getHeapMemoryUsage().getMax()));
for (MemoryPoolMXBean mp : ManagementFactory.getMemoryPoolMXBeans()) {
System.out.println("Pool: " + mp.getName() +
" (type " + mp.getType() + ")" +
" = " + mb(mp.getUsage().getMax()));
}
Run the Code on JDK8 is :
[root#docker-runner-2486794196-0fzm0 docker-runner]# java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
[root#docker-runner-2486794196-0fzm0 docker-runner]# java -jar -Xmx1024M -XX:+UnlockExperimentalVMOptions -XX:+UseCGroupMemoryLimitForHeap test.jar
Runtime max: 954728448 (910.50 M)
Non-heap: -1 (-0.00 M)
Heap: 954728448 (910.50 M)
Pool: Code Cache (type Non-heap memory) = 251658240 (240.00 M)
Pool: Metaspace (type Non-heap memory) = -1 (-0.00 M)
Pool: Compressed Class Space (type Non-heap memory) = 1073741824 (1024.00 M)
Pool: PS Eden Space (type Heap memory) = 355467264 (339.00 M)
Pool: PS Survivor Space (type Heap memory) = 1048576 (1.00 M)
Pool: PS Old Gen (type Heap memory) = 716177408 (683.00 M)
*Runtime max: 954728448 (910.50 M) *
The Runtime.maxMemory is 910.50M, I want to know how this works out
On JDK7, "Runtime.getRuntime().maxMemory()" = "-Xmx" - "Survivor"
, But it does not work on JDK8。
In JDK 8 the formula Runtime.maxMemory() = Xmx - Survivor is still fair, but the trick is how Survivor is estimated.
You haven't set the initial heap size (-Xms), and the Adaptive Size Policy is on by default. This means the heap can resize and heap generation boundaries can move in runtime. Runtime.maxMemory() estimates the amount of memory conservatively, subtracting the maximum possible survivor size from the size of New Generation.
Runtime.maxMemory() = OldGen + NewGen - MaxSurvivor
where MaxSurvivor = NewGen / MinSurvivorRatio
In your example OldGen = 683 MB, NewGen = 341 MB and MinSurvivorRatio = 3 by default. That is,
Runtime.maxMemory() = 683 + 341 - (341/3) = 910.333 MB
If you disable -XX:-UseAdaptiveSizePolicy or set the initial heap size -Xms to the same value as -Xmx, you'll see again that Runtime.maxMemory() = OldGen + Eden + Survivor.
The assumption, that the discrepancy between the reported max heap and the actual max heap stems from the survivor space, was based on empirical data, but has not been proven as intentional feature.
I expanded the program a bit (code at the end). Running this expanded program on JDK 6 with -Xmx1G -XX:-UseParallelGC gave me
Runtime max: 1037959168 (989 MiB)
Heap: 1037959168 (989 MiB)
Pool: Eden Space = 286326784 (273 MiB)
Pool: Survivor Space = 35782656 (34 MiB)
Pool: Tenured Gen = 715849728 (682 MiB)
Pool: Heap memory total = 1037959168 (989 MiB)
Eden + 2*Survivor + Tenured = 1073741824 (1024 MiB)
(Non-heap: omitted)
Here, the values match. The reported max size is equal to the sum of the heap spaces, so the sum of the reported max size and one Survivor Space’s size is equal to the result of the formula Eden + 2*Survivor + Tenured, the precise heap size.
The reason why I specified -XX:-UseParallelGC was, that the term “Tenured” of the linked answer gave me a hint about where this assumption came from. As, when I run the program on Java 6 without -XX:-UseParallelGC on my machine, I get
Runtime max: 954466304 (910 MiB)
Heap: 954466304 (910 MiB)
Pool: PS Eden Space = 335609856 (320 MiB)
Pool: PS Survivor Space = 11141120 (10 MiB)
Pool: PS Old Gen = 715849728 (682 MiB)
Pool: Heap memory total = 1062600704 (1013 MiB)
Eden + 2*Survivor + Tenured = 1073741824 (1024 MiB)
(Non-heap: omitted)
Here, the reported max size is not equal to the sum of the heap memory pools, hence the “reported max size plus Survivor” formula produces a different result. These are the same values, I get with Java 8 using default options, so your problem is not related Java 8, as even on Java 6, the values do not match when the garbage collector is different to the one used in the linked Q&A.
Note that starting with Java 9, -XX:+UseG1GC became the default and with that, I get
Runtime max: 1073741824 (1024 MiB)
Heap: 1073741824 (1024 MiB)
Pool: G1 Eden Space = unspecified/unlimited
Pool: G1 Survivor Space = unspecified/unlimited
Pool: G1 Old Gen = 1073741824 (1024 MiB)
Pool: Heap memory total = 1073741824 (1024 MiB)
Eden + 2*Survivor + Tenured = N/A
(Non-heap: omitted)
The bottom line is, the assumption that the difference is equal to the size of the Survivor Space does only hold for one specific (outdated) garbage collector. But when applicable, the formula Eden + 2*Survivor + Tenured gives the exact heap size. For the “Garbage First” collector, where the formula is not applicable, the reported max size is already the correct value.
So the best strategy is to get the max values for Eden, Survivor, and Tenured (aka Old), then check whether either of these values is -1. If so, just use Runtime.getRuntime().maxMemory(), otherwise, calculate Eden + 2*Survivor + Tenured.
The program code:
public static void main(String[] args) {
System.out.println("Runtime max: " + mb(Runtime.getRuntime().maxMemory()));
MemoryMXBean m = ManagementFactory.getMemoryMXBean();
System.out.println("Heap: " + mb(m.getHeapMemoryUsage().getMax()));
scanPools(MemoryType.HEAP);
checkFormula();
System.out.println();
System.out.println("Non-heap: " + mb(m.getNonHeapMemoryUsage().getMax()));
scanPools(MemoryType.NON_HEAP);
System.out.println();
}
private static void checkFormula() {
long total = 0;
boolean eden = false, old = false, survivor = false, na = false;
for(MemoryPoolMXBean mp: ManagementFactory.getMemoryPoolMXBeans()) {
final long max = mp.getUsage().getMax();
if(mp.getName().contains("Eden")) { na = eden; eden = true; }
else if(mp.getName().matches(".*(Old|Tenured).*")) { na = old; old = true; }
else if(mp.getName().contains("Survivor")) {
na = survivor;
survivor = true;
total += max;
}
else continue;
if(max == -1) na = true;
if(na) break;
total += max;
}
System.out.println("Eden + 2*Survivor + Tenured = "
+(!na && eden && old && survivor? mb(total): "N/A"));
}
private static void scanPools(final MemoryType type) {
long total = 0;
for(MemoryPoolMXBean mp: ManagementFactory.getMemoryPoolMXBeans()) {
if(mp.getType()!=type) continue;
long max = mp.getUsage().getMax();
System.out.println("Pool: "+mp.getName()+" = "+mb(max));
if(max != -1) total += max;
}
System.out.println("Pool: "+type+" total = "+mb(total));
}
private static String mb(long mem) {
return mem == -1? "unspecified/unlimited":
String.format("%d (%d MiB)", mem, mem>>>20);
}
Related
I have a simple CUDA kernel to test loop unrolling, then discovered another thing: when the loop count is 10, kernel takes 34 milliseconds to perform, when the loop count is 90, it takes 59 milliseconds, but when the loop count is 100, the time it takes is 423 milliseconds!
Launch configuration is the same, only loop count changed.
So, my question is, what could be the reason for this performance drop?
Here is the code, input is an array of 128x1024x1024 elements, and I'm using PyCUDA:
__global__ void copy(float *input, float *output) {
int tidx = blockIdx.y * blockDim.x + threadIdx.x;
int stride = 1024 * 1024;
for (int i = 0; i < 128; i++) {
int idx = i * stride + tidx;
float x = input[idx];
float y = 0;
for (int j = 0; j < 100; j += 10) {
x = x + sqrt(float(j));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+1));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+2));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+3));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+4));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+5));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+6));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+7));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+8));
y = sqrt(abs(x)) + sin(x) + cos(x);
x = x + sqrt(float(j+9));
y = sqrt(abs(x)) + sin(x) + cos(x);
}
output[idx] = y;
}
}
The loop count I mentioned is this line:
for (int j = 0; j < 100; j += 10)
And sample outputs here:
10 loops
griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info : 0 bytes gmem, 24 bytes cmem[3]
ptxas info : Compiling entry function 'copy' for 'sm_61'
ptxas info : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 21 registers, 336 bytes cmem[0], 52 bytes cmem[2]
computation takes 34.24 miliseconds
90 loops
griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info : 0 bytes gmem, 24 bytes cmem[3]
ptxas info : Compiling entry function 'copy' for 'sm_61'
ptxas info : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 21 registers, 336 bytes cmem[0], 52 bytes cmem[2]
computation takes 59.33 miliseconds
100 loops
griddimx: 1 griddimy: 1024 griddimz: 1
blockdimx: 1024 blockdimy: 1 blockdimz: 1
nthreads: 1048576 blocks: 1024
prefetch.py:82: UserWarning: The CUDA compiler succeeded, but said the following:
ptxas info : 0 bytes gmem, 24 bytes cmem[3]
ptxas info : Compiling entry function 'copy' for 'sm_61'
ptxas info : Function properties for copy
32 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 22 registers, 336 bytes cmem[0], 52 bytes cmem[2]
computation takes 422.96 miliseconds
The problem seems to come from loop unrolling.
Indeed, the 10-loops case can be trivially unrolled by NVCC since the loop is actually always executed once (thus the for line can be removed with j set to 0).
The 90-loops case is unrolled by NVCC (there are only 9 actual iterations). The resulting code is thus much bigger but still fast since no branches are performed (GPUs hate branches). However, the 100-loops case is not unrolled by NVCC (you hit a threshold of the compiler optimizer). The resulting code is small, but it leads to more branches being executed at runtime: branching is performed for each executed loop iteration (a total of 10).
You can see the assembly code difference here.
You can force unrolling using the directive #pragma unroll. However, keep in mind that increasing the size of a code can reduce its performance.
PS: the slightly higher number of register used in the last version may decrease performance, but simulations show that it should be OK in this case.
Given a NuSMV model, how to find its runtime and how much memory it consumed?
So the runtime can be found using this command at system prompt: /usr/bin/time -f "time %e s" NuSMV filename.smv
The above gives the wall-clock time. Is there a better way to obtain runtime statistics from within NuSMV itself?
Also how to find out how much RAM memory the program used during its processing of the file?
One possibility is to use the usage command, which displays both the amount of RAM currently being used, as well as the User and the System time used by the tool since when it was started (thus, usage should be called both before and after each operation which you want to profile).
An example execution:
NuSMV > usage
Runtime Statistics
------------------
Machine name: *****
User time 0.005 seconds
System time 0.005 seconds
Average resident text size = 0K
Average resident data+stack size = 0K
Maximum resident size = 6932K
Virtual text size = 8139K
Virtual data size = 34089K
data size initialized = 3424K
data size uninitialized = 178K
data size sbrk = 30487K
Virtual memory limit = -2147483648K (-2147483648K)
Major page faults = 0
Minor page faults = 2607
Swaps = 0
Input blocks = 0
Output blocks = 0
Context switch (voluntary) = 9
Context switch (involuntary) = 0
NuSMV > reset; read_model -i nusmvLab.2018.06.07.smv ; go ; check_property ; usage
-- specification (L6 != pc U cc = len) IN mm is true
-- specification F (min = 2 & max = 9) IN mm is true
-- specification G !((((max > arr[0] & max > arr[1]) & max > arr[2]) & max > arr[3]) & max > arr[4]) IN mm is true
-- invariant max >= min IN mm is true
Runtime Statistics
------------------
Machine name: *****
User time 47.214 seconds
System time 0.284 seconds
Average resident text size = 0K
Average resident data+stack size = 0K
Maximum resident size = 270714K
Virtual text size = 8139K
Virtual data size = 435321K
data size initialized = 3424K
data size uninitialized = 178K
data size sbrk = 431719K
Virtual memory limit = -2147483648K (-2147483648K)
Major page faults = 1
Minor page faults = 189666
Swaps = 0
Input blocks = 48
Output blocks = 0
Context switch (voluntary) = 12
Context switch (involuntary) = 145
I am currently testing Julia (I've worked with Matlab)
In matlab the calculation speed of N^3 is slower than NxNxN. This doesn't happen with N^2 and NxN. They use a different algorithm to calculate higher-order exponents because they prefer accuracy rather than speed.
I think Julia do the same thing.
I wanted to ask if there is a way to force Julia to calculate the exponent of N using multiplication instead of the default algorithm, at least for cube exponents.
Some time ago a I did a few test on matlab of this. I made a translation of that code to julia.
Links to code:
http://pastebin.com/bbeukhTc
(I cant upload all the links here :( )
Results of the scripts on Matlab 2014:
Exponente1
Elapsed time is 68.293793 seconds. (17.7x times of the smallest)
Exponente2
Elapsed time is 24.236218 seconds. (6.3x times of the smallests)
Exponente3
Elapsed time is 3.853348 seconds.
Results of the scripts on Julia 0.46:
Exponente1
18.423204 seconds (8.22 k allocations: 372.563 KB) (51.6x times of the smallest)
Exponente2
13.746904 seconds (9.02 k allocations: 407.332 KB) (38.5 times of the smallest)
Exponente3
0.356875 seconds (10.01 k allocations: 450.441 KB)
In my tests julia is faster than Matlab, but i am using a relative old version. I cant test other versions.
Checking Julia's source code:
julia/base/math.jl:
^(x::Float64, y::Integer) =
box(Float64, powi_llvm(unbox(Float64,x), unbox(Int32,Int32(y))))
^(x::Float32, y::Integer) =
box(Float32, powi_llvm(unbox(Float32,x), unbox(Int32,Int32(y))))
julia/base/fastmath.jl:
pow_fast{T<:FloatTypes}(x::T, y::Integer) = pow_fast(x, Int32(y))
pow_fast{T<:FloatTypes}(x::T, y::Int32) =
box(T, Base.powi_llvm(unbox(T,x), unbox(Int32,y)))
We can see that Julia uses powi_llvm
Checking llvm's source code:
define double #powi(double %F, i32 %power) {
; CHECK: powi:
; CHECK: bl __powidf2
%result = call double #llvm.powi.f64(double %F, i32 %power)
ret double %result
}
Now, the __powidf2 is the interesting function here:
COMPILER_RT_ABI double
__powidf2(double a, si_int b)
{
const int recip = b < 0;
double r = 1;
while (1)
{
if (b & 1)
r *= a;
b /= 2;
if (b == 0)
break;
a *= a;
}
return recip ? 1/r : r;
}
Example 1: given a = 2; b = 7:
- r = 1
- iteration 1: r = 1 * 2 = 2; b = (int)(7/2) = 3; a = 2 * 2 = 4
- iteration 2: r = 2 * 4 = 8; b = (int)(3/2) = 1; a = 4 * 4 = 16
- iteration 3: r = 8 * 16 = 128;
Example 2: given a = 2; b = 8:
- r = 1
- iteration 1: r = 1; b = (int)(8/2) = 4; a = 2 * 2 = 4
- iteration 2: r = 1; b = (int)(4/2) = 2; a = 4 * 4 = 16
- iteration 3: r = 1; b = (int)(2/2) = 1; a = 16 * 16 = 256
- iteration 4: r = 1 * 256 = 256; b = (int)(1/2) = 0;
Integer power is always implemented as a sequence multiplications. That's why N^3 is slower than N^2.
jl_powi_llvm (called in fastmath.jl. "jl_" is concatenated by macro expansion), on the other hand, casts the exponent to floating-point and calls pow(). C source code:
JL_DLLEXPORT jl_value_t *jl_powi_llvm(jl_value_t *a, jl_value_t *b)
{
jl_value_t *ty = jl_typeof(a);
if (!jl_is_bitstype(ty))
jl_error("powi_llvm: a is not a bitstype");
if (!jl_is_bitstype(jl_typeof(b)) || jl_datatype_size(jl_typeof(b)) != 4)
jl_error("powi_llvm: b is not a 32-bit bitstype");
jl_value_t *newv = newstruct((jl_datatype_t*)ty);
void *pa = jl_data_ptr(a), *pr = jl_data_ptr(newv);
int sz = jl_datatype_size(ty);
switch (sz) {
/* choose the right size c-type operation */
case 4:
*(float*)pr = powf(*(float*)pa, (float)jl_unbox_int32(b));
break;
case 8:
*(double*)pr = pow(*(double*)pa, (double)jl_unbox_int32(b));
break;
default:
jl_error("powi_llvm: runtime floating point intrinsics are not implemented for bit sizes other than 32 and 64");
}
return newv;
}
Lior's answer is excellent. Here is a solution to the problem you posed: Yes, there is a way to force usage of multiplication, at cost of accuracy. It's the #fastmath macro:
julia> #benchmark 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 999
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 16.00 bytes
allocs estimate: 1
minimum time: 13.00 ns (0.00% GC)
median time: 14.00 ns (0.00% GC)
mean time: 15.74 ns (6.14% GC)
maximum time: 1.85 μs (98.16% GC)
julia> #benchmark #fastmath 1.1 ^ 3
BenchmarkTools.Trial:
samples: 10000
evals/sample: 1000
time tolerance: 5.00%
memory tolerance: 1.00%
memory estimate: 0.00 bytes
allocs estimate: 0
minimum time: 2.00 ns (0.00% GC)
median time: 3.00 ns (0.00% GC)
mean time: 2.59 ns (0.00% GC)
maximum time: 20.00 ns (0.00% GC)
Note that with #fastmath, performance is much better.
I stumbled upon an interesting thing while checking performance of memory allocation in GO.
package main
import (
"fmt"
"time"
)
func main(){
const alloc int = 65536
now := time.Now()
loop := 50000
for i := 0; i<loop;i++{
sl := make([]byte, alloc)
i += len(sl) * 0
}
elpased := time.Since(now)
fmt.Printf("took %s to allocate %d bytes %d times", elpased, alloc, loop)
}
I am running this on a Core-i7 2600 with go version 1.6 64bit (also same results on 32bit) and 16GB of RAM (on WINDOWS 10)
so when alloc is 65536 (exactly 64K) it runs for 30 seconds (!!!!).
When alloc is 65535 it takes ~200ms.
Can someone explain this to me please?
I tried the same code at home with my core i7-920 # 3.8GHZ but it didn't show same results (both took around 200ms). Anyone has an idea what's going on?
Setting GOGC=off improved performance (down to less than 100ms). Why?
becaue of escape analysis. When you build with go build -gcflags -m the compiler prints whatever allocations escapes to heap. It really depends on your machine and GO compiler version but when the compiler decides that the allocation should move to heap it means 2 things:
1. the allocation will take longer (since "allocating" on the stack is just 1 cpu instruction)
2. the GC will have to clean up that memory later - costing more CPU time
for my machine, the allocation of 65536 bytes escapes to heap and 65535 doesn't.
that's why 1 bytes changed the whole proccess from 200ms to 30s. Amazing..
Note/Update 2021: as Tapir Liui notes in Go101 with this tweet:
As of Go 1.17, Go runtime will allocate the elements of slice x on stack if the compiler proves they are only used in the current goroutine and N <= 64KB:
var x = make([]byte, N)
And Go runtime will allocate the array y on stack if the compiler proves it is only used in the current goroutine and N <= 10MB:
var y [N]byte
Then how to allocated (the elements of) a slice which size is larger than 64KB but not larger than 10MB on stack (and the slice is only used in one goroutine)?
Just use the following way:
var y [N]byte
var x = y[:]
Considering stack allocation is faster than heap allocation, that would have a direct effect on your test, for alloc equals to 65536 and more.
Tapir adds:
In fact, we could allocate slices with arbitrary sum element sizes on stack.
const N = 500 * 1024 * 1024 // 500M
var v byte = 123
func createSlice() byte {
var s = []byte{N: 0}
for i := range s { s[i] = v }
return s[v]
}
Changing 500 to 512 make program crash.
the reason is very simple.
const alloc int = 65535
0x0000 00000 (example.go:8) TEXT "".main(SB), ABIInternal, $65784-0
const alloc int = 65536
0x0000 00000 (example.go:8) TEXT "".main(SB), ABIInternal, $248-0
the difference is where the slice are created.
I was trying to solve a puzzle in Haskell and had written the following code:
u 0 p = 0.0
u 1 p = 1.0
u n p = 1.0 + minimum [((1.0-q)*(s k p)) + (u (n-k) p) | k <-[1..n], let q = (1.0-p)**(fromIntegral k)]
s 1 p = 0.0
s n p = 1.0 + minimum [((1.0-q)*(s (n-k) p)) + q*((s k p) + (u (n-k) p)) | k <-[1..(n-1)], let q = (1.0-(1.0-p)**(fromIntegral k))/(1.0-(1.0-p)**(fromIntegral n))]
This code was terribly slow though. I suspect the reason for this is that the same things get calculated again and again. I therefore made a memoized version:
memoUa = array (0,10000) ((0,0.0):(1,1.0):[(k,mua k) | k<- [2..10000]])
mua n = (1.0) + minimum [((1.0-q)*(memoSa ! k)) + (memoUa ! (n-k)) | k <-[1..n], let q = (1.0-0.02)**(fromIntegral k)]
memoSa = array (0,10000) ((0,0.0):(1,0.0):[(k,msa k) | k<- [2..10000]])
msa n = (1.0) + minimum [((1.0-q) * (memoSa ! (n-k))) + q*((memoSa ! k) + (memoUa ! (n-k))) | k <-[1..(n-1)], let q = (1.0-(1.0-0.02)**(fromIntegral k))/(1.0-(1.0-0.02)**(fromIntegral n))]
This seems to be a lot faster, but now I get an out of memory error. I do not understand why this happens (the same strategy in java, without recursion, has no problems). Could somebody point me in the right direction on how to improve this code?
EDIT: I am adding my java version here (as I don't know where else to put it). I realize that the code isn't really reader-friendly (no meaningful names, etc.), but I hope it is clear enough.
public class Main {
public static double calc(double p) {
double[] u = new double[10001];
double[] s = new double[10001];
u[0] = 0.0;
u[1] = 1.0;
s[0] = 0.0;
s[1] = 0.0;
for (int n=2;n<10001;n++) {
double q = 1.0;
double denom = 1.0;
for (int k = 1; k <= n; k++ ) {
denom = denom * (1.0 - p);
}
denom = 1.0 - denom;
s[n] = (double) n;
u[n] = (double) n;
for (int k = 1; k <= n; k++ ) {
q = (1.0 - p) * q;
if (k<n) {
double qs = (1.0-q)/denom;
double bs = (1.0-qs)*s[n-k] + qs*(s[k]+ u[n-k]) + 1.0;
if (bs < s[n]) {
s[n] = bs;
}
}
double bu = (1.0-q)*s[k] + 1.0 + u[n-k];
if (bu < u[n]) {
u[n] = bu;
}
}
}
return u[10000];
}
public static void main(String[] args) {
double s = 0.0;
int i = 2;
//for (int i = 1; i<51; i++) {
s = s + calc(i*0.01);
//}
System.out.println("result = " + s);
}
}
I don't run out of memory when I run the compiled version, but there is a significant difference between how the Java version works and how the Haskell version works which I'll illustrate here.
The first thing to do is to add some important type signatures. In particular, you don't want Integer array indices, so I added:
memoUa :: Array Int Double
memoSa :: Array Int Double
I found these using ghc-mod check. I also added a main so that you can run it from the command line:
import System.Environment
main = do
(arg:_) <- getArgs
let n = read arg
print $ mua n
Now to gain some insight into what's going on, we can compile the program using profiling:
ghc -O2 -prof memo.hs
Then when we invoke the program like this:
memo 1000 +RTS -s
we will get profiling output which looks like:
164.31333233347755
98,286,872 bytes allocated in the heap
29,455,360 bytes copied during GC
657,080 bytes maximum residency (29 sample(s))
38,260 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 161 colls, 0 par 0.03s 0.03s 0.0002s 0.0011s
Gen 1 29 colls, 0 par 0.03s 0.03s 0.0011s 0.0017s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.21s ( 0.21s elapsed)
GC time 0.06s ( 0.06s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.27s ( 0.27s elapsed)
%GC time 21.8% (22.3% elapsed)
Alloc rate 468,514,624 bytes per MUT second
Productivity 78.2% of total user, 77.3% of total elapsed
Important things to pay attention to are:
maximum residency
Total time
%GC time (or Productivity)
Maximum residency is a measure of how much memory is needed by the program. %GC time the proportion of the time spent in garbage collection and Productivity is the complement (100% - %GC time).
If you run the program for various input values you will see a productivity of around 80%:
n Max Res. Prod. Time Output
2000 779,076 79.4% 1.10s 328.54535361588535
4000 1,023,016 80.7% 4.41s 657.0894961398351
6000 1,299,880 81.3% 9.91s 985.6071032981068
8000 1,539,352 81.5% 17.64s 1314.0968411684714
10000 1,815,600 81.7% 27.57s 1642.5891214360522
This means that about 20% of the run time is spent in garbage collection. Also, we see increasing memory usage as n increases.
It turns out we can dramatically improve productivity and memory usage by telling Haskell the order in which to evaluate the array elements instead of relying on lazy evaluation:
import Control.Monad (forM_)
main = do
(arg:_) <- getArgs
let n = read arg
forM_ [1..n] $ \i -> mua i `seq` return ()
print $ mua n
And the new profiling stats are:
n Max Res. Prod. Time Output
2000 482,800 99.3% 1.31s 328.54535361588535
4000 482,800 99.6% 5.88s 657.0894961398351
6000 482,800 99.5% 12.09s 985.6071032981068
8000 482,800 98.1% 21.71s 1314.0968411684714
10000 482,800 96.1% 34.58s 1642.5891214360522
Some interesting observations here: productivity is up, memory usage is down (constant now over the range of inputs) but run time is up. This suggests that we forced more computations than we needed to. In an imperative language like Java you have to give an evaluation order so you would know exactly which computations need to be performed. It would interesting to see your Java code to see which computations it is performing.