Unique matrix transpose problem: contradictory reports from cachegrind and perf

Unique matrix transpose problem: contradictory reports from cachegrind and perf - caching

In the following question, we're talking about an algorithm which transposes a matrix of complex values struct complex {double real = 0.0; double imag = 0.0;};. Owing to a special data-layout, there is a stride-n*n access between the rows, which means that the loading of a subsequent row causes the eviction of the previously loaded row. All runs have been done using 1 thread only.
I'm trying to understand why my 'optimized' transpose function, which makes use of 2D blocking, is performing badly (coming from: 2D blocking with unique matrix transpose problem) and so I'm trying to use performance counters/cache simulators to get a reading on what's going wrong.
According to my analysis, if n=500 is the size of the matrix, b=4 is my block-size and c=4 is my cache-line size, we have for the naive algorithm,
for (auto i1 = std::size_t{}; i1 < n1; ++i1)
{
for (auto i3 = std::size_t{}; i3 < n3; ++i3)
{
mat_out(i3, i1) = mat_in(i1, i3);
}
}
Number of cache-references: (read) n*n + (write) n*n
Number of cache-misses: (read) n*n / c + (write) n*n
Rate of misses: 62.5 %.
Sure enough, I'm getting the same output as per cachegrind:
==21470== Cachegrind, a cache and branch-prediction profiler
==21470== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==21470== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==21470== Command: ./benchmark/benchmarking_transpose_vslices_dir2_naive 500
==21470==
--21470-- warning: L3 cache found, using its data for the LL simulation.
--21470-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--21470-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
==21470==
==21470== I refs: 30,130,879,636
==21470== I1 misses: 7,666
==21470== LLi misses: 6,286
==21470== I1 miss rate: 0.00%
==21470== LLi miss rate: 0.00%
==21470==
==21470== D refs: 13,285,386,487 (6,705,198,115 rd + 6,580,188,372 wr)
==21470== D1 misses: 8,177,337,186 (1,626,402,679 rd + 6,550,934,507 wr)
==21470== LLd misses: 3,301,064,720 (1,625,156,375 rd + 1,675,908,345 wr)
==21470== D1 miss rate: 61.6% ( 24.3% + 99.6% )
==21470== LLd miss rate: 24.8% ( 24.2% + 25.5% )
==21470==
==21470== LL refs: 8,177,344,852 (1,626,410,345 rd + 6,550,934,507 wr)
==21470== LL misses: 3,301,071,006 (1,625,162,661 rd + 1,675,908,345 wr)
==21470== LL miss rate: 7.6% ( 4.4% + 25.5% )
Now for the implementation with blocking, I expect,
Hint: The following code is without remainder loops. The container intermediate_result, sized b x b, as per suggestion by #JérômeRichard, is used in order to prevent cache-thrashing.
for (auto bi1 = std::size_t{}; bi1 < n1; bi1 += block_size)
{
for (auto bi3 = std::size_t{}; bi3 < n3; bi3 += block_size)
{
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
intermediate_result(i3, i1) = mat_in(bi1 + i1, bi3 + i3);
}
}
for (auto i1 = std::size_t{}; i1 < block_size; ++i1)
{
#pragma omp simd safelen(8)
for (auto i3 = std::size_t{}; i3 < block_size; ++i3)
{
mat_out(bi3 + i1, bi1 + i3) = intermediate_result(i1, i3);
}
}
}
}
Number of cache-references: (read) b*b + (write) b*b
Number of cache-misses: (read) b*b / c + (write) b*b / c
Rate of misses: 25 %.
Once again, cachegrind gives me the following report:
==21473== Cachegrind, a cache and branch-prediction profiler
==21473== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==21473== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==21473== Command: ./benchmark/benchmarking_transpose_vslices_dir2_best 500 4
==21473==
--21473-- warning: L3 cache found, using its data for the LL simulation.
--21473-- warning: specified LL cache: line_size 64 assoc 12 total_size 9,437,184
--21473-- warning: simulated LL cache: line_size 64 assoc 18 total_size 9,437,184
==21473==
==21473== I refs: 157,135,137,350
==21473== I1 misses: 11,057
==21473== LLi misses: 9,604
==21473== I1 miss rate: 0.00%
==21473== LLi miss rate: 0.00%
==21473==
==21473== D refs: 43,995,141,079 (29,709,076,051 rd + 14,286,065,028 wr)
==21473== D1 misses: 3,307,834,114 ( 1,631,898,173 rd + 1,675,935,941 wr)
==21473== LLd misses: 3,301,066,570 ( 1,625,157,620 rd + 1,675,908,950 wr)
==21473== D1 miss rate: 7.5% ( 5.5% + 11.7% )
==21473== LLd miss rate: 7.5% ( 5.5% + 11.7% )
==21473==
==21473== LL refs: 3,307,845,171 ( 1,631,909,230 rd + 1,675,935,941 wr)
==21473== LL misses: 3,301,076,174 ( 1,625,167,224 rd + 1,675,908,950 wr)
==21473== LL miss rate: 1.6% ( 0.9% + 11.7% )
I cannot explain this discrepancy at this point, except to speculate that this might be because of prefetching.
Now, when I watch the same naive implementation using perf (with option "-d"), I get:
Performance counter stats for './benchmark/benchmarking_transpose_vslices_dir2_naive 500':
91.122,33 msec task-clock # 0,933 CPUs utilized
870.939 context-switches # 0,010 M/sec
17 cpu-migrations # 0,000 K/sec
50.807.083 page-faults # 0,558 M/sec
354.169.268.894 cycles # 3,887 GHz
217.031.159.494 instructions # 0,61 insn per cycle
34.980.334.095 branches # 383,883 M/sec
148.578.378 branch-misses # 0,42% of all branches
58.473.530.591 L1-dcache-loads # 641,704 M/sec
12.636.479.302 L1-dcache-load-misses # 21,61% of all L1-dcache hits
440.543.654 LLC-loads # 4,835 M/sec
276.733.102 LLC-load-misses # 62,82% of all LL-cache hits
97,705649040 seconds time elapsed
45,526653000 seconds user
47,295247000 seconds sys
When I do the same for the implementation with 2D-blocking, I get:
Performance counter stats for './benchmark/benchmarking_transpose_vslices_dir2_best 500 4':
79.865,16 msec task-clock # 0,932 CPUs utilized
766.200 context-switches # 0,010 M/sec
12 cpu-migrations # 0,000 K/sec
50.807.088 page-faults # 0,636 M/sec
310.452.015.452 cycles # 3,887 GHz
343.399.743.845 instructions # 1,11 insn per cycle
51.889.725.247 branches # 649,717 M/sec
133.541.902 branch-misses # 0,26% of all branches
81.279.037.114 L1-dcache-loads # 1017,703 M/sec
7.722.318.725 L1-dcache-load-misses # 9,50% of all L1-dcache hits
399.149.174 LLC-loads # 4,998 M/sec
123.134.807 LLC-load-misses # 30,85% of all LL-cache hits
85,660207381 seconds time elapsed
34,524170000 seconds user
46,884443000 seconds sys
Questions:
Why is there a strong difference in the output here for L1D and LLC?
Why are we seeing such bad L3 cache-miss rate (according to perf) in case of the blocking algorithm? This is obviously exacerbated when I start using 6 cores.
Any tips on how to detect cache-thrashing will also be appreciated.
Thanks in advance for your time and help, I'm glad to provide additional information upon request.
Additional Info:
The processor used for testing here is the (Coffee Lake) Intel(R) Core(TM) i5-8400 CPU # 2.80GHz.
CPU with 6 cores operating at 2.80 GHz - 4.00 GHz
L1 6x 32 KiB 8-way set associative (64 sets)
L2 6x 256 KiB 4-way set associative (1024 sets)
shared L3 9 MiB 12-way set associative (12288 sets)

Related

Some questions related to cache performance (computer architecture)

Details about the X5650 processor at https://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20X5650%20-%20AT80614004320AD%20(BX80614X5650).html
important notes:
L3 cache size : 12288KB
cache line size : 64
Consider the following two functions, which each increment the values in an array by 100.
void incrementVector1(INT4* v, int n) {
for (int k = 0; k < 100; ++k) {
for (int i = 0; i < n; ++i) {
v[i] = v[i] + 1;
} }
}
void incrementVector2(INT4* v, int n) {
for (int i = 0; i < n; ++i) {
for (int k = 0; k < 100; ++k) {
v[i] = v[i] + 1;
} }
}
The following data collected using the perf utility captures runtime information for executing
each of these functions on the Intel Xeon X5650 processor for various data sizes. In this data: • the program vector1.bin executes the function incrementVector1;
• the program vector2.bin executes the function incrementVector2;
• the programs take a command line argument which sets the value of n;
• both programs begin by allocating an array of size n and initializing all elements to 0. • LLC-loads means “last level cache loads”, the number of accesses to L3;
• LLC-load-misses means “last level cache misses”, the number of L3 cache misses.
Runtime performance of vector1.bin.
Performance counter stats for ’./vector1.bin 1000000’:
230,070 LLC-loads
3,280 LLC-load-misses # 1.43% of all LL-cache references
0.383542737 seconds time elapsed
Performance counter stats for ’./vector1.bin 3000000’:
669,884 LLC-loads
242,876 LLC-load-misses # 36.26% of all LL-cache references
1.156663301 seconds time elapsed
Performance counter stats for ’./vector1.bin 5000000’:
1,234,031 LLC-loads
898,577 LLC-load-misses # 72.82% of all LL-cache references
1.941832434 seconds time elapsed
Performance counter stats for ’./vector1.bin 7000000’:
1,620,026 LLC-loads
1,142,275 LLC-load-misses # 70.51% of all LL-cache references
2.621428714 seconds time elapsed
Performance counter stats for ’./vector1.bin 9000000’:
2,068,028 LLC-loads
1,422,269 LLC-load-misses # 68.77% of all LL-cache references
3.308037628 seconds time elapsed
8
Runtime performance of vector2.bin.
Performance counter stats for ’./vector2.bin 1000000’:
16,464 LLC-loads
1,168 LLC-load-misses # 7.049% of all LL-cache references
0.319311959 seconds time elapsed
Performance counter stats for ’./vector2.bin 3000000’:
42,052 LLC-loads
17,027 LLC-load-misses # 40.49% of all LL-cache references
0.954854798 seconds time elapsed
Performance counter stats for ’./vector2.bin 5000000’:
63,991 LLC-loads
38,459 LLC-load-misses # 60.10% of all LL-cache references
1.593526338 seconds time elapsed
Performance counter stats for ’./vector2.bin 7000000’:
99,773 LLC-loads
56,481 LLC-load-misses # 56.61% of all LL-cache references
2.198810471 seconds time elapsed
Performance counter stats for ’./vector2.bin 9000000’:
120,456 LLC-loads
76,951 LLC-load-misses # 63.88% of all LL-cache references
2.772653964 seconds time elapsed
I have two questions:
Consider the cache miss rates for vector1.bin. Between the vector sizes 1000000 and 5000000, the cache miss rate drastically increases. What is the cause of this increase in cache miss rate?
Consider the cache miss rates for both programs. Notice that the miss rate between the two programs is roughly equal for any particular array size. Why is that?

Formulas in perf stat

I am wondering about the formulas used in perf stat to calculate figures from the raw data.
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./myapp
1080267.226401 task-clock (msec) # 19.062 CPUs utilized
1,592,123,216,789 cycles # 1.474 GHz (50.00%)
871,190,006,655 instructions # 0.55 insn per cycle (75.00%)
3,697,548,810 cache-references # 3.423 M/sec (75.00%)
459,457,321 cache-misses # 12.426 % of all cache refs (75.00%)
In this context, how do you calculate M/sec from cache-references?

Formulas are seems not to be implemented in the builtin-stat.c (where default event sets for perf stat are defined), but they are probably calculated (and averaged with stddev) in perf_stat__print_shadow_stats() (and some stats are collected into arrays in perf_stat__update_shadow_stats()):
http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L626
When HW_INSTRUCTIONS is counted:
"Instructions per clock" = HW_INSTRUCTIONS / HW_CPU_CYCLES; "stalled cycles per instruction" = HW_STALLED_CYCLES_FRONTEND / HW_INSTRUCTIONS
if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
total = avg_stats(&runtime_cycles_stats[ctx][cpu]);
if (total) {
ratio = avg / total;
print_metric(ctxp, NULL, "%7.2f ",
"insn per cycle", ratio);
} else {
print_metric(ctxp, NULL, NULL, "insn per cycle", 0);
}
Branch misses are from print_branch_misses as HW_BRANCH_MISSES / HW_BRANCH_INSTRUCTIONS
There are several cache miss ratio calculations in perf_stat__print_shadow_stats() too like HW_CACHE_MISSES / HW_CACHE_REFERENCES and some more detailed (perf stat -d mode).
Stalled percents are computed as HW_STALLED_CYCLES_FRONTEND / HW_CPU_CYCLES and HW_STALLED_CYCLES_BACKEND / HW_CPU_CYCLES
GHz is computed as HW_CPU_CYCLES / runtime_nsecs_stats, where runtime_nsecs_stats was updated from any of software events task-clock or cpu-clock (SW_TASK_CLOCK & SW_CPU_CLOCK, We still know no exact difference between them two since 2010 in LKML and 2014 at SO)
if (perf_evsel__match(counter, SOFTWARE, SW_TASK_CLOCK) ||
perf_evsel__match(counter, SOFTWARE, SW_CPU_CLOCK))
update_stats(&runtime_nsecs_stats[cpu], count[0]);
There are also several formulas for transactions (perf stat -T mode).
"CPU utilized" is from task-clock or cpu-clock / walltime_nsecs_stats, where walltime is calculated by the perf stat itself (in userspace using clock from the wall (astronomic time, ):
static inline unsigned long long rdclock(void)
{
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1000000000ULL + ts.tv_nsec;
}
...
static int __run_perf_stat(int argc, const char **argv)
{
...
/*
* Enable counters and exec the command:
*/
t0 = rdclock();
clock_gettime(CLOCK_MONOTONIC, &ref_time);
if (forks) {
....
}
t1 = rdclock();
update_stats(&walltime_nsecs_stats, t1 - t0);
There are also some estimations from the Top-Down methodology (Tuning Applications Using a Top-down Microarchitecture Analysis Method, Software Optimizations Become Simple with Top-Down Analysis .. Name Skylake, IDF2015, #22 in Gregg's Methodology List. Described in 2016 by Andi Kleen https://lwn.net/Articles/688335/ "Add top down metrics to perf stat" (perf stat --topdown -I 1000 cmd mode).
And finally, if there was no exact formula for the currently printing event, there is universal "%c/sec" (K/sec or M/sec) metric: http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L845 Anything divided by runtime nsec (task-clock or cpu-clock events, if they were present in perf stat event set)
} else if (runtime_nsecs_stats[cpu].n != 0) {
char unit = 'M';
char unit_buf[10];
total = avg_stats(&runtime_nsecs_stats[cpu]);
if (total)
ratio = 1000.0 * avg / total;
if (ratio < 0.001) {
ratio *= 1000;
unit = 'K';
}
snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit);
print_metric(ctxp, NULL, "%8.3f", unit_buf, ratio);
}

Is my Theano program actually using the GPU?

Theano claims it's using the GPU; it says what device when it starts up, etc. Furthermore nvidia-smi says it's being used.
But the running time seems to be exactly the same regardless of whether or not I use it.
Could it have something to do with integer arithmetic?
import sys
import numpy as np
import theano
import theano.tensor as T
def ariths(v, ub):
"""Given a sorted vector v and scalar ub, returns multiples of elements in v.
Specifically, returns a vector containing all numbers j * k < ub where j is in
v and k >= j. Some elements may occur more than once in the output.
"""
lp = v[0]
v = T.shape_padright(v)
a = T.shape_padleft(T.arange(0, (ub + lp - 1) // lp - lp, 1, 'int64'))
res = v * (a + v)
return res[(res < ub).nonzero()]
def filter_composites(pv, using_primes):
a = ariths(using_primes, pv.size)
return T.set_subtensor(pv[a], 0)
def _iterfn(prev_bnds, pv):
bstart = prev_bnds[0]
bend = prev_bnds[1]
use_primes = pv[bstart:bend].nonzero()[0] + bstart
pv = filter_composites(pv, use_primes)
return pv
def primes_to(n):
if n <= 2:
return np.asarray([])
elif n <= 3:
return np.asarray([2])
res = T.ones(n, 'int8')
res = T.set_subtensor(res[:2], 0)
ubs = [[2, 4]]
ub = 4
while ub ** 2 < n:
prevub = ub
ub *= 2
ubs.append([prevub, ub])
(r, u5) = theano.scan(fn=_iterfn,
outputs_info=res, sequences=[np.asarray(ubs)])
return r[-1].nonzero()[0]
def main(n):
print(primes_to(n).size.eval())
if __name__ == '__main__':
main(int(sys.argv[1]))

The answer is yes. And no. If you profile your code in a GPU enabled Theano installation using nvprof, you will see something like this:
==16540== Profiling application: python ./theano_test.py
==16540== Profiling result:
Time(%) Time Calls Avg Min Max Name
49.22% 12.096us 1 12.096us 12.096us 12.096us kernel_reduce_ccontig_node_c8d7bd33dfef61705c2854dd1f0cb7ce_0(unsigned int, float const *, float*)
30.60% 7.5200us 3 2.5060us 832ns 5.7600us [CUDA memcpy HtoD]
13.93% 3.4240us 1 3.4240us 3.4240us 3.4240us [CUDA memset]
6.25% 1.5350us 1 1.5350us 1.5350us 1.5350us [CUDA memcpy DtoH]
i.e. There is a least a reduce operation being performed on your GPU. However, if you modify your main like this:
def main():
n = 100000000
print(primes_to(n).size.eval())
if __name__ == '__main__':
import cProfile, pstats
cProfile.run("main()", "{}.profile".format(__file__))
s = pstats.Stats("{}.profile".format(__file__))
s.strip_dirs()
s.sort_stats("time").print_stats(10)
and use cProfile to profile your code, you will see something like this:
Thu Mar 10 14:35:24 2016 ./theano_test.py.profile
486743 function calls (480590 primitive calls) in 17.444 seconds
Ordered by: internal time
List reduced from 1138 to 10 due to restriction <10>
ncalls tottime percall cumtime percall filename:lineno(function)
1 6.376 6.376 16.655 16.655 {theano.scan_module.scan_perform.perform}
13 6.168 0.474 6.168 0.474 subtensor.py:2084(perform)
27 2.910 0.108 2.910 0.108 {method 'nonzero' of 'numpy.ndarray' objects}
30 0.852 0.028 0.852 0.028 {numpy.core.multiarray.concatenate}
27 0.711 0.026 0.711 0.026 {method 'astype' of 'numpy.ndarray' objects}
13 0.072 0.006 0.072 0.006 {numpy.core.multiarray.arange}
1 0.034 0.034 17.142 17.142 function_module.py:482(__call__)
387 0.020 0.000 0.052 0.000 graph.py:486(stack_search)
77 0.016 0.000 10.731 0.139 op.py:767(rval)
316 0.013 0.000 0.066 0.000 graph.py:715(general_toposort)
The slowest operation (just) is the scan call, and looking at the source for scan, you can see that presently, GPU execution of scan is disabled.
So then answer is, yes, the GPU is being used for something in your code, but no, the most time consuming operation(s) are being run on the CPU because GPU execution appears to be hard disabled in the code at present.

Memoization done, what now?

I was trying to solve a puzzle in Haskell and had written the following code:
u 0 p = 0.0
u 1 p = 1.0
u n p = 1.0 + minimum [((1.0-q)*(s k p)) + (u (n-k) p) | k <-[1..n], let q = (1.0-p)**(fromIntegral k)]
s 1 p = 0.0
s n p = 1.0 + minimum [((1.0-q)*(s (n-k) p)) + q*((s k p) + (u (n-k) p)) | k <-[1..(n-1)], let q = (1.0-(1.0-p)**(fromIntegral k))/(1.0-(1.0-p)**(fromIntegral n))]
This code was terribly slow though. I suspect the reason for this is that the same things get calculated again and again. I therefore made a memoized version:
memoUa = array (0,10000) ((0,0.0):(1,1.0):[(k,mua k) | k<- [2..10000]])
mua n = (1.0) + minimum [((1.0-q)*(memoSa ! k)) + (memoUa ! (n-k)) | k <-[1..n], let q = (1.0-0.02)**(fromIntegral k)]
memoSa = array (0,10000) ((0,0.0):(1,0.0):[(k,msa k) | k<- [2..10000]])
msa n = (1.0) + minimum [((1.0-q) * (memoSa ! (n-k))) + q*((memoSa ! k) + (memoUa ! (n-k))) | k <-[1..(n-1)], let q = (1.0-(1.0-0.02)**(fromIntegral k))/(1.0-(1.0-0.02)**(fromIntegral n))]
This seems to be a lot faster, but now I get an out of memory error. I do not understand why this happens (the same strategy in java, without recursion, has no problems). Could somebody point me in the right direction on how to improve this code?
EDIT: I am adding my java version here (as I don't know where else to put it). I realize that the code isn't really reader-friendly (no meaningful names, etc.), but I hope it is clear enough.
public class Main {
public static double calc(double p) {
double[] u = new double[10001];
double[] s = new double[10001];
u[0] = 0.0;
u[1] = 1.0;
s[0] = 0.0;
s[1] = 0.0;
for (int n=2;n<10001;n++) {
double q = 1.0;
double denom = 1.0;
for (int k = 1; k <= n; k++ ) {
denom = denom * (1.0 - p);
}
denom = 1.0 - denom;
s[n] = (double) n;
u[n] = (double) n;
for (int k = 1; k <= n; k++ ) {
q = (1.0 - p) * q;
if (k<n) {
double qs = (1.0-q)/denom;
double bs = (1.0-qs)*s[n-k] + qs*(s[k]+ u[n-k]) + 1.0;
if (bs < s[n]) {
s[n] = bs;
}
}
double bu = (1.0-q)*s[k] + 1.0 + u[n-k];
if (bu < u[n]) {
u[n] = bu;
}
}
}
return u[10000];
}
public static void main(String[] args) {
double s = 0.0;
int i = 2;
//for (int i = 1; i<51; i++) {
s = s + calc(i*0.01);
//}
System.out.println("result = " + s);
}
}

I don't run out of memory when I run the compiled version, but there is a significant difference between how the Java version works and how the Haskell version works which I'll illustrate here.
The first thing to do is to add some important type signatures. In particular, you don't want Integer array indices, so I added:
memoUa :: Array Int Double
memoSa :: Array Int Double
I found these using ghc-mod check. I also added a main so that you can run it from the command line:
import System.Environment
main = do
(arg:_) <- getArgs
let n = read arg
print $ mua n
Now to gain some insight into what's going on, we can compile the program using profiling:
ghc -O2 -prof memo.hs
Then when we invoke the program like this:
memo 1000 +RTS -s
we will get profiling output which looks like:
164.31333233347755
98,286,872 bytes allocated in the heap
29,455,360 bytes copied during GC
657,080 bytes maximum residency (29 sample(s))
38,260 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 161 colls, 0 par 0.03s 0.03s 0.0002s 0.0011s
Gen 1 29 colls, 0 par 0.03s 0.03s 0.0011s 0.0017s
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.21s ( 0.21s elapsed)
GC time 0.06s ( 0.06s elapsed)
RP time 0.00s ( 0.00s elapsed)
PROF time 0.00s ( 0.00s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.27s ( 0.27s elapsed)
%GC time 21.8% (22.3% elapsed)
Alloc rate 468,514,624 bytes per MUT second
Productivity 78.2% of total user, 77.3% of total elapsed
Important things to pay attention to are:
maximum residency
Total time
%GC time (or Productivity)
Maximum residency is a measure of how much memory is needed by the program. %GC time the proportion of the time spent in garbage collection and Productivity is the complement (100% - %GC time).
If you run the program for various input values you will see a productivity of around 80%:
n Max Res. Prod. Time Output
2000 779,076 79.4% 1.10s 328.54535361588535
4000 1,023,016 80.7% 4.41s 657.0894961398351
6000 1,299,880 81.3% 9.91s 985.6071032981068
8000 1,539,352 81.5% 17.64s 1314.0968411684714
10000 1,815,600 81.7% 27.57s 1642.5891214360522
This means that about 20% of the run time is spent in garbage collection. Also, we see increasing memory usage as n increases.
It turns out we can dramatically improve productivity and memory usage by telling Haskell the order in which to evaluate the array elements instead of relying on lazy evaluation:
import Control.Monad (forM_)
main = do
(arg:_) <- getArgs
let n = read arg
forM_ [1..n] $ \i -> mua i `seq` return ()
print $ mua n
And the new profiling stats are:
n Max Res. Prod. Time Output
2000 482,800 99.3% 1.31s 328.54535361588535
4000 482,800 99.6% 5.88s 657.0894961398351
6000 482,800 99.5% 12.09s 985.6071032981068
8000 482,800 98.1% 21.71s 1314.0968411684714
10000 482,800 96.1% 34.58s 1642.5891214360522
Some interesting observations here: productivity is up, memory usage is down (constant now over the range of inputs) but run time is up. This suggests that we forced more computations than we needed to. In an imperative language like Java you have to give an evaluation order so you would know exactly which computations need to be performed. It would interesting to see your Java code to see which computations it is performing.

How do I optimize a loop which can be fully strict

I'm trying to write a brute-force solution to Project Euler Problem #145, and I cannot get my solution to run in less than about 1 minute 30 secs.
(I'm aware there are various short-cuts and even paper-and-pencil solutions; for the purpose of this question I'm not considering those).
In the best version I've come up with so far, profiling shows that the majority of the time is spent in foldDigits. This function need not be lazy at all, and to my mind ought to be optimized to a simple loop. As you can see I've attempted to make various bits of the program strict.
So my question is: without changing the overall algorithm, is there some way to bring the execution time of this program down to the sub-minute mark?
(Or if not, is there a way to see that the code of foldDigits is as optimized as possible?)
-- ghc -O3 -threaded Euler-145.hs && Euler-145.exe +RTS -N4
{-# LANGUAGE BangPatterns #-}
import Control.Parallel.Strategies
foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f !acc !n
| n < 10 = i
| otherwise = foldDigits f i d
where (d, m) = n `quotRem` 10
!i = f acc m
reverseNumber :: Int -> Int
reverseNumber !n
= foldDigits accumulate 0 n
where accumulate !v !d = v * 10 + d
allDigitsOdd :: Int -> Bool
allDigitsOdd n
= foldDigits andOdd True n
where andOdd !a d = a && isOdd d
isOdd !x = x `rem` 2 /= 0
isReversible :: Int -> Bool
isReversible n
= notDivisibleByTen n && allDigitsOdd (n + rn)
where rn = reverseNumber n
notDivisibleByTen !x = x `rem` 10 /= 0
countRange acc start end
| start > end = acc
| otherwise = countRange (acc + v) (start + 1) end
where v = if isReversible start then 1 else 0
main
= print $ sum $ parMap rseq cr ranges
where max = 1000000000
qmax = max `div` 4
ranges = [(1, qmax), (qmax, qmax * 2), (qmax * 2, qmax * 3), (qmax * 3, max)]
cr (s, e) = countRange 0 s e

As it stands, the core that ghc-7.6.1 produces for foldDigits (with -O2) is
Rec {
$wfoldDigits_r2cK
:: forall a_aha.
(a_aha -> GHC.Types.Int -> a_aha)
-> a_aha -> GHC.Prim.Int# -> a_aha
[GblId, Arity=3, Caf=NoCafRefs, Str=DmdType C(C(S))SL]
$wfoldDigits_r2cK =
\ (# a_aha)
(w_s284 :: a_aha -> GHC.Types.Int -> a_aha)
(w1_s285 :: a_aha)
(ww_s288 :: GHC.Prim.Int#) ->
case w1_s285 of acc_Xhi { __DEFAULT ->
let {
ds_sNo [Dmd=Just D(D(T)S)] :: (GHC.Types.Int, GHC.Types.Int)
[LclId, Str=DmdType]
ds_sNo =
case GHC.Prim.quotRemInt# ww_s288 10
of _ { (# ipv_aJA, ipv1_aJB #) ->
(GHC.Types.I# ipv_aJA, GHC.Types.I# ipv1_aJB)
} } in
case w_s284 acc_Xhi (case ds_sNo of _ { (d_arS, m_Xsi) -> m_Xsi })
of i_ahg { __DEFAULT ->
case GHC.Prim.<# ww_s288 10 of _ {
GHC.Types.False ->
case ds_sNo of _ { (d_Xsi, m_Xs5) ->
case d_Xsi of _ { GHC.Types.I# ww1_X28L ->
$wfoldDigits_r2cK # a_aha w_s284 i_ahg ww1_X28L
}
};
GHC.Types.True -> i_ahg
}
}
}
end Rec }
which, as you can see, re-boxes the result of the quotRem call. The problem is that no property of f is available here, and as a recursive function, foldDigits cannot be inlined.
With a manual worker-wrapper transform making the function argument static,
foldDigits :: (a -> Int -> a) -> a -> Int -> a
foldDigits f = go
where
go !acc 0 = acc
go acc n = case n `quotRem` 10 of
(q,r) -> go (f acc r) q
foldDigits becomes inlinable, and you get specialised versions for your uses operating on unboxed data, but no top-level foldDigits, e.g.
Rec {
$wgo_r2di :: GHC.Prim.Int# -> GHC.Prim.Int# -> GHC.Prim.Int#
[GblId, Arity=2, Caf=NoCafRefs, Str=DmdType LL]
$wgo_r2di =
\ (ww_s28F :: GHC.Prim.Int#) (ww1_s28J :: GHC.Prim.Int#) ->
case ww1_s28J of ds_XJh {
__DEFAULT ->
case GHC.Prim.quotRemInt# ds_XJh 10
of _ { (# ipv_aJK, ipv1_aJL #) ->
$wgo_r2di (GHC.Prim.+# (GHC.Prim.*# ww_s28F 10) ipv1_aJL) ipv_aJK
};
0 -> ww_s28F
}
end Rec }
and the effect on computation time is tangible, for the original, I got
$ ./eul145 +RTS -s -N2
608720
1,814,289,579,592 bytes allocated in the heap
196,407,088 bytes copied during GC
47,184 bytes maximum residency (2 sample(s))
30,640 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1827331 colls, 1827331 par 23.77s 11.86s 0.0000s 0.0041s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0001s 0.0001s
Parallel GC work balance: 54.94% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 620.52s (313.51s elapsed)
GC time 23.77s ( 11.86s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 644.29s (325.37s elapsed)
Alloc rate 2,923,834,808 bytes per MUT second
(I used -N2 since my i5 only has two physical cores), vs.
$ ./eul145 +RTS -s -N2
608720
16,000,063,624 bytes allocated in the heap
403,384 bytes copied during GC
47,184 bytes maximum residency (2 sample(s))
30,640 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 15852 colls, 15852 par 0.34s 0.17s 0.0000s 0.0037s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0001s 0.0001s
Parallel GC work balance: 43.86% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 4 (3 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 314.85s (160.08s elapsed)
GC time 0.34s ( 0.17s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 315.20s (160.25s elapsed)
Alloc rate 50,817,657 bytes per MUT second
Productivity 99.9% of total user, 196.5% of total elapsed
with the modification. The running time roughly halved, and the allocations reduced 100-fold.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Unique matrix transpose problem: contradictory reports from cachegrind and perf - caching

Related

Some questions related to cache performance (computer architecture)

Formulas in perf stat

Is my Theano program actually using the GPU?

Memoization done, what now?

How do I optimize a loop which can be fully strict

Categories

Resources