What is do_cos_slow.isra? - glibc

I wrote a simple code to test for prof.
double bar_compute (double d) {
double t = std::abs(d);
t += std::sqrt(d);
t += std::cos(d);
return t;
// Do some computation n times
double foo_compute(unsigned n) {
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_real_distribution<double> dist(0.0, 1.0);
double total = 0;
for (int i=0; i<n; i++) {
double d = dist(mt);
total += bar_compute(d);
return total;
When I run prof and view the output it is
56.14% runcode libm-2.23.so [.] __cos_avx
27.34% runcode runcode [.] _Z11foo_computej
13.92% runcode runcode [.] _Z11bar_computed
0.86% runcode libm-2.23.so [.] do_cos_slow.isra.1
0.44% runcode runcode [.] cos#plt
0.41% runcode libm-2.23.so [.] sloww1
0.35% runcode libm-2.23.so [.] __dubcos
0.17% runcode ld-2.23.so [.] _dl_lookup_symbol_x
What is do_cos_slow.isra and sloww1 mean?
Is there a faster version of cos that I can use? Otherwise why would it be called slow?

do_cos_slow comes from its declaration in glibc/sysdeps/ieee754/dbl-64/s_sin.c. It is called do_cos_slow because it is more precise than the function it is based on do_cos as per the comment above its declaration on Line 164.
The .isra is because the function is version which has been optimised by IPA SRA as per the following Stack Overflow Answer, What does the GCC function suffix “isra” mean?
sloww1 is a function that computes sin(x+dx) as per the comment above it.
Regarding a faster version of cos, I am not sure if there is a faster version, but if you update your glibc or libc implementation that provides libm, to at least glibc 2.28, then you will get the results of Wilco Dijkstra's removal of these slowpath functions and refactor of dosincos which gives a speed boost.
From the commit message
Refactor the sincos implementation - rather than rely on odd partial inlining
of preprocessed portions from sin and cos, explicitly write out the cases.
This makes sincos much easier to maintain and provides an additional 16-20%
speedup between 0 and 2^27. The overall speedup of sincos is 48% over this range.
Between 0 and PI it is 66% faster.
Other alternatives you can try are other libc or libm implementations, or other cos implementations including avx_mathfun or avx_mathfun with some fixes for newer GCC or supersimd.


Formulas in perf stat

I am wondering about the formulas used in perf stat to calculate figures from the raw data.
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./myapp
1080267.226401 task-clock (msec) # 19.062 CPUs utilized
1,592,123,216,789 cycles # 1.474 GHz (50.00%)
871,190,006,655 instructions # 0.55 insn per cycle (75.00%)
3,697,548,810 cache-references # 3.423 M/sec (75.00%)
459,457,321 cache-misses # 12.426 % of all cache refs (75.00%)
In this context, how do you calculate M/sec from cache-references?
Formulas are seems not to be implemented in the builtin-stat.c (where default event sets for perf stat are defined), but they are probably calculated (and averaged with stddev) in perf_stat__print_shadow_stats() (and some stats are collected into arrays in perf_stat__update_shadow_stats()):
When HW_INSTRUCTIONS is counted:
"Instructions per clock" = HW_INSTRUCTIONS / HW_CPU_CYCLES; "stalled cycles per instruction" = HW_STALLED_CYCLES_FRONTEND / HW_INSTRUCTIONS
if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
total = avg_stats(&runtime_cycles_stats[ctx][cpu]);
if (total) {
ratio = avg / total;
print_metric(ctxp, NULL, "%7.2f ",
"insn per cycle", ratio);
} else {
print_metric(ctxp, NULL, NULL, "insn per cycle", 0);
Branch misses are from print_branch_misses as HW_BRANCH_MISSES / HW_BRANCH_INSTRUCTIONS
There are several cache miss ratio calculations in perf_stat__print_shadow_stats() too like HW_CACHE_MISSES / HW_CACHE_REFERENCES and some more detailed (perf stat -d mode).
GHz is computed as HW_CPU_CYCLES / runtime_nsecs_stats, where runtime_nsecs_stats was updated from any of software events task-clock or cpu-clock (SW_TASK_CLOCK & SW_CPU_CLOCK, We still know no exact difference between them two since 2010 in LKML and 2014 at SO)
if (perf_evsel__match(counter, SOFTWARE, SW_TASK_CLOCK) ||
perf_evsel__match(counter, SOFTWARE, SW_CPU_CLOCK))
update_stats(&runtime_nsecs_stats[cpu], count[0]);
There are also several formulas for transactions (perf stat -T mode).
"CPU utilized" is from task-clock or cpu-clock / walltime_nsecs_stats, where walltime is calculated by the perf stat itself (in userspace using clock from the wall (astronomic time, ):
static inline unsigned long long rdclock(void)
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1000000000ULL + ts.tv_nsec;
static int __run_perf_stat(int argc, const char **argv)
* Enable counters and exec the command:
t0 = rdclock();
clock_gettime(CLOCK_MONOTONIC, &ref_time);
if (forks) {
t1 = rdclock();
update_stats(&walltime_nsecs_stats, t1 - t0);
There are also some estimations from the Top-Down methodology (Tuning Applications Using a Top-down Microarchitecture Analysis Method, Software Optimizations Become Simple with Top-Down Analysis .. Name Skylake, IDF2015, #22 in Gregg's Methodology List. Described in 2016 by Andi Kleen https://lwn.net/Articles/688335/ "Add top down metrics to perf stat" (perf stat --topdown -I 1000 cmd mode).
And finally, if there was no exact formula for the currently printing event, there is universal "%c/sec" (K/sec or M/sec) metric: http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L845 Anything divided by runtime nsec (task-clock or cpu-clock events, if they were present in perf stat event set)
} else if (runtime_nsecs_stats[cpu].n != 0) {
char unit = 'M';
char unit_buf[10];
total = avg_stats(&runtime_nsecs_stats[cpu]);
if (total)
ratio = 1000.0 * avg / total;
if (ratio < 0.001) {
ratio *= 1000;
unit = 'K';
snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit);
print_metric(ctxp, NULL, "%8.3f", unit_buf, ratio);

DirectX 11 Compute Shader device synchronization?

Background: perform benchmarking/comparisson over GPGPU platforms.
Problem: Device synchronization when dispatching a DirectX 11 Compute Shader.
Looking for the equivalent of cudaDeviceSynchronize() of clFinish(...) to make a fair comparisson of how my algorithm performs.
CUDA and OpenCL functions are more clear on the blocking/ non-blocking issues. DirectCompute however is more related to the graphics pipeline (of which I learning and very unfamiliar with) and therefore I have trouble finding out if a Dispatch call is blocking or if previously memory allocation/transfers are finished.
Code DX_1:
// Setup
for (...) {
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
// Release
Code DX_2:
for (...) {
// Setup
context->Dispatch(number_of_groups, 1, 1);
times[i] = stopTimer();
// Release
Results (average times of 2^2 to 2^11 elements):
1.6 205.5 24.8
1.8 133.4 24.8
29.1 186.5 25.6
18.6 175.0 25.6
11.4 187.5 26.6
85.2 127.7 26.3
166.4 151.1 28.1
98.2 149.5 35.2
26.8 203.5 31.6
Notice: these times are run on a desktop GPU with a screen connected, some erratic timings are expected. Times are not supposed to include host to device buffer transfers.
Notice 2: These are very short sequences (4 - 2048 elements) the interesting tests are performed on problem sizes of up to 2^26 elements.
My new solution is to avoid synchronization with device. I have looked into some methods of retreiving timestamps instead, results look ok and I'm fairly sure the comparisons are fair enough. I compared my CUDA times (Event Record vs. QPC) and the difference is small, a seemingly constant overhead.
CUDA Event Host QPC
4,6 30,0
4,8 30,0
5,0 31,0
5,2 32,0
5,6 34,0
6,1 34,0
6,9 31,0
8,3 47,0
9,2 34,0
12,0 39,0
16,7 46,0
20,5 55,0
32,1 69,0
48,5 111,0
86,0 134,0
182,4 237,0
419,0 473,0
In case my question brings someone in hopes of finding how to do gpgpu benchmarking I will leave some code behind demonstrating my current benchmarking strategy.
Code Examples, CUDA
cudaEvent_t start, stop;
float milliseconds = 0;
// Launch my algorithm
cudaEventElapsedTime(&milliseconds, start, stop);
cl_event start_event, end_event;
cl_ulong start = 0, end = 0;
// Enqueue a dummy kernel for the start event.
clEnqueueNDRangeKernel(..., &start_event);
// Launch my algorithm
// Enqueue a dummy kernel for the end event.
clEnqueueNDRangeKernel(..., &end_event);
clWaitForEvents(1, &end_event);
clGetEventProfilingInfo(start_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(end_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
timeInMS = (double)(end - start)*(double)(1e-06);
Here I followed the suggestion from Adam Miles and looked into that source. Will look something like this:
ID3D11Device* device = nullptr;
// Setup
ID3D11QueryPtr disjoint_query;
ID3D11QueryPtr q_start;
ID3D11QueryPtr q_end;
if (disjoint_query == NULL)
D3D11_QUERY_DESC desc;
desc.MiscFlags = 0;
device->CreateQuery(&desc, &disjoint_query);
desc.Query = D3D11_QUERY_TIMESTAMP;
device->CreateQuery(&desc, &q_start);
device->CreateQuery(&desc, &q_end);
// Launch my algorithm
UINT64 start, end;
while (S_OK != context->GetData(q_start, &start, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(q_end, &end, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(disjoint_query, &q_freq, sizeof(D3D11_QUERY_DATA_TIMESTAMP_DISJOINT), 0)){};
timeInMS = (((double)(end - start)) / ((double)q_freq.Frequency)) * 1000.0;
static LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds, Frequency;
static void __inline startTimer()
static double __inline stopTimer()
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
return (double)ElapsedMicroseconds.QuadPart;
My code examples are taken out of context and I tried to do some clean-up but errors might be present.
If you're interested in how long a particular Draw or Dispatch is taking on the GPU then you should take a look at DirectX 11's Timestamp queries. You can query the GPU's clock frequency and current clock value before and after some GPU work and figure out how long that took in wall time.
This is probably a good primer / example on how to do it:

fast small angle sinus/cosinus approximation

I'm doing some rigid-body rotation dynamics simulation, which means I have to compute many rotations by small angle, which has performance bottleneck in evaluation of trigonometric function. Now I do it by Taylor(McLaurin) series:
class double2{
double x,y;
// Intristic full sin/cos
final void rotate ( double a){
double x_=x;
double ca=Math.cos(a); double sa=Math.sin(a);
x=ca*x_-sa*y; y=sa*x_+ca*y;
// Taylor 7th-order aproximation
final void rotate_d7( double a){
double x_=x;
double a2=a*a;
double a4=a2*a2;
double a6=a4*a2;
double ca= 1.0d - a2 /2.0d + a4 /24.0d - a6/720.0d;
double sa= a - a2*a/6.0d + a4*a/120.0d - a6*a/5040.0d;
x=ca*x_-sa*y; y=sa*x_+ca*y;
but the trade of performance-speed is not so great as I would expect:
error(100x dphi=Pi/100 ) time [ns pre rotation]
v.rotate_d1() : -0.010044860504615213 9.314306 ns/op
v.rotate_d3() : 3.2624666136960023E-6 16.268745 ns/op
v.rotate_d5() : -4.600003294941146E-10 35.433617 ns/op
v.rotate_d7() : 3.416711358283919E-14 49.831547 ns/op
v.rotate() : 3.469446951953614E-16 75.70213 ns/op
Is there any faster method how to evaluate approximation of sin() and cos() for small angle ( like < Pi/100 )
I was thinking maybe some rational series, or continuous fraction approximation? Do you know any? ( Precomputed table doesn't make sense here )
You might find that adjusting your calculations can improve performance. E.g.:
const double c7 = -1/5040d;
const double c5 = 1/120d;
const double c3 = -1/6d;
double a2 = a * a;
double sa = (((c7 * a2 + c5) * a2 + c3) * a2 + 1) * a;
// similarly for cos
Now the optimiser might be doing some of this itself anyway, so your mileage may vary. Would be interested to know the results either way.
Instead of optimizing the trig functions, see if you can do without them. Rigid-body simulations tend to be a perfectly natural fit for vector math.
Two ways : reduce the precision if possible (as often in video games, use minimal acceptable precision if you aim performance)
the you should try to use tabulated values. Once per execution (when the game loads ?) compute an array of sinus/ cosinus/ that you then access in constant time.
float cosAlpha = COSINUS[(int)(k*alpha)]; // e.g: k = 1000
tune k and the array size to choose angle resolution vs. memory footprint.
edit: Don't forget to use parity of cosinus/sinus functions to avoid duplicate values in the tab
edit2: try floats instead of double. Difference will be insignificant for the player, and the performance impact way be interesting. Test it !
can you add some inline assembler? Targetting the i386 'fsincos' instruction is probably the fastest method :
Vector2 unit_vector ( Angle angle ) {
Vector2 r;
//now the normal processor detection
//and various platform specific vesions
# if defined (__i386__) && !defined (NO_ASM)
# if defined __GNUC__
# define ASM_SINCOS
asm ("fsincos" : "=t" (r.x), "=u" (r.y) : "0" (angle.radians()));
# elif defined _MSC_VER
# define ASM_SINCOS
double a = angle.radians();
__asm fld a
__asm fsincos
__asm fstp r.x
__asm fstp r.y
# endif
# endif
from here.
This has the added bonus of calculating both sin and cos in a single call.
EDIT : it's Java.
Are your rotations suitably self-contained that you can offload thousands at a time over JNI? Otherwise this hardware-specific approach is no good.
For small x (x<0.2 in radians) you can safely assume sin(x) = x.
The maximum deviation is 0.0013.

What does 'Samples' mean in perf output?

I used linux perf to profile my program and I can not understand the result.
10.5% 2 fun ..........
|- 80% - ABC
| call_ABC
-- 20% - DEF
The above example means that 'fun' has two samples and contributes 10.5% overheads,
and 80% of them is called from ABC, 20% from DEF. Am I right?
Now we have only two samples, then how does 'perf' calculate the fraction of ABC and DEF?
Why aren't they 50%? dose 'perf' use additional information?
The above example means that 'fun' has two samples and contributes 10.5% overheads,
Yes, this part of perf report -g -n shows that 2 of 19 samples (2 is 10.5% of 19) was in the foo function itself. 17 other samples were sampled in the other function.
I just reproduced your code with recent gcc (-static -O3 -fno-inline -fno-omit-frame-pointer -g) and perf (perf record -e cycles:u -c 500000 -g ./test12968422 for low resolution samples or -c 5000 for high resolution). Now perf has bit different weight rules, but idea should be same. When there is only 2 samples for the program and both are in the foo, call-graph (perf report -n -g callee) is 50 for every of call_DEF/_ABC (no additional information). This program actually had 86% of runtime in foo, 61% of them when called from ABC, 25% of 86 when called from DEF:
100% 2 fun
- fun
+ 50% call_DEF
+ 50% call_ABC
What are the kind of additional information perf may use to reconstruct more information? I think it can be either self weight of call_DEF and call_ABC; or it can be frequency of "call_ABC->foo" and "call_DEF->foo" parts of callchain in the all sample call stacks.
With perf from linux kernel versions 4.4 / 4.10 I can't reproduce your situation. I added different amount of self work in the call_ABC and call_DEF. Both of them just calls foo for fixed amount of work. Now I have 19 samples of -e cycles:u -c 54000 -g, 13 for call_ABC, 2 for call_DEF, 2 for fun (and 2 in some random functions):
Children Self Samples Symbol
74% 68% 13 [.] call_ABC
16% 10.5% 2 [.] call_DEF
10.5% 10.5% 2 [.] fun
- fun
+ 5.26% call_ABC
+ 5.26% call_DEF
So, try newer version of perf, not from epoch of 3.2 Linux kernels.
First source of fun only work, inequal shares when called from ABC and from DEF:
#define g 100000
int a[2+g];
void fill_a(){
for(int f=0;f<g;f++)
int fun(int b)
return b;
int call_ABC(int b)
int d = b;
b = fun(d);
return d-b;
int call_DEF(int b)
int e = b;
b = fun(e);
return e+b;
int main()
int c,d;
return c+d;
Second source of inequal work in ABC and DEF with equal small work in fun:
#define g 100000
int a[2+g];
void fill_a(){
for(int f=0;f<g;f++)
int fun(int b)
return b;
int call_ABC(int b)
int d = b;
b = fun(5000);
return d-b;
int call_DEF(int b)
int e = b;
b = fun(5000);
return e+b;
int main()
int c,d;
return c+d;

Performance problem with Euler problem and recursion on Int64 types

I'm currently learning Haskell using the project Euler problems as my playground.
I was astound by how slow my Haskell programs turned out to be compared to similar
programs written in other languages. I'm wondering if I've forseen something, or if this is the kind of performance penalties one has to expect when using Haskell.
The following program in inspired by Problem 331, but I've changed it before posting so I don't spoil anything for other people. It computes the arc length of a discrete circle drawn on a 2^30 x 2^30 grid. It is a simple tail recursive implementation and I make sure that the updates of the accumulation variable keeping track of the arc length is strict. Yet it takes almost one and a half minute to complete (compiled with the -O flag with ghc).
import Data.Int
arcLength :: Int64->Int64
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' x y norm2 acc
| x > y = acc
| norm2 < 0 = arcLength' (x + 1) y (norm2 + 2*x +1) acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y-1) (norm2 - 2*(x + y) + 2) acc
| otherwise = arcLength' (x + 1) y (norm2 + 2*x + 1) $! (acc + 1)
main = print $ arcLength (2^30)
Here is a corresponding implementation in Java. It takes about 4.5 seconds to complete.
public class ArcLength {
public static void main(String args[]) {
long n = 1 << 30;
long x = 0;
long y = n-1;
long acc = 0;
long norm2 = 0;
long time = System.currentTimeMillis();
while(x <= y) {
if (norm2 < 0) {
norm2 += 2*x + 1;
} else if (norm2 > 2*(n-1)) {
norm2 += 2 - 2*(x+y);
} else {
norm2 += 2*x + 1;
time = System.currentTimeMillis() - time;
EDIT: After the discussions in the comments I made som modifications in the Haskell code and did some performance tests. First I changed n to 2^29 to avoid overflows. Then I tried 6 different version: With Int64 or Int and with bangs before either norm2 or both and norm2 and acc in the declaration arcLength' x y !norm2 !acc. All are compiled with
ghc -O3 -prof -rtsopts -fforce-recomp -XBangPatterns arctest.hs
Here are the results:
(Int !norm2 !acc)
total time = 3.00 secs (150 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int norm2 !acc)
total time = 3.56 secs (178 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int norm2 acc)
total time = 3.56 secs (178 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int64 norm2 acc)
arctest.exe: out of memory
(Int64 norm2 !acc)
total time = 48.46 secs (2423 ticks # 20 ms)
total alloc = 26,246,173,228 bytes (excludes profiling overheads)
(Int64 !norm2 !acc)
total time = 31.46 secs (1573 ticks # 20 ms)
total alloc = 3,032 bytes (excludes profiling overheads)
I'm using GHC 7.0.2 under a 64-bit Windows 7 (The Haskell platform binary distribution). According to the comments, the problem does not occur when compiling under other configurations. This makes me think that the Int64 type is broken in the Windows release.
Hm, I installed a fresh Haskell platform with 7.0.3, and get roughly the following core for your program (-ddump-simpl):
Main.$warcLength' =
\ (ww_s1my :: GHC.Prim.Int64#) (ww1_s1mC :: GHC.Prim.Int64#)
(ww2_s1mG :: GHC.Prim.Int64#) (ww3_s1mK :: GHC.Prim.Int64#) ->
case {__pkg_ccall ghc-prim hs_gtInt64 [...]
ww_s1my ww1_s1mC GHC.Prim.realWorld#
So GHC has realized that it can unpack your integers, which is good. But this hs_getInt64 call looks suspiciously like a C call. Looking at the assembler output (-ddump-asm), we see stuff like:
pushl %eax
movl 76(%esp),%eax
pushl %eax
call _hs_gtInt64
addl $16,%esp
So this looks very much like every operation on the Int64 get turned into a full-blown C call in the backend. Which is slow, obviously.
The source code of GHC.IntWord64 seems to verify that: In a 32-bit build (like the one currently shipped with the platform), you will have only emulation via the FFI interface.
Hmm, this is interesting. So I just compiled both of your programs, and tried them out:
% java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
% javac ArcLength.java
% java ArcLength
So about 6.6 seconds for the Java solution. Next is ghc with some optimization:
% ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.12.1
% ghc --make -O arc.hs
% time ./arc
./arc 12.68s user 0.04s system 99% cpu 12.718 total
Just under 13 seconds for ghc -O
Trying with some further optimization:
% ghc --make -O3
% time ./arc [13:16]
./arc 5.75s user 0.00s system 99% cpu 5.754 total
With further optimization flags, the haskell solution took under 6 seconds
It would be interesting to know what version compiler you are using.
There's a couple of interesting things in your question.
You should be using -O2 primarily. It will just do a better job (in this case, identifying and removing laziness that was still present in the -O version).
Secondly, your Haskell isn't quite the same as the Java (it does different tests and branches). As with others, running your code on my Linux box results in around 6s runtime. It seems fine.
Make sure it is the same as the Java
One idea: let's do a literal transcription of your Java, with the same control flow, operations and types.
import Data.Bits
import Data.Int
loop :: Int -> Int
loop n = go 0 (n-1) 0 0
go :: Int -> Int -> Int -> Int -> Int
go x y acc norm2
| x <= y = case () of { _
| norm2 < 0 -> go (x+1) y acc (norm2 + 2*x + 1)
| norm2 > 2 * (n-1) -> go (x-1) (y-1) acc (norm2 + 2 - 2 * (x+y))
| otherwise -> go (x+1) y (acc+1) (norm2 + 2*x + 1)
| otherwise = acc
main = print $ loop (1 `shiftL` 30)
Peek at the core
We'll take a quick peek at the Core, using ghc-core, and it shows a very nice loop of unboxed type:
:: Int#
-> Int#
-> Int#
-> Int#
-> Int#
main_$s$wgo =
\ (sc_sQa :: Int#)
(sc1_sQb :: Int#)
(sc2_sQc :: Int#)
(sc3_sQd :: Int#) ->
case <=# sc3_sQd sc2_sQc of _ {
False -> sc1_sQb;
True ->
case <# sc_sQa 0 of _ {
False ->
case ># sc_sQa 2147483646 of _ {
False ->
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
(+# sc1_sQb 1)
(+# sc3_sQd 1);
True ->
(+# sc_sQa 2)
(*# 2 (+# sc3_sQd sc2_sQc)))
(-# sc2_sQc 1)
(-# sc3_sQd 1)
True ->
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
(+# sc3_sQd 1)
that is, all unboxed into registers. That loop looks great!
And performs just fine (Linux/x86-64/GHC 7.03):
./A 5.95s user 0.01s system 99% cpu 5.980 total
Checking the asm
We get reasonable assembly too, as a nice loop:
cmpq %rdi, %r8
jg .L8
testq %r14, %r14
movq %r14, %rdx
js .L4
cmpq $2147483646, %r14
jle .L9
leaq (%rdi,%r8), %r10
addq $2, %rdx
leaq -1(%rdi), %rdi
addq %r10, %r10
movq %rdx, %r14
leaq -1(%r8), %r8
subq %r10, %r14
jmp Main_mainzuzdszdwgo_info
leaq 1(%r14,%r8,2), %r14
addq $1, %rsi
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
movq %rsi, %rbx
jmp *0(%rbp)
leaq 1(%r14,%r8,2), %r14
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
Using the -fvia-C backend.
So this looks fine!
My suspicion, as mentioned in the comment above, is something to do with the version of libgmp you have on 32 bit Windows generating poor code for 64 bit ints. First try upgrading to GHC 7.0.3, and then try some of the other code generator backends, then if you still have an issue with Int64, file a bug report to GHC trac.
Broadly confirming that it is indeed the cost of making those C calls in the 32 bit emulation of 64 bit ints, we can replace Int64 with Integer, which is implemented with C calls to GMP on every machine, and indeed, runtime goes from 3s to well over a minute.
Lesson: use hardware 64 bits if at all possible.
The normal optimization flag for performance concerned code is -O2. What you used, -O, does very little. -O3 doesn't do much (any?) more than -O2 - it even used to include experimental "optimizations" that often made programs notably slower.
With -O2 I get performance competitive with Java:
tommd#Mavlo:Test$ uname -r -m
2.6.37 x86_64
tommd#Mavlo:Test$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.3
tommd#Mavlo:Test$ ghc -O2 so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
real 0m4.948s
user 0m4.896s
sys 0m0.000s
And Java is about 1 second faster (20%):
tommd#Mavlo:Test$ time java ArcLength
real 0m3.961s
user 0m3.936s
sys 0m0.024s
But an interesting thing about GHC is it has many different backends. By default it uses the native code generator (NCG), which we timed above. There's also an LLVM backend that often does better... but not here:
tommd#Mavlo:Test$ ghc -O2 so.hs -fllvm -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
real 0m5.973s
user 0m5.968s
sys 0m0.000s
But, as FUZxxl mentioned in the comments, LLVM does much better when you add a few strictness annotations:
$ ghc -O2 -fllvm -fforce-recomp so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
real 0m4.099s
user 0m4.088s
sys 0m0.000s
There's also an old "via-c" generator that uses C as an intermediate language. It does well in this case:
tommd#Mavlo:Test$ ghc -O2 so.hs -fvia-c -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
on the commandline:
Warning: The -fvia-c flag will be removed in a future GHC release
Linking so ...
ttommd#Mavlo:Test$ ti
tommd#Mavlo:Test$ time ./so
real 0m3.982s
user 0m3.972s
sys 0m0.000s
Hopefully the NCG will be improved to match via-c for this case before they remove this backend.
dberg, I feel like all of this got off to a bad start with the unfortunate -O flag. Just to emphasize a point made by others, for run-of-the-mill compilation and testing, do like me and paste this into your .bashrc or whatever:
alias ggg="ghc --make -O2"
alias gggg="echo 'Glorious Glasgow for Great Good!' && ghc --make -O2 --fforce-recomp"
I've played with the code a little and this version seems to run faster than Java version on my laptop (3.55s vs 4.63s):
{-# LANGUAGE BangPatterns #-}
arcLength :: Int->Int
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' :: Int -> Int -> Int -> Int -> Int
arcLength' !x !y !norm2 !acc
| x > y = acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y - 1) (norm2 - 2*(x + y) + 2) acc
| norm2 < 0 = arcLength' (succ x) y (norm2 + x*2 + 1) acc
| otherwise = arcLength' (succ x) y (norm2 + 2*x + 1) (acc + 1)
main = print $ arcLength (2^30)
$ ghc -O2 tmp1.hs -fforce-recomp
[1 of 1] Compiling Main ( tmp1.hs, tmp1.o )
Linking tmp1 ...
$ time ./tmp1
real 0m3.553s
user 0m3.539s
sys 0m0.006s
