What does 'Samples' mean in perf output? - profile

I used linux perf to profile my program and I can not understand the result.
10.5% 2 fun ..........
|- 80% - ABC
| call_ABC
-- 20% - DEF
The above example means that 'fun' has two samples and contributes 10.5% overheads,
and 80% of them is called from ABC, 20% from DEF. Am I right?
Now we have only two samples, then how does 'perf' calculate the fraction of ABC and DEF?
Why aren't they 50%? dose 'perf' use additional information?

The above example means that 'fun' has two samples and contributes 10.5% overheads,
Yes, this part of perf report -g -n shows that 2 of 19 samples (2 is 10.5% of 19) was in the foo function itself. 17 other samples were sampled in the other function.
I just reproduced your code with recent gcc (-static -O3 -fno-inline -fno-omit-frame-pointer -g) and perf (perf record -e cycles:u -c 500000 -g ./test12968422 for low resolution samples or -c 5000 for high resolution). Now perf has bit different weight rules, but idea should be same. When there is only 2 samples for the program and both are in the foo, call-graph (perf report -n -g callee) is 50 for every of call_DEF/_ABC (no additional information). This program actually had 86% of runtime in foo, 61% of them when called from ABC, 25% of 86 when called from DEF:
100% 2 fun
- fun
+ 50% call_DEF
+ 50% call_ABC
What are the kind of additional information perf may use to reconstruct more information? I think it can be either self weight of call_DEF and call_ABC; or it can be frequency of "call_ABC->foo" and "call_DEF->foo" parts of callchain in the all sample call stacks.
With perf from linux kernel versions 4.4 / 4.10 I can't reproduce your situation. I added different amount of self work in the call_ABC and call_DEF. Both of them just calls foo for fixed amount of work. Now I have 19 samples of -e cycles:u -c 54000 -g, 13 for call_ABC, 2 for call_DEF, 2 for fun (and 2 in some random functions):
Children Self Samples Symbol
74% 68% 13 [.] call_ABC
16% 10.5% 2 [.] call_DEF
10.5% 10.5% 2 [.] fun
- fun
+ 5.26% call_ABC
+ 5.26% call_DEF
So, try newer version of perf, not from epoch of 3.2 Linux kernels.
First source of fun only work, inequal shares when called from ABC and from DEF:
#define g 100000
int a[2+g];
void fill_a(){
for(int f=0;f<g;f++)
int fun(int b)
return b;
int call_ABC(int b)
int d = b;
b = fun(d);
return d-b;
int call_DEF(int b)
int e = b;
b = fun(e);
return e+b;
int main()
int c,d;
return c+d;
Second source of inequal work in ABC and DEF with equal small work in fun:
#define g 100000
int a[2+g];
void fill_a(){
for(int f=0;f<g;f++)
int fun(int b)
return b;
int call_ABC(int b)
int d = b;
b = fun(5000);
return d-b;
int call_DEF(int b)
int e = b;
b = fun(5000);
return e+b;
int main()
int c,d;
return c+d;


Haskell program runs very slow

I wrote my first program calculating prime numbers. However it runs really slow, and I can't figure out why. I wrote similar code in java and for n = 10000 the java program doesn't take any time, while the Haskell program takes like 2 minutes.
import Data.List
main = do
print "HowManyPrimes? - OnlyInteger"
inputNumber <- getLine
let x = (read inputNumber :: Int)
print (firstNPrimes x)
-- prime - algorithm
primeNumber:: Int -> Bool
primeNumber 2 = True
primeNumber x = primNumberRec x (div x 2)
primNumberRec:: Int -> Int -> Bool
primNumberRec x y
|y == 0 = False
|y == 1 = True
|mod x y == 0 = False
|otherwise = primNumberRec x (y-1)
-- prime numbers till n
primesTillN:: Int -> [Int]
primesTillN n = 2:[ x | x <- [3,5..n], primeNumber x ]
firstNPrimes:: Int -> [Int]
firstNPrimes 0 = []
firstNPrimes n = 2: take (n-1) [x|x <- [3,5..], primeNumber x]
Thanks in advance.
Similar java code:
import java.util.Scanner;
public class PrimeNumbers{
static Scanner scan = new Scanner(System.in);
public boolean primeAlgorithm(int x){
if (x < 2)
return false;
return primeAlgorithm(x, (int)Math.sqrt(x));
public boolean primeAlgorithm(int x, int divider){
if (divider == 1)
return true;
if (x%divider == 0)
return false;
return primeAlgorithm(x, divider-1);
public static void main(String[] args){
PrimeNumbers p = new PrimeNumbers();
int howManyPrimes = scan.nextInt();
int number = 3;
System.out.print(number+" ");
When doing timing measurements, always compile; ghci is designed for a fast change-rebuild-run loop, not for speedy execution of the produced code. However, even after following this advice there is a huge timing difference between your two snippets.
The key difference between your java and Haskell is using sqrt instead of dividing by 2. Your originals, on my machine:
% javac Test.java && echo 10000 | /usr/bin/time java Test >/dev/null
0.21user 0.02system 0:00.13elapsed 186%CPU (0avgtext+0avgdata 38584maxresident)k
0inputs+0outputs (0major+5823minor)pagefaults 0swaps
% ghc -O2 test && echo 10000 | /usr/bin/time ./test >/dev/null
8.85user 0.00system 0:08.87elapsed 99%CPU (0avgtext+0avgdata 4668maxresident)k
0inputs+0outputs (0major+430minor)pagefaults 0swaps
So 0.2s for java, 8.9s for Haskell. After switching to using square root with the following change:
- primeNumber x = primNumberRec x (div x 2)
+ primeNumber x = primNumberRec x (ceiling (sqrt (fromIntegral x)))
I get the following timing for the Haskell:
% ghc -O2 test && echo 10000 | /usr/bin/time ./test >/dev/null
0.07user 0.00system 0:00.07elapsed 98%CPU (0avgtext+0avgdata 4560maxresident)k
0inputs+0outputs (0major+427minor)pagefaults 0swaps
Now 3x faster than the java code. (And of course there are significantly better algorithms that will make it even faster still.)
Compile it!
Haskell code in GHCi is far from optimised; try to compile it into a binary with ghc -o prime prime.hs or even better use -O2 optimisation. I had a script once that took 5min in GHCi but mere seconds once compiled.

Formulas in perf stat

I am wondering about the formulas used in perf stat to calculate figures from the raw data.
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./myapp
1080267.226401 task-clock (msec) # 19.062 CPUs utilized
1,592,123,216,789 cycles # 1.474 GHz (50.00%)
871,190,006,655 instructions # 0.55 insn per cycle (75.00%)
3,697,548,810 cache-references # 3.423 M/sec (75.00%)
459,457,321 cache-misses # 12.426 % of all cache refs (75.00%)
In this context, how do you calculate M/sec from cache-references?
Formulas are seems not to be implemented in the builtin-stat.c (where default event sets for perf stat are defined), but they are probably calculated (and averaged with stddev) in perf_stat__print_shadow_stats() (and some stats are collected into arrays in perf_stat__update_shadow_stats()):
When HW_INSTRUCTIONS is counted:
"Instructions per clock" = HW_INSTRUCTIONS / HW_CPU_CYCLES; "stalled cycles per instruction" = HW_STALLED_CYCLES_FRONTEND / HW_INSTRUCTIONS
if (perf_evsel__match(evsel, HARDWARE, HW_INSTRUCTIONS)) {
total = avg_stats(&runtime_cycles_stats[ctx][cpu]);
if (total) {
ratio = avg / total;
print_metric(ctxp, NULL, "%7.2f ",
"insn per cycle", ratio);
} else {
print_metric(ctxp, NULL, NULL, "insn per cycle", 0);
Branch misses are from print_branch_misses as HW_BRANCH_MISSES / HW_BRANCH_INSTRUCTIONS
There are several cache miss ratio calculations in perf_stat__print_shadow_stats() too like HW_CACHE_MISSES / HW_CACHE_REFERENCES and some more detailed (perf stat -d mode).
GHz is computed as HW_CPU_CYCLES / runtime_nsecs_stats, where runtime_nsecs_stats was updated from any of software events task-clock or cpu-clock (SW_TASK_CLOCK & SW_CPU_CLOCK, We still know no exact difference between them two since 2010 in LKML and 2014 at SO)
if (perf_evsel__match(counter, SOFTWARE, SW_TASK_CLOCK) ||
perf_evsel__match(counter, SOFTWARE, SW_CPU_CLOCK))
update_stats(&runtime_nsecs_stats[cpu], count[0]);
There are also several formulas for transactions (perf stat -T mode).
"CPU utilized" is from task-clock or cpu-clock / walltime_nsecs_stats, where walltime is calculated by the perf stat itself (in userspace using clock from the wall (astronomic time, ):
static inline unsigned long long rdclock(void)
struct timespec ts;
clock_gettime(CLOCK_MONOTONIC, &ts);
return ts.tv_sec * 1000000000ULL + ts.tv_nsec;
static int __run_perf_stat(int argc, const char **argv)
* Enable counters and exec the command:
t0 = rdclock();
clock_gettime(CLOCK_MONOTONIC, &ref_time);
if (forks) {
t1 = rdclock();
update_stats(&walltime_nsecs_stats, t1 - t0);
There are also some estimations from the Top-Down methodology (Tuning Applications Using a Top-down Microarchitecture Analysis Method, Software Optimizations Become Simple with Top-Down Analysis .. Name Skylake, IDF2015, #22 in Gregg's Methodology List. Described in 2016 by Andi Kleen https://lwn.net/Articles/688335/ "Add top down metrics to perf stat" (perf stat --topdown -I 1000 cmd mode).
And finally, if there was no exact formula for the currently printing event, there is universal "%c/sec" (K/sec or M/sec) metric: http://elixir.free-electrons.com/linux/v4.13.4/source/tools/perf/util/stat-shadow.c#L845 Anything divided by runtime nsec (task-clock or cpu-clock events, if they were present in perf stat event set)
} else if (runtime_nsecs_stats[cpu].n != 0) {
char unit = 'M';
char unit_buf[10];
total = avg_stats(&runtime_nsecs_stats[cpu]);
if (total)
ratio = 1000.0 * avg / total;
if (ratio < 0.001) {
ratio *= 1000;
unit = 'K';
snprintf(unit_buf, sizeof(unit_buf), "%c/sec", unit);
print_metric(ctxp, NULL, "%8.3f", unit_buf, ratio);

Why is the Rust random number generator slower with multiple instances running?

I am doing some random number generation for my Lotto Simulation and was wondering why would it be MUCH slower when running multiple instances?
I am running this program under Ubuntu 15.04 (linux kernel 4.2). rustc 1.7.0-nightly (d5e229057 2016-01-04)
Overall CPU utilization is about 45% during these tests but each individual thread is taking up 100% of that thread.
Here is my script I am using to start multiple instances at the same time.
#!/usr/bin/env bash
pkill lotto_sim
for _ in `seq 1 14`;
./lotto_sim 15000000 1>> /var/log/syslog &
Took PT38.701900316S seconds to generate 15000000 random tickets
Took PT39.193917241S seconds to generate 15000000 random tickets
Took PT39.412279484S seconds to generate 15000000 random tickets
Took PT39.492940352S seconds to generate 15000000 random tickets
Took PT39.715433024S seconds to generate 15000000 random tickets
Took PT39.726609237S seconds to generate 15000000 random tickets
Took PT39.884151996S seconds to generate 15000000 random tickets
Took PT40.025874106S seconds to generate 15000000 random tickets
Took PT40.088332517S seconds to generate 15000000 random tickets
Took PT40.112601899S seconds to generate 15000000 random tickets
Took PT40.205958636S seconds to generate 15000000 random tickets
Took PT40.227956170S seconds to generate 15000000 random tickets
Took PT40.393753486S seconds to generate 15000000 random tickets
Took PT40.465173616S seconds to generate 15000000 random tickets
However, a single run gives this output:
$ ./lotto_sim 15000000
Took PT9.860698141S seconds to generate 15000000 random tickets
My understanding is that each process has it's own memory and doesn't share anything. Correct?
Here is the relevant code:
extern crate rand;
extern crate itertools;
extern crate time;
use std::env;
use rand::{Rng, Rand};
use itertools::Itertools;
use time::PreciseTime;
struct Ticket {
whites: Vec<u8>,
power_ball: u8,
is_power_play: bool,
const WHITE_MIN: u8 = 1;
const WHITE_MAX: u8 = 69;
const POWER_BALL_MIN: u8 = 1;
const POWER_BALL_MAX: u8 = 26;
impl Rand for Ticket {
fn rand<R: Rng>(rng: &mut R) -> Self {
let pp_guess = rng.gen_range(0, 100);
let pp_value = pp_guess < POWER_PLAY_PERCENTAGE;
let mut whites_vec: Vec<_> = (0..).map(|_| rng.gen_range(WHITE_MIN, WHITE_MAX + 1))
let pb_value = rng.gen_range(POWER_BALL_MIN, POWER_BALL_MAX + 1);
Ticket { whites: whites_vec, power_ball: pb_value, is_power_play: pp_value}
fn gen_test(num_tickets: i64) {
let mut rng = rand::thread_rng();
let _: Vec<_> = rng.gen_iter::<Ticket>()
.take(num_tickets as usize)
fn main() {
let args: Vec<_> = env::args().collect();
let num_tickets: i64 = args[1].parse::<i64>().unwrap();
let start = PreciseTime::now();
let end = PreciseTime::now();
println!("Took {} seconds to generate {} random tickets", start.to(end), num_tickets);
Maybe a better question would be how do I debug and figure this out? Where would I look within the program or within my OS to find the performance hindrances? I am new to Rust and lower level programming like this that relies so heavily on the OS.

Scheduling Algorithm with limitations

Thanks to user3125280, D.W. and Evgeny Kluev the question is updated.
I have a list of webpages and I must download them frequently, each webpage got a different download frequency. Based on this frequency we group the webpages in 5 groups:
Items in group 1 are downloaded once per 1 hour
items in group 2 once per 2 hours
items in group 3 once per 4 hours
items in group 4 once per 12 hours
items in group 5 once per 24 hours
This means, we must download all the group 1 webpages in 1 hour, all the group 2 in 2 hours etc.
I am trying to make an algorithm. As input, I have:
a) DATA_ARR = one array with 5 numbers. Each number represents the number of items in this group.
b) TIME_ARR = one array with 5 numbers (1, 2, 4, 12, 24) representing how often the items will be downloaded.
b) X = the total number of webpages to download per hour. This is calculated using items_in_group/download_frequently and rounded upwards. If we have 15 items in group 5, and 3 items in group 4, this will be 15/24 + 3/12 = 0.875 and rounded is 1.
Every hour my program must download at max X sites. I expect the algorithm to output something like:
Hour 1: A1 B0 C4 D5
Hour 2: A2 B1 C2 D2
A1 = 2nd item of 1st group
C0 = 1st item of 3rd group
My algorithm must be as efficient as possible. This means:
a) the pattern must be extendable to at least 200+ hours
b) no need to create a repeatable pattern
c) spaces are needed when possible in order to use the absolute minimum bandwidth
d) never ever download an item more often than the update frequency, no exceptions
group 1: 0 items | once per 1 hour
group 2: 3 items | once per 2 hours
group 3: 4 items | once per 4 hours
group 4: 0 items | once per 12 hours
group 5: 0 items | once per 24 hours
We calculate the number of items we can take per hour: 3/2+4/4 = 2.5. We round this upwards and it's 3.
Using pencil and paper, we can found the following solution:
Hour 1: B0 C0 B1
Hour 2: B2 C1 c2
Hour 3: B0 C3 B1
Hour 4: B2
Hour 5: B0 C0 B1
Hour 6: B2 C1 c2
Hour 7: B0 C3 B1
Hour 8: B2
Hour 9: B0 C0 B1
Hour 10: B2 C1 c2
Hour 11: B0 C3 B1
Hour 12: B2
Hour 13: B0 C0 B1
Hour 14: B2 C1 c2
and continue the above.
We take C0, C1 C2, and C3 once every 4 hours. We also take B0, B1 and B2 once every 2 hours.
Question: Please, explain to me, how to design an algorithm able to download the items, while using the absolute minimum number of downloads? Brute force is NOT a solution and the algorithm must be efficient CPU wise because the number of elements can be huge.
You may read the answer posted here: https://cs.stackexchange.com/a/19422/12497 as well as the answer posted bellow by user3125280.
You problem is a typical scheduling problem. These kinds of problems are well studied in computer science so there is a huge array of literature to consult.
The code is kind of like Deficit round robin, but with a few simplifications. First, we feed the queues ourself by adding to the data_to_process variable. Secondly, the queues just iterate through a list of values.
One difference is that this solution will get the optimal value you want, barring mathematical error.
Rough sketch: have not compiled (c++11) unix based, to spec code
#include <iostream>
#include <vector>
#include <numeric>
#include <unistd.h>
//#include <cmath> //for ceil
#define TIME_SCALE ((double)60.0) //1 for realtime speed
//Assuming you are not refreshing ints in the real case
template<typename T>
struct queue
const std::vector<T> data; //this will be filled with numbers
int position;
double refresh_rate; //must be refreshed ever ~ hours
double data_rate; //this many refreshes per hour
double credit; //amount of refreshes owed
queue(std::initializer_list<T> v, int r ) :
data(v), position(0), refresh_rate(r), credit(0) {
data_rate = data.size() / (double) refresh_rate;
int getNext() {
return data[position++ % data.size()];
double time_passed(){
static double total;
//if(total < 20){ //stop early
usleep(60000000 / TIME_SCALE); //sleep for a minute
total += 1.0 / 60.0; //add a minute
std::cout << "Time: " << total << std::endl;
return 1.0; //change to 1.0 / 60.0 for real time speed
//} else return 0;
int main()
//keep a list of the queues
std::vector<queue<int> > queues{
{{1, 2, 3}, 2},
{{1, 2, 3, 4}, 3}};
double total_data_rate = 0;
for(auto q : queues) total_data_rate += q.data_rate;
double data_to_process = 0; //how many refreshes we have to do
int queue_number = 0; //which queue we are processing
auto current_queue = &queues[0];
while(1) {
data_to_process += time_passed() * total_data_rate;
//data_to_process = ceil(data_to_process) //optional
while(data_to_process >= 1){
//data_to_process >= 0 will make the the scheduler more
//eager in the first time period (ie. everything will updated correctly
//in the first period and and following periods
if(current_queue->credit >= 1){
//don't change here though, since credit determines the weighting only,
//not how many refreshes are made
std::cout << "From queue " << queue_number << " refreshed " <<
current_queue->getNext() << std::endl;
current_queue->credit -= 1;
data_to_process -= 1;
} else {
queue_number = (queue_number + 1) % queues.size();
current_queue = &queues[queue_number];
current_queue->credit += current_queue->data_rate;
return 0;
The example should now compile on gcc with --std=c++11 and give you what you want.
and here is test case output: (for non-time scaled earlier code)
Time: 0
From queue 1 refreshed 1
From queue 0 refreshed 1
From queue 1 refreshed 2
Time: 1
From queue 0 refreshed 2
From queue 0 refreshed 3
From queue 1 refreshed 3
Time: 2
From queue 0 refreshed 1
From queue 1 refreshed 4
From queue 1 refreshed 1
Time: 3
From queue 0 refreshed 2
From queue 0 refreshed 3
From queue 1 refreshed 2
Time: 4
From queue 0 refreshed 1
From queue 1 refreshed 3
From queue 0 refreshed 2
Time: 5
From queue 0 refreshed 3
From queue 1 refreshed 4
From queue 1 refreshed 1
As an extension, to answer the repeating pattern problem by allowing this scheduler to complete only the first lcm(update_rate * lcm(...refresh rates...), ceil(update_rate)) steps, and then repeating the pattern.
ALSO: this will, indeed, be unsolvable sometimes because of the requirement on hour boundaries. When I use your unsolvable example, and modify time_passed to return 0.1, the schedule is solved with updates every 1.1 hours (just not at the hour boundaries!).
It seems your constraints are all over the place. To quickly summarise my other answer:
It meets the refresh rates only on average
It does the least number of downloads at hour intervals required to fulfil the above
It was based on these (sometimes unfulfillable) constraints
Update at discrete, 1 hour intervals
Update the fewest items each time
Update each item at fixed intervals
and broke 3.
Since both the hourly interval and least-each-time constraints are not really necessary, I will give a simpler, better answer here, which breaks 2.
#include <iostream>
#include <vector>
#include <numeric>
#include <unistd.h>
#define TIME_SCALE ((double)60.0)
//Assuming you are not refreshing ints in the real case
template<typename T>
struct queue
const std::vector<T> data; //this is the data to refresh
int position; //this is the data we are up to
double refresh_rate; //must be refreshed every this many hours
double data_rate; //this many refreshes per hour
double credit; //is owed this many refreshes
const char* name;//a name for each queue
queue(std::initializer_list<T> v, int r, const char* n ) :
data(v), position(0), refresh_rate(r), credit(0), name(n) {
data_rate = data.size() / (double) refresh_rate;
void refresh() {
std::cout << "From queue " << name << " refreshed " << data[position++ % data.size()] << "\n";
double time_passed(){
static double total;
usleep(60000000 / TIME_SCALE); //sleep for a minute
total += 1.0; //add a minute
std::cout << "Time: " << total << std::endl;
return 1.0; //change to 1.0 / 60.0 for real time speed
int main()
//keep a list of the queues
std::vector<queue<int> > queues{
{{1}, 1, "A"},
{{1}, 2, "B"}};
while(1) {
auto t = time_passed();
for(queue<int>& q : queues) {
q.credit += q.data_rate * t;
while(q.credit >= 1){
q.credit -= 1.0;
return 0;
It has the potential, however, to schedule many refreshes on the same hour. There is a third option as well, which breaks the hour-interval rule and updates only one at a time.
I think this is the easiest and requires the minimal number of updates (like the previous answer) but doesn't break rule 3.

Performance problem with Euler problem and recursion on Int64 types

I'm currently learning Haskell using the project Euler problems as my playground.
I was astound by how slow my Haskell programs turned out to be compared to similar
programs written in other languages. I'm wondering if I've forseen something, or if this is the kind of performance penalties one has to expect when using Haskell.
The following program in inspired by Problem 331, but I've changed it before posting so I don't spoil anything for other people. It computes the arc length of a discrete circle drawn on a 2^30 x 2^30 grid. It is a simple tail recursive implementation and I make sure that the updates of the accumulation variable keeping track of the arc length is strict. Yet it takes almost one and a half minute to complete (compiled with the -O flag with ghc).
import Data.Int
arcLength :: Int64->Int64
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' x y norm2 acc
| x > y = acc
| norm2 < 0 = arcLength' (x + 1) y (norm2 + 2*x +1) acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y-1) (norm2 - 2*(x + y) + 2) acc
| otherwise = arcLength' (x + 1) y (norm2 + 2*x + 1) $! (acc + 1)
main = print $ arcLength (2^30)
Here is a corresponding implementation in Java. It takes about 4.5 seconds to complete.
public class ArcLength {
public static void main(String args[]) {
long n = 1 << 30;
long x = 0;
long y = n-1;
long acc = 0;
long norm2 = 0;
long time = System.currentTimeMillis();
while(x <= y) {
if (norm2 < 0) {
norm2 += 2*x + 1;
} else if (norm2 > 2*(n-1)) {
norm2 += 2 - 2*(x+y);
} else {
norm2 += 2*x + 1;
time = System.currentTimeMillis() - time;
EDIT: After the discussions in the comments I made som modifications in the Haskell code and did some performance tests. First I changed n to 2^29 to avoid overflows. Then I tried 6 different version: With Int64 or Int and with bangs before either norm2 or both and norm2 and acc in the declaration arcLength' x y !norm2 !acc. All are compiled with
ghc -O3 -prof -rtsopts -fforce-recomp -XBangPatterns arctest.hs
Here are the results:
(Int !norm2 !acc)
total time = 3.00 secs (150 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int norm2 !acc)
total time = 3.56 secs (178 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int norm2 acc)
total time = 3.56 secs (178 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int64 norm2 acc)
arctest.exe: out of memory
(Int64 norm2 !acc)
total time = 48.46 secs (2423 ticks # 20 ms)
total alloc = 26,246,173,228 bytes (excludes profiling overheads)
(Int64 !norm2 !acc)
total time = 31.46 secs (1573 ticks # 20 ms)
total alloc = 3,032 bytes (excludes profiling overheads)
I'm using GHC 7.0.2 under a 64-bit Windows 7 (The Haskell platform binary distribution). According to the comments, the problem does not occur when compiling under other configurations. This makes me think that the Int64 type is broken in the Windows release.
Hm, I installed a fresh Haskell platform with 7.0.3, and get roughly the following core for your program (-ddump-simpl):
Main.$warcLength' =
\ (ww_s1my :: GHC.Prim.Int64#) (ww1_s1mC :: GHC.Prim.Int64#)
(ww2_s1mG :: GHC.Prim.Int64#) (ww3_s1mK :: GHC.Prim.Int64#) ->
case {__pkg_ccall ghc-prim hs_gtInt64 [...]
ww_s1my ww1_s1mC GHC.Prim.realWorld#
So GHC has realized that it can unpack your integers, which is good. But this hs_getInt64 call looks suspiciously like a C call. Looking at the assembler output (-ddump-asm), we see stuff like:
pushl %eax
movl 76(%esp),%eax
pushl %eax
call _hs_gtInt64
addl $16,%esp
So this looks very much like every operation on the Int64 get turned into a full-blown C call in the backend. Which is slow, obviously.
The source code of GHC.IntWord64 seems to verify that: In a 32-bit build (like the one currently shipped with the platform), you will have only emulation via the FFI interface.
Hmm, this is interesting. So I just compiled both of your programs, and tried them out:
% java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
% javac ArcLength.java
% java ArcLength
So about 6.6 seconds for the Java solution. Next is ghc with some optimization:
% ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.12.1
% ghc --make -O arc.hs
% time ./arc
./arc 12.68s user 0.04s system 99% cpu 12.718 total
Just under 13 seconds for ghc -O
Trying with some further optimization:
% ghc --make -O3
% time ./arc [13:16]
./arc 5.75s user 0.00s system 99% cpu 5.754 total
With further optimization flags, the haskell solution took under 6 seconds
It would be interesting to know what version compiler you are using.
There's a couple of interesting things in your question.
You should be using -O2 primarily. It will just do a better job (in this case, identifying and removing laziness that was still present in the -O version).
Secondly, your Haskell isn't quite the same as the Java (it does different tests and branches). As with others, running your code on my Linux box results in around 6s runtime. It seems fine.
Make sure it is the same as the Java
One idea: let's do a literal transcription of your Java, with the same control flow, operations and types.
import Data.Bits
import Data.Int
loop :: Int -> Int
loop n = go 0 (n-1) 0 0
go :: Int -> Int -> Int -> Int -> Int
go x y acc norm2
| x <= y = case () of { _
| norm2 < 0 -> go (x+1) y acc (norm2 + 2*x + 1)
| norm2 > 2 * (n-1) -> go (x-1) (y-1) acc (norm2 + 2 - 2 * (x+y))
| otherwise -> go (x+1) y (acc+1) (norm2 + 2*x + 1)
| otherwise = acc
main = print $ loop (1 `shiftL` 30)
Peek at the core
We'll take a quick peek at the Core, using ghc-core, and it shows a very nice loop of unboxed type:
:: Int#
-> Int#
-> Int#
-> Int#
-> Int#
main_$s$wgo =
\ (sc_sQa :: Int#)
(sc1_sQb :: Int#)
(sc2_sQc :: Int#)
(sc3_sQd :: Int#) ->
case <=# sc3_sQd sc2_sQc of _ {
False -> sc1_sQb;
True ->
case <# sc_sQa 0 of _ {
False ->
case ># sc_sQa 2147483646 of _ {
False ->
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
(+# sc1_sQb 1)
(+# sc3_sQd 1);
True ->
(+# sc_sQa 2)
(*# 2 (+# sc3_sQd sc2_sQc)))
(-# sc2_sQc 1)
(-# sc3_sQd 1)
True ->
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
(+# sc3_sQd 1)
that is, all unboxed into registers. That loop looks great!
And performs just fine (Linux/x86-64/GHC 7.03):
./A 5.95s user 0.01s system 99% cpu 5.980 total
Checking the asm
We get reasonable assembly too, as a nice loop:
cmpq %rdi, %r8
jg .L8
testq %r14, %r14
movq %r14, %rdx
js .L4
cmpq $2147483646, %r14
jle .L9
leaq (%rdi,%r8), %r10
addq $2, %rdx
leaq -1(%rdi), %rdi
addq %r10, %r10
movq %rdx, %r14
leaq -1(%r8), %r8
subq %r10, %r14
jmp Main_mainzuzdszdwgo_info
leaq 1(%r14,%r8,2), %r14
addq $1, %rsi
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
movq %rsi, %rbx
jmp *0(%rbp)
leaq 1(%r14,%r8,2), %r14
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
Using the -fvia-C backend.
So this looks fine!
My suspicion, as mentioned in the comment above, is something to do with the version of libgmp you have on 32 bit Windows generating poor code for 64 bit ints. First try upgrading to GHC 7.0.3, and then try some of the other code generator backends, then if you still have an issue with Int64, file a bug report to GHC trac.
Broadly confirming that it is indeed the cost of making those C calls in the 32 bit emulation of 64 bit ints, we can replace Int64 with Integer, which is implemented with C calls to GMP on every machine, and indeed, runtime goes from 3s to well over a minute.
Lesson: use hardware 64 bits if at all possible.
The normal optimization flag for performance concerned code is -O2. What you used, -O, does very little. -O3 doesn't do much (any?) more than -O2 - it even used to include experimental "optimizations" that often made programs notably slower.
With -O2 I get performance competitive with Java:
tommd#Mavlo:Test$ uname -r -m
2.6.37 x86_64
tommd#Mavlo:Test$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.3
tommd#Mavlo:Test$ ghc -O2 so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
real 0m4.948s
user 0m4.896s
sys 0m0.000s
And Java is about 1 second faster (20%):
tommd#Mavlo:Test$ time java ArcLength
real 0m3.961s
user 0m3.936s
sys 0m0.024s
But an interesting thing about GHC is it has many different backends. By default it uses the native code generator (NCG), which we timed above. There's also an LLVM backend that often does better... but not here:
tommd#Mavlo:Test$ ghc -O2 so.hs -fllvm -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
real 0m5.973s
user 0m5.968s
sys 0m0.000s
But, as FUZxxl mentioned in the comments, LLVM does much better when you add a few strictness annotations:
$ ghc -O2 -fllvm -fforce-recomp so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
real 0m4.099s
user 0m4.088s
sys 0m0.000s
There's also an old "via-c" generator that uses C as an intermediate language. It does well in this case:
tommd#Mavlo:Test$ ghc -O2 so.hs -fvia-c -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
on the commandline:
Warning: The -fvia-c flag will be removed in a future GHC release
Linking so ...
ttommd#Mavlo:Test$ ti
tommd#Mavlo:Test$ time ./so
real 0m3.982s
user 0m3.972s
sys 0m0.000s
Hopefully the NCG will be improved to match via-c for this case before they remove this backend.
dberg, I feel like all of this got off to a bad start with the unfortunate -O flag. Just to emphasize a point made by others, for run-of-the-mill compilation and testing, do like me and paste this into your .bashrc or whatever:
alias ggg="ghc --make -O2"
alias gggg="echo 'Glorious Glasgow for Great Good!' && ghc --make -O2 --fforce-recomp"
I've played with the code a little and this version seems to run faster than Java version on my laptop (3.55s vs 4.63s):
{-# LANGUAGE BangPatterns #-}
arcLength :: Int->Int
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' :: Int -> Int -> Int -> Int -> Int
arcLength' !x !y !norm2 !acc
| x > y = acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y - 1) (norm2 - 2*(x + y) + 2) acc
| norm2 < 0 = arcLength' (succ x) y (norm2 + x*2 + 1) acc
| otherwise = arcLength' (succ x) y (norm2 + 2*x + 1) (acc + 1)
main = print $ arcLength (2^30)
$ ghc -O2 tmp1.hs -fforce-recomp
[1 of 1] Compiling Main ( tmp1.hs, tmp1.o )
Linking tmp1 ...
$ time ./tmp1
real 0m3.553s
user 0m3.539s
sys 0m0.006s
