JVM crashes while authenticating pub/sub - spring-boot

I use GCP client libraries to implement pub/sub model in my spring-boot application. For authenticating i'm using GOOGLE_APPLICATION_CREDENTIALS path env variable. It works fine with other versions of JDK/JRE, But it fails with segmentation Error with below mentioned jdk/jre
Environment details
Java version:
openjdk version "1.8.0_322"
OpenJDK Runtime Environment (Zulu 8.60.0.22-SA-linux-musl-x64) (build 1.8.0_322-b06)
OpenJDK 64-Bit Server VM (Zulu 8.60.0.22-SA-linux-musl-x64) (build 25.322-b06, mixed mode)
Log:
# A fatal error has been detected by the Java Runtime Environment:
# SIGSEGV (0xb) at pc=0x0000000000003fd6, pid=1, tid=0x00007f99a14fcb38
#
# JRE version: OpenJDK Runtime Environment (Zulu 8.60.0.22-SA-linux-musl-x64) (8.0_322-b06) (build 1.8.0_322-b06)
#
# Java VM: OpenJDK 64-Bit Server VM (25.322-b06 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C 0x0000000000003fd6
#
# Core dump written. Default location: //core or core.1
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid1.log
#
# If you would like to submit a bug report, please visit:
# http://www.azul.com/support/
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;Z)V+0
j java.lang.ClassLoader.loadLibrary0(Ljava/lang/Class;Ljava/io/File;)Z+328
j java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)V+92
j java.lang.Runtime.load0(Ljava/lang/Class;Ljava/lang/String;)V+57
j java.lang.System.load(Ljava/lang/String;)V+7
j io.grpc.netty.shaded.io.netty.util.internal.NativeLibraryUtil.loadLibrary(Ljava/lang/String;Z)V+5
v ~StubRoutines::call_stub
J 2066 sun.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (0 bytes) # 0x00007f5cad99bdf7 [0x00007f5cad99bd80+0x77]
J 2065 C1 sun.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (104 bytes) # 0x00007f5cad9a2a8c [0x00007f5cad9a1900+0x118c]
J 1974 C1 sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (10 bytes) # 0x00007f5cad961784 [0x00007f5cad961680+0x104]
J 2084 C1 java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (62 bytes) # 0x00007f5cad9a3e8c [0x00007f5cad9a3aa0+0x3ec]
j io.grpc.netty.shaded.io.netty.util.internal.NativeLibraryLoader$1.run()Ljava/lang/Object;+53
v ~StubRoutines::call_stub
J 1349 java.security.AccessController.doPrivileged(Ljava/security/PrivilegedAction;)Ljava/lang/Object; (0 bytes) # 0x00007f5cad764f4f [0x00007f5cad764f00+0x4f]
j io.grpc.netty.shaded.io.netty.util.internal.NativeLibraryLoader.loadLibraryByHelper(Ljava/lang/Class;Ljava/lang/String;Z)V+10
j io.grpc.netty.shaded.io.netty.util.internal.NativeLibraryLoader.loadLibrary(Ljava/lang/ClassLoader;Ljava/lang/String;Z)V+15
j io.grpc.netty.shaded.io.netty.util.internal.NativeLibraryLoader.load(Ljava/lang/String;Ljava/lang/ClassLoader;)V+359
j io.grpc.netty.shaded.io.netty.channel.epoll.Native.loadNativeLibrary()V+60
j io.grpc.netty.shaded.io.netty.channel.epoll.Native.<clinit>()V+76
v ~StubRoutines::call_stub
j io.grpc.netty.shaded.io.netty.channel.epoll.Epoll.<clinit>()V+28
v ~StubRoutines::call_stub
J 993 java.lang.Class.forName0(Ljava/lang/String;ZLjava/lang/ClassLoader;Ljava/lang/Class;)Ljava/lang/Class; (0 bytes) # 0x00007f5cad6995fa [0x00007f5cad699580+0x7a]
J 1952 C1 java.lang.Class.forName(Ljava/lang/String;)Ljava/lang/Class; (15 bytes) # 0x00007f5cad948d4c [0x00007f5cad948ba0+0x1ac]
j io.grpc.netty.shaded.io.grpc.netty.Utils.isEpollAvailable()Z+3
j io.grpc.netty.shaded.io.grpc.netty.Utils.<clinit>()V+144
v ~StubRoutines::call_stub
j io.grpc.netty.shaded.io.grpc.netty.NettyChannelBuilder.<clinit>()V+16
v ~StubRoutines::call_stub
j io.grpc.netty.shaded.io.grpc.netty.NettyChannelProvider.builderForAddress(Ljava/lang/String;I)Lio/grpc/netty/shaded/io/grpc/netty/NettyChannelBuilder;+2
j io.grpc.netty.shaded.io.grpc.netty.NettyChannelProvider.builderForAddress(Ljava/lang/String;I)Lio/grpc/ManagedChannelBuilder;+3
j io.grpc.ManagedChannelBuilder.forAddress(Ljava/lang/String;I)Lio/grpc/ManagedChannelBuilder;+5
j com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createSingleChannel()Lio/grpc/ManagedChannel;+285
j com.google.api.gax.grpc.InstantiatingGrpcChannelProvider$$Lambda$596.createSingleChannel()Lio/grpc/ManagedChannel;+4
j com.google.api.gax.grpc.ChannelPool.<init>(Lcom/google/api/gax/grpc/ChannelPoolSettings;Lcom/google/api/gax/grpc/ChannelFactory;Ljava/util/concurrent/ScheduledExecutorService;)V+71
j com.google.api.gax.grpc.ChannelPool.create(Lcom/google/api/gax/grpc/ChannelPoolSettings;Lcom/google/api/gax/grpc/ChannelFactory;)Lcom/google/api/gax/grpc/ChannelPool;+9
j com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.createChannel()Lcom/google/api/gax/rpc/TransportChannel;+10
j com.google.api.gax.grpc.InstantiatingGrpcChannelProvider.getTransportChannel()Lcom/google/api/gax/rpc/TransportChannel;+35
j com.google.api.gax.rpc.ClientContext.create(Lcom/google/api/gax/rpc/StubSettings;)Lcom/google/api/gax/rpc/ClientContext;+179
j com.google.cloud.pubsub.v1.stub.GrpcSubscriberStub.create(Lcom/google/cloud/pubsub/v1/stub/SubscriberStubSettings;)Lcom/google/cloud/pubsub/v1/stub/GrpcSubscriberStub;+6
j com.google.cloud.pubsub.v1.Subscriber.doStart()V+16
j com.google.api.core.AbstractApiService$InnerService.doStart()V+4
j com.google.common.util.concurrent.AbstractService.startAsync()Lcom/google/common/util/concurrent/Service;+33
j com.google.api.core.AbstractApiService.startAsync()Lcom/google/api/core/ApiService;+4
j com.google.cloud.pubsub.v1.Subscriber.startAsync()Lcom/google/api/core/ApiService;+1
Dependencies:
<dependencies>
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>libraries-bom</artifactId>
<version>25.1.0</version>
<type>pom</type>
<scope>import</scope>
</dependency>
</dependencies>
</dependencyManagement>
<dependencies>
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-pubsub</artifactId>
</dependency>
<dependencies>
And also i wanted to know is there any other way to authenticate other than using path env variable? Can I use spring.cloud.gcp.credentials.location=file:{location} with GCP client libraries, instead of env variable?

As mentioned by #Juraj Martinka, it was problem with underlying google library io.grpc.netty.shaded. It seems Netty does not support Alpine since Netty depends on glibc but Alpine does not have it, it has musl libc instead.
The issue disappears if you disable Netty's native support
or if you use an image that has glibc, e.g.:
azul/zulu-openjdk-alpine:11-jre: Alpine-based, no glibc -> does not work
azul/zulu-openjdk:11: Ubuntu-based, has glibc -> works
Using -Dio.grpc.netty.shaded.io.netty.transport.noNative=true avoids the segfault
example:
java -jar -D-Dio.grpc.netty.shaded.io.netty.transport.noNative=true app.jar
The other workaround is, using grpc-netty instead of grpc-netty-shaded
<dependency>
<groupId>com.google.cloud</groupId>
<artifactId>google-cloud-pubsub</artifactId>
<exclusions>
<exclusion>
<groupId>io.grpc</groupId>
<artifactId>grpc-netty-shaded</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>io.grpc</groupId>
<artifactId>grpc-netty</artifactId>
</dependency>
Reference Links: Link 1, Link 2

Related

What is the usual cause of such "illegal instruction" error, and how could I debug?

I am working on a program using GNU Scientific Library. It gives an "illegal instruction (core dump)" after solving a nonlinear equation in half. See below. What is the usual cause of such "illegal instruction" error, and how could I debug in such situations?
...
iter 0: A = 1.0000, lambda = 1.0000, b = 0.0000, cond(J) = 6.0000, |f(x)| = 101.0200
iter 1: A = 3.5110, lambda = -12.8820, b = 1.2364, cond(J) = 92.8216, |f(x)| = -nan
iter 2: A = 3.5110, lambda = -12.8820, b = 1.2364, cond(J) = nan, |f(x)| = -nan
iter 3: A = 3.5110, lambda = -12.8820, b = 1.2364, cond(J) = nan, |f(x)| = -nan
iter 4: A = 3.5110, lambda = -12.8820, b = 1.2364, cond(J) = nan, |f(x)| = -nan
Illegal instruction (core dumped)
With gdb, I got a bit of additional info.
Program received signal SIGILL, Illegal instruction.
0x00000000004d1030 in nielsen_reject (nu=<optimized out>, mu=<optimized out>) at nielsen.c:98
98 *nu <<= 1;
(gdb) p nu
$1 = <optimized out>
(gdb) x/i $pc
=> 0x4d1030 <trust_iterate+8912>: ud2
Above, nielsen.c98 looks like this
...
static int
nielsen_reject(double * mu, long * nu)
{
*mu *= (double) *nu;
/* nu := 2*nu */
*nu <<= 1;
return GSL_SUCCESS;
}
CPU is x86_64 according to uname -m. OS is Ubuntu 16.04 VirtualMachine on a Mac(host). GCC version is 5.4.
gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
The code is compiled with gcc, with Address Sanitizer and Undefined Sanitizer switched on (-fsanitize=address,undefined).
It seems that the compiler produces a "ud2" instruction when sanitizers are switched on. I would like to know whether this bug is a compiler bug, sanitizer bug, or code bug?

Spring Boot 2. Hikari Connection Pool optimization

I have a SpringBoot app, I was making some performance test in the controller, and I realized that whatever is the first query I put the controller, It take ages compare to the others... (ths DB is a remote connection, but I can't change this)
long t1 = System.nanoTime();
menuPriceSummaryService.findAllVegan().stream();
long t2 = System.nanoTime();
long elapsedTimeInSeconds = (t2 - t1) / 1000000000;
System.out.println("elapsedTimeInSeconds1 -> " + elapsedTimeInSeconds);
t1 = System.nanoTime();
menuPriceSummaryService.findAllVegan();
t2 = System.nanoTime();
elapsedTimeInSeconds = (t2 - t1) / 1000000000;
System.out.println("elapsedTimeInSeconds2 -> " + elapsedTimeInSeconds);
t1 = System.nanoTime();
menuPriceSummaryService.findAllVegan().parallelStream();
t2 = System.nanoTime();
elapsedTimeInSeconds = (t2 - t1) / 1000000000;
System.out.println("elapsedTimeInSeconds3 -> " + elapsedTimeInSeconds);
t1 = System.nanoTime();
menuPriceSummaryService.findAllVegan().parallelStream().filter(this::notInMyFavourites);
t2 = System.nanoTime();
elapsedTimeInSeconds = (t2 - t1) / 1000000000;
the time:
elapsedTimeInSeconds1 -> 76
elapsedTimeInSeconds2 -> 0
elapsedTimeInSeconds3 -> 0
elapsedTimeInSeconds4 -> 0
Is it normal?
Is there is something I can do configuring the Hikari pool to optimize this?
the pom.xml
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>com.h2database</groupId>
<artifactId>h2</artifactId>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<scope>runtime</scope>
</dependency>
the application.properties:
spring.datasource.url=jdbc:mysql://elcordelaciutat.awob1oxhu1so.eu-central-1.rds.amazonaws.com:3306/elcor
spring.datasource.username=elcor
spring.datasource.password=elcor2#$
spring.jpa.show-sql=false
spring.jpa.properties.hibernate.format_sql=true
hibernate.dialect=org.hibernate.dialect.MySQLDialect
You should follow Hikari's MySQL Configuration:
A typical MySQL configuration for HikariCP might look something like this:
dataSource.cachePrepStmts=true
dataSource.prepStmtCacheSize=250
dataSource.prepStmtCacheSqlLimit=2048
dataSource.useServerPrepStmts=true
dataSource.useLocalSessionState=true
dataSource.useLocalTransactionState=true
dataSource.rewriteBatchedStatements=true
dataSource.cacheResultSetMetadata=true
dataSource.cacheServerConfiguration=true
dataSource.elideSetAutoCommits=true
dataSource.maintainTimeStats=false

SIGSEGV in Chronicle Queue 4.5.19

What would cause Chronicle Queue to segfault? I assume I've missed a configuration somewhere. I have a readOnly Chronicle Queue created like this:
ChronicleQueue readQueue = SingleChronicleQueueBuilder.binary (readBasePath).readOnly (true).build ();
The JVM segfaulted 2016-12-31T00:00:00, which is when I assume the queue file was cycled. This is the environment:
Chronicle Queue 4.5.19
JVM OpenJDK 1.8.0_112-b16
Ubuntu 14.04.3 LTS Linux 3.13.0-74
Here is the stacktrace:
> V [libjvm.so+0xa08d97]
J 875 sun.misc.Unsafe.compareAndSwapInt(Ljava/lang/Object;JII)Z (0 bytes) # 0x00007fde1d328c46 [0x00007fde1d328b80+0xc6]
j net.openhft.chronicle.core.UnsafeMemory.compareAndSwapInt(JII)Z+8
j net.openhft.chronicle.bytes.NativeBytesStore.compareAndSwapInt(JII)Z+17
j net.openhft.chronicle.bytes.AbstractBytes.compareAndSwapInt(JII)Z+16
j net.openhft.chronicle.wire.AbstractWire.writeEndOfWire(JLjava/util/concurrent/TimeUnit;J)V+32
j net.openhft.chronicle.queue.impl.single.SingleChronicleQueueStore.writeEOF(Lnet/openhft/chronicle/wire/Wire;J)V+9
j net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$StoreTailer.checkMoveToNextCycle(ZLnet/openhft/chronicle/bytes/Bytes;)Z+43
j net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$StoreTailer.inACycle(Z)Z+176
j net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$StoreTailer.next(Z)Z+12
j net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$StoreTailer.readingDocument(Z)Lnet/openhft/chronicle/wire/DocumentContext;+6
j net.openhft.chronicle.queue.ExcerptTailer.readingDocument()Lnet/openhft/chronicle/wire/DocumentContext;+2
j net.openhft.chronicle.wire.MarshallableIn.readDocument(Lnet/openhft/chronicle/wire/ReadMarshallable;)Z+1
That looks like a race condition. When a memory mapping is truly freed it cannot be accessed or it will trigger a segmentation fault. The reason I suspect this is that it should be free on a roll from one cycle to the next.
I have added an issue https://github.com/OpenHFT/Chronicle-Queue/issues/319

segmentation fault with negative index

This code fragment gives segfault on the line with ->, please note n=3
real_t _b[n+1];
real_t * b = _b+1;
std::fill( b, b + n , (real_t)0.0 );
for ( unsigned c = 0; c < n; c ++ )
{
-> b[c-1] = 0; b[c] = 1;
Lsolve( xtmp, lu, b, n );
I'm told this is because I'm in 64-bit (Linux amd64, gcc 4.6, debug flag -O0)
anyone could tell me more?
It's to do with the two's complement value of the index being added to the address, it works fine in 32 bit but not 64
Detailed here: http://www.devx.com/tips/Tip/41349

Performance problem with Euler problem and recursion on Int64 types

I'm currently learning Haskell using the project Euler problems as my playground.
I was astound by how slow my Haskell programs turned out to be compared to similar
programs written in other languages. I'm wondering if I've forseen something, or if this is the kind of performance penalties one has to expect when using Haskell.
The following program in inspired by Problem 331, but I've changed it before posting so I don't spoil anything for other people. It computes the arc length of a discrete circle drawn on a 2^30 x 2^30 grid. It is a simple tail recursive implementation and I make sure that the updates of the accumulation variable keeping track of the arc length is strict. Yet it takes almost one and a half minute to complete (compiled with the -O flag with ghc).
import Data.Int
arcLength :: Int64->Int64
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' x y norm2 acc
| x > y = acc
| norm2 < 0 = arcLength' (x + 1) y (norm2 + 2*x +1) acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y-1) (norm2 - 2*(x + y) + 2) acc
| otherwise = arcLength' (x + 1) y (norm2 + 2*x + 1) $! (acc + 1)
main = print $ arcLength (2^30)
Here is a corresponding implementation in Java. It takes about 4.5 seconds to complete.
public class ArcLength {
public static void main(String args[]) {
long n = 1 << 30;
long x = 0;
long y = n-1;
long acc = 0;
long norm2 = 0;
long time = System.currentTimeMillis();
while(x <= y) {
if (norm2 < 0) {
norm2 += 2*x + 1;
x++;
} else if (norm2 > 2*(n-1)) {
norm2 += 2 - 2*(x+y);
x--;
y--;
} else {
norm2 += 2*x + 1;
x++;
acc++;
}
}
time = System.currentTimeMillis() - time;
System.err.println(acc);
System.err.println(time);
}
}
EDIT: After the discussions in the comments I made som modifications in the Haskell code and did some performance tests. First I changed n to 2^29 to avoid overflows. Then I tried 6 different version: With Int64 or Int and with bangs before either norm2 or both and norm2 and acc in the declaration arcLength' x y !norm2 !acc. All are compiled with
ghc -O3 -prof -rtsopts -fforce-recomp -XBangPatterns arctest.hs
Here are the results:
(Int !norm2 !acc)
total time = 3.00 secs (150 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int norm2 !acc)
total time = 3.56 secs (178 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int norm2 acc)
total time = 3.56 secs (178 ticks # 20 ms)
total alloc = 2,892 bytes (excludes profiling overheads)
(Int64 norm2 acc)
arctest.exe: out of memory
(Int64 norm2 !acc)
total time = 48.46 secs (2423 ticks # 20 ms)
total alloc = 26,246,173,228 bytes (excludes profiling overheads)
(Int64 !norm2 !acc)
total time = 31.46 secs (1573 ticks # 20 ms)
total alloc = 3,032 bytes (excludes profiling overheads)
I'm using GHC 7.0.2 under a 64-bit Windows 7 (The Haskell platform binary distribution). According to the comments, the problem does not occur when compiling under other configurations. This makes me think that the Int64 type is broken in the Windows release.
Hm, I installed a fresh Haskell platform with 7.0.3, and get roughly the following core for your program (-ddump-simpl):
Main.$warcLength' =
\ (ww_s1my :: GHC.Prim.Int64#) (ww1_s1mC :: GHC.Prim.Int64#)
(ww2_s1mG :: GHC.Prim.Int64#) (ww3_s1mK :: GHC.Prim.Int64#) ->
case {__pkg_ccall ghc-prim hs_gtInt64 [...]
ww_s1my ww1_s1mC GHC.Prim.realWorld#
[...]
So GHC has realized that it can unpack your integers, which is good. But this hs_getInt64 call looks suspiciously like a C call. Looking at the assembler output (-ddump-asm), we see stuff like:
pushl %eax
movl 76(%esp),%eax
pushl %eax
call _hs_gtInt64
addl $16,%esp
So this looks very much like every operation on the Int64 get turned into a full-blown C call in the backend. Which is slow, obviously.
The source code of GHC.IntWord64 seems to verify that: In a 32-bit build (like the one currently shipped with the platform), you will have only emulation via the FFI interface.
Hmm, this is interesting. So I just compiled both of your programs, and tried them out:
% java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.7) (6b18-1.8.7-2~squeeze1)
OpenJDK 64-Bit Server VM (build 14.0-b16, mixed mode)
% javac ArcLength.java
% java ArcLength
843298604
6630
So about 6.6 seconds for the Java solution. Next is ghc with some optimization:
% ghc --version
The Glorious Glasgow Haskell Compilation System, version 6.12.1
% ghc --make -O arc.hs
% time ./arc
843298604
./arc 12.68s user 0.04s system 99% cpu 12.718 total
Just under 13 seconds for ghc -O
Trying with some further optimization:
% ghc --make -O3
% time ./arc [13:16]
843298604
./arc 5.75s user 0.00s system 99% cpu 5.754 total
With further optimization flags, the haskell solution took under 6 seconds
It would be interesting to know what version compiler you are using.
There's a couple of interesting things in your question.
You should be using -O2 primarily. It will just do a better job (in this case, identifying and removing laziness that was still present in the -O version).
Secondly, your Haskell isn't quite the same as the Java (it does different tests and branches). As with others, running your code on my Linux box results in around 6s runtime. It seems fine.
Make sure it is the same as the Java
One idea: let's do a literal transcription of your Java, with the same control flow, operations and types.
import Data.Bits
import Data.Int
loop :: Int -> Int
loop n = go 0 (n-1) 0 0
where
go :: Int -> Int -> Int -> Int -> Int
go x y acc norm2
| x <= y = case () of { _
| norm2 < 0 -> go (x+1) y acc (norm2 + 2*x + 1)
| norm2 > 2 * (n-1) -> go (x-1) (y-1) acc (norm2 + 2 - 2 * (x+y))
| otherwise -> go (x+1) y (acc+1) (norm2 + 2*x + 1)
}
| otherwise = acc
main = print $ loop (1 `shiftL` 30)
Peek at the core
We'll take a quick peek at the Core, using ghc-core, and it shows a very nice loop of unboxed type:
main_$s$wgo
:: Int#
-> Int#
-> Int#
-> Int#
-> Int#
main_$s$wgo =
\ (sc_sQa :: Int#)
(sc1_sQb :: Int#)
(sc2_sQc :: Int#)
(sc3_sQd :: Int#) ->
case <=# sc3_sQd sc2_sQc of _ {
False -> sc1_sQb;
True ->
case <# sc_sQa 0 of _ {
False ->
case ># sc_sQa 2147483646 of _ {
False ->
main_$s$wgo
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
(+# sc1_sQb 1)
sc2_sQc
(+# sc3_sQd 1);
True ->
main_$s$wgo
(-#
(+# sc_sQa 2)
(*# 2 (+# sc3_sQd sc2_sQc)))
sc1_sQb
(-# sc2_sQc 1)
(-# sc3_sQd 1)
};
True ->
main_$s$wgo
(+# (+# sc_sQa (*# 2 sc3_sQd)) 1)
sc1_sQb
sc2_sQc
(+# sc3_sQd 1)
that is, all unboxed into registers. That loop looks great!
And performs just fine (Linux/x86-64/GHC 7.03):
./A 5.95s user 0.01s system 99% cpu 5.980 total
Checking the asm
We get reasonable assembly too, as a nice loop:
Main_mainzuzdszdwgo_info:
cmpq %rdi, %r8
jg .L8
.L3:
testq %r14, %r14
movq %r14, %rdx
js .L4
cmpq $2147483646, %r14
jle .L9
.L5:
leaq (%rdi,%r8), %r10
addq $2, %rdx
leaq -1(%rdi), %rdi
addq %r10, %r10
movq %rdx, %r14
leaq -1(%r8), %r8
subq %r10, %r14
jmp Main_mainzuzdszdwgo_info
.L9:
leaq 1(%r14,%r8,2), %r14
addq $1, %rsi
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
.L8:
movq %rsi, %rbx
jmp *0(%rbp)
.L4:
leaq 1(%r14,%r8,2), %r14
leaq 1(%r8), %r8
jmp Main_mainzuzdszdwgo_info
Using the -fvia-C backend.
So this looks fine!
My suspicion, as mentioned in the comment above, is something to do with the version of libgmp you have on 32 bit Windows generating poor code for 64 bit ints. First try upgrading to GHC 7.0.3, and then try some of the other code generator backends, then if you still have an issue with Int64, file a bug report to GHC trac.
Broadly confirming that it is indeed the cost of making those C calls in the 32 bit emulation of 64 bit ints, we can replace Int64 with Integer, which is implemented with C calls to GMP on every machine, and indeed, runtime goes from 3s to well over a minute.
Lesson: use hardware 64 bits if at all possible.
The normal optimization flag for performance concerned code is -O2. What you used, -O, does very little. -O3 doesn't do much (any?) more than -O2 - it even used to include experimental "optimizations" that often made programs notably slower.
With -O2 I get performance competitive with Java:
tommd#Mavlo:Test$ uname -r -m
2.6.37 x86_64
tommd#Mavlo:Test$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.0.3
tommd#Mavlo:Test$ ghc -O2 so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
843298604
real 0m4.948s
user 0m4.896s
sys 0m0.000s
And Java is about 1 second faster (20%):
tommd#Mavlo:Test$ time java ArcLength
843298604
3880
real 0m3.961s
user 0m3.936s
sys 0m0.024s
But an interesting thing about GHC is it has many different backends. By default it uses the native code generator (NCG), which we timed above. There's also an LLVM backend that often does better... but not here:
tommd#Mavlo:Test$ ghc -O2 so.hs -fllvm -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
843298604
real 0m5.973s
user 0m5.968s
sys 0m0.000s
But, as FUZxxl mentioned in the comments, LLVM does much better when you add a few strictness annotations:
$ ghc -O2 -fllvm -fforce-recomp so.hs
[1 of 1] Compiling Main ( so.hs, so.o )
Linking so ...
tommd#Mavlo:Test$ time ./so
843298604
real 0m4.099s
user 0m4.088s
sys 0m0.000s
There's also an old "via-c" generator that uses C as an intermediate language. It does well in this case:
tommd#Mavlo:Test$ ghc -O2 so.hs -fvia-c -fforce-recomp
[1 of 1] Compiling Main ( so.hs, so.o )
on the commandline:
Warning: The -fvia-c flag will be removed in a future GHC release
Linking so ...
ttommd#Mavlo:Test$ ti
tommd#Mavlo:Test$ time ./so
843298604
real 0m3.982s
user 0m3.972s
sys 0m0.000s
Hopefully the NCG will be improved to match via-c for this case before they remove this backend.
dberg, I feel like all of this got off to a bad start with the unfortunate -O flag. Just to emphasize a point made by others, for run-of-the-mill compilation and testing, do like me and paste this into your .bashrc or whatever:
alias ggg="ghc --make -O2"
alias gggg="echo 'Glorious Glasgow for Great Good!' && ghc --make -O2 --fforce-recomp"
I've played with the code a little and this version seems to run faster than Java version on my laptop (3.55s vs 4.63s):
{-# LANGUAGE BangPatterns #-}
arcLength :: Int->Int
arcLength n = arcLength' 0 (n-1) 0 0 where
arcLength' :: Int -> Int -> Int -> Int -> Int
arcLength' !x !y !norm2 !acc
| x > y = acc
| norm2 > 2*(n-1) = arcLength' (x - 1) (y - 1) (norm2 - 2*(x + y) + 2) acc
| norm2 < 0 = arcLength' (succ x) y (norm2 + x*2 + 1) acc
| otherwise = arcLength' (succ x) y (norm2 + 2*x + 1) (acc + 1)
main = print $ arcLength (2^30)
:
$ ghc -O2 tmp1.hs -fforce-recomp
[1 of 1] Compiling Main ( tmp1.hs, tmp1.o )
Linking tmp1 ...
$ time ./tmp1
843298604
real 0m3.553s
user 0m3.539s
sys 0m0.006s

Resources