SIGSEGV in Chronicle Queue 4.5.19 - chronicle

What would cause Chronicle Queue to segfault? I assume I've missed a configuration somewhere. I have a readOnly Chronicle Queue created like this:
ChronicleQueue readQueue = SingleChronicleQueueBuilder.binary (readBasePath).readOnly (true).build ();
The JVM segfaulted 2016-12-31T00:00:00, which is when I assume the queue file was cycled. This is the environment:
Chronicle Queue 4.5.19
JVM OpenJDK 1.8.0_112-b16
Ubuntu 14.04.3 LTS Linux 3.13.0-74
Here is the stacktrace:
> V []
J 875 sun.misc.Unsafe.compareAndSwapInt(Ljava/lang/Object;JII)Z (0 bytes) # 0x00007fde1d328c46 [0x00007fde1d328b80+0xc6]
j net.openhft.chronicle.core.UnsafeMemory.compareAndSwapInt(JII)Z+8
j net.openhft.chronicle.bytes.NativeBytesStore.compareAndSwapInt(JII)Z+17
j net.openhft.chronicle.bytes.AbstractBytes.compareAndSwapInt(JII)Z+16
j net.openhft.chronicle.wire.AbstractWire.writeEndOfWire(JLjava/util/concurrent/TimeUnit;J)V+32
j net.openhft.chronicle.queue.impl.single.SingleChronicleQueueStore.writeEOF(Lnet/openhft/chronicle/wire/Wire;J)V+9
j net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$StoreTailer.checkMoveToNextCycle(ZLnet/openhft/chronicle/bytes/Bytes;)Z+43
j net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$StoreTailer.inACycle(Z)Z+176
j net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$
j net.openhft.chronicle.queue.impl.single.SingleChronicleQueueExcerpts$StoreTailer.readingDocument(Z)Lnet/openhft/chronicle/wire/DocumentContext;+6
j net.openhft.chronicle.queue.ExcerptTailer.readingDocument()Lnet/openhft/chronicle/wire/DocumentContext;+2
j net.openhft.chronicle.wire.MarshallableIn.readDocument(Lnet/openhft/chronicle/wire/ReadMarshallable;)Z+1

That looks like a race condition. When a memory mapping is truly freed it cannot be accessed or it will trigger a segmentation fault. The reason I suspect this is that it should be free on a roll from one cycle to the next.
I have added an issue


JVM crashes while authenticating pub/sub

I use GCP client libraries to implement pub/sub model in my spring-boot application. For authenticating i'm using GOOGLE_APPLICATION_CREDENTIALS path env variable. It works fine with other versions of JDK/JRE, But it fails with segmentation Error with below mentioned jdk/jre
Environment details
Java version:
openjdk version "1.8.0_322"
OpenJDK Runtime Environment (Zulu (build 1.8.0_322-b06)
OpenJDK 64-Bit Server VM (Zulu (build 25.322-b06, mixed mode)
# A fatal error has been detected by the Java Runtime Environment:
# SIGSEGV (0xb) at pc=0x0000000000003fd6, pid=1, tid=0x00007f99a14fcb38
# JRE version: OpenJDK Runtime Environment (Zulu (8.0_322-b06) (build 1.8.0_322-b06)
# Java VM: OpenJDK 64-Bit Server VM (25.322-b06 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# C 0x0000000000003fd6
# Core dump written. Default location: //core or core.1
# An error report file with more information is saved as:
# /tmp/hs_err_pid1.log
# If you would like to submit a bug report, please visit:
Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j java.lang.ClassLoader$NativeLibrary.load(Ljava/lang/String;Z)V+0
j java.lang.ClassLoader.loadLibrary0(Ljava/lang/Class;Ljava/io/File;)Z+328
j java.lang.ClassLoader.loadLibrary(Ljava/lang/Class;Ljava/lang/String;Z)V+92
j java.lang.Runtime.load0(Ljava/lang/Class;Ljava/lang/String;)V+57
j java.lang.System.load(Ljava/lang/String;)V+7
v ~StubRoutines::call_stub
J 2066 sun.reflect.NativeMethodAccessorImpl.invoke0(Ljava/lang/reflect/Method;Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (0 bytes) # 0x00007f5cad99bdf7 [0x00007f5cad99bd80+0x77]
J 2065 C1 sun.reflect.NativeMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (104 bytes) # 0x00007f5cad9a2a8c [0x00007f5cad9a1900+0x118c]
J 1974 C1 sun.reflect.DelegatingMethodAccessorImpl.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (10 bytes) # 0x00007f5cad961784 [0x00007f5cad961680+0x104]
J 2084 C1 java.lang.reflect.Method.invoke(Ljava/lang/Object;[Ljava/lang/Object;)Ljava/lang/Object; (62 bytes) # 0x00007f5cad9a3e8c [0x00007f5cad9a3aa0+0x3ec]
v ~StubRoutines::call_stub
J 1349;)Ljava/lang/Object; (0 bytes) # 0x00007f5cad764f4f [0x00007f5cad764f00+0x4f]
v ~StubRoutines::call_stub
v ~StubRoutines::call_stub
J 993 java.lang.Class.forName0(Ljava/lang/String;ZLjava/lang/ClassLoader;Ljava/lang/Class;)Ljava/lang/Class; (0 bytes) # 0x00007f5cad6995fa [0x00007f5cad699580+0x7a]
J 1952 C1 java.lang.Class.forName(Ljava/lang/String;)Ljava/lang/Class; (15 bytes) # 0x00007f5cad948d4c [0x00007f5cad948ba0+0x1ac]
v ~StubRoutines::call_stub
v ~StubRoutines::call_stub
j io.grpc.ManagedChannelBuilder.forAddress(Ljava/lang/String;I)Lio/grpc/ManagedChannelBuilder;+5
And also i wanted to know is there any other way to authenticate other than using path env variable? Can I use{location} with GCP client libraries, instead of env variable?
As mentioned by #Juraj Martinka, it was problem with underlying google library io.grpc.netty.shaded. It seems Netty does not support Alpine since Netty depends on glibc but Alpine does not have it, it has musl libc instead.
The issue disappears if you disable Netty's native support
or if you use an image that has glibc, e.g.:
azul/zulu-openjdk-alpine:11-jre: Alpine-based, no glibc -> does not work
azul/zulu-openjdk:11: Ubuntu-based, has glibc -> works
Using avoids the segfault
java -jar app.jar
The other workaround is, using grpc-netty instead of grpc-netty-shaded
Reference Links: Link 1, Link 2

CUDA profiler reports inefficient global memory access

I have a simple CUDA kernel which I thought was accessing global memory efficiently. The Nvidia profiler however reports that I am performing inefficient global memory accesses. My kernel code is:
__global__ void update_particles_kernel
float4 *pos,
float4 *vel,
float4 *acc,
float dt,
int numParticles
int index = threadIdx.x + blockIdx.x * blockDim.x;
int offset = 0;
while(index + offset < numParticles)
vel[index + offset].x += dt*acc[index + offset].x; // line 247
vel[index + offset].y += dt*acc[index + offset].y;
vel[index + offset].z += dt*acc[index + offset].z;
pos[index + offset].x += dt*vel[index + offset].x; // line 251
pos[index + offset].y += dt*vel[index + offset].y;
pos[index + offset].z += dt*vel[index + offset].z;
offset += blockDim.x * gridDim.x;
In particular the profiler reports the following:
From the CUDA best practices guide it says:
"For devices of compute capability 2.x, the requirements can be summarized quite easily: the concurrent accesses of the threads of a warp will coalesce into a number of transactions equal to the number of cache lines necessary to service all of the threads of the warp. By default, all accesses are cached through L1, which as 128-byte lines. For scattered access patterns, to reduce overfetch, it can sometimes be useful to cache only in L2, which caches shorter 32-byte segments (see the CUDA C Programming Guide).
For devices of compute capability 3.x, accesses to global memory are cached only in L2; L1 is reserved for local memory accesses. Some devices of compute capability 3.5, 3.7, or 5.2 allow opt-in caching of globals in L1 as well."
Now in my kernel based on this information I would expect that 16 accesses would be required to service a 32 thread warp because float4 is 16 bytes and on my card (770m compute capability 3.0) reads from the L2 cache are performed in 32 bytes chunks (16 bytes * 32 threads / 32 bytes cache lines = 16 accesses). Indeed as you can see the profiler reports that I am doing 16 access. What I don't understand is why the profiler reports that the ideal access would involve 8 L2 transactions per access for line 247 and only 4 L2 transactions per access for the remaining lines. Can someone explain what I am missing here?
I have a simple CUDA kernel which I thought was accessing global memory efficiently. The Nvidia profiler however reports that I am performing inefficient global memory accesses.
To take one example, your float4 vel array is stored in memory like this:
0.x 0.y 0.z 0.w 1.x 1.y 1.z 1.w 2.x 2.y 2.z 2.w 3.x 3.y 3.z 3.w ...
^ ^ ^ ^ ...
thread0 thread1 thread2 thread3
So when you do this:
vel[index + offset].x += ...; // line 247
you are accessing (storing) at the locations (.x) that I have marked above. The gaps in between each ^ mark indicate an inefficient access pattern, which the profiler is pointing out. (It does not matter that in the very next line of code, you are storing to the .y locations.)
There are at least 2 solutions, one of which would be a classical AoS -> SoA reorganization of your data, with appropriate code adjustments. This is well documented (e.g. here on the cuda tag and elsewhere) in terms of what it means, and how to do it, so I will let you look that up.
The other typical solution is to load a float4 quantity per thread, when you need it, and store a float4 quantity per thread, when you need to. Your code can be trivially reworked to do this, which should give improved profiling results:
//preceding code need not change
while(index + offset < numParticles)
float4 my_vel = vel[index + offset];
float4 my_acc = acc[index + offset];
my_vel.x += dt*my_acc.x;
my_vel.y += dt*my_acc.y;
my_vel.z += dt*my_acc.z;
vel[index + offset] = my_vel;
float4 my_pos = pos[index + offset];
my_pos.x += dt*my_vel.x;
my_pos.y += dt*my_vel.y;
my_pos.z += dt*my_vel.z;
pos[index + offset] = my_pos;
offset += blockDim.x * gridDim.x;
Even though you might think that this code is "less efficient" than your code, because your code "appears" to be only loading and storing .x, .y, .z, whereas mine "appears" to also load and store .w, in fact there is essentially no difference, due to the way a GPU loads and stores to/from global memory. Although your code does not appear to touch .w, in the process of accessing the adjacent elements, the GPU will load the .w elements from global memory, and also (eventually) store the .w elements back to global memory.
What I don't understand is why the profiler reports that the ideal access would involve 8 L2 transactions per access for line 247
For line 247 in your original code, you are accessing one float quantity per thread for the load operation of acc.x, and one float quantity per thread for the load operation of vel.x. A float quantity per thread by itself should require 128 bytes for a warp, which is 4 32-byte L2 cachelines. Two loads together would require 8 L2 cacheline loads. This is the ideal case, which assumes that the quantities are packed together nicely (SoA). But that is not what you have (you have AoS).

How to detect what is preventing multiple cores being used in golang?

So, I have a piece of code that is concurrent and it's meant to be run onto each CPU/core.
There are two large vectors with input/output values
var (
input = make([]float64, rowCount)
output = make([]float64, rowCount)
these are filled and I want to compute the distance (error) between each input-output pair. Being the pairs independent, a possible concurrent version is the following:
var d float64 // Error to be computed
// Setup a worker "for each CPU"
ch := make(chan float64)
nw := runtime.NumCPU()
for w := 0; w < nw; w++ {
go func(id int) {
var wd float64
// eg nw = 4
// worker0, i = 0, 4, 8, 12...
// worker1, i = 1, 5, 9, 13...
// worker2, i = 2, 6, 10, 14...
// worker3, i = 3, 7, 11, 15...
for i := id; i < rowCount; i += nw {
res := compute(input[i])
wd += distance(res, output[i])
ch <- wd
// Compute total distance
for w := 0; w < nw; w++ {
d += <-ch
The idea is to have a single worker for each CPU/core, and each worker processes a subset of the rows.
The problem I'm having is that this code is no faster than the serial code.
Now, I'm using Go 1.7 so runtime.GOMAXPROCS should be already set to runtime.NumCPU(), but even setting it explicitly does not improves performances.
distance is just (a-b)*(a-b);
compute is a bit more complex, but should be reentrant and use global data only for reading (and uses math.Pow and math.Sqrt functions);
no other goroutine is running.
So, besides accessing the global data (input/output) for reading, there are no locks/mutexes that I am aware of (not using math/rand, for example).
I also compiled with -race and nothing emerged.
My host has 4 virtual cores, but when I run this code I get (using htop) CPU usage to 102%, but I expected something around 380%, as it happened in the past with other go code that used all the cores.
I would like to investigate, but I don't know how the runtime allocates threads and schedule goroutines.
How can I debug this kind of issues? Can pprof help me in this case? What about the runtime package?
Thanks in advance
Sorry, but in the end I got the measurement wrong. #JimB was right, and I had a minor leak, but not so much to justify a slowdown of this magnitude.
My expectations were too high: the function I was making concurrent was called only at the beginning of the program, therefore the performance improvement was just minor.
After applying the pattern to other sections of the program, I got the expected results. My mistake in evaluation which section was the most important.
Anyway, I learned a lot of interesting things meanwhile, so thanks a lot to all the people trying to help!

golang slice allocation performance

I stumbled upon an interesting thing while checking performance of memory allocation in GO.
package main
import (
func main(){
const alloc int = 65536
now := time.Now()
loop := 50000
for i := 0; i<loop;i++{
sl := make([]byte, alloc)
i += len(sl) * 0
elpased := time.Since(now)
fmt.Printf("took %s to allocate %d bytes %d times", elpased, alloc, loop)
I am running this on a Core-i7 2600 with go version 1.6 64bit (also same results on 32bit) and 16GB of RAM (on WINDOWS 10)
so when alloc is 65536 (exactly 64K) it runs for 30 seconds (!!!!).
When alloc is 65535 it takes ~200ms.
Can someone explain this to me please?
I tried the same code at home with my core i7-920 # 3.8GHZ but it didn't show same results (both took around 200ms). Anyone has an idea what's going on?
Setting GOGC=off improved performance (down to less than 100ms). Why?
becaue of escape analysis. When you build with go build -gcflags -m the compiler prints whatever allocations escapes to heap. It really depends on your machine and GO compiler version but when the compiler decides that the allocation should move to heap it means 2 things:
1. the allocation will take longer (since "allocating" on the stack is just 1 cpu instruction)
2. the GC will have to clean up that memory later - costing more CPU time
for my machine, the allocation of 65536 bytes escapes to heap and 65535 doesn't.
that's why 1 bytes changed the whole proccess from 200ms to 30s. Amazing..
Note/Update 2021: as Tapir Liui notes in Go101 with this tweet:
As of Go 1.17, Go runtime will allocate the elements of slice x on stack if the compiler proves they are only used in the current goroutine and N <= 64KB:
var x = make([]byte, N)
And Go runtime will allocate the array y on stack if the compiler proves it is only used in the current goroutine and N <= 10MB:
var y [N]byte
Then how to allocated (the elements of) a slice which size is larger than 64KB but not larger than 10MB on stack (and the slice is only used in one goroutine)?
Just use the following way:
var y [N]byte
var x = y[:]
Considering stack allocation is faster than heap allocation, that would have a direct effect on your test, for alloc equals to 65536 and more.
Tapir adds:
In fact, we could allocate slices with arbitrary sum element sizes on stack.
const N = 500 * 1024 * 1024 // 500M
var v byte = 123
func createSlice() byte {
var s = []byte{N: 0}
for i := range s { s[i] = v }
return s[v]
Changing 500 to 512 make program crash.
the reason is very simple.
const alloc int = 65535
0x0000 00000 (example.go:8) TEXT "".main(SB), ABIInternal, $65784-0
const alloc int = 65536
0x0000 00000 (example.go:8) TEXT "".main(SB), ABIInternal, $248-0
the difference is where the slice are created.

OpenMP program freezing before starting loop?

I have a program I am trying to parallelize using OpenMP - it makes a very large loop over some data. Since incrementing a shared variable (so I can report progress as it goes) is somewhat of an issue, I thought I'd break the loop up into smaller chunks, loop over those multiple times, and just report the status at the end of/outside the openmp loop.
Problem is, before the OpenMP for loop starts for the 3rd time, the program locks up. Just sits there, does nothing. I've stripped out all but the simplest code. Here it is:
some other variable declarations for removed code above here
int dbl = 0;
int lasttime = 0;
int seedbase = 0;
const char *pl = "0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
const double mm = 62.0 / 2147483647.0;
for(dbl = 0; dbl < 2048 && !abort; dbl++) {
seedbase = dbl; //(dbl * 2097152) - 2147483648;
printf("Loop %d %d\n", dbl, abort);
#pragma omp parallel for private(seed) shared(dbl)
for(seed = 0; seed < 20971; seed++) { //52
if(dbl == 2)
lasttime = time();
hps = (double)((dbl*2097152) * clk_tck) / (double)((times(&tms) - start_time));
printf("So far: %0.2fsec (%0.2fhps) %0.2f sec left\n", (double)(times(&tms) - start_time) / (double)clk_tck, hps, (((long)1 << 32) - (dbl * 2097152)) / hps);
When compiled and run, I get:
Loop 0 0
So far: 0.02sec (0.00hps) inf sec left
Loop 1 0
So far: 0.02sec (104857600.00hps) 40.94 sec left
Loop 2 0
Loop 0 starts, and the openmp runs (and does nothing) then exits, and the "So far:" is printed.
Loop 1 starts, same thing.
Loop 2 starts, and everything hangs. The printf("oo"); never happens. If I change the line to be if(dbl <= 2) my screen fills with looped "oo"'s as the loop runs.
But before the seed loop ever happens the third time - it's dead. Just sits there chewing up CPU time doing nothing.
Can you not quickly loop over a openmp loop? Is that the issue? I find it odd it's ALWAYS stopping before the 3rd run, regardless of how complex the code inside the seed loop is (I removed 200 lines of code - it had no effect)
