XIO: fatal IO error 11 caused by 32-bit libxcb - x11

Yes, this question has been asked before, but reading the answers didn't enlighten me much.
I wrote a C program that crashes after a few days of use. An important point is that it does NOT generate a core file, even though everything is set up so that it should (core_pattern, ulimit -c unlimited, etc. I can trigger a core dump fine with kill -SIGQUIT).
The programs extensively logs what it does, but there's no hint about the crash in the log.
The only message displayed at the crash (or before?) is:
XIO: fatal IO error 11 (Resource temporarily unavailable) on X server ":0"
after 2322 requests (2322 known processed) with 0 events remaining.
So two questions:
- how is it possible for a program to crash (return $?=1) without core dump.
- what is this error message about and what can I do ?
System is RedHat Enterprise 6.4
Edit:
I managed to force a core dump by calling abort() from inside an atexit() callback:
(gdb) bt
#0 0x00bc8424 in __kernel_vsyscall ()
#1 0x0085a861 in raise () from /lib/libc.so.6
#2 0x0085c13a in abort () from /lib/libc.so.6
#3 0x0808f5cf in Unexpected () at MyCode.c:1378
#4 0x0085de9f in exit () from /lib/libc.so.6
#5 0x00c85701 in _XDefaultIOError () from /usr/lib/libX11.so.6
#6 0x00c85797 in _XIOError () from /usr/lib/libX11.so.6
#7 0x00c84055 in _XReply () from /usr/lib/libX11.so.6
#8 0x00c68b8f in XGetImage () from /usr/lib/libX11.so.6
#9 0x004fd6a7 in ?? () from /usr/local/lib/libcvi.so
#10 0x00478ad5 in ?? () from /usr/local/lib/libcvi.so
...
#29 0x001eed9d in ?? () from /usr/local/lib/libcvi.so
#30 0x001eee41 in RunUserInterface () from /usr/local/lib/libcvi.so
#31 0x0808fab4 in main (argc=2, argv=0xbfbdc984) at MyCode.c:1540
Anyone can enlighten me as to this X11 problem ? libcvi.so is not mine, only MyCode.c (LabWindows/CVI).
Edit 2014-12-05:
Here's an even more precise backtrace. Things definitely happen in X11, but I'm no X11 programmer, so looking at the source code for X from the provided linestell me only that the X server (?) is temporarily unavailable. Is there any way to simply tell it to ignore this error if it's only temporary ?
#4 0x00965eaf in __run_exit_handlers (status=1) at exit.c:78
#5 exit (status=1) at exit.c:100
#6 0x00c356b1 in _XDefaultIOError (dpy=0x88aeb80) at XlibInt.c:1292
#7 0x00c35747 in _XIOError (dpy=0x88aeb80) at XlibInt.c:1498
#8 0x00c340a6 in _XReply (dpy=0x88aeb80, rep=0xbf82fa90, extra=0, discard=0) at xcb_io.c:708
#9 0x00c18c0f in XGetImage (dpy=0x88aeb80, d=27263845, x=0, y=0, width=60, height=20, plane_mask=4294967295, format=2) at GetImage.c:75
#10 0x005f46a7 in ?? () from /usr/local/lib/libcvi.so
Corresponding lines:
XlibInt.c: _XDefaultIOError()
1292: exit(1);
XlibInt.c: _XIOError
1498: _XDefaultIOError(dpy);
xcb_io.c: _XReply()
708: if(!reply) _XIOError(dpy);
GetImage.c: XGetImage()
74: if (_XReply (dpy, (xReply *) &rep, 0, xFalse) == 0 || ...

OK, I finally found the cause (thanks to someone at National Instruments), a better diagnostic and a workaround.
The bug is in many versions of libxcb and is a 32-bit counter rollover problem that has been known for a few years: https://bugs.freedesktop.org/show_bug.cgi?id=71338
Not all versions of libxcb are affected libxcb-1.9-5 has it, libxcb-1.5-1 doesn't. From the bug list, 64-bits OS shouldn't be affected, but I managed to trigger it on at least one version.
Which brings me to a better diagnostic. The following program will crash in less than 15 minutes on affected libraries (better than the entire week it previously took):
// Compile with: gcc test.c -lX11 && time ./a.out
#include <X11/Xlib.h>
void main(void) {
Display *d = XOpenDisplay(NULL);
if (d)
for(;;)
XNoOp(d);
}
And one final thing, the above prog compiled and ran on a 64-bit system works fine, compiled and ran on an old 32-bit system also works fine, but if I transfer the 32-bit version to the 64-bit system, it crashes after a few minutes.

I just had a program that acted exactly like this, with exactly the same error message. I would expect the counter error to process 2^32 events before crashing.
The program was structured so that a worker thread has a separate X connection to the X thread so that it can send messages to the X thread to update the window.
In the end I traced the problem down to a place where the function sending the events to the window to redraw it was called by multiple threads, without a mutex on it, and since X to the same X connection is not re-entrant, crashed with this exact error. Put in a mutex on the function and no problems since.

Related

Comprehensive list of GLIBC functions that can execute a file (execv, execve, fexecve, posix_spawn,..)

I am writing an LD_PRELOAD utility that wraps all calls to exec() type functions.
But holy cow, there are a lot of them. So far I have found:
execv, execvp, execve, execvpe
fexecve, execveat,
execl, execlp, execle, execlpe,
posix_spawn, posix_spawnp
Is there a comprehensive list somewhere of all the lib functions that execute another program (and aren't just wrappers to one of these functions)?
As an example, I just discovered that whatever the perl library IPC::Open3 uses is not on the list above, so I don't see the exec that happens. (strace sees it, but it claims that everything on the list above is just 'execve' which is not actually true.)
At least on my (Debian GNU/Linux) system, the execve system call comes from Perl calling execvp:
#0 0x00007ffff7d35787 in execve () at ../sysdeps/unix/syscall-template.S:120
#1 0x00007ffff7d3603b in __execvpe_common (file=0x5555559ee710 "some", argv=0x5555558e31f0, envp=0x5555558c6980, exec_script=<optimized out>) at execvpe.c:136
#2 0x00005555556bcf50 in Perl_do_aexec5 ()
#3 0x00005555556b14ef in Perl_pp_exec ()
#4 0x0000555555652cf6 in Perl_runops_standard ()
#5 0x00005555555c6a6c in perl_run ()
#6 0x000055555559c472 in main ()
(gdb) fr 2
#2 0x00005555556bcf50 in Perl_do_aexec5 ()
(gdb) x/i $pc-5
0x5555556bcf4b <Perl_do_aexec5+427>: callq 0x55555559b2c0 <execvp#plt>
Using ltrace also confirms that:
ltrace -f --library=libc.so.6 perl foo.pl |& grep execv
[pid 1789012] perl->execvp(0x55cd3c4a4bc0, 0x55cd3c3a28e0, 0x7fff82e61770, 0x7fd4f3317212) = 0xffffffff
...

File doesn't open in Fortran [duplicate]

I get the following error when I execute a fortran code compiled with gfortran. The error is followed by a backtrace for this error pointing to memory locations.
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x2b2f8e39da2d in ???
#1 0x2b2f8e39cc63 in ???
#2 0x311823256f in ???
#3 0x311827a7be in ???
#4 0x2b2f8e39cff2 in ???
#5 0x2b2f8e4adde6 in ???
#6 0x2b2f8e4ae047 in ???
#7 0x2b2f8e4a62d7 in ???
#8 0x40482a in instrument_
at /home/user/model/instrument.f:90
#9 0x406c1e in funcdet
at /home/user/model/funcsynth.f:67
#10 0x406c98 in main
at /home/user/model/funcsynth.f:78
Segmentation fault (core dumped)
I would like to know where the first instance of error arises - is it the first line of the backtrace or the lastline? Also, strategies that might help me debug the issue.
Update:
Upon backtracing, the line 90 of instrument involves opening a file like so:
out_file3 = 'new_file'
OPEN(unit=3,file=out_file3,status='unknown')
To determine the issue behind I've incorporated error checking by doing so:
OPEN(unit=3,file=out_file3,status='unknown',iostat=io_status, err=100)
100 write(STDOUT,*) 'io status=', io_status
The code exits with the error:
Error: Invalid value for ERR specification at (1). How do I determine the appropriate value for ERR specification? This led me to suspect that unit=3 might be the cause for error, I've changed the value for unit, but everytime get the "Segmentation fault (core dumped)" error.
Update 2:
OPEN(unit=3,file=out_file3,status='unknown',err=17)
17 write(STDOUT,*) 'Problem'
Still get the Segmentation fault (core dumped) error at the line corresponding to OPEN.... I can only guess that unit=3 is the root cause of the problem.
Update 3
Attempt at a self sufficient example:
character*280 testfile,finalfile,outfile
DIR = '/storage/work/user/'
testfile = 'test.dat'
CALL getenv(DIR,outfile)
CALL sappend(outfile,testfile,finalfile)
OPEN(unit=3,file=finalfine,status='new')
write(3,*) 'Test'
END

Unable to get full user-mode stacktrace while kernel debugging in windbg

I have a virtual Windows 7 x64 machine on a Windows 10 host, and I kernel debug it with windbg 10.0.10586.567. I'm running my own application on it, which I have full source and private symbols for. Whenever I break in and ask for stack traces of the app's threads, the backtrace always stops when one of my application's binaries are "hit."
So for instance, if I break in, switch to the process, and request a stacktrace with !thread [thread address] 1f, I get something like this (note the "early" zero return address at the last line):
fffff880`0534e870 fffff800`026d6992 nt!KiSwapContext+0x7a
fffff880`0534e9b0 fffff800`026d81a2 nt!KiCommitThreadWait+0x1d2
fffff880`0534ea40 fffff800`029c7a2e nt!KeDelayExecutionThread+0x186
fffff880`0534eab0 fffff800`026d08d3 nt!NtDelayExecution+0x59
fffff880`0534eae0 00000000`76e7165a nt!KiSystemServiceCopyEnd+0x13 (TrapFrame # fffff880`0534eae0)
00000000`00276708 000007fe`fcf91203 ntdll!NtDelayExecution+0xa
00000000`00276710 00000001`410e7dd9 KERNELBASE!SleepEx+0xab
00000000`002767b0 00000000`00000000 MyApp!MainMessageLoop+0x4b1 [d:\whatever\path\myapplication.cpp # 3024]
This looks very similar to when you you are missing a binary while debugging a user-mode dump (lack of unwind data) of an x64 process, except in that case the stack trace usually does not stop "this sudden", rather it goes astray at that point, and shows bogus values.
Some extra info/things I tried:
I have the correct symbol paths set up (both the Microsoft symbol server, and a local folder on the host with matching PDBs, even though the latter is not needed for just the stack trace)
I have a binary path set up (.exepath) containing matching binaries on the host (I've made absolutely sure of this; copied the binaries directly from the guest to the host machine)
If I put a breakpoint in one of the app's exported DLL functions, then when the debugger breaks in, I get a one-liner stack trace like this: 0000000000274b40 0000000000000000 MyAppDLL!SomeExportedFunction+0x32 [d:\whatever\path\myapplicationDLL.cpp # 232]
I've tried virtually every combination of commands to get a stacktrace (.process /i, .process /r /p, !process -1 7, .reloads, .reload /users, .reload /f MyApp.exe, !thread [address] 1f, etc.) with no success
Tried with an older version of windbg (6.11.0001.404) as well, same result
Also tried on Windows 8.1 as a guest with the very same binaries, same result
!sym noisy output (irrelevant lines omitted):
0: kd>.process /i [address]
0: kd>g
0: kd>.reload /user
0: kd> !process -1 2
0: kd> !thread [address] 1f
[...]
DBGHELP: d:\symbolcache\MyApp.pdb\76931C5A6C284779AD2F916CA324617E1\MyApp.pdb already cached
DBGHELP: MyApp - private symbols & lines
[...]
lmvm MyApp output:
[...]
Loaded symbol image file: MyApp.exe
Image path: C:\MyApp\MyApp.exe
[...]
Any ideas?
I accidentally stumbled into a linker switch that solves this problem: /DEBUGTYPE with the PDATA argument. If you link your binaries with this switch, unwind information will be copied into your PDBs.
I recompiled/relinked the application in question with /DEBUGTYPE:CV,PDATA (/DEBUGTYPE:CV is the default if /DEBUG is specified, see the documentation), now everything works like a charm, I always get full call stacks.
One strange aspect of this: windbg happily uses unwind data found in the PDBs, but ignores the very same data in the mapped binaries (both on the host machine).
This is not a perfect solution to the problem (or any solution at all, one might say), but I'm providing this provisional answer with a workaround.
You should be able to get the information you want, albeit not so well-formatted using something like dps #rsp L10.
In x86-64 you don't have a parallel of the x86 ebp-chain, but the return addresses are still on the stack. Those will give you the functions in the stack, and the values between them will be the arguments passed to the functions (and saved registers on the stack, etc.). A random example from Google (as I'm not on my Windows machine right now):
0:017> dps #rsp
00000000`1bb0fbb8 00000000`00000020
00000000`1bb0fbc0 00000000`00000000
00000000`1bb0fbc8 00000000`008bc6c6 Dolphin!ReadDataFromFifoOnCPU+0xb6 [d:\sources\comex\source\core\videocommon\fifo.cpp # 245]
00000000`1bb0fbd0 00000000`1ba0ffeb
00000000`1bb0fbd8 00000000`00000020
00000000`1bb0fbe0 00000000`00000020
00000000`1bb0fbe8 00000000`00000800
00000000`1bb0fbf0 00000000`1ba0ffeb
00000000`1bb0fbf8 00000000`008c2ff5 Dolphin!InterpretDisplayListPreprocess+0x45 [d:\sources\comex\source\core\videocommon\opcodedecoding.cpp # 87]
00000000`1bb0fc00 00000000`00000000
00000000`1bb0fc08 00000000`008bc041 Dolphin!RunGpu+0x81 [d:\sources\comex\source\core\videocommon\fifo.cpp # 389]
00000000`1bb0fc10 00000000`8064cbc0
00000000`1bb0fc18 00000000`1bb0fcc0
00000000`1bb0fc20 00000000`00000000
00000000`1bb0fc28 00000000`008c2dda Dolphin!OpcodeDecoder_Preprocess+0x14a [d:\sources\comex\source\core\videocommon\opcodedecoding.cpp # 326]
00000000`1bb0fc30 00000000`8064cbe0
Given that you have symbols, the return addresses are easily distinguishable.
The unwind data is lazy loaded for user mode modules, so it's not going to be mapped unless someone needs it. Unfortunately the kernel debugger doesn't force the information to be present for user images, so sometimes you get this behavior. You can see if the data is mapped or not by dumping the PE header (!dh) and checking the state of the Exception Directory (!pte imagename+offset).
Given that you own the app, try forcing the information to be resident by doing a stack walk NOP somewhere in your app:
PVOID stack[2];
(VOID)CaptureStackBackTrace(0, 2, (PVOID*)&stack, NULL);
That doesn't guarantee the entire directory will be present, but usually good enough.

Get instruction pointer of running application on Unix

Is there a way to get the instruction pointer of a running application Unix?
I have a running process (C++) and want to get its current location, and thereafter in GDB (on a different machine) map the location to source location ('list' command).
On Linux, there is /proc/[pid]/stat.
From "man proc":
stat Status information about the process. This is used by
ps(1). It is defined in /usr/src/linux/fs/proc/array.c.
...
kstkeip %lu
The current EIP (instruction pointer).
AFAICT, the 29th field of the output corresponds to the current instruction pointer of the process. For example:
gdb date
GNU gdb Red Hat Linux (6.0post-0.20040223.20rh)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu"...(no debugging symbols found)...Using host libthread_db library "/lib64/tls/libthread_db.so.1".
(gdb) set stop-on-solib-events 1
(gdb) run
(no debugging symbols found)...(no debugging symbols found)...(no debugging symbols found)...[Thread debugging using libthread_db enabled]
[New Thread 182896391360 (LWP 27968)]
(no debugging symbols found)...Stopped due to shared library event
(gdb) c
[Switching to Thread 182896391360 (LWP 27968)]
Stopped due to shared library event
(gdb) where
#0 0x00000036b060bb20 in _dl_debug_state_internal () from /lib64/ld-linux-x86-64.so.2
#1 0x00000036b060b51c in _dl_init_internal () from /lib64/ld-linux-x86-64.so.2
#2 0x00000036b0600f72 in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#3 0x0000000000000001 in ?? ()
#4 0x0000007fbff62728 in ?? ()
#5 0x0000000000000000 in ?? ()
(gdb) shell cat /proc/27968/stat
27968 (date) T 27839 27968 8955 34817 27839 4194304 42 0 330 0 0 0 0 0 18 0 0 0 1881668573 6144000 78 18446744073709551615 4194304 4234416 548680739552 18446744073709551615 234887363360 0 0 0 0 18446744071563322838 0 0 17 0 0 0 0 0 0 0
(gdb) p/a 234887363360 <--- the value of 29th field
$1 = 0x36b060bb20 <_dl_debug_state_internal>
The instruction pointer can be retrieved on Linux with the following code:
pid_t traced_process;
struct user_regs_struct regs;
ptrace(PTRACE_ATTACH, traced_process, NULL, NULL);
ptrace(PTRACE_GETREGS, traced_process, NULL, &regs);
printf("EIP: %lx\n", regs.eip);
You will need to temporarily stop the process or thread in order to get its current instruction pointer. You can do it by attaching to the process with ptrace() or (on HP-UX) ttrace() and accessing the registers.
If you're using gdb anyway, you can simply attach yourself to a running process like this:
gdb program 1234
where program is the name of the executable you're debugging, and 1234 is the PID. You can then use all of the facilities of gdb to debug the process.

Ruby/Glibc coredump (double free or corruption)

I am using a distributed continuous integration tool which I have written by myself in Ruby. It uses a fork of Mike Perham's "politics" for distribution of the tasks. The "politics" module is using threads for the mDNS part.
Every now and then I encounter a core dump which I don't understand:
*** glibc detected *** ruby: double free or corruption (fasttop): 0x086d8600 ***
======= Backtrace: =========
/lib/libc.so.6[0xb7cef494]
/lib/libc.so.6[0xb7cf0b93]
/lib/libc.so.6(cfree+0x6d)[0xb7cf3c7d]
/usr/lib/libruby18.so.1.8[0xb7e8adf8]
/usr/lib/libruby18.so.1.8(ruby_xmalloc+0x85)[0xb7e8b395]
/usr/lib/libruby18.so.1.8[0xb7e5065e]
...
/usr/lib/libruby18.so.1.8[0xb7e717f4]
/usr/lib/libruby18.so.1.8[0xb7e74296]
/usr/lib/libruby18.so.1.8(rb_yield+0x27)[0xb7e7fb57]
======= Memory map: ========
...
I am running on Gentoo and have rebuild Ruby and Glibc with "-gdbg" and turned off the striping to get a meaningful core:
...
Core was generated by `ruby /home/develop/dcc/bin/dcc-worker'.
Program terminated with signal 6, Aborted.
#0 0xb7f20410 in __kernel_vsyscall ()
(gdb) bt
#0 0xb7f20410 in __kernel_vsyscall ()
#1 0xb7cacb60 in *__GI___open_catalog (cat_name=0x6 <Address 0x6 out of bounds>, nlspath=0xbf9d6f00 " ", env_var=0x0, catalog=0x1) at open_catalog.c:237
#2 0xb7cae498 in __sigdelset (set=0x6) from /lib/libc.so.6
#3 *__GI_sigfillset (set=0x6) at ../signal/sigfillset.c:42
#4 0xb7ce952d in freopen64 (filename=0x2 <Address 0x2 out of bounds>, mode=0xb7db02c8 "\" total=\"%zu\" count=\"%zu\"/>\n", fp=0x9) at freopen64.c:47
#5 0xb7cef494 in _IO_str_init_readonly (sf=0x86d8600, ptr=0xb7eef5a9 "te\213V\b\205\322\017\204\220", size=-1210273804) at strops.c:88
#6 0xb7cf0b93 in mALLINFo (av=0xb) at malloc.c:5865
#7 0xb7cf3c7d in __libc_calloc (n=141395456, elem_size=3214793136) at malloc.c:4019
#8 0xb7e8adf8 in ?? () at gc.c:1390 from /usr/lib/libruby18.so.1.8
#9 0x086d8600 in ?? ()
#10 0xb7e89400 in rb_gc_disable () at gc.c:256
#11 0xb7e8b395 in add_freelist () at gc.c:1087
#12 gc_sweep () at gc.c:1186
#13 garbage_collect () at gc.c:1524
#14 0xb7e5065e in ?? () from /usr/lib/libruby18.so.1.8
#15 0x00000340 in ?? ()
#16 0x00000000 in ?? ()
(gdb)
Hmm??? For me this looks like it's totally Ruby intern. On other "double free or corruption" problems here at stackoverflow I have seen that maybe threads are part of the problem.
Also the problem does not occur at the exactly same position. I have another backtrace which is much longer but the crash is also in garbage_collect but with a slightly different path:
(gdb) bt
#0 0xffffe430 in __kernel_vsyscall ()
#1 0xf7c8b8c0 in *__GI_raise (sig=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#2 0xf7c8d1f5 in *__GI_abort () at abort.c:88
#3 0xf7cc7e35 in __libc_message (do_abort=2, fmt=0xf7d8daa8 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:170
#4 0xf7ccdd24 in malloc_printerr (action=2, str=0xf7d8dbec "double free or corruption (fasttop)", ptr=0x911f5d0) at malloc.c:6197
#5 0xf7ccf403 in _int_free (av=0xf7daa380, p=0x911f5c8) at malloc.c:4750
#6 0xf7cd24ad in *__GI___libc_free (mem=0x911f5d0) at malloc.c:3716
#7 0xf7e68768 in obj_free () at gc.c:1366
#8 gc_sweep () at gc.c:1174
#9 garbage_collect () at gc.c:1524
#10 0xf7e68be5 in rb_newobj () at gc.c:436
#11 0xf7eb9840 in str_alloc (klass=0) at string.c:67
... (150 lines of rb_eval/call/yield etc.)
Has anyone a suggestion how to isolate and maybe solve this problem?
Quick, easy, and not as helpful: export MALLOC_CHECK_=2. This causes glibc to do some extra level of checking during free(), to avoid heap corruption. It will abort() and give a core dump as soon as it detects corruption, instead of waiting until there's an actual problem caused by the corruption.
Not quite as quick and easy, but much more helpful (if you get it working): valgrind.
Valgrind makes it easy to find heap corruption issues. There are some spurious errors reported when using Ruby 1.8 under valgrind, but they can be eliminated using this ruby patch (and configuring with --enable-valgrind) or using a valgrind suppression file. To run your ruby program under valgrind, just prefix the command with valgrind:
valgrind ruby /home/develop/dcc/bin/dcc-worker
If the crashing process is a child of the process you are running, use valgrind --trace-children=yes. Look in particular for invalid writes, which are a sign of heap corruption.
I got this very same error in a simple 'C' program called rd_test; it would just read a given number of bytes using read(2) from a given input file (could be a device file).
The actual bug turned out to be a buffer overflow of 1 byte (as i did
...
buf[n]='\0';
...
where 'n' is the number of bytes read into the buffer 'buf').
Silly me.
BUT, the thing is I never caught that until I ran it with valgrind!
So IMHO valgrind is definitely worth running on cases like this.
The 'double free or corruption' error went away as soon as i got rid of the offending bug.
I got the same error message , not in ruby but in a zenity-program .
I discovered it had something todo with me closing two times an open pipe !
Check if You dont free two-or more times the same heap-memory , closing again already closed files or pipes .
Goodluck

Resources