How to diagnose/trace "sendsig: useracc failed." problem in HP-UX

How to diagnose/trace "sendsig: useracc failed." problem in HP-UX - ruby

I am trying to compile Ruby 1.9.1-p0 on HP-UX. After a small change to ext/pty.c it compiles successfully, albeit with a lot of warning messages (about 5K). When I run the self-tests using "make test" it crashes and core-dumps with the following error:
sendsig: useracc failed. 0x9fffffffbf7dae00 0x00000000005000
Pid 3044 was killed due to failure in writing the signal context - possible stack overflow.
Illegal instruction
From googling this problem the Illegal instruction is just a signal that the system uses to kill the process, and not related to the problem. It would seem that there is a problem with the re-establishing the context when calling the signal handler. Bringing the core up in gdb doesn't show a particularly deep stack, so I don't think the "possible stack overflow" is right either.
The gdb stack backtrace output looks like this:
#0 0xc00000000033a990:0 in __ksleep+0x30 () from /usr/lib/hpux64/libc.so.1
#1 0xc0000000001280a0:0 in __mxn_sleep+0xae0 ()
from /usr/lib/hpux64/libpthread.so.1
#2 0xc0000000000c0f90:0 in <unknown_procedure> + 0xc50 ()
from /usr/lib/hpux64/libpthread.so.1
#3 0xc0000000000c1e30:0 in pthread_cond_timedwait+0x1d0 ()
from /usr/lib/hpux64/libpthread.so.1

Answering my own question:
The problem was that the stack being allocated was too small. So it really was a stack overflow. The sendsig() function was preparing a context structure to be copied from kernel space to user space. The useracc() function checks that there's enough space at the address specified to do so.
The Ruby 1.9.1-p0 code was using PTHREAD_STACK_MIN to allocate the stack for any threads created. According to HP-UX documentation, on Itanium this is 256KB, but when I checked the header files, it was only 4KB. The error message from useracc() indicated that it was trying to copy 20KB.
So if a thread received a signal, it wouldn't have enough space to receive the signal context on its stack.

Related

Jump into `SignalHandler` right after a call instruction

I'm doing a debug on a program which report:
Thread 1 "test.out" received signal SIGSEGV, Segmentation fault.
I then gdbed the program and found out that the program jump into a SignalHandler function right after a call instruction call 0x401950.
I tested the call destination rax, rdi and rsi (Input of the call). However nothing strange found.
I haven't meet this situation before, I guess there is an soft interrupt created due to some of the exemption, which then scheduled after the instruction. So the actual problem may occur earlier.
Now I need to find out where rises the exemption, so that I could fix it. But I do not have any clue on how.
Therefore I came to ask if anyone could help me on that.
One big sorry for not showing the code, since it is company asset.....
Thanks for anyone who helps!!!

I then gdbed the program and found out that the program jump into a SignalHandler function right after a call instruction call 0x401950.
You didn't say which processor and OS you are using. Guessing Linux and x86_64, you have a stack overflow.
The CALL instruction pushes return address onto the stack, and this operation will generate SIGSEGV if your stack is exhausted.
You can confirm this guess by using (gdb) where (which is likely to show very deep recursion, though other reasons for stack exhaustion are also possible), and by looking at the value of RSP (which should be just below page boundary).
Depending on how the stack size is set up, using ulimit -s unlimited before invoking the program may work around this crash (though you really ought to fix the root cause by some other mechanism).

electric-fence segfaults in malloc

I've got a rather complicated program that does a lot of memory allocation, and today by surprise it started segfaulting in a weird way that gdb couldn't pin-point the location of. Suspecting memory corruption somewhere, I linked it against Electric Fence, but I am baffled as to what it is telling me:
ElectricFence Exiting: mprotect() failed:
Program received signal SIGSEGV, Segmentation fault.
__strlen_sse2 () at ../sysdeps/i386/i686/multiarch/strlen.S:99
99 ../sysdeps/i386/i686/multiarch/strlen.S: No such file or directory.
in ../sysdeps/i386/i686/multiarch/strlen.S
#0 __strlen_sse2 () at ../sysdeps/i386/i686/multiarch/strlen.S:99
#1 0xb7fd6f2d in ?? () from /usr/lib/libefence.so.0
#2 0xb7fd6fc2 in EF_Exit () from /usr/lib/libefence.so.0
#3 0xb7fd6b48 in ?? () from /usr/lib/libefence.so.0
#4 0xb7fd66c9 in memalign () from /usr/lib/libefence.so.0
#5 0xb7fd68ed in malloc () from /usr/lib/libefence.so.0
#6 <and above are frames in my program>
I'm calling malloc with a value of 36, so I'm pretty sure that shouldn't be a problem.
What I don't understand is how it is even possible that I could be trashing the heap in malloc. In reading the manual page a bit more, it appears that maybe I am writing to a free page, or maybe I'm underwriting a buffer. So, I have tried the following environment variables, together and by themselves:
EF_PROTECT_FREE=1
EF_PROTECT_BELOW=1
EF_ALIGNMENT=64
EF_ALIGNMENT=4096
The last two had absolutely no effect.
The first one changed the portions of the stack frame which are in my program (where in my program was executing when malloc was called fatally), but with identical frames once malloc was entered.
The second one changed a bit more; in addition to the crash occurring at a different place in my program, it also occurred in a call to realloc instead of malloc, although realloc is directly calling malloc and otherwise the back trace is identical to above.
I'm not explicitly linking against any other libraries besides fence.
Update: I found several places where it suggests that the message: " mprotect() failed: Cannot allocate memory" means that there is not enough memory on the machine. But I am not seeing the "Cannot allocate memory" part, and ps says I am only using 15% of memory. With such a small allocation (4k+32) could this really be the problem?

I just wasted several hours on the same problem.
It turns out that it is to do with the setting in
/proc/sys/vm/max_map_count
From the kernel documentation:
"This file contains the maximum number of memory map areas a process may have. Memory map areas are used as a side-effect of calling malloc, directly by mmap and mprotect, and also when loading shared libraries.
While most applications need less than a thousand maps, certain programs, particularly malloc debuggers, may consume lots of them, e.g., up to one or two maps per allocation."
So you can 'cat' that file to see what it is set to, and then you can 'echo' a bigger number into it. Like this: echo 165535 > /proc/sys/vm/max_map_count
For me, this allowed electric fence to get past where it was before, and start to find real bugs.

How to debug a foobar Program Counter

I was immediately suspicious of the crash. A Floating Point Exception in a method whose only arithmetic was a "divide by sizeof(short)".
I looked at the stack crawl & saw that the offset into the method was "+91". Then I examined a disassembly of that method & confirmed that the Program Counter was in fact foobar at the time of the crash. The disassembly showed instructions at +90 and +93 but not +91.
This is a method, 32-bit x86 instructions, that gets called very frequently in the life of the application. This crash has been reported 3 times.
How does this happen? How do I set a debugging trap for the situation?

Generally when you fault in the middle of an instruction, its due to bad flow control(ie: a broken jump, call, retn), an overflow, bad dereferencing or your debug symbols being out-of-sync making the stack trace show incorrect info. Your first step is to reliably reproduce the error everytime, else you'll have trouble trapping it, from there I'd just run it in a debugger, force the conditions to make it explode, then examine the (call) stack and registers to see if they are valid values etc.

Allocating a buffer of more a page size on stack will corrupt memory?

In Windows, stack is implemented as followed: a specified page is followed committed stack pages. It's protection flag is as guarded. So when thead references an address on the guared page, an memory fault rises which makes memory manager commits the guarded page to the stack and clean the page's guarded flag, then it reserves a new page as guarded.
when I allocate an buffer which size is more than one page(4KB), however, an expected error haven't happen. Why?

Excellent question (+1).
There's a trick, and few people know about it (besides driver writers).
When you allocate large buffer on the stack - the compiler automatically adds so-called stack probes. It's an extra code (implemented in CRT usually), which probes the allocated region, page-by-page, in the needed order.
EDIT:
The function is _chkstk.

The fault doesn't reach your program - it is handled by the operating system. Similar thing happens when your program tries to read memory that happens to be written into the swap file - a trap occurs and the operating system unswaps the page and your program continues.

ARM Data Abort error exception debugging

So now I understand that I'm getting a ARM Data Abort exception - I see how to trap the exception itself (a bad address in the STL library), but I would like to walk back up the stack frame before the exception. I'm using the IAR toolchain, and it tells me the call stack is unavailable after the exception - is there a trick way to convince the tool to show me the call stack? Thanks for all the quick help!

if you look at the ARM ARM (ARM Architecture Reference Manual, just google "arm arm"), Programmers Model -> Processor modes and Registers sections. When in abort mode you are priveledged so you can switch from abort to say supervisor and then make a copy of r13, then switch back to abort mode and dump the stack from the copy of r13. Your r14 also tells you where the abort occurred.
I wouldnt be surprised if this abort was from an alignment. Trying to read/write a word with an address with something other than zeros in the lower two bits or a halfword with the lsbit of the address set. Actually if you take the link register and a dump of the registers (r0-r12) since abort and user/supervisor use the same register space, you can look at the instruction that caused the abort and the address to see if it was indeed an alignment problem or something else. Note that the pc is one, two or three instructions ahead depending on the mode thumb or arm that had the abort, if you are not using thumb at all then this nothing to worry about.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio