Why calling dlopen sometimes breaks my application by damaging class variables content? - glibc

I am trying to load library with dlopen().
But call to this dlopen() function sometimes (not always) damages my class variables and then app goes to segmentation fault.
Below is not precise code (pseudocode), but explanation what happens:
class MyClass {
public:
int MyVar;
void Print() { printf("Simply breakpoint\n"); };
void LoadLibrary() { dlopen("/usr/lib/x86_64-linux-gnu/libavcodec.so.58.54.100",RTLD_LAZY); };
MyClass() {
MyVar = 12345;
printf("MyVar address %p\n",&MyVar);
Print();
LoadLibrary();
};
}
void main()
{
MyClass obj;
}
I do debug it with gdb following way:
>gdb MyApp
>break Print
>run
when it stops at Print function breakpoint I see printed address of variable MyVar.
MyVar address 0x7fff900bc2bc
Also I can check its content.
Then I do
>watch *0x7fff900bc2bc
Hardware watchpoint 2: *0x7fff900bc2bc
>cont
When it continues it breaks on unexpected writing to my variable MyVar:
Thread 1 "MyApp" hit Hardware watchpoint 2: *0x7fff900bc2bc
Old value = 12345
New value = 32767
memmove () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:356
356 ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S: No such file or directory.
(gdb) backtrace
#0 memmove () at ../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:356
#1 0x00007ffff7fde759 in _dl_map_object_deps (map=map#entry=0x7fff90145110, preloads=preloads#entry=0x0,
npreloads=npreloads#entry=0, trace_mode=trace_mode#entry=0, open_mode=open_mode#entry=-2147483648)
at dl-deps.c:446
#2 0x00007ffff7fe4db0 in dl_open_worker (a=a#entry=0x7fffa6fd80f0) at dl-open.c:571
#3 0x00007ffff53dd928 in __GI__dl_catch_exception (exception=<optimized out>, operate=<optimized out>,
args=<optimized out>) at dl-error-skeleton.c:208
#4 0x00007ffff7fe460a in _dl_open (file=0x42d8ee0 "/usr/lib/x86_64-linux-gnu/libavcodec.so.58.54.100",
mode=-2147483646, caller_dlopen=<optimized out>, nsid=-2, argc=2, argv=0x7fffffffea88, env=0x54037d0)
at dl-open.c:837
#5 0x00007ffff57bc34c in dlopen_doit (a=a#entry=0x7fffa6fd8310) at dlopen.c:66
#6 0x00007ffff53dd928 in __GI__dl_catch_exception (exception=exception#entry=0x7fffa6fd82b0,
operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:208
#7 0x00007ffff53dd9f3 in __GI__dl_catch_error (objname=0x7fff900d8770, errstring=0x7fff900d8778,
mallocedp=0x7fff900d8768, operate=<optimized out>, args=<optimized out>) at dl-error-skeleton.c:227
#8 0x00007ffff57bcb59 in _dlerror_run (operate=operate#entry=0x7ffff57bc2f0 <dlopen_doit>,
args=args#entry=0x7fffa6fd8310) at dlerror.c:170
#9 0x00007ffff57bc3da in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#10 0x000000000209ec5b in MyClass::LoadLibrary() ()
.......
From stack backtrace I see that MyVar is damaged by call to dlopen()
But why?
What I am doing wrong?
How to resolve?
Unfortunately I cannot show all source code because it is huge and involves many different components, many threads, many 3rd party libraries.
I cannot simply dynamically link my app with libavcodec because it is already statically linked in 3rd party library but 3rd party library is built without required features unfortunately (without VAAPI support). Dynamic linking makes symbol conflicts.
That is why I was decided try to load libavcodec manually by dlopen() and get all required function pointers from dlsym().

But why? What I am doing wrong? How to resolve?
You didn't say which version of GLIBC you are using (or which distribution).
The code in GLIBC-2.27 dl-deps.c reads:
struct link_map **l_initfini = (struct link_map **)
malloc ((2 * nneeded + 1) * sizeof needed[0]);
if (l_initfini == NULL)
_dl_signal_error (ENOMEM, map->l_name, NULL,
N_("cannot allocate dependency list"));
l_initfini[0] = l;
memcpy (&l_initfini[1], needed, nneeded * sizeof needed[0]);
memcpy (&l_initfini[nneeded + 1], l_initfini,
nneeded * sizeof needed[0]); // line 446
You also didn't say whether MyClass is heap or stack allocated.
One way that the GLIBC code could write over your variable is when you have already corrupted heap earlier. This is especially likely if MyClass is in fact heap-allocated (which it appears to be given the 0x7fff900bc2bc address).
The fact that this "write over" happens only some of the time is also symptomatic of heap corruption.
As the very first step, I would run the program under Valgrind and make sure that no heap corruption (heap buffer overflow, free unallocated, double-free, etc.) is detected before LoadLibrary() runs.

Related

core dump stack indicates SIGSEGV due to vector<vector<int>> usage

I have a code snippet that is behaving weirdly. The code is simply aiming to implement radix and bucket sort. When I comment in the main one of either sort and run it works perfectly. But when I enable both of them i am getting a core dump. And the weird part is core dump as indicated by the stack is crossing over into the stl_vector.h.
The code reference is here:- https://rextester.com/RUUDP10453
When i enable only one of the sorts like below in main it works fine.
//doRadixSort(arr, size);
doBucketSort(arr, size);
or
doRadixSort(arr, size);
//doBucketSort(arr, size);
But when both are enabled there is segmentation fault after both sorts are completed as indicated by the
cout << "i am here at exit" << endl;
The core dump stack indicates some reference/hint at vector of vector buckets. But i have properly allocated and reserved it the required memory. so why this is happening i need some expertise to dig out. I have tried debugging this in eclipse CDT C++ for about 2 hrs with no lead.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 _int_free (av=0x7f66d702eb00 <main_arena>, p=0xf98020, have_lock=0) at malloc.c:3976
3976 >= ((char *) av->top + chunksize(av->top)), 0))
(gdb) where
#0 _int_free (av=0x7f66d702eb00 <main_arena>, p=0xf98020, have_lock=0) at malloc.c:3976
#1 0x00007f66d6cf33dc in __GI___libc_free (mem=<optimized out>) at malloc.c:2966
#2 0x00000000004030fa in __gnu_cxx::new_allocator<std::vector<int, std::allocator<int> > >::deallocate (this=0x7fffc6ffa060, __p=0xf98030) at /usr/include/c++/6.3.1/ext/new_allocator.h:110
#3 0x0000000000402d23 in std::allocator_traits<std::allocator<std::vector<int, std::allocator<int> > > >::deallocate (__a=..., __p=0xf98030, __n=10) at /usr/include/c++/6.3.1/bits/alloc_traits.h:442
#4 0x00000000004027ac in std::_Vector_base<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >::_M_deallocate (this=0x7fffc6ffa060, __p=0xf98030, __n=10)
at /usr/include/c++/6.3.1/bits/stl_vector.h:178
#5 0x00000000004025e4 in std::_Vector_base<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >::~_Vector_base (this=0x7fffc6ffa060, __in_chrg=<optimized out>)
at /usr/include/c++/6.3.1/bits/stl_vector.h:160
#6 0x000000000040211d in std::vector<std::vector<int, std::allocator<int> >, std::allocator<std::vector<int, std::allocator<int> > > >::~vector (this=0x7fffc6ffa060, __in_chrg=<optimized out>)
at /usr/include/c++/6.3.1/bits/stl_vector.h:427
#7 0x0000000000401d4b in doBucketSort (arr=0x7fffc6ffa100, size=#0x7fffc6ffa0f8: 12) at tako.cpp:97
#8 0x0000000000401e29 in main (argc=1, argv=0x7fffc6ffa218) at tako.cpp:141
(gdb)
Alternatively, I found the below also works which is equivalent to the resize function.
vector<vector<int>> buckets;
constexpr size_t size=10, bucketSize=10;
buckets.reserve(bucketSize);
for(unsigned int i=0; i<=bucketSize; ++i)
buckets.push_back({ });
for(unsigned int i=0; i<=bucketSize; ++i)
buckets[i].reserve(size);

conflict in symbols exposed by tcmalloc and glibc

I was recently debugging a crash in a product and identified the cause to be a conflict in the memory allocation symbols exposed by glibc and tcmalloc. I wrote the following sample code for exposing this issue:
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
#include <assert.h>
#include <stdlib.h>
int main()
{
struct addrinfo hints = {0}, *res = NULL;
hints.ai_family = AF_UNSPEC;
hints.ai_socktype = SOCK_STREAM;
int rc = getaddrinfo("myserver", NULL, &hints, &res);
assert(rc == 0);
return 0;
}
I compiled it using the following command:
g++ temp.cpp -g -lresolv
I executed the program using the following command:
LD_PRELOAD=/path/to/libtcmalloc_minimal.so.4 ./a.out
The program crashes with the following stack:
#0 0x00007ffff6c7c875 in *__GI_raise (sig=) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64
#1 0x00007ffff6c7de51 in *__GI_abort () at abort.c:92
#2 0x00007ffff6cbd8bf in __libc_message (do_abort=2, fmt=0x7ffff6d8c460 "*** glibc detected *** %s: %s: 0x%s ***\n") at ../sysdeps/unix/sysv/linux/libc_fatal.c:186
#3 0x00007ffff6cc30c8 in malloc_printerr (action=2, str=0x7ffff6d88fec "free(): invalid pointer", ptr=) at malloc.c:6282
#4 0x00007ffff6cc810c in *__GI___libc_free (mem=) at malloc.c:3733
#5 0x00007ffff6839e89 in _nss_dns_gethostbyname4_r (name=0x400814 "myserver", pat=0x7fffffffdfa8, buffer=0x7fffffffd9b0 "myserver.mydomain.com", buflen=1024, errnop=0x7fffffffdfbc, herrnop=0x7fffffffdf98, ttlp=0x0) at nss_dns/dns-host.c:341
#6 0x00007ffff6d11917 in gaih_inet (name=0x400814 "myserver", service=0x7fffffffdf88, req=0x7fffffffe1d0, pai=0x7fffffffe160, naddrs=0x7fffffffe168) at ../sysdeps/posix/getaddrinfo.c:880
#7 0x00007ffff6d14301 in *__GI_getaddrinfo (name=0x400814 "myserver", service=0x0, hints=0x7fffffffe1d0, pai=0x7fffffffe200) at ../sysdeps/posix/getaddrinfo.c:2452
#8 0x00000000004006f0 in main () at temp.cpp:12
The reason for this is that the free() function called by _nss_dns_gethostbyname4_r() from libnss_dns.so is from libc.so while the corresponding malloc() was called from libresolv.so from libtcmalloc_minimal.so. The addresses of tcmalloc's malloc() and free() functions are getting into the GOT of libresolv.so leading to this crash. The crash goes away if I don't link my program to libresolv.so.
Now for my question. Is there any documentation which explains how to safely use tcmalloc to avoid crashes like this ?
glibc has some documentation for interposing malloc:
Replacing malloc
Something else must be going here, though. Typical builds of glibc and glibc will get this right (even in fairly old versions of either package).
My best guess is you are using some SUSE glibc variant, which uses RTLD_DEEPBIND for NSS modules. This results in a known issue with malloc interposition. SUSE suggests setting the RTLD_DEEPBIND=0 environment variable as a workaround.

Segmentation fault on using boost syslog and cpp-netlib

After adding boost syslog into source code, segmentation fault appears inside cpp-netlib library.
I was able to prepare minimum working code snippet to reproduce the problem.
#include <boost/network/protocol/http/client.hpp>
#include <boost/log/utility/setup/file.hpp>
#include <boost/log/sinks/syslog_backend.hpp>
#include <iostream>
using namespace boost::network;
using namespace boost::network::http;
namespace sinks = boost::log::sinks;
int main()
{
client::request request_("http://www.boost.org/");
client client_;
client::response response_ = client_.get(request_);
std::string body_ = body(response_);
std::cout << "body: " << body_;
using syslog_sinkT = sinks::synchronous_sink <sinks::syslog_backend>;
boost::shared_ptr <sinks::syslog_backend> backend = boost::make_shared <sinks::syslog_backend> ();
boost::shared_ptr<syslog_sinkT> sink = boost::make_shared <syslog_sinkT> (backend);
}
When last 2 lines are commented, segmentation fault disappears and everything works fine.
gdb stack trace (approximately, may vary):
Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0xf7c29b40 (LWP 19874)]
0x00000000 in ?? ()
(gdb) where
#0 0x00000000 in ?? ()
#1 0x083c376c in boost::asio::detail::task_io_service_operation::complete (bytes_transferred=260, ec=..., owner=...,
this=0xf6900710)
at /home/kostidov/prj/third_party-master/boost/boost_1_60_0/__public__/v0/Linux-libc6/include/boost/asio/detail/task_io_service_o$
eration.hpp:38
#2 boost::asio::detail::task_io_service::do_run_one (ec=..., this_thread=..., lock=..., this=<optimized out>)
at /home/kostidov/prj/third_party-master/boost/boost_1_60_0/__public__/v0/Linux-libc6/include/boost/asio/detail/impl/task_io_serv$
ce.ipp:372
#3 boost::asio::detail::task_io_service::run (ec=..., this=0x84ea280)
at /home/kostidov/prj/third_party-master/boost/boost_1_60_0/__public__/v0/Linux-libc6/include/boost/asio/detail/impl/task_io_serv$
ce.ipp:149
#4 boost::asio::io_service::run (this=0x84e9a94)
at /home/kostidov/prj/third_party-master/boost/boost_1_60_0/__public__/v0/Linux-libc6/include/boost/asio/impl/io_service.ipp:59
#5 0x083b5766 in boost::_mfi::mf0<unsigned int, boost::asio::io_service>::operator() (p=<optimized out>, this=<optimized out>)
at /home/kostidov/prj/third_party-master/boost/boost_1_60_0/__public__/v0/Linux-libc6/include/boost/bind/mem_fn_template.hpp:49
#6 boost::_bi::list1<boost::_bi::value<boost::asio::io_service*> >::operator()<unsigned int, boost::_mfi::mf0<unsigned int, boost::a$
io::io_service>, boost::_bi::list0> (a=<synthetic pointer>, f=..., this=0x84ea94c)
at /home/kostidov/prj/third_party-master/boost/boost_1_60_0/__public__/v0/Linux-libc6/include/boost/bind/bind.hpp:249
#7 boost::_bi::bind_t<unsigned int, boost::_mfi::mf0<unsigned int, boost::asio::io_service>, boost::_bi::list1<boost::_bi::value<boo$
t::asio::io_service*> > >::operator() (this=0x84ea944)
at /home/kostidov/prj/third_party-master/boost/boost_1_60_0/__public__/v0/Linux-libc6/include/boost/bind/bind.hpp:1222
#8 boost::detail::thread_data<boost::_bi::bind_t<unsigned int, boost::_mfi::mf0<unsigned int, boost::asio::io_service>, boost::_bi::$
ist1<boost::_bi::value<boost::asio::io_service*> > > >::run (this=0x84ea828)
at /home/kostidov/prj/third_party-master/boost/boost_1_60_0/__public__/v0/Linux-libc6/include/boost/thread/detail/thread.hpp:116
#9 0x0840a4f8 in boost::(anonymous namespace)::thread_proxy (param=0x84ea828) at libs/thread/src/pthread/thread.cpp:167
#10 0xf7de1f70 in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
#11 0xf7d18bee in clone () from /lib/i386-linux-gnu/libc.so.6
Problem exists on Ubuntu 14.04 with cpp-netlib 0.11.2 and both boost versions 1_58_0 and 1_60_0. Boost, cpp-netlib and my application are compiled with -std=c++11.
Note 1. Segmentation fault appears inside cpp-netlib before reaching syslog_backend creation. Only presence of last 2 lines guarantees SIGSEGV reproduction.
Note 2. Reproduces only with syslog_backend. Any other logging targets (file, consol) work fine.
The best idea I have is the problem may lay inside boost during static variables initialisation, but I have no proves regarding this version.
Any suggestions?
Seems like I used too many compile options for building both boost and cpp-netlib.
I prepared new build for both boost and cpp-netlib once again, but this time I used as less additional options as possible.
And it works fine.
EDIT: I found the key which causes the error. It's BOOST_ASIO_ENABLE_HANDLER_TRACKING It was defined during the boost compilation, but wasn't defined during compilation of cpp-netlib and my application.
https://svn.boost.org/trac/boost/ticket/11945

helgrind does not detect recursive locking of std::mutex

I observed that helgrind won't detect a recursive lock on a non-recursive c++11 std::mutex. The problem is however detected when using pthread_mutex_lock.
Two simple testcases to demonstrate the problem:
// Test code: C++11 std::mutex
// helgrind does not detect recursive locking
void test_cpp11()
{
std::mutex m;
m.lock();
m.lock();
}
// pthread-based test code
// helgrind does detect recursive locking
void test_pth()
{
pthread_mutex_t m;
pthread_mutex_init(&m, 0);
pthread_mutex_lock(&m);
pthread_mutex_lock(&m);
}
gdb shows that the same pthread library functions are being called:
#0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
#1 0x00007ffff78c2657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007ffff78c2480 in __GI___pthread_mutex_lock (mutex=0x7fffffffe450) at ../nptl/pthread_mutex_lock.c:79
#3 0x00000000004008ad in test_pth() ()
#1 0x00007ffff78c2657 in _L_lock_909 () from /lib/x86_64-linux-gnu/libpthread.so.0
#2 0x00007ffff78c2480 in __GI___pthread_mutex_lock (mutex=0x7fffffffe450) at ../nptl/pthread_mutex_lock.c:79
#3 0x00000000004007f7 in __gthread_mutex_lock(pthread_mutex_t*) ()
#4 0x00000000004008ec in std::mutex::lock() ()
#5 0x0000000000400857 in test_cpp11() ()
This was observed with g++ 4.7.3, 4.8.2 and 4.9.0 on Ubuntu 14.04 64-bit.
Does anyone have an idea what might be the reason and what might be done to get helgrind to detect the recursive locking?
Not an answer to the original question but I think it's worth mentioning that one should always check the program with both helgrind and drd. The drd tool successfully detects the problem in both scenarios.

How to inspect the return value of a function in GDB?

Is it possible to inspect the return value of a function in gdb assuming the return value is not assigned to a variable?
I imagine there are better ways to do it, but the finish command executes until the current stack frame is popped off and prints the return value -- given the program
int fun() {
return 42;
}
int main( int argc, char *v[] ) {
fun();
return 0;
}
You can debug it as such --
(gdb) r
Starting program: /usr/home/hark/a.out
Breakpoint 1, fun () at test.c:2
2 return 42;
(gdb) finish
Run till exit from #0 fun () at test.c:2
main () at test.c:7
7 return 0;
Value returned is $1 = 42
(gdb)
The finish command can be abbreviated as fin. Do NOT use the f, which is abbreviation of frame command!
Yes, just examine the EAX register by typing print $eax. For most functions, the return value is stored in that register, even if it's not used.
The exceptions to this are functions returning types larger than 32 bits, specifically 64-bit integers (long long), doubles, and structs or classes.
The other exception is if you're not running on an Intel architecture. In that case, you'll have to figure out which register is used, if any.
Here's how todo this with no symbols.
gdb ls
This GDB was configured as "ppc64-yellowdog-linux-gnu"...
(no debugging symbols found)
Using host libthread_db library "/lib64/libthread_db.so.1".
(gdb) break __libc_start_main
Breakpoint 1 at 0x10013cb0
(gdb) r
Starting program: /bin/ls
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
(no debugging symbols found)
Breakpoint 1 at 0xfdfed3c
(no debugging symbols found)
[Thread debugging using libthread_db enabled]
[New Thread 4160418656 (LWP 10650)]
(no debugging symbols found)
(no debugging symbols found)
[Switching to Thread 4160418656 (LWP 10650)]
Breakpoint 1, 0x0fdfed3c in __libc_start_main () from /lib/libc.so.6
(gdb) info frame
Stack level 0, frame at 0xffd719a0:
pc = 0xfdfed3c in __libc_start_main; saved pc 0x0
called by frame at 0x0
Arglist at 0xffd71970, args:
Locals at 0xffd71970, Previous frame's sp is 0xffd719a0
Saved registers:
r24 at 0xffd71980, r25 at 0xffd71984, r26 at 0xffd71988, r27 at 0xffd7198c,
r28 at 0xffd71990, r29 at 0xffd71994, r30 at 0xffd71998, r31 at 0xffd7199c,
pc at 0xffd719a4, lr at 0xffd719a4
(gdb) frame 0
#0 0x0fdfed3c in __libc_start_main () from /lib/libc.so.6
(gdb) info fr
Stack level 0, frame at 0xffd719a0:
pc = 0xfdfed3c in __libc_start_main; saved pc 0x0
called by frame at 0x0
Arglist at 0xffd71970, args:
Locals at 0xffd71970, Previous frame's sp is 0xffd719a0
Saved registers:
r24 at 0xffd71980, r25 at 0xffd71984, r26 at 0xffd71988, r27 at 0xffd7198c,
r28 at 0xffd71990, r29 at 0xffd71994, r30 at 0xffd71998, r31 at 0xffd7199c,
pc at 0xffd719a4, lr at 0xffd719a4
Formatting kinda messed up there, note the use of "info frame" to inspect frames, and "frame #" to navigate your context to another context (up and down the stack)
bt also show's an abbreviated stack to help out.

Resources