What are debugging restrictions inside Docker? - debugging

I am debugging a compiled (c++) python module which mysteriously crashes at import time inside docker container. When I run python3 inside gdb, then import, and then try to print backtrace, I only get
# gdb /usr/bin/pyton3
GNU gdb (Ubuntu 7.11.1-0ubuntu1~16.04) 7.11.
...
(gdb) run -c 'import myModule'
warning: Error disabling address space randomization: Operation not permitted
During startup program terminated with signal SIGSEGV, Segmentation fault.
(gdb) bt
No stack.
(gdb)
Are there some restriction what gdb can(not) do inside a Docker container? The crash in itself is surprising, but the inability to debug it even more so. The module is compiled with -ggdb2 to include as much debugging information as possible.

For the record: debugging (and also strace) need the CAP_PTRACE capability. The problem was solved by running the container with --privileged.

Related

Tensorflow in Docker on MacOs cannot load libraries with the tensorflow/tensorflow:latest image

Following the Getting Started guide for running Tensorflow in a docker container yields an error very quickly. Working through the instructions here, I pulled the tensorflow/tensorflow:latest image and then tried running:
docker run -it --rm tensorflow/tensorflow:latest python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
It's billed as a cpu-only image (under section https://www.tensorflow.org/install/docker#examples_using_cpu-only_images), which I believe I need because I'm running on a Mac, and don't have an Nvidia graphics card. However, the following errors occur:
~/> docker run -it --rm tensorflow/tensorflow:latest python -c "import tensorflow as tf; print(tf.reduce_sum(tf.random.normal([1000, 1000])))"
2020-01-12 18:53:17.848471: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-01-12 18:53:17.848763: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-01-12 18:53:17.848843: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
2020-01-12 18:53:18.288367: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2020-01-12 18:53:18.288461: E tensorflow/stream_executor/cuda/cuda_driver.cc:351] failed call to cuInit: UNKNOWN ERROR (303)
2020-01-12 18:53:18.288516: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (996a2dd2ab59): /proc/driver/nvidia/version does not exist
2020-01-12 18:53:18.289117: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-01-12 18:53:18.295734: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2592000000 Hz
2020-01-12 18:53:18.296844: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x55c37f5f81f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-01-12 18:53:18.296881: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
tf.Tensor(-203.31216, shape=(), dtype=float32)
Are the instructions misleading (saying latest is a cpu-only image) or am I missing something? I see references to cuda in the output so it would seem this isn't a cpu-only image? Should I be using a different image?
Here are my specs:
macOs Mojave - 10.14.6
Docker Desktop - 2.1.0.6
docker - 19.03.5
I had this same issue not long ago, but I discovered that it actually prints the correct result, and those logs might be warnings.
The result of print(tf.reduce_sum(tf.random.normal([1000, 1000]))) is actually tf.Tensor(-203.31216, shape=(), dtype=float32), which is in the last line of what was printed out.
if you can run your code in a jupyter notebook, then you should use the jupyter/tensorflow-notebook image.
Set it up this way:
docker pull jupyter/tensorflow-notebook
docker run -p 8888:8888 jupyter/tensorflow-notebook
After running the container, a link would be displayed on the terminal that you can use to access the notebook. Copy the link and paste in your browser, and start writing your code. It worked well for me without any of those "Warnings" printed out.
If the tensorflow container runs your code correctly with those errors still printed out, please let me know. But for now I am using the jupyter notebook container and I am not getting any errors or warnings.
If i find a way of clearing those warnings, I will let you know here.
I met this problem too, and I was trying to run on a PC. Because the nvidia-docker doesn't support Windows, I also have to use the cpu-only image.
As Odohi David said, you can run your code in a jupyter notebook. You may run it with the official image like this:
docker run -it -p 8888:8888 tensorflow/tensorflow:latest-py3-jupyter
Then open localhost:8888 in your browser. You will need a token to access the notebook and you can find it in the terminal. For example:
[I 13:57:13.801 NotebookApp] http://8c6f388f31c0:8888/?token=44ec1f209c1777c4c1b28b38b9842f297e170cbacf24b291
In this case, the token is 44ec1f209c1777c4c1b28b38b9842f297e170cbacf24b291.
After that, you can access to the official tensorflow-tutorials, and write your code on the notebook.

Building parallel hdf5 with cygwin

I recently attempted to install parallel hdf5-1.10.1 from source with cygwin 2.8.2 and stumbled into a "SIG_SETMASK undeclared" error during make. I wanted to use h5py to write hdf files in parallel (in windows environment).
I tried to follow the instructions from https://support.hdfgroup.org/ftp/HDF5/current/src/unpacked/release_docs/INSTALL_parallel
First I ran the c version of the Sample_mpio.c and it seems it worked:
Proc 2: all tests passed
Proc 3: all tests passed
Proc 1: all tests passed
Proc 0: all tests passed
So, the first step was configure:
CC=mpicc ./configure --enable-parallel --enable-shared
and then make
However, at some point make exited with several SIG_SETMASK undeclared error messages pointing to sigsetjmp(J,N) of the #define HDsigsetjmp(J,N) sigsetjmp(J,N) part followed by in expansion of macro ... for several macros.
Since there are no separate building instructions for cygwin, like for the serial case, I thought it might be an issue of cygwin.
What is the proper way to install parallel hdf5 (and h5py) on windows?
Edit 1:
Thanks to #varro 's suggestion, setting make CFLAGS=-D_POSIX_C_SOURCE resolved the error.
Nevertheless, now make check gives a segmentation fault...
Testing testhdf5.exe
/bin/sh: line 30: 22284 Segmentation fault (core dumped) srcdir="." ./${tname} >> ${log} 2>&1

Upstart causes kernel panic on embedded Linux

I am using ptxdist to create kernel and rootfs images for a Linux embedded system, running on an ARM Cortex A8 CPU.
I was trying to use a newer compiler (GCC 5+) and so was forced to upgrade several external packages that would not compile under the new GCC.
I compiled the following versions of Upstart and its immediate dependencies:
upstart: 1.13.2
libnih: 1.0.3
dbus: 1.11.2
json-c: 0.12.1
When I boot, I get the following message:
init: com.ubuntu.Upstart.c:3525: Assertion failed in control_emit_event_emitted: env != NULL
init: Caught abort, core dumped
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000600
Searching online did not yield useful hints - the only relevant issue I found is this, but it is relevant to an older version of Upstart, and my libnih is of the correct version already.
According to comment #8 in the bug report you linked, it is not enough to use version 1.0.3 of libnih -- you have to specifically use the Ubuntu version, as this seems to include dbus fixes which could solve the problem you are seeing. From the bug report:
David Ireland (e-david) wrote on 2013-04-22: #7
I've built libnih
1.0.3 from source and also made sure that upstart builds with that version of the nih-dbus-tool. I'm still having this problem.
James Hunt (jamesodhunt) wrote on 2013-04-22: #8
Which problem? The
crash? If so, you are still using the wrong version of libnih: you
should be using the Ubuntu version (specifically 1.0.3-4ubuntu16) from
here:
https://code.launchpad.net/~ubuntu-branches/ubuntu/raring/libnih/raring
You do not need the --session flag to run a "Session Init" (yes, this
is a little confusing but --session was added for testing a long time
ago and is still required for that). A "Session Init" only requires
"--user".

debugging Tensorflow's C++ code behind the SWIG interface

I'm not sure how to debug (presumably with GDB) the Python code behind a SWIG interface.
I can use ipdb to watch the execution of Tensorflow's Python code all the way to the SWIG wrapper (e.g. tf_session.TF_Run in session.py), but I would like to debug the C++ code behind the SWIG interface.
Presumably I build Tensorflow with bazel build --config debug, but how do I attach gdb to the resulting code when called from the Python interface?
TensorFlow's C++ code executes in the same process as the Python code that calls it (or, if you are using the distributed version, in the same process as one of the Python programs that created a tf.GrpcServer).
The simplest interface between Python and C++ is the pure-C API in tensor_c_api.h. To intercept one of these calls, you can attach gdb to the process ID of the Python interpreter that is running TensorFlow, and create a breakpoint on one of these functions.
For example, using an interactive Python session, in the first terminal enter:
$ python
>>> import tensorflow
>>> import os
>>> os.getpid()
14680
Then, in another terminal, start gdb:
$ gdb -p 14680
[...]
(gdb) break TF_NewSession
Breakpoint 1 at 0x7f15f450a4d0
(gdb) continue
Continuing.
Back in the Python interpreter, create a new session:
>>> sess = tf.Session()
The interpreter will pause, and your debugger will print something like the following:
Breakpoint 1, 0x00007f15f450a4d0 in TF_NewSession () from [...]/tensorflow/python/_pywrap_tensorflow.so
(gdb) backtrace
#0 0x00007f15f450a4d0 in TF_NewSession () from [...]/tensorflow/python/_pywrap_tensorflow.so
#1 0x00007f15f3ac5cdb in _wrap_TF_NewSession () from [...]/tensorflow/python/_pywrap_tensorflow.so
#2 0x000000000049968d in PyEval_EvalFrameEx ()
#3 0x00000000004a090c in PyEval_EvalCodeEx ()
#4 0x0000000000499a52 in PyEval_EvalFrameEx ()
[...]
You can now use the full power of gdb to debug TensorFlow.

Ocropus 4.4. crashes on Xubuntu 10.4

I have problems with Ocropus 4.4. (open source OCR). I installed Ocropus and neccessary libries following this script. Compilation/Installation goes without any problem.
However after successful installation of ocropus, I am not able to run any of ocropus-* commands. Ocropus-page and ocropus-hocr crash without giving any error meaningful message. I just get a cryptic killed after 2 minutes of processing.
ocropus-pages page_001.jpeg
[note] line recognizer: >
[note] *** 1 page_001.jpeg ***
Killed
I do not know where to start to fix it. Has anybody an idea where to start?
My machine is XUbuntu 10.4 64bits (run in VirtualBox) with 512MB of base memory.
I would try to convert your images to uncrompressed tiff before processing. I know that ocropus is working closely with tesseract and that tesseract handles 1bpp uncompressed tiffs the best.
Despite this being an old question I thought I'd post a solution / some debug tips as I experienced the same problem under the same scenario. The scenario being: Debian running under VirtualBox with 512MB RAM
In order to debug "Killed" messages you should look into /var/log for help. In my case kern.log contained:
debian kernel: [89675.791910] Out of memory: Kill process 13004
(ocropus-rtrain) score 806 or sacrifice child
debian kernel: [89675.791951] Killed process 13004 (ocropus-rtrain)
total-vm:800816kB, anon-rss:412424kB, file-rss:4kB
It's likely you need to increase the amount of resources assigned to the VM particularly if you're running some big fancy window manager in the background!

Resources