I run a docker cluster with a few thousand containers and a few times per day randomly I have a process that gets "stuck" blocking a container from stopping. Below is an example container with its corresponding process and all things I have tried to kill the container / process.
The container:
# docker ps | grep 950677e2317f
950677e2317f 7e553d1d9f6f "/bin/sh -c /minecraf" 2 days ago Up 2 days 0.0.0.0:22661->22661/tcp, 0.0.0.0:22661->22661/udp, 0.0.0.0:37681->37681/tcp, 0.0.0.0:37681->37681/udp gloomy_jennings
Try to stop container using docker daemon (it tries forever without result):
# time docker stop --time=1 950677e2317f
^C
real 0m13.508s
user 0m0.036s
sys 0m0.008s
Daemon log while trying to stop:
# journalctl -fu docker.service
-- Logs begin at Fri 2015-12-11 15:40:55 CET. --
Dec 31 23:30:33 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:33.164731953+01:00" level=info msg="POST /v1.21/containers/950677e2317f/stop?t=1"
Dec 31 23:30:34 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:34.165531990+01:00" level=info msg="Container 950677e2317fcd2403ef5b5ffad37204e880136e91f76b0a8682e04a93e80942 failed to exit within 1 seconds of SIGTERM - using the force"
Dec 31 23:30:44 m3561.contabo.host docker[9988]: time="2015-12-31T23:30:44.165954266+01:00" level=info msg="Container 950677e2317f failed to exit within 10 seconds of kill - trying direct SIGKILL"
Looking into the processes running on the machine reveals the zombie process (pid 11991 on host machine):
# ps aux | grep [1]1991
root 11991 84.3 0.0 5836 132 ? R Dec30 1300:19 bash -c (echo stop > /tmp/minecraft &)
# top -b | grep [1]1991
11991 root 20 0 5836 132 20 R 89.5 0.0 1300:29 bash
And it is indeed a process running within our container (check container id):
# cat /proc/11991/mountinfo
...
/var/lib/docker/containers/950677e2317fcd2403ef5b5ffad37204e880136e91f76b0a8682e04a93e80942/resolv.conf /etc/resolv.conf rw,relatime - ext4 /dev/sda2 rw,errors=remount-ro,data=ordered
Trying to kill the process yields nothing:
# kill -9 11991
# ps aux | grep [1]1991
root 11991 84.3 0.0 5836 132 ? R Dec30 1303:58 bash -c (echo stop > /tmp/minecraft &)
Some overview data:
# docker version
Client:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5
Built: Fri Nov 20 13:20:08 UTC 2015
OS/Arch: linux/amd64
Server:
Version: 1.9.1
API version: 1.21
Go version: go1.4.2
Git commit: a34a1d5
Built: Fri Nov 20 13:20:08 UTC 2015
OS/Arch: linux/amd64
# docker info
Containers: 189
Images: 322
Server Version: 1.9.1
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 700
Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 4.2.0-19-generic
Operating System: Ubuntu 15.10
CPUs: 24
Total Memory: 125.8 GiB
Name: m3561.contabo.host
ID: ZM2Q:RA6Q:E4NM:5Q2Q:R7E4:BFPQ:EEVK:7MEO:YRH6:SVS6:RIHA:3I2K
# uname -a
Linux m3561.contabo.host 4.2.0-19-generic #23-Ubuntu SMP Wed Nov 11 11:39:30 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
If stopping the docker daemon the process still lives. The only way to get rid of the process is to restart the host machine. As this happens fairly frequently (requires every node to restart every 3-7 days) it has a serious impact on the uptime of the overall cluster.
Any ideas on what to do here?
Okay, I think I found the root cause of this. The folks over at Docker helped me out, check out this thread on GitHub.
It turns out this most likely is a bug in the Linux Kernel 4.19+. I'll be rolling back to an older version until it is fixed.
UPDATE: I've been running 3.* only in my cluster for several days now without any issues. This was most certainly a kernel bug.
I had a similar problem and switching to use overlay2 storage driver made the problem go away. Changing the storage driver will loose all docker state (images & containers). It seems that the aufs storage driver has some problems that might be a source of lock ups.
Related
Processes are taking time when running them in a Docker container on an Ubuntu 18 machine. But the same process with the same Docker version is running fine on an Ubuntu 16 machine.
I have a node application listening on some port. Accepting get requests on the path "/" and "/docker" which simply runs a command "whoami" in the host machine and in a Docker container respectively and returns the result. The same node application with the same Docker container is running on both the machines (Ubuntu 16 and Ubuntu 18).
Firstly, I tried sending 20 concurrent get requests with path "/" to both the machines. And both the machines executed the command in an average of
35-40ms.
Secondly, I tried sending 20 concurrent get requests with path "/docker" to both the machines. Here, the Ubuntu 16 machine took a maximum of 4.3 seconds and an average of 3 seconds. But the Ubuntu 18 machine takes a maximum of 10 seconds and an average of 9 seconds.
I tried the above test multiple times and concluded that when running the process inside Docker, the time taken to execute is almost double in the Ubuntu 18 machine compared to Ubuntu 16.
I checked the following:
I tried monitoring through top and htop while hitting 20 requests. But everything seems the same there.
Also tried monitoring using perf command. But unable to find any unusual difference there. But I am not very used to perf command and so unable to understand clearly.
While these 20 requests were in processing. I run the same Docker command manually with strace. And found random results i.e. sometimes time taken in clock_gettime or futex (FUTEX_WAIT) or sometimes in +++ exited with 0 +++ message on Ubuntu 18, but it took less time on Ubuntu 16.
Below are the different configurations and code snippets I am using and running:
Machine1: Giving better performance.
node v10.16.0
npm 6.9.0
docker 18.09.8
ubuntu 16.04.3 LTS, xenial
Machine2: Giving poor performance.
node v10.16.0
npm 6.9.0
docker 18.09.8
ubuntu 18.04.2 LTS, bionic
Node application code snippet:
// for path "/docker"
var excuteInDocker = function() {
var cmd = "docker";
var args = ["exec", "ubuntu", "whoami"];
return executeCmd(cmd, args);
}
// for path "/"
var execute = function(){
var cmd = 'whoami';
var args = [];
return executeCmd(cmd, args);
}
Output of docker info which are common to both ubuntu 16 and 18:
Containers: 1
Running: 1
Paused: 0
Stopped: 0
Images: 2
Server Version: 18.09.8
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc version: 425e105d5a03fabd737a126ad93d62a9eeede87f
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 7.296GiB
Name: myhostname
ID: LLLO:OMTS:PNNM:T3MP:AD2F:UMDG:IIZK:OGBO:3ZLL:YDBX:ONAO:AY5G
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 27
Goroutines: 42
System Time: 2019-07-25T15:25:54.991694211+05:30
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
WARNING: No swap limit support
docker info specific to Ubuntu 16:
Kernel Version: 4.4.0-112-generic
Operating System: Ubuntu 16.04.3 LTS
Total Memory: 7.303GiB
ID: FOFI:RW7N:RZSP:HHKH:BKS3:LMWL:TC2J:W7V2:222Y:Q2AU:XMU3:KLU7
docker info specific to Ubuntu 18:
Kernel Version: 4.15.0-1040-aws
Operating System: Ubuntu 18.04.2 LTS
Total Memory: 7.296GiB
ID: LLLO:OMTS:PNNM:T3MP:AD2F:UMDG:IIZK:OGBO:3ZLL:YDBX:ONAO:AY5G
Ubuntu 16 machine Data:
1. Data of time taken in execution
2019-07-25 14:06:42.851 INFO uid: 540ae880-aeb7-11e9-919d-dd32b3cf84d5 time: 475 result: {"success":true,"data":"root"}
2019-07-25 14:06:43.183 INFO uid: 54145e60-aeb7-11e9-919d-dd32b3cf84d5 time: 745 result: {"success":true,"data":"root"}
2019-07-25 14:06:45.711 INFO uid: 540c4810-aeb7-11e9-919d-dd32b3cf84d5 time: 3326 result: {"success":true,"data":"root"}
.
.
.
2019-07-25 14:06:46.835 INFO uid: 541d5f10-aeb7-11e9-919d-dd32b3cf84d5 time: 4338 result: {"success":true,"data":"root"}
Logs of command strace -t docker exec ubuntu whoami:
Result of perf top --sort comm,dso:
Ubuntu 18 machine Data:
1. Data of time taken in execution:
2019-07-25 14:07:32.559 INFO uid: 715a6af0-aeb7-11e9-a5a9-2fffd4e800d1 time: 1008 result: {"success":true,"data":"root"}
2019-07-25 14:07:32.941 INFO uid: 7178c860-aeb7-11e9-a5a9-2fffd4e800d1 time: 1191 result: {"success":true,"data":"root"}
2019-07-25 14:07:40.363 INFO uid: 71767e70-aeb7-11e9-a5a9-2fffd4e800d1 time: 8628 result: {"success":true,"data":"root"}
.
.
.
2019-07-25 14:07:41.970 INFO uid: 718af0d0-aeb7-11e9-a5a9-2fffd4e800d1 time: 10101 result: {"success":true,"data":"root"}
Logs of command strace -t docker exec ubuntu whoami:
result of perf top --sort comm,dso:
So, I need help in debugging what is wrong with Docker on the Ubuntu 18 machine. Or if there is any limitation with Docker on Ubuntu 18 or maybe some machine or Docker configuration issue.
I did not encounter such a problem on my desktop, but mysql container was running very slowly in a way that I could not understand on my ubuntu laptop. This solution solved my problem.
I could be missing something ridiculous, but every docker container I have tried to expose to my host machine (Mac) doesn't seem to work. I can tell that the containers are running and appear to have properly been exposed to the port I chose. Am I missing something obvious? Any help would be greatly appreciated.
I pulled down latest ElasticSearch image: https://www.elastic.co/guide/en/elasticsearch/reference/current/docker.html
Run Docker:
docker run -d -p 9200:9200 docker.elastic.co/elasticsearch/elasticsearch:5.4.0
Request to see running images:
docker ps
View running image:
5e8ae3b13f7c docker.elastic.co/elasticsearch/elasticsearch:5.4.0 "/bin/bash bin/es-..." 4 seconds ago Up 4 seconds 0.0.0.0:9200->9200/tcp, 9300/tcp eloquent_almeida
Run lsof to see if anything exposed on port 9200
lsof -i tcp:9200
Nothing returned
Mac OS: 10.12.4
Docker Updated Version:
docker version
Client:
Version: 17.04.0-ce
API version: 1.27 (downgraded from 1.28)
Go version: go1.7.5
Git commit: 4845c56
Built: Wed Apr 5 23:33:17 2017
OS/Arch: darwin/amd64
Server:
Version: 17.03.1-ce
API version: 1.27 (minimum version 1.12)
Go version: go1.7.5
Git commit: c6d412e
Built: Mon Mar 27 16:58:30 2017
OS/Arch: linux/amd64
Experimental: false
Downloaded nmap and ran against 9200 localhost. Also made sure 9200 is open now in /etc/pf.conf.
Nmap scan report for localhost (127.0.0.1)
Host is up (0.00016s latency).
Other addresses for localhost (not scanned): ::1
PORT STATE SERVICE
9200/tcp closed wap-wsp
Also tried using docker-machine on mac's IP:
docker-machine ip default
192.168.99.100
Tried 192.168.99.100:9200 and still no luck
You know, it looks like something is wrong with downloaded image or docker installation. I repeated your steps - all is Ok:
[06:40 PM] borlaze#mac: /tmp $ docker run -d -p 9200:9200 docker.elastic.co/elasticsearch/elasticsearch:5.4.0
[06:41 PM] borlaze#mac: /tmp $ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
fd05a1fe9b5a docker.elastic.co/elasticsearch/elasticsearch:5.4.0 "/bin/bash bin/es-..." 9 seconds ago Up 7 seconds 0.0.0.0:9200->9200/tcp, 9300/tcp practical_bell
[06:41 PM] borlaze#mac: /tmp $ lsof -i tcp:9200
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
com.docke 32108 borlaze 21u IPv4 0x601aa3189a6fc3e3 0t0 TCP *:wap-wsp (LISTEN)
com.docke 32108 borlaze 22u IPv6 0x601aa318a167e6cb 0t0 TCP localhost:wap-wsp (LISTEN)
Checked on OS 10.12.4, docker
[06:45 PM] borlaze#mac: /tmp $ docker version
Client:
Version: 17.03.1-ce
API version: 1.27
Go version: go1.7.5
Git commit: c6d412e
Built: Tue Mar 28 00:40:02 2017
OS/Arch: darwin/amd64
Server:
Version: 17.03.1-ce
API version: 1.27 (minimum version 1.12)
Go version: go1.7.5
Git commit: c6d412e
Built: Fri Mar 24 00:00:50 2017
OS/Arch: linux/amd64
Experimental: true
Try to remove image and repeat.
I need to start dockerd during system boot in Fedora 25.
I have installed docker-engine in Fedora 25 Server Edition.
Docker version:
Client:
Version: 1.13.0
API version: 1.25
Go version: go1.7.3
Git commit: 49bf474
Built: Tue Jan 17 09:58:06 2017
OS/Arch: linux/amd64
Server:
Version: 1.13.0
API version: 1.25 (minimum version 1.12)
Go version: go1.7.3
Git commit: 49bf474
Built: Tue Jan 17 09:58:06 2017
OS/Arch: linux/amd64
Experimental: false
I have stored docker file via custom location, so i used to start dockerd for below method
dockerd -g /u01/docker
I see this init scripts init1 and init2 But, they only start the docker, i need to start dockerd with custom location
Like:
dockerd -g /u01/docker
How can I change that github init script, or can you suggest how to write a new one?
I find out solution myself
I find out default docker.service file and change to our custom location path
I added this line ExecStart=/usr/bin/dockerd -g /u01/docker instead of ExecStart=/usr/bin/dockerd
docker.service:
[Unit]
Description=Docker Application Container Engine
Documentation=https://docs.docker.com
After=network.target firewalld.service
[Service]
Type=notify
# the default is not to use systemd for cgroups because the delegate issues still
# exists and systemd currently does not support the cgroup feature set required
# for containers run by docker
ExecStart=/usr/bin/dockerd -g /u01/docker
ExecReload=/bin/kill -s HUP $MAINPID
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity
# Uncomment TasksMax if your systemd version supports it.
# Only systemd 226 and above support this version.
#TasksMax=infinity
TimeoutStartSec=0
# set delegate yes so that systemd does not reset the cgroups of docker containers
Delegate=yes
# kill only the docker process, not all processes in the cgroup
KillMode=process
[Install]
WantedBy=multi-user.target
After that I run
sytemctl daemon-reload
systemctl enable docker.service
systemctl start docker.service
Now, docker start while system boot.
I simply run the following command:
docker run -d -p 80:80 --name webserver nginx
and after pulling all images returns this error:
docker: Error response from daemon: driver failed programming external
connectivity on endpoint webserver
(ac5719bc0e95ead1a4ec6b6ae437c4c0b8a9600ee69ecf72e73f8d2d12020f97):
Error starting userland proxy: Bind for 0.0.0.0:80: unexpected error
(Failure EADDRINUSE).
Here is my docker Version info:
Client:
Version: 1.12.0
API version: 1.24
Go version: go1.6.3
Git commit: 8eab29e
Built: Thu Jul 28 21:15:28 2016
OS/Arch: darwin/amd64
Server:
Version: 1.12.0
API version: 1.24
Go version: go1.6.3
Git commit: 8eab29e
Built: Thu Jul 28 21:15:28 2016
OS/Arch: linux/amd64
How to fix this?
You didn't provide informations such as Docker version, system or docker processes running so I assume the most likely situation.
The output contains: Failure EADDRINUSE. It means that port 80 is used by something else. You can use lsof -i TCP:80 to check which process is listening on that port. If there is nothing running on the port, it might be some issue with Docker. For example the one with not releasing ports immediately.
I'm traing to set up docker swarm over my virtual cluster. First, I try to install the swarm-master on the localhost with docker-machine.
The problem is that the machine need to use a proxy to access the discovery token.
First I ask a token with swarm create. To do that, I created this file :
$cat /etc/systemd/system/docker.service.d/http_proxy.conf
[Service]
Environment="HTTP_PROXY=http://**.**.**.**:3128/" "HTTPS_PROXY=http://**.**.**.**:3128/" "NO_PROXY=localhost,127.0.0.1,192.168.2.100,192.168.2.101,192.168.2.102,192.168.2.103,192.168.2.104,192.168.2.105,192.168.2.106,192.168.2.107,192.168.2.108,192.168.2.194,192.168.2.110"
I restarted the daemon and I can pull the swarm image :
$docker run -e "http_proxy=http://**.**.**.**:3128/" -e "https_proxy=http://**.**.**.**:3128/" swarm create
b54d8665e72939d2c611d8f9e99521b4
After I want to create the swarm master :
$docker-machine create -d generic --generic-ip-address localhost \
--engine-env HTTP_PROXY=http://192.168.254.10:3128/ \
--engine-env HTTPS_PROXY=http://192.168.254.10:3128/ \
--engine-env NO_PROXY=localhost,192.168.2.102,192.168.2.100 \
--swarm --swarm-master --swarm-discovery \
token://b54d8665e72939d2c611d8f9e99521b4 swarm-master
Result :
Running pre-create checks...
Creating machine...
Waiting for machine to be running, this may take a few minutes...
Machine is running, waiting for SSH to be available...
Detecting operating system of created instance...
Provisioning created instance...
Copying certs to the local machine directory...
Copying certs to the remote machine...
Setting Docker configuration on the remote daemon...
Configuring swarm...
To see how to connect Docker to this machine, run: docker-machine env swarm-master
And I have errors in the logs of the join and manage container (I think the error come because the containers don't take care of the proxy) :
$docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
6fbf967cdb60 swarm:latest "/swarm join --advert" 53 seconds ago Up 52 seconds 2375/tcp swarm-agent
8b176116989e swarm:latest "/swarm manage --tlsv" 54 seconds ago Up 53 seconds 2375/tcp, 0.0.0.0:3376->3376/tcp swarm-agent-master
$docker logs 6fbf967cdb60
time="2015-11-17T19:37:21Z" level=info msg="Registering on the discovery service every 20s..." addr="localhost:2376" discovery="token://b54d8665e72939d2c611d8f9e99521b4"
time="2015-11-17T19:37:41Z" level=error msg="Post https://discovery.hub.docker.com/v1/clusters/b54d8665e72939d2c611d8f9e99521b4?ttl=60: dial tcp: lookup discovery.hub.docker.com on 8.8.4.4:53: read udp 172.17.0.3:46576->8.8.4.4:53: i/o timeout"
$docker logs 8b176116989e
time="2015-11-17T19:37:20Z" level=info msg="Listening for HTTP" addr="0.0.0.0:3376" proto=tcp
time="2015-11-17T19:37:40Z" level=error msg="Discovery error: Get https://discovery.hub.docker.com/v1/clusters/b54d8665e72939d2c611d8f9e99521b4: dial tcp: lookup discovery.hub.docker.com on 8.8.4.4:53: read udp 172.17.0.2:44241->8.8.4.4:53: i/o timeout"
Is it a bug of the generic driver ?
Some others informations :
# docker version
Client:
Version: 1.9.0
API version: 1.21
Go version: go1.4.2
Git commit: 76d6bc9
Built: Tue Nov 3 17:29:38 UTC 2015
OS/Arch: linux/amd64
Server:
Version: 1.9.0
API version: 1.21
Go version: go1.4.2
Git commit: 76d6bc9
Built: Tue Nov 3 17:29:38 UTC 2015
OS/Arch: linux/amd64
# docker info
Containers: 2
Images: 8
Server Version: 1.9.0
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 12
Dirperm1 Supported: true
Execution Driver: native-0.2
Logging Driver: json-file
Kernel Version: 3.16.0-4-amd64
Operating System: Debian GNU/Linux 8 (jessie)
CPUs: 2
Total Memory: 1000 MiB
Name: swarm-master
ID: 6SDE:CQRA:NM6W:TY2H:4DPB:O4YO:IGRT:33AA:OKQP:M6UK:EMSR:H4WR
WARNING: No memory limit support
WARNING: No swap limit support
Labels:
provider=generic
Thank you :)
The problem was that it's not possible to use docker machine to create the swarm-master on the same machine. So I created two VM, one with docker-machine (and mh-keystore) and one other for swarm-master.
Creating the mh-keystore on localhost :
$docker-machine create -d generic --generic-ip-address localhost mh-keystore
$docker $(docker-machine config mh-keystore) run -d \
-p "8500:8500" \
-h "consul" \
progrium/consul -server -bootstrap
$docker ps
Installation of swarm-master to the other machine
$ docker-machine create \
-d generic --generic-ip-address 192.168.2.100 \
--swarm --swarm-image="swarm" --swarm-master \
--swarm-discovery="consul://192.168.2.103:8500" \
swarm-master
Creation of agent :
$ docker-machine create \
-d generic --generic-ip-address 192.168.2.102 \
--swarm \
--swarm-discovery="consul://192.168.2.103:8500" \
swarm-agent-00