Docker service tasks stuck in preparing state after reboot on windows - windows

Restarting a windows server that is a swarm worker, causes windows containers to get stuck in a "Preparing" state indefinitely once the server and docker daemon are back online.
Image of tasks/containers stuck in preparing state:
https://user-images.githubusercontent.com/4528753/65180353-4e5d6e80-da22-11e9-8060-451150865177.png
Steps to reproduce the issue:
Create a swarm (in my case I have CentOS7 managers, and a few windows server 1903 workers)
Create a "global" docker service that only runs on the windows machines. They should start up fine
initially and work just fine.
Drain one or more of the windows nodes that is running the windows container(s) from step 2 (docker node update --availability=drain nodename)
Restart one or more of the nodes that were drained in step 3, wait for them to come back up
Set the windows node(s) back to active (docker node update --availability=active nodename)
At this point, just observe that the docker service created in step 2 will be "Preparing" the containers to start up on these nodes, and there it will stay (docker service ps servicename --no-trunc) -- you can observe this and run these commands from any master node
memberlist: Refuting a suspect message (from: c9347e85405d)
memberlist: Failed to send ping: write udp 10.60.3.40:7946->10.60.3.110:7946: wsasendto: The requested address is not valid in its
context.
grpc: addrConn.createTransport failed to connect to {10.60.3.110:2377 0 <nil>}. Err :connection error: desc = "transport: Error while
dialing dial tcp 10.60.3.110:2377: connectex: A socket operation was attempted to an unreachable host.". Reconnecting... [module=grpc]
memberlist: Failed to send ping: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is not valid in its
context.
grpc: addrConn.createTransport failed to connect to {10.60.3.110:2377 0 <nil>}. Err :connection error: desc = "transport: Error while
dialing dial tcp 10.60.3.110:2377: connectex: A socket operation was attempted to an unreachable host.". Reconnecting... [module=grpc]
agent: session failed [node.id=wuhifvg9li3v5zuq2xu7c6hxa module=node/agent error=rpc error: code = Unavailable desc = all SubConns are
in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.60.3.69:2377:
connectex: A socket operation was attempted to an unreachable host." backoff=6.3s]
Failed to send gossip to 10.60.3.110: write udp 10.60.3.40:7946->10.60.3.110:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.105: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.186: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.105: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.186: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.105: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.109: write udp 10.60.3.40:7946->10.60.3.109:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.110: write udp 10.60.3.40:7946->10.60.3.110:7946: wsasendto: The requested address is not valid in its
context.
memberlist: Failed to send gossip to 10.60.3.105:7946: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is
not valid in its context.
memberlist: Failed to send gossip to 10.60.3.186:7946: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is
not valid in its context.
Many of these errors are odd, for example... 7946 is totally open between the cluster nodes, telnets confirm this.
I expect to see the docker service containers start promptly, and not stuck in a Preparing state. The docker image is already pulled, it should be fast.
docker version output
Client: Docker Engine - Enterprise
Version: 19.03.2
API version: 1.40
Go version: go1.12.8
Git commit: c92ab06ed9
Built: 09/03/2019 16:38:11
OS/Arch: windows/amd64
Experimental: false
Server: Docker Engine - Enterprise
Engine:
Version: 19.03.2
API version: 1.40 (minimum version 1.24)
Go version: go1.12.8
Git commit: c92ab06ed9
Built: 09/03/2019 16:35:47
OS/Arch: windows/amd64
Experimental: false
docker info output
Client:
Debug Mode: false
Plugins:
cluster: Manage Docker clusters (Docker Inc., v1.1.0-8c33de7)
Server:
Containers: 4
Running: 0
Paused: 0
Stopped: 4
Images: 4
Server Version: 19.03.2
Storage Driver: windowsfilter
Windows:
Logging Driver: json-file
Plugins:
Volume: local
Network: ics l2bridge l2tunnel nat null overlay transparent
Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
Swarm: active
NodeID: wuhifvg9li3v5zuq2xu7c6hxa
Is Manager: false
Node Address: 10.60.3.40
Manager Addresses:
10.60.3.110:2377
10.60.3.186:2377
10.60.3.69:2377
Default Isolation: process
Kernel Version: 10.0 18362 (18362.1.amd64fre.19h1_release.190318-1202)
Operating System: Windows Server Datacenter Version 1903 (OS Build 18362.356)
OSType: windows
Architecture: x86_64
CPUs: 4
Total Memory: 8GiB
Name: SWARMWORKER1
ID: V2WJ:OEUM:7TUQ:WPIO:UOK4:IAHA:KWMN:RQFF:CAUO:LUB6:DJIJ:OVBX
Docker Root Dir: E:\docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: this node is not a swarm manager - check license status on a manager node
Additional Details
These nodes are not using Docker Desktop for windows. I am provisioning docker on the box primarily based on the powershell instructions here: https://docs.docker.com/install/windows/docker-ee/
Windows firewall is disabled
iptables/firewalld is disabled
Communication is completely open between the cluster nodes
Totally up-to-date on cumulative updates
I posted on the moby repo issues but never heard a peep:
https://github.com/moby/moby/issues/39955
The ONLY way I've found to temporarily fix the issue is to drain the node from the swarm, delete docker files, reinstall windows "Containers" feature, then rejoin to the swarm. But, it happens again on reboot.
What's interesting is that when I see a swarm task in a "Preparing" state on the windows worker, the server doesn't seem to be doing anything at all, it's like the manager thinks the worker is preparing the container, but it isn't...
Anyone have any suggestions??

Related

Windows unable to connect to exposed to Docker port

So I have a container running with port forwarding set up. It seems that the port is listening on the local windows host, for some reason, the connection won't go through.
The command to run the docker container:
docker run -p 4400:4400 storybook:latest
Inside the container itself, I can verify the service is running on port 4400:
netstat -ltnp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 127.0.0.1:4400 0.0.0.0:* LISTEN 33/node
wget http://0.0.0.0:4400
--2022-08-23 19:57:12-- http://0.0.0.0:4400/
Connecting to 0.0.0.0:4400... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5866 (5.7K) [text/html]
Saving to: 'index.html'
100%[==============================================================================>] 5,866 --.-K/s in 0s
2022-08-23 19:57:12 (363 MB/s) - 'index.html' saved [5866/5866]
And on the windows host, I can verify docker is listening on port 4440:
netstat -aon | find /i "listening"
TCP [::]:4400 [::]:0 LISTENING 20412
tasklist
Image Name PID Session Name Session# Mem Usage
========================= ======== ================ =========== ============
com.docker.backend.exe 20412 Console 1 26,540 K
But I can't access the service via the Windows host.
wget http://localhost:4400
wget : The underlying connection was closed: The connection was closed unexpectedly.
I even tried getting the IP address of the docker container:
docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
898079f335be storybook:latest "npx nx serve storyb…" 34 minutes ago Up 34 minutes 0.0.0.0:4400->4400/tcp, :::4400->4400/tcp relaxed_mayer
> docker inspect -f '{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' 898079f335be
172.17.0.2
And tried accessing the service via that IP:
wget http://172.17.0.2:4400
wget : Unable to connect to the remote server
The version of windows:
Edition: Windows 10 Enterprise
Version: 21H2
OS Build: 19044.1766
Docker information:
Client:
Context: desktop-linux
Debug Mode: false
Plugins:
buildx: Build with BuildKit (Docker Inc., v0.5.1-docker)
compose: Docker Compose (Docker Inc., 2.0.0-beta.4)
scan: Docker Scan (Docker Inc., v0.8.0)
Server:
Containers: 7
Running: 1
Paused: 0
Stopped: 6
Images: 22
Server Version: 20.10.7
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
userxattr: false
Logging Driver: json-file
Cgroup Driver: cgroupfs
Cgroup Version: 1
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
Default Runtime: runc
Init Binary: docker-init
containerd version: d71fcd7d8303cbf684402823e425e9dd2e99285d
runc version: b9ee9c6314599f1b4a7f497e1f1f856fe433d3b7
init version: de40ad0
Security Options:
seccomp
Profile: default
Kernel Version: 5.4.72-microsoft-standard-WSL2
Operating System: Docker Desktop
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 25.06GiB
Name: docker-desktop
ID: VFD3:RX76:D4JD:5Z6P:R2IQ:7JD4:FFQS:YDLJ:BDNW:J4UX:4U5A:GF4S
Docker Root Dir: /var/lib/docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
WARNING: No blkio throttle.read_bps_device support
WARNING: No blkio throttle.write_bps_device support
WARNING: No blkio throttle.read_iops_device support
WARNING: No blkio throttle.write_iops_device support
EDIT: I am using WSL2 as the backend.

Docker Windows getting Client.Timeout exceeded while awaiting headers for any pull image command or login

When trying to run docker run hello-world on windows 10(no option), I am getting this error(a common error I saw on many threads) :
Unable to find image 'hello-world:latest' locally
docker: Error response from daemon: Get https://registry-1.docker.io/v2/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers).
I already had some images pulled earlier(couple of months back) which are working. But I am not able to pull any new image(eg: mongo) or even not able to pull hello-world image. Have searched throughout, tried setting up dns to 8.8.8.8/8.8.4.4, experimental=true inside docker configs(Docker Desktop GUI), yet unable to resolve. One thing I wasn't expecting was HTTP PROXY inside docker info, as I had removed the proxy settings from docker's GUI and even environment variables.
docker info:
Client:
Debug Mode: false
Plugins:
scan: Docker Scan (Docker Inc., v0.3.4)
Server:
Containers: 3
Running: 2
Paused: 0
Stopped: 1
Images: 4
Server Version: 19.03.13
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host ipvlan macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 8fba4e9a7d01810a393d5d25a3621dc101981175
runc version: dc9208a3303feef5b3839f4323d9beb36df0a9dd
init version: fec3683
Security Options:
seccomp
Profile: default
Kernel Version: 5.4.39-linuxkit
Operating System: Docker Desktop
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.941GiB
Name: docker-desktop
ID: <Docker Id>
Docker Root Dir: /var/lib/docker
Debug Mode: false
HTTP Proxy: http://<ip>:port/
HTTPS Proxy: http://<ip>:port/
No Proxy: localhost,127.0.0.2,firm.com,firm.org
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
Can someone please help how to remove these proxies from docker info and allow me to pull the images ?
I have not ticked Expose daemon on tcp://localhost:2375 without TLS and Use the WSL 2 based engine inside docker-desktop gui. Also inside proxies, setting manual is turned off and network uses manual dns configuration of 8.8.8.8. Also I am unable to ping to hub.docker.com which says request timeout, and trying to do docker login inside cmd returns the same request timeout error, but docker desktop(GUI) is showing my user logged in. I feel if we are able to remove the proxy from docker info, it might solve the problem.
I apparently solved it with this change:
Before:
After:
This is weird stuff according to me since we are disabling the manual proxy configs it shouldn't go inside it, but even after restarting and stopping the services it didn't work. So eventually I removed all the proxy url(despite manual proxy being off) and then it worked. Just windows stuff I guess.

Go net.Listen() cannot bind to a docker service port after updating to docker version 19.03.2

I am using docker-compose to expose a docker service in my windows 10 machine.
Also, using a function in golang to check whether the service is completely up or not:
package main
import (
"fmt"
"net"
)
func main() {
err := ping(9800)
fmt.Println(err)
}
func ping(port uint16) (err error) {
fmt.Println("checking port:", port)
conn, err := net.Listen("tcp", fmt.Sprintf("localhost:%d", port))
if err != nil {
return
}
conn.Close()
return
}
Docker and Go versions used now are:
C:\>docker version
Client: Docker Engine - Community
Version: 19.03.2
API version: 1.40
Go version: go1.12.8
Git commit: 6a30dfc
Built: Thu Aug 29 05:26:49 2019
OS/Arch: windows/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 19.03.2
API version: 1.40 (minimum version 1.12)
Go version: go1.12.8
Git commit: 6a30dfc
Built: Thu Aug 29 05:32:21 2019
OS/Arch: linux/amd64
Experimental: false
containerd:
Version: v1.2.6
GitCommit: 894b81a4b802e4eb2a91d1ce216b8817763c29fb
runc:
Version: 1.0.0-rc8
GitCommit: 425e105d5a03fabd737a126ad93d62a9eeede87f
docker-init:
Version: 0.18.0
GitCommit: fec3683
C:\>go version
go version go1.13.1 windows/amd64
The container is up and the service is exposed via hostPort 50014:
C:\>docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
48eeb27f5ddc d.reg.io/adata "/usr/local/bin/adat…" 7 seconds ago Up 4seconds 0.0.0.0:50014->50014/tcp desktop_adata_1
When running the go script to bind to the port 50014, it returns error:
C:\>go run ping.go
checking port: 50014
listen tcp 127.0.0.1:50014: bind: An attempt was made to access a socket in a way forbidden by its access permissions.
This happens only after the update to docker-for-windows version 19.03.2.
Can someone help me to solve this issue?
UPDATE:
There is a question: What is Administered port exclusions in windows 10?
It is true that the port 50014 comes in administered port exclusions range.
I thought removing the port from that exclusion list or using any other port will work. But I used host ports 80, 8080, 50014 and 9800 one by one to expose the service and tried to bind them. But every time it failed.
Port 80 and 8080 are non-excluded ports. The container is up and the service is listening on the port. But gives error while trying to bind using the go function:
checking port: 80
listen tcp 127.0.0.1:80: bind: An attempt was made to access a socket in a way forbidden by its access permissions.
Port 50014 is in administered port exclusion range. The container is up and the service is listening on the port. It also gives the same error while trying to bind:
checking port: 80
listen tcp 127.0.0.1:80: bind: An attempt was made to access a socket in a way forbidden by its access permissions.
Port 9800 is in the normal port exclusion range. The difference is, this time the container will not be up. Docker cannot use that host port to expose a service. It will give an error when running docker-compose up -d :
Creating desktop_adata_1 ... error
ERROR: for desktop_adata_1 Cannot start service adata: driver failed programming external connectivity on endpoint desktop_adata_1 (1a5978c5fbf35cb08fce14c8d5192756b3de8a77bd815f490e3e8ce542abaeaa): Error starting userland proxy: listen tcp 0.0.0.0:9800: bind: An attempt was made to access a socket in a way forbidden by its access permissions.
That means, in my case, the reason for the error would be the net.Listen() fails to bind to a port that is used for a docker container service.
After docker update, net.Listen() was not able to ping to a port that is already used to expose a Docker container service.
As a workaround, I used net.Dial() instead of net.Listen() in the go code.
It works as expected.

Error response from daemon: rpc error: code = 2 desc = The swarm does not have a leader [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 years ago.
Improve this question
I have created 3 nodes swarm cluster by Virtualbox using docker-machine. The three are all running and i'm able to use 'docker-machine ssh' to connect every one.There is a problem that I restart the physical machine and the cluster seems to not work,why? The follow is the details.Thanks for your guide and advice.
san#san-System-Product-Name:~$ docker-machine ls
NAME ACTIVE DRIVER STATE URL
SWARM DOCKER ERRORS
first - virtualbox Running tcp://192.168.99.100:2376
v17.06.0-ce
second - virtualbox Running tcp://192.168.99.101:2376
v17.06.0-ce
third - virtualbox Running tcp://192.168.99.102:2376
v17.06.0-ce
The first is a leader and the second is a manager while the third is a worker.I tried to use 'docker-machine ssh first docker node ls'.
Error response from daemon:
`rpc error: code = 2 desc = The swarm does not have a leader`.
It's possible that too few managers are online. Make sure more than
half of the managers are online.
exit status 1
san#san-System-Product-Name:~$ docker-machine ssh first docker info
Containers: 2
Running: 0 Paused: 0 Stopped: 2
Images: 3
Server Version: 17.06.0-ce
Storage Driver: aufs
Root Dir: /mnt/sda1/var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 17
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: pending
NodeID: dowdk4pzfzm85zijbo23e6xs3
Error: rpc error: code = 2 desc = The swarm does not have a leader. It's possible that too few managers are online. Make sure more than half of the managers are online.
Is Manager: true Node Address: 192.168.99.100
Manager Addresses:
192.168.99.100:2375
192.168.99.102:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfb82a876ecc11b5ca0977d1733adbe58599088a
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
seccomp
Profile: default
Kernel Version:
4.4.74-boot2docker
Operating System: Boot2Docker 17.06.0-ce (TCL 7.2); HEAD : 0672754 - Thu Jun 29 00:06:31 UTC 2017
OSType: linux
Architecture: x86_64
CPUs: 1
Total Memory: 995.8MiB
Name: first
ID: ACGX:Z6QQ:5KOX:7W2O:OMMM:43PB:4QES:KKGJ:IXUC:J2SW:F4SJ:QMQ4
Docker Root Dir: /mnt/sda1/var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
File Descriptors: 24
Goroutines: 76
System Time: 2017-07-28T01:57:37.410536525Z
EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels: provider=virtualbox
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
san#san-System-Product-Name:~$ docker-machine ssh first docker network ls
NETWORK ID NAME DRIVER SCOPE
22e85840407d bridge bridge local
fc3c6786739c docker_gwbridge bridge local
e294dde63753 host host local
55f8e340b794 none null local
how could i fix this problem and use
docker node ls
on manage node?very thanks for your advice.
I had the same problem but I'm not sure what caused it. I was able to fix it by entering:
docker swarm init --force-new-cluster
Everything got restored.

Cannot connect to Cloudera Manager, not listening on port 7180

I'd really appreciate some help to get cloudera manager running on AWS EC2.
Its my first install, and I'm aiming to use the AWS Free Tier to spin up a few nodes and do some training on Hadoop cluster and the cloudera distribution. I'm using the RedHat RHEL 7.2 image on AWS EC2.
I am following the instructions here... Cloudera Manager installation
I have installed cloudera manager OK, and get to the screen where it invites you to use a browser to log-in to the cloudera manager server. But that's where the problem starts. It seems the app is not listening on port 7180, so there's no hope of connecting from another machine across the network. I can't even connect locally, on the server, yet the service appears to be running OK. But its not listening on port 7180.
Q1 - How can I confirm the config is set to use port 7180.?
Q2 - are there obvious steps that I'm missing here ?
Thanks in advance,
[Edit..]
I'm beginning to wonder if the Free EC2 host is running short on memory to run cloudera manager. I saw one comment that implied that....AWS Forum post . But the process doesn't crash or report any problems in its logfile. So it must be OK, right?
[Edit.... with more diagnostic info....]
Here's a list of the diagnostics I've checked:-
SELinux is not running [for install and testing purposes],
WAN firewalls,
EC2 firewall/Security group,
Local firewall on server,
Cloudera manager log,
Is the service up and running?
Can you connect locally?
Securtity group on the EC2 instance, it contains:-
SSH and Port 7180,
Firewall/iptables/firewalld on the RedHat instance, tried:-
adding ports to iptables, then
dissabling iptables, then
adding ports to firewalld, then
dissabling the firewalld service,
$ sudo iptables -L
Chain INPUT (policy ACCEPT)
target prot opt source destination
ACCEPT all -- anywhere anywhere ctstate RELATED,ESTABLISHED
ACCEPT tcp -- anywhere anywhere tcp dpt:ssh
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:7180
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:7182
But I'm getting the feeling that the installation of cloudera manager is not happy, or not running correctly.
I've checked the cloudera manager log, and it ends with the following.
$ tail /var/log/cloudera-scm-server/cloudera-scm-server.log
2016-02-25 11:02:23,581 INFO main:com.cloudera.cmon.components.MetricSchemaUpdate: persisting 19264 new metrics
2016-02-25 11:02:28,920 INFO main:com.cloudera.cmon.components.MetricSchemaUpdate: persisting 0 updated metrics
2016-02-25 11:02:28,924 INFO main:com.cloudera.cmon.components.MetricSchemaManager: Cross entity aggregates processed.
And when I use tail -f, and restart the cloudera-scm-server service, the log scrolls a lot, and comes back the same state. If I search for ERROR, there are no lines with "ERR".
$ sudo service cloudera-scm-server start
Starting cloudera-scm-server (via systemctl): [ OK ]
$ sudo systemctl status cloudera-scm-server
● cloudera-scm-server.service - LSB: Cloudera SCM Server
Loaded: loaded (/etc/rc.d/init.d/cloudera-scm-server)
Active: active (exited) since Thu 2016-02-25 12:23:03 EST; 44s ago
Docs: man:systemd-sysv-generator(8)
Process: 747 ExecStart=/etc/rc.d/init.d/cloudera-scm-server start (code=exited, status=0/SUCCESS)
So, if I try to test the service, by connecting from the local machine I get the sort of behavious that makes me thing its just not listening, and maybe not started correctly.
Try poke it with a curl from the same shell as the cloudera-scm-server service was started
$ curl localhost:7180
curl: (7) Failed connect to localhost:7180; Connection refused
$ wget localhost:7180
--2016-02-25 08:00:16-- http://localhost:7180/
Resolving localhost (localhost)... ::1, 127.0.0.1
Connecting to localhost (localhost)|::1|:7180... failed: Connection refused.
Connecting to localhost (localhost)|127.0.0.1|:7180... failed: Connection refused.
Try check what ports are listening on that machine, no 7180 , what's up with that???
$ netstat -nltp
(No info could be read for "-p": geteuid()=1000 but you should be root.)
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:7432 0.0.0.0:* LISTEN -
tcp 0 0 0.0.0.0:22 0.0.0.0:* LISTEN -
tcp 0 0 127.0.0.1:25 0.0.0.0:* LISTEN -
tcp6 0 0 :::7432 :::* LISTEN -
tcp6 0 0 :::22 :::* LISTEN -
tcp6 0 0 ::1:25 :::* LISTEN -
Here's what to look for, and a possible solution - give it more memory...
Check the status of the cloudera-scm-server service using [depending on your flavour of linux]
$ sudo service cloudera-scm-server status
OR
$ sudo systemctl status cloudera-scm-server
Look for the status - Active: active (running)
But if you find - Active: active (exited)
you may have a problem during the startup of the cloudera-scm-server.
In which case, look at the log files for cloudera-scm-server
$sudo ls -l /var/log/cloudera-scm-server
$sudo cat /var/log/cloudera-scm-server/cloudera-scm-server.out
JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x000000078dc58000, 265809920, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 265809920 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/hs_err_pid831.log
[ec2-user#ip-172-31-31-166 ~]$ sudo tail -100 /var/log/cloudera-scm-server/cloudera-scm-server.out
JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
Java HotSpot(TM) 64-Bit Server VM warning: INFO: os::commit_memory(0x000000078dc58000, 265809920, 0) failed; error='Cannot allocate memory' (errno=12)
Use the command top to indicate how much memory is available to your system.
Possible solution - have a look at this discussion at Cloudera forum
In this case the java heap size was too small.
As we see that heap was exhausted, assuming this is not a memory leak
or something of the sort, Cloudera Manager may need more heap to
operate. This can be configured in:
/etc/default/cloudera-scm-server You could, for instance, change "-Xmx2G" to "-Xmx3G" or "-Xmx4G" If the problem still
happens, perhaps the heap dumps will yeild some clues.
I'd suggest you tail the logs. If you are using the free tier, cloudera manager will take a while to come up... possibly up to 5 minutes or more after you start the cloudera-scm-server.
The logs should show if there are any errors, possibly issues with memory allocation since the free tier servers have limited memory available. The little snippet of log entries looks fine and typical - it will go through a long list of processes before the UI comes up on 7180.
Also while that is going on, run top or even free -g to see how much resources are being used - particularly memory.
I was having the exact same issue, cannot hit the CM login using public DNS or IP on port 7180.
Following steps will help you :
iptables stopped (service iptables stop)
SELinux disabled (got to /etc/selinux/config and disbaled the selinux)
curl/wget localhost:7180 works (check the curl status)
ufw allow 7180
service httpd status should be running.
check va/log/cloudera-scm-server log : if any error found then troubleshoot the error
cloudera-scm-server status (should be running state)
netstat -nap | grep 7180 returns (if running other service then kill it)
telnet localhost 7180 (should be connected)
Cannot connect to Cloudera Manager, not listening on port 7180
1] Check the status:
sudo service cloudera-scm-server status
*cloudera-scm-server.service - LSB: Cloudera SCM Server Loaded: loaded (/etc/rc.d/init.d/cloudera-scm-server; bad; vendor preset: disabled) Active: active (exited) since UTC; 47min ago Docs: man:systemd-sysv-generator(8) rm /var/run/cloudera-scm-server.pid
NOTE : The Cloudera Manager service will not be running as it exited abnormally.
Running service cloudera-scm-server status will print following message "cloudera-scm-server dead but pid file exists".
Reason: Out of memory.
Solution : Examine the heap dump that the Cloudera Manager Server creates when it runs out of memory. The heap dump file is created in the /tmp directory, has file extension .hprof and file permission of 600. Its owner and group will be the owner and group of the Cloudera Manager server process, normally cloudera-scm:cloudera-scm.
Link : http://www.cloudera.com/documentation/manager/5-0-x/Cloudera-Manager-Diagnostics-Guide/cm5dg_troubleshooting_cluster_config.html
Check the status of `cloudera-scm-server` and follow the instructions ahead:
[root#quickstart ~]# `service cloudera-scm-server status`
By default, Cloudera's QuickStart VM manages CDH using Linux's configuration
and service management. To use Cloudera Manager instead, you must shut down
and disable the existing CDH services and then start Cloudera Manager. You can
do this by running the following command:
`sudo /home/cloudera/cloudera-manager`
[root#quickstart ~]# `sudo /home/cloudera/cloudera-manager `
`[QuickStart] Shutting down CDH services via init scripts...
JMX enabled by default
Using config: /etc/zookeeper/conf/zoo.cfg
[QuickStart] Disabling CDH services on boot...
[QuickStart] Starting Cloudera Manager services...
[QuickStart] Deploying client configuration...
[QuickStart] Starting CM Management services...
[QuickStart] Enabling CM services on boot...
[QuickStart] Starting CDH services...`
________________________________________________________________________________
Success! You can now log into Cloudera Manager from the QuickStart VM's browser:
http://quickstart.cloudera:7180
Username: cloudera
Password: cloudera

Resources