Changing Hostname on Mesos-Slave not working - mesos
I followed the tutorial from https://open.mesosphere.com/getting-started/install/ to setup mesos and marathon.
I am using Vagrant to create 2 nodes, a master and a slave.
At the end of the tutorial I have marathon and mesos functioning.
First problem: Only the slave on the master machine is visible to mesos. The "independent" slave on the second vagrant node machine is not visible on mesos even though I have put the same settings in /etc/mesos/zk for both the nodes. From what I understand this is the file that gives the master node addresses.
Second Problem: When I change the hostname for the slave on the master machine to the ip address, the slave does not run.
When I remove the file /etc/mesos-slave/hostname, and restart the slave start running again.
I get the following logs in /var/log/mesos:
Log file created at: 2016/09/13 19:38:57
Running on machine: vagrant-ubuntu-trusty-64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0913 19:38:57.316082 2870 logging.cpp:194] INFO level logging started!
I0913 19:38:57.319680 2870 containerizer.cpp:196] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
I0913 19:38:57.321099 2870 linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0913 19:38:57.322904 2870 main.cpp:434] Starting Mesos agent
I0913 19:38:57.323637 2887 slave.cpp:198] Agent started on 1)#10.0.2.15:5051
I0913 19:38:57.323648 2887 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="192.168.33.20" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://192.168.33.20:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
I0913 19:38:57.323942 2887 slave.cpp:519] Agent resources: cpus(*):1; mem(*):244; disk(*):35164; ports(*):[31000-32000]
I0913 19:38:57.323969 2887 slave.cpp:527] Agent attributes: [ ]
I0913 19:38:57.323974 2887 slave.cpp:532] Agent hostname: 192.168.33.20
I0913 19:38:57.326578 2886 state.cpp:57] Recovering state from '/var/lib/mesos/meta'
After this when I do "sudo service mesos-slave status" it says stop/waiting.
I am not sure how to go about dealing with these two problems. Any help appreciated.
Update
On the "Independent Slave Machine" I am getting the following logs:
file: mesos-slave.vagrant-ubuntu-trusty-64.invalid-user.log.ERROR.20160914-141226.1197
Log file created at: 2016/09/14 14:12:26
Running on machine: vagrant-ubuntu-trusty-64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0914 14:12:26.699146 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:26.700430 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:27.634099 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:28.784499 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:34.914746 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:36.906472 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:37.242663 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:40.442214 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:42.033504 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:47.239245 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:50.712105 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:13:03.200935 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
file: mesos-slave.vagrant-ubuntu-trusty-64.invalid-user.log.INFO.20160914-141502.4788
Log file created at: 2016/09/14 14:15:02
Running on machine: vagrant-ubuntu-trusty-64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0914 14:15:02.491973 4788 logging.cpp:194] INFO level logging started!
I0914 14:15:02.495968 4788 containerizer.cpp:196] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
I0914 14:15:02.497270 4788 linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0914 14:15:02.498855 4788 main.cpp:434] Starting Mesos agent
I0914 14:15:02.499091 4788 slave.cpp:198] Agent started on 1)#10.0.2.15:5051
I0914 14:15:02.499195 4788 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="192.168.33.31" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://192.168.33.20:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
I0914 14:15:02.499560 4788 slave.cpp:519] Agent resources: cpus(*):1; mem(*):244; disk(*):35164; ports(*):[31000-32000]
I0914 14:15:02.499620 4788 slave.cpp:527] Agent attributes: [ ]
I0914 14:15:02.499650 4788 slave.cpp:532] Agent hostname: 192.168.33.31
I0914 14:15:02.502511 4803 state.cpp:57] Recovering state from '/var/lib/mesos/meta'
I0914 14:15:02.502554 4803 state.cpp:697] No checkpointed resources found at '/var/lib/mesos/meta/resources/resources.info'
I0914 14:15:02.502630 4803 state.cpp:100] Failed to find the latest agent from '/var/lib/mesos/meta'
I0914 14:15:02.510077 4807 status_update_manager.cpp:200] Recovering status update manager
I0914 14:15:02.510150 4807 containerizer.cpp:522] Recovering containerizer
I0914 14:15:02.510758 4807 provisioner.cpp:253] Provisioner recovery complete
I0914 14:15:02.510815 4807 slave.cpp:4782] Finished recovery
I0914 14:15:02.511342 4804 group.cpp:349] Group process (group(1)#10.0.2.15:5051) connected to ZooKeeper
I0914 14:15:02.511368 4804 group.cpp:837] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0914 14:15:02.511376 4804 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
I0914 14:15:02.513720 4804 detector.cpp:152] Detected a new leader: (id='4')
I0914 14:15:02.513813 4804 group.cpp:706] Trying to get '/mesos/json.info_0000000004' in ZooKeeper
I0914 14:15:02.514854 4804 zookeeper.cpp:259] A new leading master (UPID=master#10.0.2.15:5050) is detected
I0914 14:15:02.514928 4804 slave.cpp:895] New master detected at master#10.0.2.15:5050
I0914 14:15:02.514940 4804 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0914 14:15:02.514961 4804 slave.cpp:927] Detecting new master
I0914 14:15:02.514976 4804 status_update_manager.cpp:174] Pausing sending status updates
E0914 14:15:03.228878 4811 process.cpp:2105] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0914 14:15:03.229086 4806 slave.cpp:3732] master#10.0.2.15:5050 exited
W0914 14:15:03.229099 4806 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0914 14:15:03.342586 4811 process.cpp:2105] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0914 14:15:03.342675 4806 slave.cpp:3732] master#10.0.2.15:5050 exited
W0914 14:15:03.342685 4806 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0914 14:15:06.773352 4811 process.cpp:2105] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0914 14:15:06.773438 4806 slave.cpp:3732] master#10.0.2.15:5050 exited
W0914 14:15:06.773448 4806 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0914 14:15:09.190912 4811 process.cpp:2105] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0914 14:15:09.191007 4806 slave.cpp:3732] master#10.0.2.15:5050 exited
W0914 14:15:09.191017 4806 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0914 14:15:16.597836 4811 process.cpp:2105] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0914 14:15:16.597929 4806 slave.cpp:3732] master#10.0.2.15:5050 exited
W0914 14:15:16.597940 4806 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
I0914 14:15:33.944555 4809 slave.cpp:3732] master#10.0.2.15:5050 exited
W0914 14:15:33.944607 4809 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0914 14:15:33.944682 4811 process.cpp:2105] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0914 14:16:02.515676 4804 slave.cpp:4591] Current disk usage 4.72%. Max allowed age: 5.969647788608773days
E0914 14:16:11.307096 4811 process.cpp:2105] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0914 14:16:11.307189 4806 slave.cpp:3732] master#10.0.2.15:5050 exited
W0914 14:16:11.307199 4806 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
Update 2
The configurations for both the machines seem to be the same (I say seem because I have verified, they are the same but still I cannot seem to connect the remote slave, so there must be something going wrong).
The logs for the machine slave1 are as following:
mesos-slave.slave1.invalid-user.log.WARNING
Log file created at: 2016/09/17 20:28:34
Running on machine: slave1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W0917 20:28:34.018565 17112 slave.cpp:202]
**************************************************
Agent bound to loopback interface! Cannot communicate with remote master(s). You might want to set '--ip' flag to a routable IP address.
**************************************************
E0917 20:28:34.797722 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
W0917 20:28:34.797917 17129 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:35.612090 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
W0917 20:28:35.612185 17133 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:37.841622 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
W0917 20:28:37.841723 17128 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:38.358543 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
W0917 20:28:38.358711 17128 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:51.705592 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
W0917 20:28:51.705704 17128 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
mesos-slave.slave1.invalid-user.log.INFO
Log file created at: 2016/09/17 20:28:34
Running on machine: slave1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0917 20:28:34.011777 17112 logging.cpp:194] INFO level logging started!
I0917 20:28:34.014294 17112 containerizer.cpp:196] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
I0917 20:28:34.016263 17112 linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0917 20:28:34.017916 17112 main.cpp:434] Starting Mesos agent
I0917 20:28:34.018307 17112 slave.cpp:198] Agent started on 1)#127.0.0.1:5051
I0917 20:28:34.018381 17112 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="192.168.33.31" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://192.168.33.20:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
W0917 20:28:34.018565 17112 slave.cpp:202]
**************************************************
Agent bound to loopback interface! Cannot communicate with remote master(s). You might want to set '--ip' flag to a routable IP address.
**************************************************
I0917 20:28:34.018896 17112 slave.cpp:519] Agent resources: cpus(*):1; mem(*):244; disk(*):35164; ports(*):[31000-32000]
I0917 20:28:34.018959 17112 slave.cpp:527] Agent attributes: [ ]
I0917 20:28:34.018987 17112 slave.cpp:532] Agent hostname: 192.168.33.31
I0917 20:28:34.022061 17127 state.cpp:57] Recovering state from '/var/lib/mesos/meta'
I0917 20:28:34.022337 17127 state.cpp:697] No checkpointed resources found at '/var/lib/mesos/meta/resources/resources.info'
I0917 20:28:34.022431 17127 state.cpp:100] Failed to find the latest agent from '/var/lib/mesos/meta'
I0917 20:28:34.028128 17133 group.cpp:349] Group process (group(1)#127.0.0.1:5051) connected to ZooKeeper
I0917 20:28:34.028177 17133 group.cpp:837] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0917 20:28:34.028187 17133 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
I0917 20:28:34.028659 17130 status_update_manager.cpp:200] Recovering status update manager
I0917 20:28:34.028875 17129 containerizer.cpp:522] Recovering containerizer
I0917 20:28:34.029595 17129 provisioner.cpp:253] Provisioner recovery complete
I0917 20:28:34.029912 17112 slave.cpp:4782] Finished recovery
I0917 20:28:34.030637 17133 detector.cpp:152] Detected a new leader: (id='6')
I0917 20:28:34.030733 17133 group.cpp:706] Trying to get '/mesos/json.info_0000000006' in ZooKeeper
I0917 20:28:34.032158 17133 zookeeper.cpp:259] A new leading master (UPID=master#127.0.0.1:5050) is detected
I0917 20:28:34.032232 17133 slave.cpp:895] New master detected at master#127.0.0.1:5050
I0917 20:28:34.032245 17133 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0917 20:28:34.032263 17133 slave.cpp:927] Detecting new master
I0917 20:28:34.032281 17133 status_update_manager.cpp:174] Pausing sending status updates
E0917 20:28:34.797722 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
I0917 20:28:34.797904 17129 slave.cpp:3732] master#127.0.0.1:5050 exited
W0917 20:28:34.797917 17129 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:35.612090 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
I0917 20:28:35.612174 17133 slave.cpp:3732] master#127.0.0.1:5050 exited
W0917 20:28:35.612185 17133 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:37.841622 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
I0917 20:28:37.841713 17128 slave.cpp:3732] master#127.0.0.1:5050 exited
W0917 20:28:37.841723 17128 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:38.358543 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
I0917 20:28:38.358700 17128 slave.cpp:3732] master#127.0.0.1:5050 exited
W0917 20:28:38.358711 17128 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:51.705592 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
I0917 20:28:51.705665 17128 slave.cpp:3732] master#127.0.0.1:5050 exited
W0917 20:28:51.705704 17128 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
mesos-slave.slave1.invalid-user.log.ERROR
Log file created at: 2016/09/17 20:28:34
Running on machine: slave1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0917 20:28:34.797722 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:35.612090 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:37.841622 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:38.358543 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:51.705592 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
mesos-slave.INFO
Log file created at: 2016/09/17 20:28:34
Running on machine: slave1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0917 20:28:34.011777 17112 logging.cpp:194] INFO level logging started!
I0917 20:28:34.014294 17112 containerizer.cpp:196] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
I0917 20:28:34.016263 17112 linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0917 20:28:34.017916 17112 main.cpp:434] Starting Mesos agent
I0917 20:28:34.018307 17112 slave.cpp:198] Agent started on 1)#127.0.0.1:5051
I0917 20:28:34.018381 17112 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="192.168.33.31" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://192.168.33.20:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
W0917 20:28:34.018565 17112 slave.cpp:202]
**************************************************
Agent bound to loopback interface! Cannot communicate with remote master(s). You might want to set '--ip' flag to a routable IP address.
**************************************************
I0917 20:28:34.018896 17112 slave.cpp:519] Agent resources: cpus(*):1; mem(*):244; disk(*):35164; ports(*):[31000-32000]
I0917 20:28:34.018959 17112 slave.cpp:527] Agent attributes: [ ]
I0917 20:28:34.018987 17112 slave.cpp:532] Agent hostname: 192.168.33.31
I0917 20:28:34.022061 17127 state.cpp:57] Recovering state from '/var/lib/mesos/meta'
mesos-slave.ERROR
Log file created at: 2016/09/17 20:28:34
Running on machine: slave1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0917 20:28:34.797722 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:35.612090 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:37.841622 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:38.358543 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:51.705592 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
Logs for the master machine are as following:
mesos-slave.master.invalid-user.log.WARNING
Log file created at: 2016/09/17 20:28:30
Running on machine: master
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W0917 20:28:30.118418 21439 slave.cpp:202]
**************************************************
Agent bound to loopback interface! Cannot communicate with remote master(s). You might want to set '--ip' flag to a routable IP address.
**************************************************
mesos-slave.master.invalid-user.log.INFO
Log file created at: 2016/09/17 20:28:30
Running on machine: master
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0917 20:28:30.107797 21423 logging.cpp:194] INFO level logging started!
I0917 20:28:30.112454 21423 containerizer.cpp:196] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
I0917 20:28:30.113862 21423 linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0917 20:28:30.114965 21423 main.cpp:434] Starting Mesos agent
I0917 20:28:30.118180 21439 slave.cpp:198] Agent started on 1)#127.0.0.1:5051
I0917 20:28:30.118201 21439 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="192.168.33.20" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://192.168.33.20:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
W0917 20:28:30.118418 21439 slave.cpp:202]
**************************************************
Agent bound to loopback interface! Cannot communicate with remote master(s). You might want to set '--ip' flag to a routable IP address.
**************************************************
I0917 20:28:30.118688 21439 slave.cpp:519] Agent resources: cpus(*):1; mem(*):244; disk(*):35164; ports(*):[31000-32000]
I0917 20:28:30.118716 21439 slave.cpp:527] Agent attributes: [ ]
I0917 20:28:30.118719 21439 slave.cpp:532] Agent hostname: 192.168.33.20
I0917 20:28:30.121039 21440 state.cpp:57] Recovering state from '/var/lib/mesos/meta'
Related
Greenplum Operator on kubernetes zapr error
I am trying to deploy Greenplum Operator on kubernetes and I get the following error: kubectl describe pod greenplum-operator-87d989b4d-ldft6: Name: greenplum-operator-87d989b4d-ldft6 Namespace: greenplum Priority: 0 Node: node-1/some-ip Start Time: Mon, 23 May 2022 14:07:26 +0200 Labels: app=greenplum-operator pod-template-hash=87d989b4d Annotations: cni.projectcalico.org/podIP: some-ip cni.projectcalico.org/podIPs: some-ip Status: Running IP: some-ip IPs: IP: some-ip Controlled By: ReplicaSet/greenplum-operator-87d989b4d Containers: greenplum-operator: Container ID: docker://364997050b1f337ff61b8ce40534697bbc13aae29f7b9ae5255245375acce03f Image: greenplum-operator:v2.3.0 Image ID: docker-pullable://greenplum-operator:v2.3.0 Port: <none> Host Port: <none> Command: greenplum-operator --logLevel debug State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: Error Exit Code: 1 Started: Mon, 23 May 2022 15:29:59 +0200 Finished: Mon, 23 May 2022 15:30:32 +0200 Ready: False Restart Count: 19 Environment: GREENPLUM_IMAGE_REPO: greenplum-operator:v2.3.0 GREENPLUM_IMAGE_TAG: v2.3.0 OPERATOR_IMAGE_REPO: greenplum-operator:v2.3.0 OPERATOR_IMAGE_TAG: v2.3.0 Mounts: /var/run/secrets/kubernetes.io/serviceaccount from greenplum-system-operator-token-xcz4q (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: greenplum-system-operator-token-xcz4q: Type: Secret (a volume populated by a Secret) SecretName: greenplum-system-operator-token-xcz4q Optional: false QoS Class: BestEffort Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s node.kubernetes.io/unreachable:NoExecute op=Exists for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning BackOff 32s (x340 over 84m) kubelet Back-off restarting failed container kubectl logs greenplum-operator-87d989b4d-ldft6 {"level":"INFO","ts":"2022-05-23T13:35:38.735Z","logger":"setup","msg":"Go Info","Version":"go1.14.10","GOOS":"linux","GOARCH":"amd64"} {"level":"INFO","ts":"2022-05-23T13:35:41.242Z","logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"} {"level":"INFO","ts":"2022-05-23T13:35:41.262Z","logger":"setup","msg":"starting manager"} {"level":"INFO","ts":"2022-05-23T13:35:41.262Z","logger":"admission","msg":"starting greenplum validating admission webhook server"} {"level":"INFO","ts":"2022-05-23T13:35:41.262Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumpxfservice","source":"kind source: /, Kind="} {"level":"INFO","ts":"2022-05-23T13:35:41.264Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumplservice","source":"kind source: /, Kind="} {"level":"INFO","ts":"2022-05-23T13:35:41.264Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumcluster","source":"kind source: /, Kind="} {"level":"INFO","ts":"2022-05-23T13:35:41.262Z","logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"} {"level":"INFO","ts":"2022-05-23T13:35:41.265Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumtextservice","source":"kind source: /, Kind="} {"level":"INFO","ts":"2022-05-23T13:35:41.361Z","logger":"admission","msg":"CertificateSigningRequest: created"} {"level":"INFO","ts":"2022-05-23T13:35:41.363Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumpxfservice","source":"kind source: /, Kind="} {"level":"INFO","ts":"2022-05-23T13:35:41.364Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumplservice","source":"kind source: /, Kind="} {"level":"INFO","ts":"2022-05-23T13:35:41.364Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumcluster","source":"kind source: /, Kind="} {"level":"INFO","ts":"2022-05-23T13:35:41.366Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumtextservice","source":"kind source: /, Kind="} {"level":"INFO","ts":"2022-05-23T13:35:41.464Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"greenplumpxfservice"} {"level":"INFO","ts":"2022-05-23T13:35:41.464Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"greenplumplservice"} {"level":"INFO","ts":"2022-05-23T13:35:41.465Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"greenplumplservice","worker count":1} {"level":"INFO","ts":"2022-05-23T13:35:41.465Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"greenplumcluster"} {"level":"INFO","ts":"2022-05-23T13:35:41.465Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"greenplumpxfservice","worker count":1} {"level":"INFO","ts":"2022-05-23T13:35:41.465Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"greenplumcluster","worker count":1} {"level":"INFO","ts":"2022-05-23T13:35:41.466Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"greenplumtextservice"} {"level":"INFO","ts":"2022-05-23T13:35:41.466Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"greenplumtextservice","worker count":1} {"level":"ERROR","ts":"2022-05-23T13:36:11.368Z","logger":"setup","msg":"error","error":"getting certificate for webhook: failure while waiting for approval: timed out waiting for the condition","errorCauses":[{"error":"getting certificate for webhook: failure while waiting for approval: timed out waiting for the condition"}],"stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr#v0.1.0/zapr.go:128\nmain.main\n\t/greenplum-for-kubernetes/greenplum-operator/cmd/greenplumOperator/main.go:35\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"} I tried to redeploy the cert-manager and check logs but couldn't find anything. Documentation of the greenplum-for-kubernetes doesn't mention anything about that. Read the whole troubleshooting document from the pivotal website too
Ansible - Gather Confluent services Status from all hosts in a file
This is my set-up Bootstrap servers confl-server1 confl-server2 confl-server3 Connect Servers confl-server4 confl-server5 REST Proxy confl-server4 Schema Registry confl-server4 confl-server5 Control Center confl-server6 Zookeepers confl-server7 confl-server8 confl-server9 When I execute the systemctl status confluent-* command on On confl-server4, I get the below output. systemctl status confluent-* ● confluent-kafka-connect.service - Apache Kafka Connect - distributed Loaded: loaded (/usr/lib/systemd/system/confluent-kafka-connect.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/confluent-kafka-connect.service.d └─override.conf Active: active (running) since Thu 2022-02-24 17:33:06 EST; 1 day 18h ago Docs: http://docs.confluent.io/ Main PID: 29825 (java) CGroup: /system.slice/confluent-kafka-connect.service └─29825 java -Xms256M -Xmx2G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX... ● confluent-schema-registry.service - RESTful Avro schema registry for Apache Kafka Loaded: loaded (/usr/lib/systemd/system/confluent-schema-registry.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/confluent-schema-registry.service.d └─override.conf Active: active (running) since Thu 2022-01-06 15:49:55 EST; 1 months 20 days ago Docs: http://docs.confluent.io/ Main PID: 23391 (java) CGroup: /system.slice/confluent-schema-registry.service └─23391 java -Xmx1000M -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.aw... ● confluent-kafka-rest.service - A REST proxy for Apache Kafka Loaded: loaded (/usr/lib/systemd/system/confluent-kafka-rest.service; enabled; vendor preset: disabled) Drop-In: /etc/systemd/system/confluent-kafka-rest.service.d └─override.conf Active: active (running) since Sun 2022-01-02 00:06:07 EST; 1 months 25 days ago Docs: http://docs.confluent.io/ Main PID: 890 (java) CGroup: /system.slice/confluent-kafka-rest.service └─890 java -Xmx256M -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.h... I want to write an ansible script where I can get the service status information on a particular host with the service name in a single output that I can redirect to a file on broker1 This is what I have tried ( based on some post on SO ), --- - name: Check Confluent services status # hosts: localhost hosts: all gather_facts: false become: true vars: ansible_ssh_extra_args: "-o StrictHostKeyChecking=no" ansible_host_key_checking: false tasks: - name: Check if confluent is active command: systemctl status confluent-* register: confluent_check ignore_errors: yes no_log: True failed_when: false - name: Debug message - Check if confluent is active debug: msg: "{{ ansible_play_hosts | map('extract', hostvars, 'confluent_check') | map(attribute='stdout') | list }}" but it gives the output and a lot more for different confluent components for every service in long format on every server ok: [confl-server4] => { "msg": [ "● confluent-zookeeper.service - Apache Kafka - ZooKeeper\n Loaded: loaded (/usr/lib/systemd/system/confluent-zookeeper.service; enabled; vendor preset: disabled)\n Drop-In: /etc/systemd/system/confluent-zookeeper.service.d\n └─override.conf\n Active: active (running) since Mon 2022-01-10 11:54:38 EST; 1 months 16 days ago\n Docs: http://docs.confluent.io/\n Main PID: 26052 (java)\n CGroup: /system.slice/confluent-zookeeper.service\n └─26052 java -Xmx1g -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/var/log/kafka/zookeeper-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../ce-broker-plugins/build/libs/*:/usr/bin/../ce-broker-plugins/build/dependant-libs/*:/usr/bin/../ce-auth-providers/build/libs/*:/usr/bin/../ce-auth-providers/build/dependant-libs/*:/usr/bin/../ce-rest-server/build/libs/*:/usr/bin/../ce-rest-server/build/dependant-libs/*:/usr/bin/../ce-audit/build/libs/*:/usr/bin/../ce-audit/build/dependant-libs/*:/usr/bin/../share/java/kafka/*:/usr/bin/../share/java/confluent-metadata-service/*:/usr/bin/../share/java/rest-utils/*:/usr/bin/../share/java/confluent-common/*:/usr/bin/../share/java/confluent-security/schema-validator/*:/usr/bin/../support-metrics-client/build/dependant-libs-2.12.10/*:/usr/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/*:/usr/bin/../support-metrics-fullcollector/build/dependant-libs-2.12.10/*:/usr/bin/../support-metrics-fullcollector/build/libs/*:/usr/share/java/support-metrics-fullcollector/* -Dlog4j.configuration=file:/etc/kafka/zookeeper_log4j.properties org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/kafka/zookeeper.properties\n\nFeb 26 07:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 07:54:39,613] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 08:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 08:54:39,612] INFO Purge task started. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 08:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 08:54:39,612] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)\nFeb 26 08:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 08:54:39,612] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 09:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 09:54:39,612] INFO Purge task started. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 09:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 09:54:39,612] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)\nFeb 26 09:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 09:54:39,613] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 10:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 10:54:39,612] INFO Purge task started. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 10:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 10:54:39,612] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)\nFeb 26 10:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 10:54:39,612] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)", "● confluent-zookeeper.service - Apache Kafka - ZooKeeper\n Loaded: loaded (/usr/lib/systemd/system/confluent-zookeeper.service; enabled; vendor preset: disabled)\n Drop-In: /etc/systemd/system/confluent-zookeeper.service.d\n └─override.conf\n Active: active (running) since Mon 2022-01-10 11:52:47 EST; 1 months 16 days ago\n Docs: http://docs.confluent.io/\n Main PID: 23394 (java)\n CGroup: /system.slice/confluent-zookeeper.service\n └─23394 java -Xmx1g -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/var/log/kafka/zookeeper-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../ce-broker-plugins/build/libs/*:/usr/bin/../ce-broker-plugins/build/dependant-libs/*:/usr/bin/../ce-auth-providers/build/libs/*:/usr/bin/../ce-auth-providers/build/dependant-libs/*:/usr/bin/../ce-rest-server/build/libs/*:/usr/bin/../ce-rest-server/build/dependant-libs/*:/usr/bin/../ce-audit/build/libs/*:/usr/bin/../ce-audit/build/dependant-libs/*:/usr/bin/../share/java/kafka/*:/usr/bin/../share/java/confluent-metadata-service/*:/usr/bin/../share/java/rest-utils/*:/usr/bin/../share/java/confluent-common/*:/usr/bin/../share/java/confluent-security/schema-validator/*:/usr/bin/../support-metrics-client/build/dependant-libs-2.12.10/*:/usr/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/*:/usr/bin/../support-metrics-fullcollector/build/dependant-libs-2.12.10/*:/usr/bin/../support-metrics-fullcollector/build/libs/*:/usr/share/java/support-metrics-fullcollector/* -Dlog4j.configuration=file:/etc/kafka/zookeeper_log4j.properties org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/kafka/zookeeper.properties\n\nFeb 26 07:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 07:52:48,217] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 08:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 08:52:48,216] INFO Purge task started. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 08:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 08:52:48,216] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)\nFeb 26 08:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 08:52:48,217] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 09:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 09:52:48,216] INFO Purge task started. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 09:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 09:52:48,216] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)\nFeb 26 09:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 09:52:48,216] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 10:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 10:52:48,216] INFO Purge task started. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 10:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 10:52:48,216] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)\nFeb 26 10:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 10:52:48,217] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)", I also tried - name: checking service status command: systemctl status "{{ item }}" loop: "{{ ansible_facts.services.keys() | select('match', '^.*confluent.*$') | list }}" register: result ignore_errors: yes - name: checking service status showing report debug: var: result But that gives even longer output for each host I would like to get the server name, service name and status - running or failed - Server: confl-server4 Service: confluent-kafka-connect.service Active: Active (Running). or (Failed), if failed for services on all servers in a single file on the broker1 host How can I achieve that? Thank you
Getting Module Failure error while running Ansible playbook
I am getting the following error when running my Ansible playbook: hosts file [node1] rabbit-node1 ansible_ssh_host=x.x.x.x ansible_ssh_user=ubuntu [node2] rabbit-node2 ansible_ssh_host=x.x.x.x ansible_ssh_user=ubuntu [node3] rabbit-node3 ansible_ssh_host=x.x.x.x ansible_ssh_user=ubuntu [workers] rabbit-node2 rabbit-node3 [all_group] rabbit-node1 rabbit-node2 rabbit-node3 [all:vars] ansible_python_interpreter=/usr/bin/python3 ansible_ssh_user=ubuntu ansible_private_key_file=private key path ansible_ssh_extra_args='-o StrictHostKeyChecking=no' Error fatal: [rabbit-node1]: FAILED! => {"changed": false, "failed": true, "invocation": {"module_name": "setup"}, "module_stderr": "OpenSSH_7.2p2 Ubuntu-4ubuntu2.1, OpenSSL 1.0.2g 1 Mar 2016\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 19: Applying options for x.x.x.x\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 25400\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\nShared connection to x.x.x.x closed.\r\n", "module_stdout": " File \"/home/ubuntu/.ansible/tmp/ansible-tmp-1584361123.7-96709661573808/setup\", line 3160\r\n except OSError, e:\r\n ^\r\nSyntaxError: invalid syntax\r\n", "msg": "MODULE FAILURE", "parsed": false} 3 playbook.yml Playbook file for Setting hostname installing rabbitmq and creating cluster of rabitmq having 3 nodes. - name: deploy RabbitMQ and setup the environment hosts: - all_group #gather_facts: False user: ubuntu sudo: yes roles: - set_hostname - install_rabbitmq - name: Configure RabbitMQ Cluster hosts: - workers user: ubuntu sudo: yes roles: - cluster_setup
Missing queues from RabbitMQ Metricbeat
It looks like only a fraction of the queues on my RabbitMQ cluster are making it into Elasticsearch via Metricbeat. When I query RabbitMQ's /api/overview, I see 887 queues reported: object_totals: { consumers: 517, queues: 887, exchanges: 197, connections: 305, channels: 622 }, When I query RabbitMQ's /api/queues (which is what Metricbeat hits), I count 887 queues there as well. When I get a unique count of the field rabbitmq.queue.name in Elasticsearch, I am seeing only 309 queues. I don't see anything in the debug output that stands out to me. It's just the usual INFO level startup messages, followed by the publish information: root#rabbitmq:/etc/metricbeat# metricbeat -e 2019-06-24T21:13:33.692Z INFO instance/beat.go:571 Home path: [/usr/share/metricbeat] Config path: [/etc/metricbeat] Data path: [/var/lib/metricbeat] Logs path: [/var/log/metricbeat] 2019-06-24T21:13:33.692Z INFO instance/beat.go:579 Beat ID: xxx 2019-06-24T21:13:33.692Z INFO [index-management.ilm] ilm/ilm.go:129 Policy name: metricbeat-7.1.1 2019-06-24T21:13:33.692Z INFO [seccomp] seccomp/seccomp.go:116 Syscall filter successfully installed 2019-06-24T21:13:33.692Z INFO [beat] instance/beat.go:827 Beat info {"system_info": {"beat": {"path": {"config": "/etc/metricbeat", "data": "/var/lib/metricbeat", "home": "/usr/share/metricbeat", "logs": "/var/log/metricbeat"}, "type": "metricbeat", "uuid": "xxx"}}} 2019-06-24T21:13:33.692Z INFO [beat] instance/beat.go:836 Build info {"system_info": {"build": {"commit": "3358d9a5a09e3c6709a2d3aaafde628ea34e8419", "libbeat": "7.1.1", "time": "2019-05-23T13:23:10.000Z", "version": "7.1.1"}}} 2019-06-24T21:13:33.692Z INFO [beat] instance/beat.go:839 Go runtime info {"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":4,"version":"go1.11.5"}}} [...] 2019-06-24T21:13:33.694Z INFO [beat] instance/beat.go:872 Process info {"system_info": {"process": {"capabilities": {"inheritable":null,"permitted":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"effective":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"bounding":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"ambient":null}, "cwd": "/etc/metricbeat", "exe": "/usr/share/metricbeat/bin/metricbeat", "name": "metricbeat", "pid": 30898, "ppid": 30405, "seccomp": {"mode":"filter","no_new_privs":true}, "start_time": "2019-06-24T21:13:33.100Z"}}} 2019-06-24T21:13:33.694Z INFO instance/beat.go:280 Setup Beat: metricbeat; Version: 7.1.1 2019-06-24T21:13:33.694Z INFO [publisher] pipeline/module.go:97 Beat name: metricbeat 2019-06-24T21:13:33.694Z INFO instance/beat.go:391 metricbeat start running. 2019-06-24T21:13:33.694Z INFO cfgfile/reload.go:150 Config reloader started 2019-06-24T21:13:33.694Z INFO [monitoring] log/log.go:117 Starting metrics logging every 30s [...] 2019-06-24T21:13:43.696Z INFO filesystem/filesystem.go:57 Ignoring filesystem types: sysfs, rootfs, ramfs, bdev, proc, cpuset, cgroup, cgroup2, tmpfs, devtmpfs, configfs, debugfs, tracefs, securityfs, sockfs, dax, bpf, pipefs, hugetlbfs, devpts, ecryptfs, fuse, fusectl, pstore, mqueue, autofs 2019-06-24T21:13:43.696Z INFO fsstat/fsstat.go:59 Ignoring filesystem types: sysfs, rootfs, ramfs, bdev, proc, cpuset, cgroup, cgroup2, tmpfs, devtmpfs, configfs, debugfs, tracefs, securityfs, sockfs, dax, bpf, pipefs, hugetlbfs, devpts, ecryptfs, fuse, fusectl, pstore, mqueue, autofs 2019-06-24T21:13:44.696Z INFO pipeline/output.go:95 Connecting to backoff(async(tcp://xxx)) 2019-06-24T21:13:44.711Z INFO pipeline/output.go:105 Connection to backoff(async(tcp://xxx)) established 2019-06-24T21:14:03.696Z INFO [monitoring] log/log.go:144 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":130,"time":{"ms":131}},"total":{"ticks":1960,"time":{"ms":1965},"value":1960},"user":{"ticks":1830,"time":{"ms":1834}}},"handles":{"limit":{"hard":1048576,"soft":1024},"open":12},"info":{"ephemeral_id":"xxx","uptime":{"ms":30030}},"memstats":{"gc_next":30689808,"memory_alloc":21580680,"memory_total":428076400,"rss":79917056}},"libbeat":{"config":{"module":{"running":0},"reloads":2},"output":{"events":{"acked":7825,"batches":11,"total":7825},"read":{"bytes":66},"type":"logstash","write":{"bytes":870352}},"pipeline":{"clients":4,"events":{"active":313,"published":8138,"retry":523,"total":8138},"queue":{"acked":7825}}},"metricbeat":{"rabbitmq":{"connection":{"events":2987,"failures":10,"success":2977},"exchange":{"events":1970,"success":1970},"node":{"events":10,"success":10},"queue":{"events":3130,"failures":10,"success":3120}},"system":{"cpu":{"events":2,"success":2},"filesystem":{"events":7,"success":7},"fsstat":{"events":1,"success":1},"load":{"events":2,"success":2},"memory":{"events":2,"success":2},"network":{"events":4,"success":4},"process":{"events":18,"success":18},"process_summary":{"events":2,"success":2},"socket_summary":{"events":2,"success":2},"uptime":{"events":1,"success":1}}},"system":{"cpu":{"cores":4},"load":{"1":0.48,"15":0.28,"5":0.15,"norm":{"1":0.12,"15":0.07,"5":0.0375}}}}}} I think if there were a problem getting the queue, I should see an error in the logs above as per https://github.com/elastic/beats/blob/master/metricbeat/module/rabbitmq/queue/data.go#L94-L104 Here's the metricbeat.yml: metricbeat.config.modules: path: ${path.config}/modules.d/*.yml reload.enabled: true reload.period: 10s setup.template.settings: index.number_of_shards: 1 index.codec: best_compression name: metricbeat fields: environment: development processors: - add_cloud_metadata: ~ output.logstash: hosts: ["xxx"] Here's the modules.d/rabbitmq.yml: - module: rabbitmq metricsets: ["node", "queue", "connection", "exchange"] enabled: true period: 2s hosts: ["xxx"] username: xxx password: xxx
I solved it by upgrading Elastic Stack from 7.1.1 to 7.2.0.
websockets on GKE with istio gives 'no healthy upstream' and 'CrashLoopBackOff'
I am on GKE using Istio version 1.0.3 . I try to get my express.js with socket.io (and uws engine) backend working with websockets and had this backend running before on a 'non kubernetes server' with websockets without problems. When I simply enter the external_gke_ip as url I get my backend html page, so http works. But when my client-app makes socketio authentication calls from my client-app I get 503 errors in the browser console: WebSocket connection to 'ws://external_gke_ip/socket.io/?EIO=3&transport=websocket' failed: Error during WebSocket handshake: Unexpected response code: 503 And when I enter the external_gke_ip as url while socket calls are made I get: no healthy upstream in the browser. And the pod gives: CrashLoopBackOff. I find somewhere: 'in node.js land, socket.io typically does a few non-websocket Handshakes to the Server before eventually upgrading to Websockets. If you don't have sticky-sessions, the upgrade never works.' So maybe I need sticky sessions? Or not... as I just have one replica of my app? It seems to be done by setting sessionAffinity: ClientIP, but with istio I do not know how to do this and in the GUI I can edit some values of the loadbalancers, but Session affinity shows 'none' and I can not edit it. Other settings that might be relevant and that I am not sure of (how to set using istio) are: externalTrafficPolicy=Local Ttl My manifest config file: apiVersion: v1 kind: Service metadata: name: myapp labels: app: myapp spec: selector: app: myapp ports: - port: 8089 targetPort: 8089 protocol: TCP name: http --- apiVersion: apps/v1 kind: Deployment metadata: name: myapp labels: app: myapp spec: selector: matchLabels: app: myapp template: metadata: labels: app: myapp spec: containers: - name: app image: gcr.io/myproject/firstapp:v1 imagePullPolicy: Always ports: - containerPort: 8089 env: - name: POSTGRES_DB_HOST value: 127.0.0.1:5432 - name: POSTGRES_DB_USER valueFrom: secretKeyRef: name: mysecret key: username - name: POSTGRES_DB_PASSWORD valueFrom: secretKeyRef: name: mysecret key: password readinessProbe: httpGet: path: /healthz scheme: HTTP port: 8089 initialDelaySeconds: 10 timeoutSeconds: 5 - name: cloudsql-proxy image: gcr.io/cloudsql-docker/gce-proxy:1.11 command: ["/cloud_sql_proxy", "-instances=myproject:europe-west4:osm=tcp:5432", "-credential_file=/secrets/cloudsql/credentials.json"] securityContext: runAsUser: 2 allowPrivilegeEscalation: false volumeMounts: - name: cloudsql-instance-credentials mountPath: /secrets/cloudsql readOnly: true volumes: - name: cloudsql-instance-credentials secret: secretName: cloudsql-instance-credentials --- apiVersion: networking.istio.io/v1alpha3 kind: Gateway metadata: name: myapp-gateway spec: selector: istio: ingressgateway servers: - port: number: 80 name: http protocol: HTTP hosts: - "*" --- apiVersion: networking.istio.io/v1alpha3 kind: VirtualService metadata: name: myapp spec: hosts: - "*" gateways: - myapp-gateway http: - match: - uri: prefix: / route: - destination: host: myapp weight: 100 websocketUpgrade: true --- apiVersion: networking.istio.io/v1alpha3 kind: ServiceEntry metadata: name: google-apis spec: hosts: - "*.googleapis.com" ports: - number: 443 name: https protocol: HTTPS location: MESH_EXTERNAL --- apiVersion: networking.istio.io/v1alpha3 kind: ServiceEntry metadata: name: cloud-sql-instance spec: hosts: - 35.204.XXX.XX # ip of cloudsql database ports: - name: tcp number: 3307 protocol: TCP location: MESH_EXTERNAL Various output (while making socket calls, when I stop these the deployment restarts and READY returns to 3/3): kubectl get pods NAME READY STATUS RESTARTS AGE myapp-8888 2/3 CrashLoopBackOff 11 1h $ kubectl describe pod/myapp-8888 gives: Name: myapp-8888 Namespace: default Node: gke-standard-cluster-1-default-pool-888888-9vtk/10.164.0.36 Start Time: Sat, 19 Jan 2019 14:33:11 +0100 Labels: app=myapp pod-template-hash=207157 Annotations: kubernetes.io/limit-ranger: LimitRanger plugin set: cpu request for container app; cpu request for container cloudsql-proxy sidecar.istio.io/status: {"version":"3c9617ff82c9962a58890e4fa987c69ca62487fda71c23f3a2aad1d7bb46c748","initContainers":["istio-init"],"containers":["istio-proxy"]... Status: Running IP: 10.44.0.5 Controlled By: ReplicaSet/myapp-64c59c94dc Init Containers: istio-init: Container ID: docker://a417695f99509707d0f4bfa45d7d491501228031996b603c22aaf398551d1e45 Image: gcr.io/gke-release/istio/proxy_init:1.0.2-gke.0 Image ID: docker-pullable://gcr.io/gke-release/istio/proxy_init#sha256:e30d47d2f269347a973523d0c5d7540dbf7f87d24aca2737ebc09dbe5be53134 Port: <none> Host Port: <none> Args: -p 15001 -u 1337 -m REDIRECT -i * -x -b 8089, -d State: Terminated Reason: Completed Exit Code: 0 Started: Sat, 19 Jan 2019 14:33:19 +0100 Finished: Sat, 19 Jan 2019 14:33:19 +0100 Ready: True Restart Count: 0 Environment: <none> Mounts: <none> Containers: app: Container ID: docker://888888888888888888888888 Image: gcr.io/myproject/firstapp:v1 Image ID: docker-pullable://gcr.io/myproject/firstapp#sha256:8888888888888888888888888 Port: 8089/TCP Host Port: 0/TCP State: Terminated Reason: Completed Exit Code: 0 Started: Sat, 19 Jan 2019 14:40:14 +0100 Finished: Sat, 19 Jan 2019 14:40:37 +0100 Last State: Terminated Reason: Completed Exit Code: 0 Started: Sat, 19 Jan 2019 14:39:28 +0100 Finished: Sat, 19 Jan 2019 14:39:46 +0100 Ready: False Restart Count: 3 Requests: cpu: 100m Readiness: http-get http://:8089/healthz delay=10s timeout=5s period=10s #success=1 #failure=3 Environment: POSTGRES_DB_HOST: 127.0.0.1:5432 POSTGRES_DB_USER: <set to the key 'username' in secret 'mysecret'> Optional: false POSTGRES_DB_PASSWORD: <set to the key 'password' in secret 'mysecret'> Optional: false Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-rclsf (ro) cloudsql-proxy: Container ID: docker://788888888888888888888888888 Image: gcr.io/cloudsql-docker/gce-proxy:1.11 Image ID: docker-pullable://gcr.io/cloudsql-docker/gce-proxy#sha256:5c690349ad8041e8b21eaa63cb078cf13188568e0bfac3b5a914da3483079e2b Port: <none> Host Port: <none> Command: /cloud_sql_proxy -instances=myproject:europe-west4:osm=tcp:5432 -credential_file=/secrets/cloudsql/credentials.json State: Running Started: Sat, 19 Jan 2019 14:33:40 +0100 Ready: True Restart Count: 0 Requests: cpu: 100m Environment: <none> Mounts: /secrets/cloudsql from cloudsql-instance-credentials (ro) /var/run/secrets/kubernetes.io/serviceaccount from default-token-rclsf (ro) istio-proxy: Container ID: docker://f3873d0f69afde23e85d6d6f85b1f Image: gcr.io/gke-release/istio/proxyv2:1.0.2-gke.0 Image ID: docker-pullable://gcr.io/gke-release/istio/proxyv2#sha256:826ef4469e4f1d4cabd0dc846 Port: <none> Host Port: <none> Args: proxy sidecar --configPath /etc/istio/proxy --binaryPath /usr/local/bin/envoy --serviceCluster myapp --drainDuration 45s --parentShutdownDuration 1m0s --discoveryAddress istio-pilot.istio-system:15007 --discoveryRefreshDelay 1s --zipkinAddress zipkin.istio-system:9411 --connectTimeout 10s --statsdUdpAddress istio-statsd-prom-bridge.istio-system:9125 --proxyAdminPort 15000 --controlPlaneAuthPolicy NONE State: Running Started: Sat, 19 Jan 2019 14:33:54 +0100 Ready: True Restart Count: 0 Requests: cpu: 10m Environment: POD_NAME: myapp-64c59c94dc-8888 (v1:metadata.name) POD_NAMESPACE: default (v1:metadata.namespace) INSTANCE_IP: (v1:status.podIP) ISTIO_META_POD_NAME: myapp-64c59c94dc-8888 (v1:metadata.name) ISTIO_META_INTERCEPTION_MODE: REDIRECT Mounts: /etc/certs/ from istio-certs (ro) /etc/istio/proxy from istio-envoy (rw) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: cloudsql-instance-credentials: Type: Secret (a volume populated by a Secret) SecretName: cloudsql-instance-credentials Optional: false default-token-rclsf: Type: Secret (a volume populated by a Secret) SecretName: default-token-rclsf Optional: false istio-envoy: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: Memory istio-certs: Type: Secret (a volume populated by a Secret) SecretName: istio.default Optional: true QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 7m31s default-scheduler Successfully assigned myapp-64c59c94dc-tdb9c to gke-standard-cluster-1-default-pool-65b9e650-9vtk Normal SuccessfulMountVolume 7m31s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk MountVolume.SetUp succeeded for volume "istio-envoy" Normal SuccessfulMountVolume 7m31s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk MountVolume.SetUp succeeded for volume "cloudsql-instance-credentials" Normal SuccessfulMountVolume 7m31s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk MountVolume.SetUp succeeded for volume "default-token-rclsf" Normal SuccessfulMountVolume 7m31s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk MountVolume.SetUp succeeded for volume "istio-certs" Normal Pulling 7m30s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk pulling image "gcr.io/gke-release/istio/proxy_init:1.0.2-gke.0" Normal Pulled 7m25s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Successfully pulled image "gcr.io/gke-release/istio/proxy_init:1.0.2-gke.0" Normal Created 7m24s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Created container Normal Started 7m23s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Started container Normal Pulling 7m4s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk pulling image "gcr.io/cloudsql-docker/gce-proxy:1.11" Normal Pulled 7m3s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Successfully pulled image "gcr.io/cloudsql-docker/gce-proxy:1.11" Normal Started 7m2s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Started container Normal Pulling 7m2s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk pulling image "gcr.io/gke-release/istio/proxyv2:1.0.2-gke.0" Normal Created 7m2s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Created container Normal Pulled 6m54s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Successfully pulled image "gcr.io/gke-release/istio/proxyv2:1.0.2-gke.0" Normal Created 6m51s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Created container Normal Started 6m48s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Started container Normal Pulling 111s (x2 over 7m22s) kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk pulling image "gcr.io/myproject/firstapp:v3" Normal Created 110s (x2 over 7m4s) kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Created container Normal Started 110s (x2 over 7m4s) kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Started container Normal Pulled 110s (x2 over 7m7s) kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Successfully pulled image "gcr.io/myproject/firstapp:v3" Warning Unhealthy 99s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Readiness probe failed: HTTP probe failed with statuscode: 503 Warning BackOff 85s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Back-off restarting failed container And: $ kubectl logs myapp-8888 myapp > api_server#0.0.0 start /usr/src/app > node src/ info: Feathers application started on http://localhost:8089 And the database logs (which looks ok, as some 'startup script entries' from app can be retrieved using psql): $ kubectl logs myapp-8888 cloudsql-proxy 2019/01/19 13:33:40 using credential file for authentication; email=proxy-user#myproject.iam.gserviceaccount.com 2019/01/19 13:33:40 Listening on 127.0.0.1:5432 for myproject:europe-west4:osm 2019/01/19 13:33:40 Ready for new connections 2019/01/19 13:33:54 New connection for "myproject:europe-west4:osm" 2019/01/19 13:33:55 couldn't connect to "myproject:europe-west4:osm": Post https://www.googleapis.com/sql/v1beta4/projects/myproject/instances/osm/createEphemeral?alt=json: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp 74.125.143.95:443: getsockopt: connection refused 2019/01/19 13:39:06 New connection for "myproject:europe-west4:osm" 2019/01/19 13:39:06 New connection for "myproject:europe-west4:osm" 2019/01/19 13:39:06 Client closed local connection on 127.0.0.1:5432 2019/01/19 13:39:13 New connection for "myproject:europe-west4:osm" 2019/01/19 13:39:14 New connection for "myproject:europe-west4:osm" 2019/01/19 13:39:14 New connection for "myproject:europe-west4:osm" 2019/01/19 13:39:14 New connection for "myproject:europe-west4:osm" EDIT: Here is the serverside log of the 503 of websocket calls to my app: { insertId: "465nu9g3xcn5hf" jsonPayload: { apiClaims: "" apiKey: "" clientTraceId: "" connection_security_policy: "unknown" destinationApp: "myapp" destinationIp: "10.44.XX.XX" destinationName: "myapp-888888-88888" destinationNamespace: "default" destinationOwner: "kubernetes://apis/extensions/v1beta1/namespaces/default/deployments/myapp" destinationPrincipal: "" destinationServiceHost: "myapp.default.svc.cluster.local" destinationWorkload: "myapp" httpAuthority: "35.204.XXX.XXX" instance: "accesslog.logentry.istio-system" latency: "1.508885ms" level: "info" method: "GET" protocol: "http" receivedBytes: 787 referer: "" reporter: "source" requestId: "bb31d922-8f5d-946b-95c9-83e4c022d955" requestSize: 0 requestedServerName: "" responseCode: 503 responseSize: 57 responseTimestamp: "2019-01-18T20:53:03.966513Z" sentBytes: 164 sourceApp: "istio-ingressgateway" sourceIp: "10.44.X.X" sourceName: "istio-ingressgateway-8888888-88888" sourceNamespace: "istio-system" sourceOwner: "kubernetes://apis/extensions/v1beta1/namespaces/istio-system/deployments/istio-ingressgateway" sourcePrincipal: "" sourceWorkload: "istio-ingressgateway" url: "/socket.io/?EIO=3&transport=websocket" userAgent: "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1" xForwardedFor: "10.44.X.X" } logName: "projects/myproject/logs/stdout" metadata: { systemLabels: { container_image: "gcr.io/gke-release/istio/mixer:1.0.2-gke.0" container_image_id: "docker-pullable://gcr.io/gke-release/istio/mixer#sha256:888888888888888888888888888888" name: "mixer" node_name: "gke-standard-cluster-1-default-pool-88888888888-8887" provider_instance_id: "888888888888" provider_resource_type: "gce_instance" provider_zone: "europe-west4-a" service_name: [ 0: "istio-telemetry" ] top_level_controller_name: "istio-telemetry" top_level_controller_type: "Deployment" } userLabels: { app: "telemetry" istio: "mixer" istio-mixer-type: "telemetry" pod-template-hash: "88888888888" } } receiveTimestamp: "2019-01-18T20:53:08.135805255Z" resource: { labels: { cluster_name: "standard-cluster-1" container_name: "mixer" location: "europe-west4-a" namespace_name: "istio-system" pod_name: "istio-telemetry-8888888-8888888" project_id: "myproject" } type: "k8s_container" } severity: "INFO" timestamp: "2019-01-18T20:53:03.965100Z" } In the browser at first it properly seems to switch protocol but then causes a repeated 503 response and subsequent health issues cause a repeating restart. The protocol switch websocket call: General: Request URL: ws://localhost:8080/sockjs-node/842/s4888/websocket Request Method: GET Status Code: 101 Switching Protocols [GREEN] Response headers: Connection: Upgrade Sec-WebSocket-Accept: NS8888888888888888888 Upgrade: websocket Request headers: Accept-Encoding: gzip, deflate, br Accept-Language: nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7 Cache-Control: no-cache Connection: Upgrade Cookie: _ga=GA1.1.1118102238.18888888; hblid=nSNQ2mS8888888888888; olfsk=ol8888888888 Host: localhost:8080 Origin: http://localhost:8080 Pragma: no-cache Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits Sec-WebSocket-Key: b8zkVaXlEySHasCkD4aUiw== Sec-WebSocket-Version: 13 Upgrade: websocket User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1 Its frames: Following the above I get multiple of these: Chrome output regarding websocket call: general: Request URL: ws://35.204.210.134/socket.io/?EIO=3&transport=websocket Request Method: GET Status Code: 503 Service Unavailable response headers: connection: close content-length: 19 content-type: text/plain date: Sat, 19 Jan 2019 14:06:39 GMT server: envoy request headers: Accept-Encoding: gzip, deflate Accept-Language: nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7 Cache-Control: no-cache Connection: Upgrade Host: 35.204.210.134 Origin: http://localhost:8080 Pragma: no-cache Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits Sec-WebSocket-Key: VtKS5xKF+GZ4u3uGih2fig== Sec-WebSocket-Version: 13 Upgrade: websocket User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1 The frames: Data: (Opcode -1) Length: 63 Time: 15:06:44.412
Using uws (uWebSockets) as websocket engine causes these errors. When I swap in my backend app this code: app.configure(socketio({ wsEngine: 'uws', timeout: 120000, reconnect: true })) for this: app.configure(socketio()) Everything works as expected. EDIT: Now it also works with uws. I used alpine docker container which is based on node 10, which does not work with uws. After switching to container based on node 8 it works.