Changing Hostname on Mesos-Slave not working - mesos

I followed the tutorial from https://open.mesosphere.com/getting-started/install/ to setup mesos and marathon.
I am using Vagrant to create 2 nodes, a master and a slave.
At the end of the tutorial I have marathon and mesos functioning.
First problem: Only the slave on the master machine is visible to mesos. The "independent" slave on the second vagrant node machine is not visible on mesos even though I have put the same settings in /etc/mesos/zk for both the nodes. From what I understand this is the file that gives the master node addresses.
Second Problem: When I change the hostname for the slave on the master machine to the ip address, the slave does not run.
When I remove the file /etc/mesos-slave/hostname, and restart the slave start running again.
I get the following logs in /var/log/mesos:
Log file created at: 2016/09/13 19:38:57
Running on machine: vagrant-ubuntu-trusty-64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0913 19:38:57.316082 2870 logging.cpp:194] INFO level logging started!
I0913 19:38:57.319680 2870 containerizer.cpp:196] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
I0913 19:38:57.321099 2870 linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0913 19:38:57.322904 2870 main.cpp:434] Starting Mesos agent
I0913 19:38:57.323637 2887 slave.cpp:198] Agent started on 1)#10.0.2.15:5051
I0913 19:38:57.323648 2887 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="192.168.33.20" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://192.168.33.20:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
I0913 19:38:57.323942 2887 slave.cpp:519] Agent resources: cpus(*):1; mem(*):244; disk(*):35164; ports(*):[31000-32000]
I0913 19:38:57.323969 2887 slave.cpp:527] Agent attributes: [ ]
I0913 19:38:57.323974 2887 slave.cpp:532] Agent hostname: 192.168.33.20
I0913 19:38:57.326578 2886 state.cpp:57] Recovering state from '/var/lib/mesos/meta'
After this when I do "sudo service mesos-slave status" it says stop/waiting.
I am not sure how to go about dealing with these two problems. Any help appreciated.
Update
On the "Independent Slave Machine" I am getting the following logs:
file: mesos-slave.vagrant-ubuntu-trusty-64.invalid-user.log.ERROR.20160914-141226.1197
Log file created at: 2016/09/14 14:12:26
Running on machine: vagrant-ubuntu-trusty-64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0914 14:12:26.699146 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:26.700430 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:27.634099 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:28.784499 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:34.914746 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:36.906472 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:37.242663 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:40.442214 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:42.033504 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:47.239245 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:12:50.712105 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0914 14:13:03.200935 1234 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
file: mesos-slave.vagrant-ubuntu-trusty-64.invalid-user.log.INFO.20160914-141502.4788
Log file created at: 2016/09/14 14:15:02
Running on machine: vagrant-ubuntu-trusty-64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0914 14:15:02.491973 4788 logging.cpp:194] INFO level logging started!
I0914 14:15:02.495968 4788 containerizer.cpp:196] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
I0914 14:15:02.497270 4788 linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0914 14:15:02.498855 4788 main.cpp:434] Starting Mesos agent
I0914 14:15:02.499091 4788 slave.cpp:198] Agent started on 1)#10.0.2.15:5051
I0914 14:15:02.499195 4788 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="192.168.33.31" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://192.168.33.20:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
I0914 14:15:02.499560 4788 slave.cpp:519] Agent resources: cpus(*):1; mem(*):244; disk(*):35164; ports(*):[31000-32000]
I0914 14:15:02.499620 4788 slave.cpp:527] Agent attributes: [ ]
I0914 14:15:02.499650 4788 slave.cpp:532] Agent hostname: 192.168.33.31
I0914 14:15:02.502511 4803 state.cpp:57] Recovering state from '/var/lib/mesos/meta'
I0914 14:15:02.502554 4803 state.cpp:697] No checkpointed resources found at '/var/lib/mesos/meta/resources/resources.info'
I0914 14:15:02.502630 4803 state.cpp:100] Failed to find the latest agent from '/var/lib/mesos/meta'
I0914 14:15:02.510077 4807 status_update_manager.cpp:200] Recovering status update manager
I0914 14:15:02.510150 4807 containerizer.cpp:522] Recovering containerizer
I0914 14:15:02.510758 4807 provisioner.cpp:253] Provisioner recovery complete
I0914 14:15:02.510815 4807 slave.cpp:4782] Finished recovery
I0914 14:15:02.511342 4804 group.cpp:349] Group process (group(1)#10.0.2.15:5051) connected to ZooKeeper
I0914 14:15:02.511368 4804 group.cpp:837] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0914 14:15:02.511376 4804 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
I0914 14:15:02.513720 4804 detector.cpp:152] Detected a new leader: (id='4')
I0914 14:15:02.513813 4804 group.cpp:706] Trying to get '/mesos/json.info_0000000004' in ZooKeeper
I0914 14:15:02.514854 4804 zookeeper.cpp:259] A new leading master (UPID=master#10.0.2.15:5050) is detected
I0914 14:15:02.514928 4804 slave.cpp:895] New master detected at master#10.0.2.15:5050
I0914 14:15:02.514940 4804 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0914 14:15:02.514961 4804 slave.cpp:927] Detecting new master
I0914 14:15:02.514976 4804 status_update_manager.cpp:174] Pausing sending status updates
E0914 14:15:03.228878 4811 process.cpp:2105] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0914 14:15:03.229086 4806 slave.cpp:3732] master#10.0.2.15:5050 exited
W0914 14:15:03.229099 4806 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0914 14:15:03.342586 4811 process.cpp:2105] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0914 14:15:03.342675 4806 slave.cpp:3732] master#10.0.2.15:5050 exited
W0914 14:15:03.342685 4806 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0914 14:15:06.773352 4811 process.cpp:2105] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0914 14:15:06.773438 4806 slave.cpp:3732] master#10.0.2.15:5050 exited
W0914 14:15:06.773448 4806 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0914 14:15:09.190912 4811 process.cpp:2105] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0914 14:15:09.191007 4806 slave.cpp:3732] master#10.0.2.15:5050 exited
W0914 14:15:09.191017 4806 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0914 14:15:16.597836 4811 process.cpp:2105] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0914 14:15:16.597929 4806 slave.cpp:3732] master#10.0.2.15:5050 exited
W0914 14:15:16.597940 4806 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
I0914 14:15:33.944555 4809 slave.cpp:3732] master#10.0.2.15:5050 exited
W0914 14:15:33.944607 4809 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0914 14:15:33.944682 4811 process.cpp:2105] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0914 14:16:02.515676 4804 slave.cpp:4591] Current disk usage 4.72%. Max allowed age: 5.969647788608773days
E0914 14:16:11.307096 4811 process.cpp:2105] Failed to shutdown socket with fd 11: Transport endpoint is not connected
I0914 14:16:11.307189 4806 slave.cpp:3732] master#10.0.2.15:5050 exited
W0914 14:16:11.307199 4806 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
Update 2
The configurations for both the machines seem to be the same (I say seem because I have verified, they are the same but still I cannot seem to connect the remote slave, so there must be something going wrong).
The logs for the machine slave1 are as following:
mesos-slave.slave1.invalid-user.log.WARNING
Log file created at: 2016/09/17 20:28:34
Running on machine: slave1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W0917 20:28:34.018565 17112 slave.cpp:202]
**************************************************
Agent bound to loopback interface! Cannot communicate with remote master(s). You might want to set '--ip' flag to a routable IP address.
**************************************************
E0917 20:28:34.797722 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
W0917 20:28:34.797917 17129 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:35.612090 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
W0917 20:28:35.612185 17133 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:37.841622 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
W0917 20:28:37.841723 17128 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:38.358543 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
W0917 20:28:38.358711 17128 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:51.705592 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
W0917 20:28:51.705704 17128 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
mesos-slave.slave1.invalid-user.log.INFO
Log file created at: 2016/09/17 20:28:34
Running on machine: slave1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0917 20:28:34.011777 17112 logging.cpp:194] INFO level logging started!
I0917 20:28:34.014294 17112 containerizer.cpp:196] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
I0917 20:28:34.016263 17112 linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0917 20:28:34.017916 17112 main.cpp:434] Starting Mesos agent
I0917 20:28:34.018307 17112 slave.cpp:198] Agent started on 1)#127.0.0.1:5051
I0917 20:28:34.018381 17112 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="192.168.33.31" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://192.168.33.20:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
W0917 20:28:34.018565 17112 slave.cpp:202]
**************************************************
Agent bound to loopback interface! Cannot communicate with remote master(s). You might want to set '--ip' flag to a routable IP address.
**************************************************
I0917 20:28:34.018896 17112 slave.cpp:519] Agent resources: cpus(*):1; mem(*):244; disk(*):35164; ports(*):[31000-32000]
I0917 20:28:34.018959 17112 slave.cpp:527] Agent attributes: [ ]
I0917 20:28:34.018987 17112 slave.cpp:532] Agent hostname: 192.168.33.31
I0917 20:28:34.022061 17127 state.cpp:57] Recovering state from '/var/lib/mesos/meta'
I0917 20:28:34.022337 17127 state.cpp:697] No checkpointed resources found at '/var/lib/mesos/meta/resources/resources.info'
I0917 20:28:34.022431 17127 state.cpp:100] Failed to find the latest agent from '/var/lib/mesos/meta'
I0917 20:28:34.028128 17133 group.cpp:349] Group process (group(1)#127.0.0.1:5051) connected to ZooKeeper
I0917 20:28:34.028177 17133 group.cpp:837] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
I0917 20:28:34.028187 17133 group.cpp:427] Trying to create path '/mesos' in ZooKeeper
I0917 20:28:34.028659 17130 status_update_manager.cpp:200] Recovering status update manager
I0917 20:28:34.028875 17129 containerizer.cpp:522] Recovering containerizer
I0917 20:28:34.029595 17129 provisioner.cpp:253] Provisioner recovery complete
I0917 20:28:34.029912 17112 slave.cpp:4782] Finished recovery
I0917 20:28:34.030637 17133 detector.cpp:152] Detected a new leader: (id='6')
I0917 20:28:34.030733 17133 group.cpp:706] Trying to get '/mesos/json.info_0000000006' in ZooKeeper
I0917 20:28:34.032158 17133 zookeeper.cpp:259] A new leading master (UPID=master#127.0.0.1:5050) is detected
I0917 20:28:34.032232 17133 slave.cpp:895] New master detected at master#127.0.0.1:5050
I0917 20:28:34.032245 17133 slave.cpp:916] No credentials provided. Attempting to register without authentication
I0917 20:28:34.032263 17133 slave.cpp:927] Detecting new master
I0917 20:28:34.032281 17133 status_update_manager.cpp:174] Pausing sending status updates
E0917 20:28:34.797722 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
I0917 20:28:34.797904 17129 slave.cpp:3732] master#127.0.0.1:5050 exited
W0917 20:28:34.797917 17129 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:35.612090 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
I0917 20:28:35.612174 17133 slave.cpp:3732] master#127.0.0.1:5050 exited
W0917 20:28:35.612185 17133 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:37.841622 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
I0917 20:28:37.841713 17128 slave.cpp:3732] master#127.0.0.1:5050 exited
W0917 20:28:37.841723 17128 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:38.358543 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
I0917 20:28:38.358700 17128 slave.cpp:3732] master#127.0.0.1:5050 exited
W0917 20:28:38.358711 17128 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
E0917 20:28:51.705592 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
I0917 20:28:51.705665 17128 slave.cpp:3732] master#127.0.0.1:5050 exited
W0917 20:28:51.705704 17128 slave.cpp:3737] Master disconnected! Waiting for a new master to be elected
mesos-slave.slave1.invalid-user.log.ERROR
Log file created at: 2016/09/17 20:28:34
Running on machine: slave1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0917 20:28:34.797722 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:35.612090 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:37.841622 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:38.358543 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:51.705592 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
mesos-slave.INFO
Log file created at: 2016/09/17 20:28:34
Running on machine: slave1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0917 20:28:34.011777 17112 logging.cpp:194] INFO level logging started!
I0917 20:28:34.014294 17112 containerizer.cpp:196] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
I0917 20:28:34.016263 17112 linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0917 20:28:34.017916 17112 main.cpp:434] Starting Mesos agent
I0917 20:28:34.018307 17112 slave.cpp:198] Agent started on 1)#127.0.0.1:5051
I0917 20:28:34.018381 17112 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="192.168.33.31" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://192.168.33.20:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
W0917 20:28:34.018565 17112 slave.cpp:202]
**************************************************
Agent bound to loopback interface! Cannot communicate with remote master(s). You might want to set '--ip' flag to a routable IP address.
**************************************************
I0917 20:28:34.018896 17112 slave.cpp:519] Agent resources: cpus(*):1; mem(*):244; disk(*):35164; ports(*):[31000-32000]
I0917 20:28:34.018959 17112 slave.cpp:527] Agent attributes: [ ]
I0917 20:28:34.018987 17112 slave.cpp:532] Agent hostname: 192.168.33.31
I0917 20:28:34.022061 17127 state.cpp:57] Recovering state from '/var/lib/mesos/meta'
mesos-slave.ERROR
Log file created at: 2016/09/17 20:28:34
Running on machine: slave1
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
E0917 20:28:34.797722 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:35.612090 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:37.841622 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:38.358543 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
E0917 20:28:51.705592 17135 process.cpp:2105] Failed to shutdown socket with fd 12: Transport endpoint is not connected
Logs for the master machine are as following:
mesos-slave.master.invalid-user.log.WARNING
Log file created at: 2016/09/17 20:28:30
Running on machine: master
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
W0917 20:28:30.118418 21439 slave.cpp:202]
**************************************************
Agent bound to loopback interface! Cannot communicate with remote master(s). You might want to set '--ip' flag to a routable IP address.
**************************************************
mesos-slave.master.invalid-user.log.INFO
Log file created at: 2016/09/17 20:28:30
Running on machine: master
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0917 20:28:30.107797 21423 logging.cpp:194] INFO level logging started!
I0917 20:28:30.112454 21423 containerizer.cpp:196] Using isolation: posix/cpu,posix/mem,filesystem/posix,network/cni
I0917 20:28:30.113862 21423 linux_launcher.cpp:101] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
I0917 20:28:30.114965 21423 main.cpp:434] Starting Mesos agent
I0917 20:28:30.118180 21439 slave.cpp:198] Agent started on 1)#127.0.0.1:5051
I0917 20:28:30.118201 21439 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="192.168.33.20" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="posix/cpu,posix/mem" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://192.168.33.20:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos"
W0917 20:28:30.118418 21439 slave.cpp:202]
**************************************************
Agent bound to loopback interface! Cannot communicate with remote master(s). You might want to set '--ip' flag to a routable IP address.
**************************************************
I0917 20:28:30.118688 21439 slave.cpp:519] Agent resources: cpus(*):1; mem(*):244; disk(*):35164; ports(*):[31000-32000]
I0917 20:28:30.118716 21439 slave.cpp:527] Agent attributes: [ ]
I0917 20:28:30.118719 21439 slave.cpp:532] Agent hostname: 192.168.33.20
I0917 20:28:30.121039 21440 state.cpp:57] Recovering state from '/var/lib/mesos/meta'

Related

Greenplum Operator on kubernetes zapr error

I am trying to deploy Greenplum Operator on kubernetes and I get the following error:
kubectl describe pod greenplum-operator-87d989b4d-ldft6:
Name: greenplum-operator-87d989b4d-ldft6
Namespace: greenplum
Priority: 0
Node: node-1/some-ip
Start Time: Mon, 23 May 2022 14:07:26 +0200
Labels: app=greenplum-operator
pod-template-hash=87d989b4d
Annotations: cni.projectcalico.org/podIP: some-ip
cni.projectcalico.org/podIPs: some-ip
Status: Running
IP: some-ip
IPs:
IP: some-ip
Controlled By: ReplicaSet/greenplum-operator-87d989b4d
Containers:
greenplum-operator:
Container ID: docker://364997050b1f337ff61b8ce40534697bbc13aae29f7b9ae5255245375acce03f
Image: greenplum-operator:v2.3.0
Image ID: docker-pullable://greenplum-operator:v2.3.0
Port: <none>
Host Port: <none>
Command:
greenplum-operator
--logLevel
debug
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Mon, 23 May 2022 15:29:59 +0200
Finished: Mon, 23 May 2022 15:30:32 +0200
Ready: False
Restart Count: 19
Environment:
GREENPLUM_IMAGE_REPO: greenplum-operator:v2.3.0
GREENPLUM_IMAGE_TAG: v2.3.0
OPERATOR_IMAGE_REPO: greenplum-operator:v2.3.0
OPERATOR_IMAGE_TAG: v2.3.0
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from greenplum-system-operator-token-xcz4q (ro)
Conditions:
Type Status
Initialized True
Ready False
ContainersReady False
PodScheduled True
Volumes:
greenplum-system-operator-token-xcz4q:
Type: Secret (a volume populated by a Secret)
SecretName: greenplum-system-operator-token-xcz4q
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 32s (x340 over 84m) kubelet Back-off restarting failed container
kubectl logs greenplum-operator-87d989b4d-ldft6
{"level":"INFO","ts":"2022-05-23T13:35:38.735Z","logger":"setup","msg":"Go Info","Version":"go1.14.10","GOOS":"linux","GOARCH":"amd64"}
{"level":"INFO","ts":"2022-05-23T13:35:41.242Z","logger":"controller-runtime.metrics","msg":"metrics server is starting to listen","addr":":8080"}
{"level":"INFO","ts":"2022-05-23T13:35:41.262Z","logger":"setup","msg":"starting manager"}
{"level":"INFO","ts":"2022-05-23T13:35:41.262Z","logger":"admission","msg":"starting greenplum validating admission webhook server"}
{"level":"INFO","ts":"2022-05-23T13:35:41.262Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumpxfservice","source":"kind source: /, Kind="}
{"level":"INFO","ts":"2022-05-23T13:35:41.264Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumplservice","source":"kind source: /, Kind="}
{"level":"INFO","ts":"2022-05-23T13:35:41.264Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumcluster","source":"kind source: /, Kind="}
{"level":"INFO","ts":"2022-05-23T13:35:41.262Z","logger":"controller-runtime.manager","msg":"starting metrics server","path":"/metrics"}
{"level":"INFO","ts":"2022-05-23T13:35:41.265Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumtextservice","source":"kind source: /, Kind="}
{"level":"INFO","ts":"2022-05-23T13:35:41.361Z","logger":"admission","msg":"CertificateSigningRequest: created"}
{"level":"INFO","ts":"2022-05-23T13:35:41.363Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumpxfservice","source":"kind source: /, Kind="}
{"level":"INFO","ts":"2022-05-23T13:35:41.364Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumplservice","source":"kind source: /, Kind="}
{"level":"INFO","ts":"2022-05-23T13:35:41.364Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumcluster","source":"kind source: /, Kind="}
{"level":"INFO","ts":"2022-05-23T13:35:41.366Z","logger":"controller-runtime.controller","msg":"Starting EventSource","controller":"greenplumtextservice","source":"kind source: /, Kind="}
{"level":"INFO","ts":"2022-05-23T13:35:41.464Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"greenplumpxfservice"}
{"level":"INFO","ts":"2022-05-23T13:35:41.464Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"greenplumplservice"}
{"level":"INFO","ts":"2022-05-23T13:35:41.465Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"greenplumplservice","worker count":1}
{"level":"INFO","ts":"2022-05-23T13:35:41.465Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"greenplumcluster"}
{"level":"INFO","ts":"2022-05-23T13:35:41.465Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"greenplumpxfservice","worker count":1}
{"level":"INFO","ts":"2022-05-23T13:35:41.465Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"greenplumcluster","worker count":1}
{"level":"INFO","ts":"2022-05-23T13:35:41.466Z","logger":"controller-runtime.controller","msg":"Starting Controller","controller":"greenplumtextservice"}
{"level":"INFO","ts":"2022-05-23T13:35:41.466Z","logger":"controller-runtime.controller","msg":"Starting workers","controller":"greenplumtextservice","worker count":1}
{"level":"ERROR","ts":"2022-05-23T13:36:11.368Z","logger":"setup","msg":"error","error":"getting certificate for webhook: failure while waiting for approval: timed out waiting for the condition","errorCauses":[{"error":"getting certificate for webhook: failure while waiting for approval: timed out waiting for the condition"}],"stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/pkg/mod/github.com/go-logr/zapr#v0.1.0/zapr.go:128\nmain.main\n\t/greenplum-for-kubernetes/greenplum-operator/cmd/greenplumOperator/main.go:35\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}
I tried to redeploy the cert-manager and check logs but couldn't find anything. Documentation of the greenplum-for-kubernetes doesn't mention anything about that. Read the whole troubleshooting document from the pivotal website too

Ansible - Gather Confluent services Status from all hosts in a file

This is my set-up
Bootstrap servers
confl-server1
confl-server2
confl-server3
Connect Servers
confl-server4
confl-server5
REST Proxy
confl-server4
Schema Registry
confl-server4
confl-server5
Control Center
confl-server6
Zookeepers
confl-server7
confl-server8
confl-server9
When I execute the systemctl status confluent-* command on On confl-server4, I get the below output.
systemctl status confluent-*
● confluent-kafka-connect.service - Apache Kafka Connect - distributed
Loaded: loaded (/usr/lib/systemd/system/confluent-kafka-connect.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/confluent-kafka-connect.service.d
└─override.conf
Active: active (running) since Thu 2022-02-24 17:33:06 EST; 1 day 18h ago
Docs: http://docs.confluent.io/
Main PID: 29825 (java)
CGroup: /system.slice/confluent-kafka-connect.service
└─29825 java -Xms256M -Xmx2G -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX...
● confluent-schema-registry.service - RESTful Avro schema registry for Apache Kafka
Loaded: loaded (/usr/lib/systemd/system/confluent-schema-registry.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/confluent-schema-registry.service.d
└─override.conf
Active: active (running) since Thu 2022-01-06 15:49:55 EST; 1 months 20 days ago
Docs: http://docs.confluent.io/
Main PID: 23391 (java)
CGroup: /system.slice/confluent-schema-registry.service
└─23391 java -Xmx1000M -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.aw...
● confluent-kafka-rest.service - A REST proxy for Apache Kafka
Loaded: loaded (/usr/lib/systemd/system/confluent-kafka-rest.service; enabled; vendor preset: disabled)
Drop-In: /etc/systemd/system/confluent-kafka-rest.service.d
└─override.conf
Active: active (running) since Sun 2022-01-02 00:06:07 EST; 1 months 25 days ago
Docs: http://docs.confluent.io/
Main PID: 890 (java)
CGroup: /system.slice/confluent-kafka-rest.service
└─890 java -Xmx256M -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -Djava.awt.h...
I want to write an ansible script where I can get the service status information on a particular host with the service name in a single output that I can redirect to a file on broker1
This is what I have tried ( based on some post on SO ),
---
- name: Check Confluent services status
# hosts: localhost
hosts: all
gather_facts: false
become: true
vars:
ansible_ssh_extra_args: "-o StrictHostKeyChecking=no"
ansible_host_key_checking: false
tasks:
- name: Check if confluent is active
command: systemctl status confluent-*
register: confluent_check
ignore_errors: yes
no_log: True
failed_when: false
- name: Debug message - Check if confluent is active
debug:
msg: "{{ ansible_play_hosts | map('extract', hostvars, 'confluent_check') | map(attribute='stdout') | list }}"
but it gives the output and a lot more for different confluent components for every service in long format on every server
ok: [confl-server4] => {
"msg": [
"● confluent-zookeeper.service - Apache Kafka - ZooKeeper\n Loaded: loaded (/usr/lib/systemd/system/confluent-zookeeper.service; enabled; vendor preset: disabled)\n Drop-In: /etc/systemd/system/confluent-zookeeper.service.d\n └─override.conf\n Active: active (running) since Mon 2022-01-10 11:54:38 EST; 1 months 16 days ago\n Docs: http://docs.confluent.io/\n Main PID: 26052 (java)\n CGroup: /system.slice/confluent-zookeeper.service\n └─26052 java -Xmx1g -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/var/log/kafka/zookeeper-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../ce-broker-plugins/build/libs/*:/usr/bin/../ce-broker-plugins/build/dependant-libs/*:/usr/bin/../ce-auth-providers/build/libs/*:/usr/bin/../ce-auth-providers/build/dependant-libs/*:/usr/bin/../ce-rest-server/build/libs/*:/usr/bin/../ce-rest-server/build/dependant-libs/*:/usr/bin/../ce-audit/build/libs/*:/usr/bin/../ce-audit/build/dependant-libs/*:/usr/bin/../share/java/kafka/*:/usr/bin/../share/java/confluent-metadata-service/*:/usr/bin/../share/java/rest-utils/*:/usr/bin/../share/java/confluent-common/*:/usr/bin/../share/java/confluent-security/schema-validator/*:/usr/bin/../support-metrics-client/build/dependant-libs-2.12.10/*:/usr/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/*:/usr/bin/../support-metrics-fullcollector/build/dependant-libs-2.12.10/*:/usr/bin/../support-metrics-fullcollector/build/libs/*:/usr/share/java/support-metrics-fullcollector/* -Dlog4j.configuration=file:/etc/kafka/zookeeper_log4j.properties org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/kafka/zookeeper.properties\n\nFeb 26 07:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 07:54:39,613] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 08:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 08:54:39,612] INFO Purge task started. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 08:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 08:54:39,612] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)\nFeb 26 08:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 08:54:39,612] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 09:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 09:54:39,612] INFO Purge task started. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 09:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 09:54:39,612] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)\nFeb 26 09:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 09:54:39,613] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 10:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 10:54:39,612] INFO Purge task started. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 10:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 10:54:39,612] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)\nFeb 26 10:54:39 confl-server8 zookeeper-server-start[26052]: [2022-02-26 10:54:39,612] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)",
"● confluent-zookeeper.service - Apache Kafka - ZooKeeper\n Loaded: loaded (/usr/lib/systemd/system/confluent-zookeeper.service; enabled; vendor preset: disabled)\n Drop-In: /etc/systemd/system/confluent-zookeeper.service.d\n └─override.conf\n Active: active (running) since Mon 2022-01-10 11:52:47 EST; 1 months 16 days ago\n Docs: http://docs.confluent.io/\n Main PID: 23394 (java)\n CGroup: /system.slice/confluent-zookeeper.service\n └─23394 java -Xmx1g -server -XX:+UseG1GC -XX:MaxGCPauseMillis=20 -XX:InitiatingHeapOccupancyPercent=35 -XX:+ExplicitGCInvokesConcurrent -XX:MaxInlineLevel=15 -Djava.awt.headless=true -Xloggc:/var/log/kafka/zookeeper-gc.log -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+UseGCLogFileRotation -XX:NumberOfGCLogFiles=10 -XX:GCLogFileSize=100M -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dkafka.logs.dir=/var/log/kafka -Dlog4j.configuration=file:/etc/kafka/log4j.properties -cp /usr/bin/../ce-broker-plugins/build/libs/*:/usr/bin/../ce-broker-plugins/build/dependant-libs/*:/usr/bin/../ce-auth-providers/build/libs/*:/usr/bin/../ce-auth-providers/build/dependant-libs/*:/usr/bin/../ce-rest-server/build/libs/*:/usr/bin/../ce-rest-server/build/dependant-libs/*:/usr/bin/../ce-audit/build/libs/*:/usr/bin/../ce-audit/build/dependant-libs/*:/usr/bin/../share/java/kafka/*:/usr/bin/../share/java/confluent-metadata-service/*:/usr/bin/../share/java/rest-utils/*:/usr/bin/../share/java/confluent-common/*:/usr/bin/../share/java/confluent-security/schema-validator/*:/usr/bin/../support-metrics-client/build/dependant-libs-2.12.10/*:/usr/bin/../support-metrics-client/build/libs/*:/usr/share/java/support-metrics-client/*:/usr/bin/../support-metrics-fullcollector/build/dependant-libs-2.12.10/*:/usr/bin/../support-metrics-fullcollector/build/libs/*:/usr/share/java/support-metrics-fullcollector/* -Dlog4j.configuration=file:/etc/kafka/zookeeper_log4j.properties org.apache.zookeeper.server.quorum.QuorumPeerMain /etc/kafka/zookeeper.properties\n\nFeb 26 07:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 07:52:48,217] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 08:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 08:52:48,216] INFO Purge task started. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 08:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 08:52:48,216] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)\nFeb 26 08:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 08:52:48,217] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 09:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 09:52:48,216] INFO Purge task started. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 09:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 09:52:48,216] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)\nFeb 26 09:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 09:52:48,216] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 10:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 10:52:48,216] INFO Purge task started. (org.apache.zookeeper.server.DatadirCleanupManager)\nFeb 26 10:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 10:52:48,216] INFO zookeeper.snapshot.trust.empty : false (org.apache.zookeeper.server.persistence.FileTxnSnapLog)\nFeb 26 10:52:48 confl-server7 zookeeper-server-start[23394]: [2022-02-26 10:52:48,217] INFO Purge task completed. (org.apache.zookeeper.server.DatadirCleanupManager)",
I also tried
- name: checking service status
command: systemctl status "{{ item }}"
loop: "{{ ansible_facts.services.keys() | select('match', '^.*confluent.*$') | list }}"
register: result
ignore_errors: yes
- name: checking service status showing report
debug:
var: result
But that gives even longer output for each host
I would like to get the server name, service name and status - running or failed -
Server: confl-server4
Service: confluent-kafka-connect.service
Active: Active (Running).
or (Failed), if failed
for services on all servers in a single file on the broker1 host
How can I achieve that?
Thank you

Getting Module Failure error while running Ansible playbook

I am getting the following error when running my Ansible playbook:
hosts file
[node1]
rabbit-node1 ansible_ssh_host=x.x.x.x ansible_ssh_user=ubuntu
[node2]
rabbit-node2 ansible_ssh_host=x.x.x.x ansible_ssh_user=ubuntu
[node3]
rabbit-node3 ansible_ssh_host=x.x.x.x ansible_ssh_user=ubuntu
[workers]
rabbit-node2
rabbit-node3
[all_group]
rabbit-node1
rabbit-node2
rabbit-node3
[all:vars]
ansible_python_interpreter=/usr/bin/python3
ansible_ssh_user=ubuntu
ansible_private_key_file=private key path
ansible_ssh_extra_args='-o StrictHostKeyChecking=no'
Error
fatal: [rabbit-node1]: FAILED! => {"changed": false, "failed": true, "invocation": {"module_name": "setup"}, "module_stderr": "OpenSSH_7.2p2 Ubuntu-4ubuntu2.1, OpenSSL 1.0.2g 1 Mar 2016\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 19: Applying options for x.x.x.x\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 25400\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\nShared connection to x.x.x.x closed.\r\n", "module_stdout": " File \"/home/ubuntu/.ansible/tmp/ansible-tmp-1584361123.7-96709661573808/setup\", line 3160\r\n except OSError, e:\r\n ^\r\nSyntaxError: invalid syntax\r\n", "msg": "MODULE FAILURE", "parsed": false}
3 playbook.yml
Playbook file for Setting hostname installing rabbitmq and creating cluster of rabitmq having 3 nodes.
- name: deploy RabbitMQ and setup the environment
hosts:
- all_group
#gather_facts: False
user: ubuntu
sudo: yes
roles:
- set_hostname
- install_rabbitmq
- name: Configure RabbitMQ Cluster
hosts:
- workers
user: ubuntu
sudo: yes
roles:
- cluster_setup

Missing queues from RabbitMQ Metricbeat

It looks like only a fraction of the queues on my RabbitMQ cluster are making it into Elasticsearch via Metricbeat.
When I query RabbitMQ's /api/overview, I see 887 queues reported:
object_totals: {
consumers: 517,
queues: 887,
exchanges: 197,
connections: 305,
channels: 622
},
When I query RabbitMQ's /api/queues (which is what Metricbeat hits), I count 887 queues there as well.
When I get a unique count of the field rabbitmq.queue.name in Elasticsearch, I am seeing only 309 queues.
I don't see anything in the debug output that stands out to me. It's just the usual INFO level startup messages, followed by the publish information:
root#rabbitmq:/etc/metricbeat# metricbeat -e
2019-06-24T21:13:33.692Z INFO instance/beat.go:571 Home path: [/usr/share/metricbeat] Config path: [/etc/metricbeat] Data path: [/var/lib/metricbeat] Logs path: [/var/log/metricbeat]
2019-06-24T21:13:33.692Z INFO instance/beat.go:579 Beat ID: xxx
2019-06-24T21:13:33.692Z INFO [index-management.ilm] ilm/ilm.go:129 Policy name: metricbeat-7.1.1
2019-06-24T21:13:33.692Z INFO [seccomp] seccomp/seccomp.go:116 Syscall filter successfully installed
2019-06-24T21:13:33.692Z INFO [beat] instance/beat.go:827 Beat info {"system_info": {"beat": {"path": {"config": "/etc/metricbeat", "data": "/var/lib/metricbeat", "home": "/usr/share/metricbeat", "logs": "/var/log/metricbeat"}, "type": "metricbeat", "uuid": "xxx"}}}
2019-06-24T21:13:33.692Z INFO [beat] instance/beat.go:836 Build info {"system_info": {"build": {"commit": "3358d9a5a09e3c6709a2d3aaafde628ea34e8419", "libbeat": "7.1.1", "time": "2019-05-23T13:23:10.000Z", "version": "7.1.1"}}}
2019-06-24T21:13:33.692Z INFO [beat] instance/beat.go:839 Go runtime info {"system_info": {"go": {"os":"linux","arch":"amd64","max_procs":4,"version":"go1.11.5"}}}
[...]
2019-06-24T21:13:33.694Z INFO [beat] instance/beat.go:872 Process info {"system_info": {"process": {"capabilities": {"inheritable":null,"permitted":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"effective":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"bounding":["chown","dac_override","dac_read_search","fowner","fsetid","kill","setgid","setuid","setpcap","linux_immutable","net_bind_service","net_broadcast","net_admin","net_raw","ipc_lock","ipc_owner","sys_module","sys_rawio","sys_chroot","sys_ptrace","sys_pacct","sys_admin","sys_boot","sys_nice","sys_resource","sys_time","sys_tty_config","mknod","lease","audit_write","audit_control","setfcap","mac_override","mac_admin","syslog","wake_alarm","block_suspend","audit_read"],"ambient":null}, "cwd": "/etc/metricbeat", "exe": "/usr/share/metricbeat/bin/metricbeat", "name": "metricbeat", "pid": 30898, "ppid": 30405, "seccomp": {"mode":"filter","no_new_privs":true}, "start_time": "2019-06-24T21:13:33.100Z"}}}
2019-06-24T21:13:33.694Z INFO instance/beat.go:280 Setup Beat: metricbeat; Version: 7.1.1
2019-06-24T21:13:33.694Z INFO [publisher] pipeline/module.go:97 Beat name: metricbeat
2019-06-24T21:13:33.694Z INFO instance/beat.go:391 metricbeat start running.
2019-06-24T21:13:33.694Z INFO cfgfile/reload.go:150 Config reloader started
2019-06-24T21:13:33.694Z INFO [monitoring] log/log.go:117 Starting metrics logging every 30s
[...]
2019-06-24T21:13:43.696Z INFO filesystem/filesystem.go:57 Ignoring filesystem types: sysfs, rootfs, ramfs, bdev, proc, cpuset, cgroup, cgroup2, tmpfs, devtmpfs, configfs, debugfs, tracefs, securityfs, sockfs, dax, bpf, pipefs, hugetlbfs, devpts, ecryptfs, fuse, fusectl, pstore, mqueue, autofs
2019-06-24T21:13:43.696Z INFO fsstat/fsstat.go:59 Ignoring filesystem types: sysfs, rootfs, ramfs, bdev, proc, cpuset, cgroup, cgroup2, tmpfs, devtmpfs, configfs, debugfs, tracefs, securityfs, sockfs, dax, bpf, pipefs, hugetlbfs, devpts, ecryptfs, fuse, fusectl, pstore, mqueue, autofs
2019-06-24T21:13:44.696Z INFO pipeline/output.go:95 Connecting to backoff(async(tcp://xxx))
2019-06-24T21:13:44.711Z INFO pipeline/output.go:105 Connection to backoff(async(tcp://xxx)) established
2019-06-24T21:14:03.696Z INFO [monitoring] log/log.go:144 Non-zero metrics in the last 30s {"monitoring": {"metrics": {"beat":{"cpu":{"system":{"ticks":130,"time":{"ms":131}},"total":{"ticks":1960,"time":{"ms":1965},"value":1960},"user":{"ticks":1830,"time":{"ms":1834}}},"handles":{"limit":{"hard":1048576,"soft":1024},"open":12},"info":{"ephemeral_id":"xxx","uptime":{"ms":30030}},"memstats":{"gc_next":30689808,"memory_alloc":21580680,"memory_total":428076400,"rss":79917056}},"libbeat":{"config":{"module":{"running":0},"reloads":2},"output":{"events":{"acked":7825,"batches":11,"total":7825},"read":{"bytes":66},"type":"logstash","write":{"bytes":870352}},"pipeline":{"clients":4,"events":{"active":313,"published":8138,"retry":523,"total":8138},"queue":{"acked":7825}}},"metricbeat":{"rabbitmq":{"connection":{"events":2987,"failures":10,"success":2977},"exchange":{"events":1970,"success":1970},"node":{"events":10,"success":10},"queue":{"events":3130,"failures":10,"success":3120}},"system":{"cpu":{"events":2,"success":2},"filesystem":{"events":7,"success":7},"fsstat":{"events":1,"success":1},"load":{"events":2,"success":2},"memory":{"events":2,"success":2},"network":{"events":4,"success":4},"process":{"events":18,"success":18},"process_summary":{"events":2,"success":2},"socket_summary":{"events":2,"success":2},"uptime":{"events":1,"success":1}}},"system":{"cpu":{"cores":4},"load":{"1":0.48,"15":0.28,"5":0.15,"norm":{"1":0.12,"15":0.07,"5":0.0375}}}}}}
I think if there were a problem getting the queue, I should see an error in the logs above as per https://github.com/elastic/beats/blob/master/metricbeat/module/rabbitmq/queue/data.go#L94-L104
Here's the metricbeat.yml:
metricbeat.config.modules:
path: ${path.config}/modules.d/*.yml
reload.enabled: true
reload.period: 10s
setup.template.settings:
index.number_of_shards: 1
index.codec: best_compression
name: metricbeat
fields:
environment: development
processors:
- add_cloud_metadata: ~
output.logstash:
hosts: ["xxx"]
Here's the modules.d/rabbitmq.yml:
- module: rabbitmq
metricsets: ["node", "queue", "connection", "exchange"]
enabled: true
period: 2s
hosts: ["xxx"]
username: xxx
password: xxx
I solved it by upgrading Elastic Stack from 7.1.1 to 7.2.0.

websockets on GKE with istio gives 'no healthy upstream' and 'CrashLoopBackOff'

I am on GKE using Istio version 1.0.3 . I try to get my express.js with socket.io (and uws engine) backend working with websockets and had this backend running before on a 'non kubernetes server' with websockets without problems.
When I simply enter the external_gke_ip as url I get my backend html page, so http works. But when my client-app makes socketio authentication calls from my client-app I get 503 errors in the browser console:
WebSocket connection to 'ws://external_gke_ip/socket.io/?EIO=3&transport=websocket' failed: Error during WebSocket handshake: Unexpected response code: 503
And when I enter the external_gke_ip as url while socket calls are made I get: no healthy upstream in the browser. And the pod gives: CrashLoopBackOff.
I find somewhere: 'in node.js land, socket.io typically does a few non-websocket Handshakes to the Server before eventually upgrading to Websockets. If you don't have sticky-sessions, the upgrade never works.' So maybe I need sticky sessions? Or not... as I just have one replica of my app? It seems to be done by setting sessionAffinity: ClientIP, but with istio I do not know how to do this and in the GUI I can edit some values of the loadbalancers, but Session affinity shows 'none' and I can not edit it.
Other settings that might be relevant and that I am not sure of (how to set using istio) are:
externalTrafficPolicy=Local
Ttl
My manifest config file:
apiVersion: v1
kind: Service
metadata:
name: myapp
labels:
app: myapp
spec:
selector:
app: myapp
ports:
- port: 8089
targetPort: 8089
protocol: TCP
name: http
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
labels:
app: myapp
spec:
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: app
image: gcr.io/myproject/firstapp:v1
imagePullPolicy: Always
ports:
- containerPort: 8089
env:
- name: POSTGRES_DB_HOST
value: 127.0.0.1:5432
- name: POSTGRES_DB_USER
valueFrom:
secretKeyRef:
name: mysecret
key: username
- name: POSTGRES_DB_PASSWORD
valueFrom:
secretKeyRef:
name: mysecret
key: password
readinessProbe:
httpGet:
path: /healthz
scheme: HTTP
port: 8089
initialDelaySeconds: 10
timeoutSeconds: 5
- name: cloudsql-proxy
image: gcr.io/cloudsql-docker/gce-proxy:1.11
command: ["/cloud_sql_proxy",
"-instances=myproject:europe-west4:osm=tcp:5432",
"-credential_file=/secrets/cloudsql/credentials.json"]
securityContext:
runAsUser: 2
allowPrivilegeEscalation: false
volumeMounts:
- name: cloudsql-instance-credentials
mountPath: /secrets/cloudsql
readOnly: true
volumes:
- name: cloudsql-instance-credentials
secret:
secretName: cloudsql-instance-credentials
---
apiVersion: networking.istio.io/v1alpha3
kind: Gateway
metadata:
name: myapp-gateway
spec:
selector:
istio: ingressgateway
servers:
- port:
number: 80
name: http
protocol: HTTP
hosts:
- "*"
---
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- "*"
gateways:
- myapp-gateway
http:
- match:
- uri:
prefix: /
route:
- destination:
host: myapp
weight: 100
websocketUpgrade: true
---
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: google-apis
spec:
hosts:
- "*.googleapis.com"
ports:
- number: 443
name: https
protocol: HTTPS
location: MESH_EXTERNAL
---
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: cloud-sql-instance
spec:
hosts:
- 35.204.XXX.XX # ip of cloudsql database
ports:
- name: tcp
number: 3307
protocol: TCP
location: MESH_EXTERNAL
Various output (while making socket calls, when I stop these the deployment restarts and READY returns to 3/3):
kubectl get pods
NAME READY STATUS RESTARTS AGE
myapp-8888 2/3 CrashLoopBackOff 11 1h
$ kubectl describe pod/myapp-8888 gives:
Name: myapp-8888
Namespace: default
Node: gke-standard-cluster-1-default-pool-888888-9vtk/10.164.0.36
Start Time: Sat, 19 Jan 2019 14:33:11 +0100
Labels: app=myapp
pod-template-hash=207157
Annotations:
kubernetes.io/limit-ranger:
LimitRanger plugin set: cpu request for container app; cpu request for container cloudsql-proxy
sidecar.istio.io/status:
{"version":"3c9617ff82c9962a58890e4fa987c69ca62487fda71c23f3a2aad1d7bb46c748","initContainers":["istio-init"],"containers":["istio-proxy"]...
Status: Running
IP: 10.44.0.5
Controlled By: ReplicaSet/myapp-64c59c94dc
Init Containers:
istio-init:
Container ID: docker://a417695f99509707d0f4bfa45d7d491501228031996b603c22aaf398551d1e45
Image: gcr.io/gke-release/istio/proxy_init:1.0.2-gke.0
Image ID: docker-pullable://gcr.io/gke-release/istio/proxy_init#sha256:e30d47d2f269347a973523d0c5d7540dbf7f87d24aca2737ebc09dbe5be53134
Port: <none>
Host Port: <none>
Args:
-p
15001
-u
1337
-m
REDIRECT
-i
*
-x
-b
8089,
-d
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 19 Jan 2019 14:33:19 +0100
Finished: Sat, 19 Jan 2019 14:33:19 +0100
Ready: True
Restart Count: 0
Environment: <none>
Mounts: <none>
Containers:
app:
Container ID: docker://888888888888888888888888
Image: gcr.io/myproject/firstapp:v1
Image ID: docker-pullable://gcr.io/myproject/firstapp#sha256:8888888888888888888888888
Port: 8089/TCP
Host Port: 0/TCP
State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 19 Jan 2019 14:40:14 +0100
Finished: Sat, 19 Jan 2019 14:40:37 +0100
Last State: Terminated
Reason: Completed
Exit Code: 0
Started: Sat, 19 Jan 2019 14:39:28 +0100
Finished: Sat, 19 Jan 2019 14:39:46 +0100
Ready: False
Restart Count: 3
Requests:
cpu: 100m
Readiness: http-get http://:8089/healthz delay=10s timeout=5s period=10s #success=1 #failure=3
Environment:
POSTGRES_DB_HOST: 127.0.0.1:5432
POSTGRES_DB_USER: <set to the key 'username' in secret 'mysecret'> Optional: false
POSTGRES_DB_PASSWORD: <set to the key 'password' in secret 'mysecret'> Optional: false
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-rclsf (ro)
cloudsql-proxy:
Container ID: docker://788888888888888888888888888
Image: gcr.io/cloudsql-docker/gce-proxy:1.11
Image ID: docker-pullable://gcr.io/cloudsql-docker/gce-proxy#sha256:5c690349ad8041e8b21eaa63cb078cf13188568e0bfac3b5a914da3483079e2b
Port: <none>
Host Port: <none>
Command:
/cloud_sql_proxy
-instances=myproject:europe-west4:osm=tcp:5432
-credential_file=/secrets/cloudsql/credentials.json
State: Running
Started: Sat, 19 Jan 2019 14:33:40 +0100
Ready: True
Restart Count: 0
Requests:
cpu: 100m
Environment: <none>
Mounts:
/secrets/cloudsql from cloudsql-instance-credentials (ro)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-rclsf (ro)
istio-proxy:
Container ID: docker://f3873d0f69afde23e85d6d6f85b1f
Image: gcr.io/gke-release/istio/proxyv2:1.0.2-gke.0
Image ID: docker-pullable://gcr.io/gke-release/istio/proxyv2#sha256:826ef4469e4f1d4cabd0dc846
Port: <none>
Host Port: <none>
Args:
proxy
sidecar
--configPath
/etc/istio/proxy
--binaryPath
/usr/local/bin/envoy
--serviceCluster
myapp
--drainDuration
45s
--parentShutdownDuration
1m0s
--discoveryAddress
istio-pilot.istio-system:15007
--discoveryRefreshDelay
1s
--zipkinAddress
zipkin.istio-system:9411
--connectTimeout
10s
--statsdUdpAddress
istio-statsd-prom-bridge.istio-system:9125
--proxyAdminPort
15000
--controlPlaneAuthPolicy
NONE
State: Running
Started: Sat, 19 Jan 2019 14:33:54 +0100
Ready: True
Restart Count: 0
Requests:
cpu: 10m
Environment:
POD_NAME: myapp-64c59c94dc-8888 (v1:metadata.name)
POD_NAMESPACE: default (v1:metadata.namespace)
INSTANCE_IP: (v1:status.podIP)
ISTIO_META_POD_NAME: myapp-64c59c94dc-8888 (v1:metadata.name)
ISTIO_META_INTERCEPTION_MODE: REDIRECT
Mounts:
/etc/certs/ from istio-certs (ro)
/etc/istio/proxy from istio-envoy (rw)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
cloudsql-instance-credentials:
Type: Secret (a volume populated by a Secret)
SecretName: cloudsql-instance-credentials
Optional: false
default-token-rclsf:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-rclsf
Optional: false
istio-envoy:
Type: EmptyDir (a temporary directory that shares a pod's lifetime)
Medium: Memory
istio-certs:
Type: Secret (a volume populated by a Secret)
SecretName: istio.default
Optional: true
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s
node.kubernetes.io/unreachable:NoExecute for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 7m31s default-scheduler Successfully assigned myapp-64c59c94dc-tdb9c to gke-standard-cluster-1-default-pool-65b9e650-9vtk
Normal SuccessfulMountVolume 7m31s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk MountVolume.SetUp succeeded for volume "istio-envoy"
Normal SuccessfulMountVolume 7m31s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk MountVolume.SetUp succeeded for volume "cloudsql-instance-credentials"
Normal SuccessfulMountVolume 7m31s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk MountVolume.SetUp succeeded for volume "default-token-rclsf"
Normal SuccessfulMountVolume 7m31s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk MountVolume.SetUp succeeded for volume "istio-certs"
Normal Pulling 7m30s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk pulling image "gcr.io/gke-release/istio/proxy_init:1.0.2-gke.0"
Normal Pulled 7m25s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Successfully pulled image "gcr.io/gke-release/istio/proxy_init:1.0.2-gke.0"
Normal Created 7m24s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Created container
Normal Started 7m23s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Started container
Normal Pulling 7m4s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk pulling image "gcr.io/cloudsql-docker/gce-proxy:1.11"
Normal Pulled 7m3s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Successfully pulled image "gcr.io/cloudsql-docker/gce-proxy:1.11"
Normal Started 7m2s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Started container
Normal Pulling 7m2s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk pulling image "gcr.io/gke-release/istio/proxyv2:1.0.2-gke.0"
Normal Created 7m2s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Created container
Normal Pulled 6m54s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Successfully pulled image "gcr.io/gke-release/istio/proxyv2:1.0.2-gke.0"
Normal Created 6m51s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Created container
Normal Started 6m48s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Started container
Normal Pulling 111s (x2 over 7m22s) kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk pulling image "gcr.io/myproject/firstapp:v3"
Normal Created 110s (x2 over 7m4s) kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Created container
Normal Started 110s (x2 over 7m4s) kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Started container
Normal Pulled 110s (x2 over 7m7s) kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Successfully pulled image "gcr.io/myproject/firstapp:v3"
Warning Unhealthy 99s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Readiness probe failed: HTTP probe failed with statuscode: 503
Warning BackOff 85s kubelet, gke-standard-cluster-1-default-pool-65b9e650-9vtk Back-off restarting failed container
And:
$ kubectl logs myapp-8888 myapp
> api_server#0.0.0 start /usr/src/app
> node src/
info: Feathers application started on http://localhost:8089
And the database logs (which looks ok, as some 'startup script entries' from app can be retrieved using psql):
$ kubectl logs myapp-8888 cloudsql-proxy
2019/01/19 13:33:40 using credential file for authentication; email=proxy-user#myproject.iam.gserviceaccount.com
2019/01/19 13:33:40 Listening on 127.0.0.1:5432 for myproject:europe-west4:osm
2019/01/19 13:33:40 Ready for new connections
2019/01/19 13:33:54 New connection for "myproject:europe-west4:osm"
2019/01/19 13:33:55 couldn't connect to "myproject:europe-west4:osm": Post https://www.googleapis.com/sql/v1beta4/projects/myproject/instances/osm/createEphemeral?alt=json: oauth2: cannot fetch token: Post https://oauth2.googleapis.com/token: dial tcp 74.125.143.95:443: getsockopt: connection refused
2019/01/19 13:39:06 New connection for "myproject:europe-west4:osm"
2019/01/19 13:39:06 New connection for "myproject:europe-west4:osm"
2019/01/19 13:39:06 Client closed local connection on 127.0.0.1:5432
2019/01/19 13:39:13 New connection for "myproject:europe-west4:osm"
2019/01/19 13:39:14 New connection for "myproject:europe-west4:osm"
2019/01/19 13:39:14 New connection for "myproject:europe-west4:osm"
2019/01/19 13:39:14 New connection for "myproject:europe-west4:osm"
EDIT:
Here is the serverside log of the 503 of websocket calls to my app:
{
insertId: "465nu9g3xcn5hf"
jsonPayload: {
apiClaims: ""
apiKey: ""
clientTraceId: ""
connection_security_policy: "unknown"
destinationApp: "myapp"
destinationIp: "10.44.XX.XX"
destinationName: "myapp-888888-88888"
destinationNamespace: "default"
destinationOwner: "kubernetes://apis/extensions/v1beta1/namespaces/default/deployments/myapp"
destinationPrincipal: ""
destinationServiceHost: "myapp.default.svc.cluster.local"
destinationWorkload: "myapp"
httpAuthority: "35.204.XXX.XXX"
instance: "accesslog.logentry.istio-system"
latency: "1.508885ms"
level: "info"
method: "GET"
protocol: "http"
receivedBytes: 787
referer: ""
reporter: "source"
requestId: "bb31d922-8f5d-946b-95c9-83e4c022d955"
requestSize: 0
requestedServerName: ""
responseCode: 503
responseSize: 57
responseTimestamp: "2019-01-18T20:53:03.966513Z"
sentBytes: 164
sourceApp: "istio-ingressgateway"
sourceIp: "10.44.X.X"
sourceName: "istio-ingressgateway-8888888-88888"
sourceNamespace: "istio-system"
sourceOwner: "kubernetes://apis/extensions/v1beta1/namespaces/istio-system/deployments/istio-ingressgateway"
sourcePrincipal: ""
sourceWorkload: "istio-ingressgateway"
url: "/socket.io/?EIO=3&transport=websocket"
userAgent: "Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1"
xForwardedFor: "10.44.X.X"
}
logName: "projects/myproject/logs/stdout"
metadata: {
systemLabels: {
container_image: "gcr.io/gke-release/istio/mixer:1.0.2-gke.0"
container_image_id: "docker-pullable://gcr.io/gke-release/istio/mixer#sha256:888888888888888888888888888888"
name: "mixer"
node_name: "gke-standard-cluster-1-default-pool-88888888888-8887"
provider_instance_id: "888888888888"
provider_resource_type: "gce_instance"
provider_zone: "europe-west4-a"
service_name: [
0: "istio-telemetry"
]
top_level_controller_name: "istio-telemetry"
top_level_controller_type: "Deployment"
}
userLabels: {
app: "telemetry"
istio: "mixer"
istio-mixer-type: "telemetry"
pod-template-hash: "88888888888"
}
}
receiveTimestamp: "2019-01-18T20:53:08.135805255Z"
resource: {
labels: {
cluster_name: "standard-cluster-1"
container_name: "mixer"
location: "europe-west4-a"
namespace_name: "istio-system"
pod_name: "istio-telemetry-8888888-8888888"
project_id: "myproject"
}
type: "k8s_container"
}
severity: "INFO"
timestamp: "2019-01-18T20:53:03.965100Z"
}
In the browser at first it properly seems to switch protocol but then causes a repeated 503 response and subsequent health issues cause a repeating restart. The protocol switch websocket call:
General:
Request URL: ws://localhost:8080/sockjs-node/842/s4888/websocket
Request Method: GET
Status Code: 101 Switching Protocols [GREEN]
Response headers:
Connection: Upgrade
Sec-WebSocket-Accept: NS8888888888888888888
Upgrade: websocket
Request headers:
Accept-Encoding: gzip, deflate, br
Accept-Language: nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7
Cache-Control: no-cache
Connection: Upgrade
Cookie: _ga=GA1.1.1118102238.18888888; hblid=nSNQ2mS8888888888888; olfsk=ol8888888888
Host: localhost:8080
Origin: http://localhost:8080
Pragma: no-cache
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Sec-WebSocket-Key: b8zkVaXlEySHasCkD4aUiw==
Sec-WebSocket-Version: 13
Upgrade: websocket
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1
Its frames:
Following the above I get multiple of these:
Chrome output regarding websocket call:
general:
Request URL: ws://35.204.210.134/socket.io/?EIO=3&transport=websocket
Request Method: GET
Status Code: 503 Service Unavailable
response headers:
connection: close
content-length: 19
content-type: text/plain
date: Sat, 19 Jan 2019 14:06:39 GMT
server: envoy
request headers:
Accept-Encoding: gzip, deflate
Accept-Language: nl-NL,nl;q=0.9,en-US;q=0.8,en;q=0.7
Cache-Control: no-cache
Connection: Upgrade
Host: 35.204.210.134
Origin: http://localhost:8080
Pragma: no-cache
Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits
Sec-WebSocket-Key: VtKS5xKF+GZ4u3uGih2fig==
Sec-WebSocket-Version: 13
Upgrade: websocket
User-Agent: Mozilla/5.0 (iPhone; CPU iPhone OS 10_3_1 like Mac OS X) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.0 Mobile/14E304 Safari/602.1
The frames:
Data: (Opcode -1)
Length: 63
Time: 15:06:44.412
Using uws (uWebSockets) as websocket engine causes these errors. When I swap in my backend app this code:
app.configure(socketio({
wsEngine: 'uws',
timeout: 120000,
reconnect: true
}))
for this:
app.configure(socketio())
Everything works as expected.
EDIT: Now it also works with uws. I used alpine docker container which is based on node 10, which does not work with uws. After switching to container based on node 8 it works.

Resources