Envoy: connection refused (via Consul) - consul

An Envoy Docker sidecar is deployed with each of our workload services (Nomad cluster, Docker deployments, private network of our own between our servers), and suppose to connect with a Consul service mesh.
Our service is correctly deployed and accessible on the corresponding (Nomad) network. But Consul has trouble validating the Envoy sidecar healthcheck:
Connect Sidecar Listening
ServiceName my-service-sidecar-proxy
CheckID service:_nomad-task-xxxxxx-xxxx-xxxxxx-xxxxx-xxxxxxxxxx-group-my-service-my-service-interface-sidecar-proxy:2
Type tcp
Output dial tcp 10.0.0.72:30241: connect: connection refused
From our servers, the workload service is accessible via our private network, or Nomad network, but the Envoy is not accessible.
# Workload service
server-node-0:~$ wget 10.0.0.63:28947/health
--2022-09-07 15:38:28-- http://10.0.0.63:28947/health
Connecting to 10.0.0.63:28947... connected.
HTTP request sent, awaiting response... 200 OK
# Envoy
server-node-0:~$ wget 10.0.0.63:30241
--2022-09-07 15:38:57-- http://10.0.0.63:30241/
Connecting to 10.0.0.63:30241... failed: Connection refused.
From the server where the services are deployed, or within their Docker container, a netstat indicates only the workload service port is open, not the Envoy one.
client-node-1:~$ netstat -a | grep 28947
tcp 0 0 client-node-1:49236 client-node-1:28947 TIME_WAIT
client-node-1:~$ netstat -a | grep 30241
client-node-1:~$
Yet, Envoy process seems to run properly in its sidecar container. Its admin port (19001) is open, and responses well:
root#3d7c60d8b94a:/# ps auxww
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
envoy 1 0.1 1.3 2256608 53300 ? Ssl 15:53 0:05 envoy -c /secrets/envoy_bootstrap.json -l debug --concurrency 1 --disable-hot-restart
root#3d7c60d8b94a:/# curl -X POST 127.0.0.2:19001/ready
LIVE
What could cause Envoy to not properly listen to its port, so it can properly act as a proxy?
With debug logs on, we noticed the following messages, not knowing if it could be related:
[2022-09-07 17:03:13.173][1][debug][config] [./source/common/config/grpc_stream.h:61] Establishing new gRPC bidi stream for rpc DeltaAggregatedResources(stream .envoy.service.discovery.v3.DeltaDiscoveryRequest) returns (stream .envoy.service.discovery.v3.DeltaDiscoveryResponse);
[2022-09-07 17:03:13.173][1][debug][router] [source/common/router/router.cc:471] [C0][S3688902394289443788] cluster 'local_agent' match for URL '/envoy.service.discovery.v3.AggregatedDiscoveryService/DeltaAggregatedResources'
[2022-09-07 17:03:13.173][1][debug][router] [source/common/router/router.cc:675] [C0][S3688902394289443788] router decoding headers:
':method', 'POST'
':path', '/envoy.service.discovery.v3.AggregatedDiscoveryService/DeltaAggregatedResources'
':authority', 'local_agent'
':scheme', 'http'
'te', 'trailers'
'content-type', 'application/grpc'
'x-consul-token', 'bf88ad2f-7d22-716c-6e5f-8c404f866d6e'
'x-envoy-internal', 'true'
'x-forwarded-for', '172.26.64.4'
[2022-09-07 17:03:13.173][1][debug][pool] [source/common/http/conn_pool_base.cc:78] queueing stream due to no available connections (ready=0 busy=0 connecting=0)
[2022-09-07 17:03:13.173][1][debug][pool] [source/common/conn_pool/conn_pool_base.cc:290] trying to create new connection
[2022-09-07 17:03:13.173][1][debug][pool] [source/common/conn_pool/conn_pool_base.cc:145] creating a new connection (connecting=0)
[2022-09-07 17:03:13.173][1][debug][http2] [source/common/http/http2/codec_impl.cc:1788] [C294] updating connection-level initial window size to 268435456
[2022-09-07 17:03:13.173][1][debug][connection] [./source/common/network/connection_impl.h:89] [C294] current connecting state: true
[2022-09-07 17:03:13.173][1][debug][client] [source/common/http/codec_client.cc:57] [C294] connecting
[2022-09-07 17:03:13.173][1][debug][connection] [source/common/network/connection_impl.cc:924] [C294] connecting to alloc/tmp/consul_grpc.sock
[2022-09-07 17:03:13.174][1][debug][connection] [source/common/network/connection_impl.cc:683] [C294] connected
[2022-09-07 17:03:13.174][1][debug][client] [source/common/http/codec_client.cc:89] [C294] connected
[2022-09-07 17:03:13.174][1][debug][pool] [source/common/conn_pool/conn_pool_base.cc:327] [C294] attaching to next stream
[2022-09-07 17:03:13.174][1][debug][pool] [source/common/conn_pool/conn_pool_base.cc:181] [C294] creating stream
[2022-09-07 17:03:13.174][1][debug][router] [source/common/router/upstream_request.cc:424] [C0][S3688902394289443788] pool ready
[2022-09-07 17:03:13.179][1][debug][connection] [source/common/network/connection_impl.cc:651] [C294] remote close
[2022-09-07 17:03:13.179][1][debug][connection] [source/common/network/connection_impl.cc:250] [C294] closing socket: 0
[2022-09-07 17:03:13.179][1][debug][client] [source/common/http/codec_client.cc:108] [C294] disconnect. resetting 1 pending requests
[2022-09-07 17:03:13.179][1][debug][client] [source/common/http/codec_client.cc:140] [C294] request reset
[2022-09-07 17:03:13.179][1][debug][pool] [source/common/conn_pool/conn_pool_base.cc:214] [C294] destroying stream: 0 remaining
[2022-09-07 17:03:13.179][1][debug][router] [source/common/router/router.cc:1198] [C0][S3688902394289443788] upstream reset: reset reason: connection termination, transport failure reason:
[2022-09-07 17:03:13.179][1][debug][http] [source/common/http/async_client_impl.cc:101] async http request response headers (end_stream=true):
':status', '200'
'content-type', 'application/grpc'
'grpc-status', '14'
'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: connection termination'
[2022-09-07 17:03:13.179][1][warning][config] [./source/common/config/grpc_stream.h:196] DeltaAggregatedResources gRPC config stream closed since 4161s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: connection termination
[2022-09-07 17:03:13.179][1][debug][config] [source/common/config/grpc_subscription_impl.cc:113] gRPC update for type.googleapis.com/envoy.config.cluster.v3.Cluster failed
[2022-09-07 17:03:13.179][1][debug][config] [source/common/config/grpc_subscription_impl.cc:113] gRPC update for type.googleapis.com/envoy.config.listener.v3.Listener failed
[2022-09-07 17:03:13.179][1][debug][pool] [source/common/conn_pool/conn_pool_base.cc:483] [C294] client disconnected, failure reason:
[2022-09-07 17:03:13.179][1][debug][pool] [source/common/conn_pool/conn_pool_base.cc:453] invoking idle callbacks - is_draining_for_deletion_=false
Envoy version: 1.23.0
Consul version: 1.13.1

After discovering this message within Consul logs:
consul grpc: Server.Serve failed to complete security handshake from
"127.0.0.1:50356": tls: first record does not look like a TLS
handshake
the root cause has been identified, as Consul 1.13.1 ships in with an incompatibility against Envoy, if using certain parameters (namely auto-encrypt or auto-config).
The solution is then:
to rollback to Consul 1.12
to wait for Consul 1.13.2 which will embark a fix to this problem (even yet, some parameters will have to be applied, see this comment)
We rollbacked to Consul 1.12 (and now Envoy fails at communicating with its attached service, but this is surely another problem).

Related

failed to initialize redis client

I am trying to run a new Redis client but I am facing the issue below.
failed to initialize redis client: dial tcp [::1]:6379: connectex: No connection could be made because the target machine actively refused it.
exit status 1

Consul Servers with TLS enabled but not required, how to check if clients are using TLS?

I'm enabling TLS connection for a Consul Datacenter with 3 servers and more than 100 node clients.
So, I config servers auto_encrypt with allow_tls as true:
"ca_file": "/etc/consul/consul-agent-ca.pem",
"cert_file": "/etc/consul/dc1-server-consul-0.pem",
"key_file": "/etc/consul/dc1-server-consul-0-key.pem",
"auto_encrypt": {
"allow_tls": true
}
But I did not require TLS verification, because it's a dynamic environment, I add and remove clients in daily basis. I cannot break info about node status.
"verify_outgoing": false,
"verify_server_hostname": false,
I need to identify which node clients are connection to the servers using TLS, and which don't.
So I can config the missing clent nodes.
I have tried a few consul commands, without success.
consul members
consul info
consul operator raft list-peers
Any tips?
There is a workaround, the service log shows the Non-TLS connection attempted
systemctl -l status consul
Sep 29 20:04:38 dev consul[25527]: 2021-09-29T20:04:38.280+0300 [WARN] agent.server.rpc: Non-TLS connection attempted with VerifyIncoming set: conn=from=57.243.24.234:57367
Sep 29 20:05:35 dev consul[25527]: 2021-09-29T20:05:35.845+0300 [WARN] agent.server.rpc: Non-TLS connection attempted with VerifyIncoming set: conn=from=127.45.234.453:60991
Sep 29 20:06:33 dev consul[25527]: 2021-09-29T20:06:33.193+0300 [WARN] agent.server.rpc: Non-TLS connection attempted with VerifyIncoming set: conn=from=45.34.30.345:62555
Sep 29 20:07:12 dev consul[25527]: 2021-09-29T20:07:12.051+0300 [WARN] agent.server.rpc: Non-TLS connection attempted with VerifyIncoming set: conn=from=234.45.24.345:57433
This is not a perfect solution, because I cannot query for TLS or Non-TLS, but it helps to check if there are clients without TLS enabled.

Docker service tasks stuck in preparing state after reboot on windows

Restarting a windows server that is a swarm worker, causes windows containers to get stuck in a "Preparing" state indefinitely once the server and docker daemon are back online.
Image of tasks/containers stuck in preparing state:
https://user-images.githubusercontent.com/4528753/65180353-4e5d6e80-da22-11e9-8060-451150865177.png
Steps to reproduce the issue:
Create a swarm (in my case I have CentOS7 managers, and a few windows server 1903 workers)
Create a "global" docker service that only runs on the windows machines. They should start up fine
initially and work just fine.
Drain one or more of the windows nodes that is running the windows container(s) from step 2 (docker node update --availability=drain nodename)
Restart one or more of the nodes that were drained in step 3, wait for them to come back up
Set the windows node(s) back to active (docker node update --availability=active nodename)
At this point, just observe that the docker service created in step 2 will be "Preparing" the containers to start up on these nodes, and there it will stay (docker service ps servicename --no-trunc) -- you can observe this and run these commands from any master node
memberlist: Refuting a suspect message (from: c9347e85405d)
memberlist: Failed to send ping: write udp 10.60.3.40:7946->10.60.3.110:7946: wsasendto: The requested address is not valid in its
context.
grpc: addrConn.createTransport failed to connect to {10.60.3.110:2377 0 <nil>}. Err :connection error: desc = "transport: Error while
dialing dial tcp 10.60.3.110:2377: connectex: A socket operation was attempted to an unreachable host.". Reconnecting... [module=grpc]
memberlist: Failed to send ping: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is not valid in its
context.
grpc: addrConn.createTransport failed to connect to {10.60.3.110:2377 0 <nil>}. Err :connection error: desc = "transport: Error while
dialing dial tcp 10.60.3.110:2377: connectex: A socket operation was attempted to an unreachable host.". Reconnecting... [module=grpc]
agent: session failed [node.id=wuhifvg9li3v5zuq2xu7c6hxa module=node/agent error=rpc error: code = Unavailable desc = all SubConns are
in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 10.60.3.69:2377:
connectex: A socket operation was attempted to an unreachable host." backoff=6.3s]
Failed to send gossip to 10.60.3.110: write udp 10.60.3.40:7946->10.60.3.110:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.105: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.186: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.105: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.186: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.105: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.109: write udp 10.60.3.40:7946->10.60.3.109:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.69: write udp 10.60.3.40:7946->10.60.3.69:7946: wsasendto: The requested address is not valid in its
context.
Failed to send gossip to 10.60.3.110: write udp 10.60.3.40:7946->10.60.3.110:7946: wsasendto: The requested address is not valid in its
context.
memberlist: Failed to send gossip to 10.60.3.105:7946: write udp 10.60.3.40:7946->10.60.3.105:7946: wsasendto: The requested address is
not valid in its context.
memberlist: Failed to send gossip to 10.60.3.186:7946: write udp 10.60.3.40:7946->10.60.3.186:7946: wsasendto: The requested address is
not valid in its context.
Many of these errors are odd, for example... 7946 is totally open between the cluster nodes, telnets confirm this.
I expect to see the docker service containers start promptly, and not stuck in a Preparing state. The docker image is already pulled, it should be fast.
docker version output
Client: Docker Engine - Enterprise
Version: 19.03.2
API version: 1.40
Go version: go1.12.8
Git commit: c92ab06ed9
Built: 09/03/2019 16:38:11
OS/Arch: windows/amd64
Experimental: false
Server: Docker Engine - Enterprise
Engine:
Version: 19.03.2
API version: 1.40 (minimum version 1.24)
Go version: go1.12.8
Git commit: c92ab06ed9
Built: 09/03/2019 16:35:47
OS/Arch: windows/amd64
Experimental: false
docker info output
Client:
Debug Mode: false
Plugins:
cluster: Manage Docker clusters (Docker Inc., v1.1.0-8c33de7)
Server:
Containers: 4
Running: 0
Paused: 0
Stopped: 4
Images: 4
Server Version: 19.03.2
Storage Driver: windowsfilter
Windows:
Logging Driver: json-file
Plugins:
Volume: local
Network: ics l2bridge l2tunnel nat null overlay transparent
Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
Swarm: active
NodeID: wuhifvg9li3v5zuq2xu7c6hxa
Is Manager: false
Node Address: 10.60.3.40
Manager Addresses:
10.60.3.110:2377
10.60.3.186:2377
10.60.3.69:2377
Default Isolation: process
Kernel Version: 10.0 18362 (18362.1.amd64fre.19h1_release.190318-1202)
Operating System: Windows Server Datacenter Version 1903 (OS Build 18362.356)
OSType: windows
Architecture: x86_64
CPUs: 4
Total Memory: 8GiB
Name: SWARMWORKER1
ID: V2WJ:OEUM:7TUQ:WPIO:UOK4:IAHA:KWMN:RQFF:CAUO:LUB6:DJIJ:OVBX
Docker Root Dir: E:\docker
Debug Mode: false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: this node is not a swarm manager - check license status on a manager node
Additional Details
These nodes are not using Docker Desktop for windows. I am provisioning docker on the box primarily based on the powershell instructions here: https://docs.docker.com/install/windows/docker-ee/
Windows firewall is disabled
iptables/firewalld is disabled
Communication is completely open between the cluster nodes
Totally up-to-date on cumulative updates
I posted on the moby repo issues but never heard a peep:
https://github.com/moby/moby/issues/39955
The ONLY way I've found to temporarily fix the issue is to drain the node from the swarm, delete docker files, reinstall windows "Containers" feature, then rejoin to the swarm. But, it happens again on reboot.
What's interesting is that when I see a swarm task in a "Preparing" state on the windows worker, the server doesn't seem to be doing anything at all, it's like the manager thinks the worker is preparing the container, but it isn't...
Anyone have any suggestions??

Mosquitto ERR_CONNECTION_REFUSED using websockets (paho client) on win 10

I've read all the threads with similar questions, but couldn't find an answer.
Mosquitto config:
listener 1883 127.0.0.1
protocol mqtt
listener 9001 127.0.0.1
protocol websockets
log output:
1567705166: mosquitto version 1.6.2 starting
1567705166: Config loaded from C:\Program Files (x86)\mosquitto\mosquitto.conf.
1567705166: Opening ipv4 listen socket on port 1883.
1567705166: Opening websockets listen socket on port 9001.
1567705166: Opening websockets listen socket on port 1883.
Chrome devtools:
mqttws31.js:977 WebSocket connection to 'ws://127.0.0.1:9001/mqtt' failed: Error in connection establishment: net::ERR_CONNECTION_REFUSED
I've tried many things but nothing helped:
Trying websockets only
Trying another port (1883 and 9001 instead of 8080)
Switching off Windows firewall
If I change the config file to:
#listener 1884 127.0.0.1
#protocol mqtt
#listener 1883 127.0.0.1
protocol websockets
Mosquitto listens for websockets on port 1883 but logfile reads:
1567706943: mosquitto version 1.6.2 starting
1567706943: Config loaded from C:\Program Files (x86)\mosquitto\mosquitto.conf.
1567706943: Opening websockets listen socket on port 1883.
1567706943: Error in poll: No error.
1567706943: Error in poll: No error.
1567706943: Error in poll: No error.
1567706943: Error in poll: No error.
1567706943: Error in poll: No error.
1567706943: Error in poll: No error.
1567706943: Error in poll: No error.
1567706943: Error in poll: No error.
1567706943: Error in poll: No error.
1567706943: Error in poll: No error.
changing config to:
protocol websockets
listener 8080 127.0.0.1
protocol mqtt
Gives me a logfile that says:
1567707450: mosquitto version 1.6.2 starting
1567707450: Config loaded from C:\Program Files (x86)\mosquitto\mosquitto.conf.
1567707450: Opening ipv4 listen socket on port 8080.
1567707450: Opening websockets listen socket on port 1883.
(no extra crap)
After following up on answer no 1:
config:
protocol websockets
listener 1883 127.0.0.1
protocol mqtt
console:
WebSocket connection to 'ws://127.0.0.1:1883/mqtt' failed: Error during WebSocket handshake: net::ERR_CONNECTION_RESET
log:
1567716915: mosquitto version 1.6.2 starting
1567716915: Config loaded from C:\Program Files (x86)\mosquitto\mosquitto.conf.
1567716915: Opening ipv4 listen socket on port 1883.
1567716915: Opening websockets listen socket on port 1883.
1567716920: New connection from 127.0.0.1 on port 1883.
1567716920: Socket error on client <unknown>, disconnecting.
1567716920: New connection from 127.0.0.1 on port 1883.
1567716920: Socket error on client <unknown>, disconnecting.
5492: Error in poll: No error.
1567715492: Error in poll: No error.
1567715492: Error in poll: No error.
1567715492: Error in poll: No error.
1567715492: Error in poll: No error.
1567715492: Error in poll: No error.
Tried another websockets client (https://www.eclipse.org/paho/clients/js/utility/) --> Failed to connect: AMQJSC0001E Connect timed out.
I can't get websockets to work with any configuration / port...
Can anyone confirm that Websockets in Mosquitto (32bit version 1.6.2 or 1.6.4) for Win10 are working ?
Your first config file won't work because you have both native MQTT and Websockets both trying to listen on port 1883. (this is because the default listener starts on port 1883), Not 100% sure how this is possible unless it's some strange IPv6 thing on Windows.
The second, is just changing the default listener protocol to Websockets, which in theory should work, assuming you try and connect to port 1883 from the webpage.
The third one makes the default listener on port 1883 Websockets and native on 8080. Again should work assuming you are trying to connect to 1883
The simplest config to enable Websockets should look like this:
listener 9001 127.0.0.1
protocol websockets
This will leave the native default listener alone on port 1883 (listening on all interfaces, use bind_address 127.0.0.1 before the listener line to make it only listen on localhost) and start the Websocket listener on port 9001
This turned out to be quite the challenge.
First I tried to get things working using an online broker and client. Still not working. I then switched to another computer and everything was working fine.
Switching back to the original computer I decided to test if websockets where working at all by going to https://www.websocket.org/echo.html
From that moment on everything started to work. First the online broker and client and then also the local server and clients. I have no idea why...

How to set proxy for connecting of MQTT bridge?

I have a mosquitto broker run on a linux behind my company proxy.
I have been configured a bridge to AWS same following: (mosquitto.conf)
connection bridge
address ec2-xxx-xxx-xxx-xxx.ap-northeast-1.compute.amazonaws.com:8089
remote_username admin
remote_password password
topic abc/raspi01 both 0
bridge_cafile /etc/pki/tls/certs/nginx-selfsigned.crt
bridge_insecure false
But when I start mosquitto service with command:
service mosquitto restart
In log file, There are some errors:
1554356888: mosquitto version 1.5.5 starting
1554356888: Config loaded from /etc/mosquitto/mosquitto.conf.
1554356888: Opening ipv4 listen socket on port 1883.
1554356888: Opening ipv6 listen socket on port 1883.
1554356888: Warning: Address family not supported by protocol
1554356888: Connecting bridge bridge (ec2-xxx-xxx-xxx-xxx.ap-northeast-1.compute.amazonaws.com:8089)
1554356888: Error creating bridge: Name or service not known.
1554356888: Warning: Unable to connect to bridge bridge.
1554356901: New connection from 127.0.0.1 on port 1883.
1554356901: New connection from 127.0.0.1 on port 1883.
1554356901: New client connected from 127.0.0.1 as mqtt_fd05fada.b70918 (c1, k60).
1554356901: New client connected from 127.0.0.1 as mqtt_2a3a025d.6c941e (c1, k60).
1554356919: Connecting bridge bridge (ec2-xxx-xxx-xxx-xxx.ap-northeast-1.compute.amazonaws.com:8089)
1554356919: Error creating bridge: Name or service not known.
1554356950: Connecting bridge bridge (ec2-xxx-xxx-xxx-xxx.ap-northeast-1.compute.amazonaws.com:8089)
1554356950: Error creating bridge: Name or service not known.
I things the cause is my company proxy.
I have tried with settings in mosquitto.service, but it not resolve.
[Unit]
Description=Mosquitto MQTT v3.1/v3.1.1 Broker
Documentation=man:mosquitto.conf(5) man:mosquitto(8)
After=network-online.target
Wants=network-online.target
[Service]
Type=notify
NotifyAccess=main
ExecStart=/usr/sbin/mosquitto -c /etc/mosquitto/mosquitto.conf
Environment="HTTPS_PROXY=http://user:pass#proxyhost:8800"
Environment="HTTP_PROXY=http://user:pass#proxyhost:8800"
Environment="NO_PROXY=127.0.0.1,localhost"
Restart=on-failure
[Install]
WantedBy=multi-user.target
Can any body help me? Thank so much.
You can not use a HTTP proxy for MQTT bridge connection (or any native MQTT connection). MQTT is a totally different protocol.
Only MQTT over Websockets would work via a HTTP proxy, but you can not configure mosquitto to run a bridge with MQTT over Websockets.
If one uses HTTP CONNECT before sending connect
in net_mosq.c
rc = connect(*sock, rp->ai_addr, rp->ai_addrlen);
One can accomplish this.
HTTP CONNECT is protocol agnostic, it works on underlying TCP protocol.

Resources