I'm running a few Windows IIS containers on a Service Fabric cluster. Occasionally, especially after high load, the outbound connections from inside the containers becomes very slow and causes timeouts. Even after restarting the containers, the issue is not fixed. I even tried to run a container explicitly using docker run inside the node as opposed to using a SF deployment, and the new container also has this slow network. What resolves it is to restart Fabric.exe process on the node. This issue is random in that it affects only one node at a given time.
Any ideas what could be causing this?
Related
Currently, I'm running an application using VMs behind a load balancer. When I need to roll out a new version, a new set of VMs are spun up and the version gets deployed to them. The old VMs are given time to complete in-flight requests - 10 minutes in my case.
I'd like to move the whole project to Docker and Kubernetes using Microsoft's standard images for aspnet but I cannot find how in-flight requests are handled with this stack.
I have a 23 node cluster running CoreOS Stable 681.2.0 on AWS across 4 availability zones. All nodes are running etcd2 and flannel. Of the 23 nodes, 8 are dedicated etcd2 nodes, the rest are specifically designated as etcd2 proxies.
Scheduled to the cluster are 3 nginx plus containers, a private Docker registry, SkyDNS, and 4 of our application containers. The application containers register themselves with with etcd2 and the nginx containers pick up any changes, render the necessary files, and finally reload.
This all works perfectly, until a singe etcd2 node is unavailable for any reason.
If the cluster of voting etcd2 members loses connectivity to a even a single other voting etcd2 member, all of the services scheduled to the fleet become unstable. Scheduled services begin stopping and starting without my intervention.
As a test, I began stopping the EC2 instances which host voting etcd2 nodes until quorum was lost. After the first etcd2 node was stopped, the above symptoms began. After a second node, services became unstable, with no observable change. Then, after the third was stopped quorum was lost and all units were unscheduled. I then started all three etcd2 nodes again and within 60 seconds the cluster had returned to a stable state.
Subsequent tests yield identical results.
Am I hitting a known bug in etcd2, fleet or CoreOS?
Is there a setting I can modify to keep units scheduled onto a node even if etcd is unavailable for any reason?
I've experienced the same thing. In my case, when I ran 1 specific unit it caused everything to blow up. Scheduled and perfectly fine running units were suddenly lost without any notice, even machines dropping out of the cluster.
I'm still not sure what the exact problem was, but I think it might have had something to do with etcd vs etcd2. I had a dependency of etcd.service in the unit file, which (I think, not sure) caused CoreOS to try and start etcd.service, while etcd2.service was already running. This might have caused the conflict in my case, and messed up the etcd registry of units and machines.
Something similar might be happening to you, so I suggest you check each host whether you're running etcd or etcd2 and check your unit files to see which one they depend on.
I had a set of t2.micro instances behind a load balancer that keep dying. By dying I mean the REST API running on them would stop responding and I couldn't SSH to the instances. I would have to launch new instances from a saved AMI.
I decided there must be something wrong with the AMI, so I rebuilt the server, got everything running and created a new AMI.
After about a week, the same problem is back. The basic monitoring shows little to no activity on the servers.
The System Log on the instance shows that the server is starting fine and my nodejs REST API is being launched.
Has anyone else experienced this and been able to find a solution?
I'm trying to put a new version of my webserver (which runs as a binary) on an amazon ec2 instance. The problem is that I have to shut the process down each time to do so. Does anyone know a workaround where I could upload it while the process is still running?
Even if you could, you don't want to. What you want to do is:
Have at least 2 machines running behind a load balancer
Take one of them out of the LB pool
Shutdown the processes on it
Replace them (binaries, resources, config, whatever)
Bring them back up
Then put it back in the pool.
Do the same for the other machine.
Make sure your chances are backward compatible, as there will be a short period of time when both versions run concurrently.
I am currently tinkering with CoreOS and creating a cluster based upon it. So far, the experience with CoreOS on a single host is quite smooth. But things get a little hazy when it comes to service discovery. Somehow I don't get the overall idea, hence I am asking here now for help.
What I want to do is to have two Docker containers running where the first relies on the second. If we are talking pure Docker, I can solve this using linked containers. So far, so good.
But this approach does not work across machine boundaries, because Docker can not link containers across multiple hosts. So I am wondering how to do this.
What I've understand so far is that CoreOS's idea of how to deal with this is to use its etcd service, which is basically a distributed key-value-store that is accessible on each host locally via port 4001, so you do not have to deal (as a consumer of etcd) with any networking details: Just access localhost:4001 and you're fine.
So, in my head, I now have the idea that this means that when a Docker which provides a service spins up, it registers itself (i.e. its IP address and its port) in the local etcd, and etcd takes care of distributing the information across the network. This way, e.g. you get key-value pairs such as:
RedisService => 192.168.3.132:49236
Now, when another Docker container needs to access a RedisService, it gets the IP address and port from their very own local etcd, at least once the information has been distributed across the network. So far, so good.
But now I have a question that I can not answer, and that puzzles me already for a few days: What happens when a service goes down? Who cleans up the data inside of etcd? If it is not cleaned up, all the clients try to access a service that is no longer there.
The only (reliable) solution I can think of at the moment is making use of etcd's TTL feature for data, but this involves a trade-off: Either you have quite high network traffic, as you need to send a heartbeat every few seconds, or you have to live with stale data. Both is not fine.
The other, well, "solution" I can think of is to make a service deregister itself when it goes down, but this only works for planned shutdowns, not for crashes, power outeages, …
So, how do you solve this?
There are a few different ways to solve this: the sidekick method, using ExecStopPost and removing on failure. I'm assuming a trio of CoreOS, etcd and systemd, but these concepts could apply elsewhere too.
The Sidekick Method
This involves running a separate process next to your main application that heartbeats to etcd. On the simple side, this is just a for loop that runs forever. You can use systemd's BindsTo to ensure that when your main unit stops, this service registration unit stops too. In the ExecStop you can explicitly delete the key you're setting. We're also setting a TTL of 60 seconds to handle any ungraceful stoppage.
[Unit]
Description=Announce nginx1.service
# Binds this unit and nginx1 together. When nginx1 is stopped, this unit will be stopped too.
BindsTo=nginx1.service
[Service]
ExecStart=/bin/sh -c "while true; do etcdctl set /services/website/nginx1 '{ \"host\": \"10.10.10.2\", \"port\": 8080, \"version\": \"52c7248a14\" }' --ttl 60;sleep 45;done"
ExecStop=/usr/bin/etcdctl delete /services/website/nginx1
[Install]
WantedBy=local.target
On the complex side, this could be a container that starts up and hits a /health endpoint that your app provides to run a health check before sending data to etcd.
ExecStopPost
If you don't want to run something beside your main app, you can have etcdctl commands within your main unit to run on start and stop. Be aware, this won't catch all failures, as you mentioned.
[Unit]
Description=MyWebApp
After=docker.service
Require=docker.service
After=etcd.service
Require=etcd.service
[Service]
ExecStart=/usr/bin/docker run -rm -name myapp1 -p 8084:80 username/myapp command
ExecStop=/usr/bin/etcdctl set /services/myapp/%H:8084 '{ \"host\": \"%H\", \"port\": 8084, \"version\": \"52c7248a14\" }'
ExecStopPost=/usr/bin/etcdctl rm /services/myapp/%H:8084
[Install]
WantedBy=local.target
%H is a systemd variable that substitutes in the hostname for the machine. If you're interested in more variable usage, check out the CoreOS Getting Started with systemd guide.
Removing on Failure
On the client side, you could remove any instance that you have failed to connect to more than X times. If you get a 500 or a timeout from /services/myapp/instance1 you could run and keep increasing the failure count and then try to connect to other hosts in the /services/myapp/ directory.
etcdctl set /services/myapp/instance1 '{ \"host\": \"%H\", \"port\": 8084, \"version\": \"52c7248a14\", \"failures\": 1 }'
When you hit your desired threshold, remove the key with etcdctl.
Regarding the network traffic that heartbeating would cause – in most cases you should be sending this traffic over a local private network that your provider runs so it should be free and very fast. etcd is constantly heartbeating with its peers anyways, so this is just a little increase in traffic.
Hop into #coreos on Freenode if you have any other questions!