Private etcd cluster using tokens - etcd

I'm trying to set up an embedded etcd cluster on a bunch of machines that I'm keeping track of (so I always have the list of machine IPs), and there will always be at least one machine with a known static IP that's always in the cluster.
I'd like to set it up so any machine that has a secret token can join this cluster, assuming I boostrap it with the token and one or more current cluster member IP addresses.
The only options that seem relevant are the initial-cluster and initial-cluster-token, but I can't make out where to set up the bootstrap servers.
Or does it make sense to just trust the (or a private) discovery service for this? Does the discovery service automatically scale up to member past the initial size, and does it work well for long lived clusters that have a lot of servers leaving joining over months or years?

Related

Apache Geode Locator on different boxes, start but not forming a cluster

I have already gone through the one of the post however, the recommendation was to run things in same LAN segment or may be swarm for docker Docker Geode remote locator
My question is if I run 2 locator on 2 different node (VMs or Physical hardware) and also specify --locators='hostN[port]' referencing one another, how should I know if they did form a cluster?
Do i need to configure them always in a WAN configurations, even if they are part of same LAN but independent node (and not processes within one node)?
list members is not showing both the locators.
I am able to connect from one gfsh console to the locator in a different node gfsh console. but as long as I am not giving any bind addresses while starting the locators, both my locator start however, I don;t know if they formed a cluster (or if they are connected to each other).
I am evaluating with Apache Geode but need HA across VM and not processes within the same node. Please guide. Thanks in advance.
I changed the IPv4 configuration and provided a nearby LAN address, recommendation is to bring them in the same LAN segment.
If I start the locator virtually simultaneously and not wait for another locator to get up already, it seems to find it.
Other options I found was to increase:
ack-wait-threshold=15
member-timeout=5000
But keep in mind, the more time you put in here for acknowledgement, the same time it will take later to tell you the node is not responding in case it fails for any odd reason.

How to properly scale Jelastic app servers horizontally

I have several stateless app servers packed into Docker containers. I have a lot of load on top of them and I want to horizontally scale this setup. My setup doesn't include load balancer nodes.
What I've done is simply increased nodes count — so far so good.
From my understanding Jelastic have some internal load balancer which decides to which node it should pass incoming request, e.g.:
user -> jelastic.my-provider.com -> one of 10 of app nodes created.
But I've noticed that lot of my nodes (especially last ones) are not receiving any requests, and just idling, while first nodes receive lion share of incoming requests (I have lot of them!). This looks strange for me, because I thought that internal load balancer doing some king of round-robin distribution.
How to setup round-robin balancing properly? I came to the conclusion that I have to create another environment with nginx/haproxy and manually add all my 10 nodes to list of downstream servers.
Edit: I've setup separate HAProxy instance and manually added all my nodes to haproxy.cfg and it worked like a charm. But the question is still open since I want to achieve automatic/by schedule horizontal scaling.
Edit2: I use Jelastic v5.3 Cerebro. I use custom Docker images (btw, I have something like ~20 envs which all about of custom images except of databases).
My topology for this specific case is pretty simple — single Docker environment with app server configured and scaled to 10 nodes. I don't use public IP.
Edit3: I don't need sticky sessions at all. All my requests came from another service deployed to jelastic (1 node).

Is it safe to use etcd across multiple data centers?

Is it safe to use etcd across multiple data centers? As it expose etcd port to public internet.
Do I have to use client certificates in this case or etcd has some sort of authification?
Yes, but there are two big issues you need to tackle:
Security. This all depends on what type of info you are storing in etcd. Using a point to point VPN is probably preferred over exposing the entire cluster to the internet. Client certificates can also be used.
Tuning. etcd relies on replication between machines for two things, aliveness and consensus. Since a successful write must be committed to at majority of the cluster before it returns as successful, your write performance will degrade as the distance between the machines increases. Aliveness is measured with periodic heartbeats between the machines. By default, etcd has a fairly aggressive 50ms heartbeat timeout, which is optimized for bare metal servers running on a local network. Without tuning this timeout value, your cluster will constantly think that members have disappeared and trigger frequent master elections. This gets worse if both of your environments are on cloud providers that have variable networks plus disk writes that traverse the network, a double whammy.
More info on etcd tuning: https://etcd.io/docs/latest/tuning/

Node thinks that it is online when it's network cable is unplugged. Pacemaker/Corosync

I am trying to cluster 2 computers together with Pacemaker/Corosync. The only resource that they share is an ocf:heartbeat:IPaddr this is the main problem:
Since there are only two nodes failover will only occur if the no-quorum-policy=ignore.
When the network cable is pulled from node A, corosync on node A binds to 127.0.0.1 and pacemaker believes that node A is still online and the node B is the one offline.
Pacemaker attempts to start the IPaddr on Node A but it fails to start because there is no network connection. Node B on the other hand recognizes that node B is offline and if the IPaddr service was started on node A it will start it on itself (node B) successfully.
However, since the service failed to start on node A it enters a fatal state and has to be rebooted to rejoin the cluster. (you could restart some of the needed services instead.)
1 workaround is the set start-failure-is-fatal="false" which makes node A continue to try to start the IPaddr service until it is successful. the problem with this is that once it is successful you have a ip conflict between the two nodes until they re cluster and one of the gives up the resource.
I am playing around with the idea of having a node attribute that mirrors cat /sys/class/net/eth0/carrier which is 1 when the cable is connected and zero when it is disconnected and then having a location rule that says if "connected" == zero don't start service kind of thing, but we'll see.
Any thoughts or ideas would be greatly appreciated.
After speaking with Andrew Beekhof (Author of Pacemaker) and Digimer on the freenote.net/#linux-cluster irc network, I have learned that the actual cause behind this issue is do to the cluster being improperly fenced.
Fencing or having stonith enabled is absolutely essential to having a successful High Availability Cluster. The following page is a must read on the subject:
Cluster Tutorial: Concept - Fencing
Many thanks to Digimer for providing this invaluable resource. The section on clustering answers this question, however the entire article is beneficial.
Basically fencing and S.T.O.N.I.T.H. (Shoot the other node in the head) are mechanisms that a cluster uses to make sure that a down node is actually dead. It needs to do this to avoid shared memory corruption, split brain status (multiple nodes taking over shared resources), and most make sure that your cluster does not get stuck in recovery or crash.
If you don't have stonith/fencing configured and enabled in your cluster environment you really need it.
Other issues to look out for are Stonith Deathmatch, and Fencing Loops.
In short the issue of loss of network connectivity causing split brain was solved by creating our own Stonith Device and writing a stonith agent following the /usr/share/doc/cluster-glue/stonith/README.external tutorial, and then writing a startup script that checks to see if the node is able to support joining the cluster and then starts corosync or waits 5 minutes and checks again.
According your configuration, the heartbeat between two nodes will use "127.0.0.1" , i think it's totally wrong.
Usually the corosync need to bind to private IPs, and the resource IPaddr service should use different ip which named traffic IP.
For example:
Node A: 192.168.1.00 (For heartbeat); 10.0.0.1(traffic ip)
Node B: 192.168.1.101 (For heartbeat) ; 10.0.0.2(traffic ip)
If my understanding is correct ,ipaddr service will startup an virtual ip base on traffic ip, we assume it's 10.0.0.3.

Haproxy Load Balancer, EC2, writing my own availability script

I've been looking at high availability solutions such as heartbeat, and keepalived to failover when an haproxy load balancer goes down. I realised that although we would like high availability it's not really a requirement at this point in time to do it to the extent of the expenditure on having 2 load balancer instances running at any one time so that we get instant failover (particularly as one lb is going to be redundant in our setup).
My alternate solution is to fire up a new load balancer EC2 instance from an AMI if the current load balancer has stopped working and associate it to the elastic ip that our domain name points to. This should ensure that downtime is limited to the time it takes to fire up the new instance and associate the elastic ip, which given our current circumstance seems like a reasonably cost effective solution to high availability, particularly as we can easily do it multi-av zone. I am looking to do this using the following steps:
Prepare an AMI of the load balancer
Fire up a single ec2 instance acting as the load balancer and assign the Elastic IP to it
Have a micro server ping the current load balancer at regular intervals (we always have an extra micro server running anyway)
If the ping times out, fire up a new EC2 instance using the load balancer AMI
Associate the elastic ip to the new instance
Shut down the old load balancer instance
Repeat step 3 onwards with the new instance
I know how to run the commands in my script to start up and shut down EC2 instances, associate the elastic IP address to an instance, and ping the server.
My question is what would be a suitable ping here? Would a standard ping suffice at regular intervals, and what would be a good interval? Or is this a rather simplistic approach and there is a smarter health check that I should be doing?
Also if anyone foresees any problems with this approach please feel free to comment
I understand exactly where you're coming from, my company is in the same position. We care about having a highly available fault tolerant system however the overhead cost simply isn't viable for the traffic we get.
One problem I have with your solution is that you're assuming the micro instance and load balancer wont both die at the same time. With my experience with amazon I can tell you it's defiantly possible that this could happen, however unlikely, its possible that whatever causes your load balancer to die also takes down the micro instance.
Another potential problem is you also assume that you will always be able to start another replacement instance during downtime. This is simply not the case, take for example an outage amazon had in their us-east-1 region a few days ago. A power outage caused one of their zones to loose power. When they restored power and began to recover the instances their API's were not working properly because of the sheer load. During this time it took almost 1 hour before they were available. If an outage like this knocks out your load balancer and you're unable to start another you'll be down.
That being said. I find the ELB's provided by amazon are a better solution for me. I'm not sure what the reasoning is behind using HAProxy but I recommend investigating the ELB's as they will allow you to do things such as auto-scaling etc.
For each ELB you create amazon creates one load balancer in each zone that has an instance registered. These are still vulnerable to certain problems during severe outages at amazon like the one described above. For example during this downtime I could not add new instances to the load balancers but my current instances ( the ones not affected by the power outage ) were still serving requests.
UPDATE 2013-09-30
Recently we've changed our infrastructure to use a combination of ELB and HAProxy. I find that ELB gives the best availability but the fact that it uses DNS load balancing doesn't work well for my application. So our setup is ELB in front of a 2 node HAProxy cluster. Using this tool HAProxyCloud I created for AWS I can easily add auto scaling groups to the HAProxy servers.
I know this is a little old, but the solution you suggest is overcomplicated, there's a much simpler method that does exactly what you're trying to accomplish...
Just put your HAProxy machine, with your custom AMI in an auto-scaling group with a minimum AND maximum of 1 instance. That way when your instance goes down the ASG will bring it right back up, EIP and all. No external monitoring necessary, same if not faster response to downed instances.

Resources