Consul infrastructure footprint and performance - consul

Is there any documentation on Consul's requirements as it pertains to infrastructure footprint ? (e.g. memory / disk / cpu requirement and typical usage of Consul agents and master themselves). How does this compare with other similar service discovery solutions ?

I know of no official documentation on this. It will depend in part on the number of total nodes you are running (i.e. server agents plus client agents on each machine running services.
This thread includes comments (and a good link to a presentation by Darron at DataDog) from production users. They indicate using AWS m3.medium to m3.large instances in their production situations (you can find specs on those instance types here). Darron's presentation includes some information on # of nodes in their scenario as well as comments on how they scaled as the number of nodes grew.

Related

How does the Titan (not backend storage) clustering work?

Context
I'm using Titan v1.0.0, on AWS infrastructure and want to support failover/fault tolerance. AWS will take care of DynamoDb storage backend but it seems necessary to have several titan instances serviced by an (ELB) load balancer.
I'm using a nodeJs library to get to gremlin, and gremlin to access Titan.
Question
So, how does the Titan (not backend storage) clustering work? If at all.
To be clear, I'm not talking about any backend storage clustering, as I'm using dynamoDb on AWS. The documentation on transaction locking suggests to me that a titan cluster must exist, as other titan nodes wouldn't know about the locking without some sort of inter-communication. But I don't see any configuration options that support this.
If clustering is possible on titan, does anyone have any information on how to get this setup in a production High-Availability setup?
An illustration of the server-side architecture:
[NodeJsA]\ /[TitanA]\
\ / \
[ELB (AWS)] [DynamoDb (AWS)]
/ \ /
[NodeJsB]/ \[TitanB]/
Further, if there is no clustering of the titan nodes then a change made via the TitanA node (above) could take the following amount of time to be seem on TitanB node (worst case):
(AWS eventual consistency convergence time (~1 sec) + TitanB cache timeout + poll time from NodeJs nodes to titanDB)
Another consequence of the lack of clustering would be that sessions would have be to pinned in the ELB else a read request, after an update, could be served by a different node with stale information.
Titan doesn't do any "clustering" outside of what is supported by the selected backend. You referred to "locking" as something that would indicate that clustering is supported, but if you read about the locking providers in that link you supplied you'll see that locking isn't really doing anything terribly fancy at a Titan level and that it is backend dependent. So Titan instances really don't have any external clustering capabilities or knowledge of each other. You therefore need to take that into account with respect to your architecture.

Websphere application server 8.5.5 clustering same application

I have the same application running on two WAS clusters. Each cluster has 3 application servers based in different datacenters. In front of each cluster are 3 IHS servers.
Can I specify a primary cluster, and a failover cluster within the plugin-cfg.xml? Currently I have both clusters defined within the plugin, but I'm only hitting 1 cluster for every request. The second cluster is completely ignored.
Thanks!
As noted already the WAS HTTP server plugin doesn't provide the function your're seeking as documented in the WAS KC http://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/rwsv_plugincfg.html?lang=en
assuming that by "failover cluster" what is actually meant is "BackupServers" in the plugin-cfg.xml
The ODR alternative mentioned previously likely isn't an option either, this because the ODR isn't supported for use in the DMZ (it's not been security hardened for DMZ deployment) http://www-01.ibm.com/support/knowledgecenter/SSAW57_8.5.5/com.ibm.websphere.nd.doc/ae/twve_odoecreateodr.html?lang=en
From an effective HA/DR perspective what you're seeking to accomplish should handled at the network layer, using the global load balancer (global site selector, global traffic manager, etc) that is routing traffic into the data centers, this is usually accomplished by setting a "site cookie" using the load balancer
This is by design. IHS, at least at the 8.5.5 level, does not allow for what you are trying to do. You will have to implement such level of high availability in a higher level in your topology.
There are a few options.
If the environemnt is relatively static, you could post-process plugin-cfg.xml and combine them into a single ServerCluster with the "dc2" servers listed as <BackupServer>'s in the cluster. The "dc1" servers are probably already listed as <PrimaryServer>'s
BackupServers are only used when no PrimaryServers are reachable.
Another option is to use the Java On-Demand Router, which has first-class awareness of applications running in two cells. Rules can be written that dictate the behavior of applications residing in two clusters (load balancer, failover, etc.). I believe these are "ODR Routing Rules".

Mesos real world use-cases

I'm trying to figure out what would be the reasons for using Mesos. Can you come up with other ones?
Running all of your services in the same cluster instead of dedicated clusters (your end-applications + DevOps such as Jenkins)
Running different maturity applications in same cluster (dev, test, production), or is this viable? Kubernetes has a similar approach with Labels
Mesos simplifies the use of traditional distributed applications such as Hadoop by easing deployment, unified API, bin-packing of resources
Full-disclosure: I currently work at Twitter and I'm involved in both Apache Mesos and Aurora.
Mesos uses cases can vary based upon a few dimensions: scale (10 servers vs 10s of thousands), available hardware (dedicated/static or in the public cloud/scalable), and workloads (primarily services, batch, or both).
Your list is a great start. Here are a few additional use cases / features to add.
Container Orchestration
As container runtimes like Docker have become popular, lots of potential users are looking at Mesos + a scheduler to manage orchestration once container images are created. Mesos is already quite mature and has been proven at scale, which I think has given it a leg up over some emergent solutions.
Increased Resource Utilization
For companies running >50 servers, a common motivation for adopting Mesos is to increase resource utilization to reduce CapEx. There are a number of examples of this in both the public and private cloud. In the case of Ebay they have been running Jenkins on Mesos and were able to reduce their VM footprint. Mesosphere has also published a case study of HubSpot (runnning on AWS), and how they've been able to replace hundreds of smaller servers with dozens of larger ones by more-efficiently using their available hardware.
Preemption
At Twitter we're running Mesos via one scheduler: Apache Aurora. One of the ways we can improve utilization relates to your use case: running different maturity applications in the same cluster. Aurora has a concept of environments, so you can run applications that are production, development, or test. Additionally, Aurora has a built-in preemption feature which allows it to prioritize production over non-production tasks, killing non-production tasks when those resources are needed to run production ones as well as a priority system within each environment.
Long-term, functionality related to preemption will also be located in the Mesos core itself -- it's a killer feature related to both increased resource utilization and running different maturity applications (dev, test, prod). There are a few Mesos tickets to follow if you're interested in keeping up to date, including MESOS-155 for preemption, and MESOS-1474 for inverse offers.
Colocating Batch and Services
Running batch and services in a shared Mesos cluster will be key to driving up utilization even further as js84 points out. Check out Project Myriad, an effort to colocate Mesos and YARN workloads in the same cluster. At this time I'm not aware of any large deployments running both batch and services, but it's certainly the direction the community is moving in as it becomes easier for multiple frameworks to run in a shared cluster.
At least one additional use case comes to mind: Development SDK for developing distributed applications. If you have a look at Mesos Frameworks you will find a number of frameworks which have been developed on top of Mesos. Also interesting Apple's Siri framework powering Siri.
Regarding your 1): One additional angle you should keep in mind here is scaling your applications in the same cluster. I.e. at peak load of your website, shift resources easily towards the webservers while scaling down the Hadoop analytical processing.

Docker-Swarm, Kubernetes, Mesos & Core-OS Fleet

I am relatively new to all these, but I'm having troubles getting a clear picture among the listed technologies.
Though, all of these try to solve different problems, but do have things in common too. I would like to understand what are the things that are common and what is different. It is likely that the combination of few would be great fit, if so what are they?
I am listing a few of them along with questions, but it would be great if someone lists all of them in detail and answers the questions.
Kubernetes vs Mesos:
This link
What's the difference between Apache's Mesos and Google's Kubernetes
provides a good insight into the differences, but I'm unable to understand as to why Kubernetes should run on top of Mesos. Is it more to do with coming together of two opensource solutions?
Kubernetes vs Core-OS Fleet:
If I use kubernetes, is fleet required?
How does Docker-Swarm fit into all the above?
Disclosure: I'm a lead engineer on Kubernetes
I think that Mesos and Kubernetes are largely aimed at solving similar problems of running clustered applications, they have different histories and different approaches to solving the problem.
Mesos focuses its energy on very generic scheduling, and plugging in multiple different schedulers. This means that it enables systems like Hadoop and Marathon to co-exist in the same scheduling environment. Mesos is less focused on running containers. Mesos existed prior to widespread interest in containers and has been re-factored in parts to support containers.
In contrast, Kubernetes was designed from the ground up to be an environment for building distributed applications from containers. It includes primitives for replication and service discovery as core primitives, where-as such things are added via frameworks in Mesos. The primary goal of Kubernetes is a system for building, running and managing distributed systems.
Fleet is a lower-level task distributor. It is useful for bootstrapping a cluster system, for example CoreOS uses it to distribute the kubernetes agents and binaries out to the machines in a cluster in order to turn-up a kubernetes cluster. It is not really intended to solve the same distributed application development problems, think of it more like systemd/init.d/upstart for your cluster. It's not required if you run kubernetes, you can use other tools (e.g. Salt, Puppet, Ansible, Chef, ...) to accomplish the same binary distribution.
Swarm is an effort by Docker to extend the existing Docker API to make a cluster of machines look like a single Docker API. Fundamentally, our experience at Google and elsewhere indicates that the node API is insufficient for a cluster API. You can see a bunch of discussion on this here: https://github.com/docker/docker/pull/8859 and here: https://github.com/docker/docker/issues/8781
Join us on IRC # #google-containers if you want to talk more.
I think the simplest answer is that there is no simple answer. The swift rise to power of containers, and Docker in particular has left a power vacuum for "container scheduling and orchestration", whatever that might mean. In reality, that means you have a number of technologies that can work in harmony on some levels, but with certain aspects in competition. For example, Kubernetes can be used as a one stop shop for deploying and managing containers on a compute cluster (as Google originally designed it), but could also sit atop Fleet, making use of the resilience tier that Fleet provides on CoreOS.
As this Google vid states Kubernetes is not a complete out the box container scaling solution, but is a good statement to start from. In the same way, you would at some stage expect Apache Mesos to be able to work with Kubernetes, but not with Marathon, in as much as Marathon appears to fulfil the same role as Kubernetes. Somewhere I think I've read these could become part of the same effort, but I could be wrong about that - it's really about the strategic direction of Mesosphere and the corresponding adoption of Kubernetes principles.
In the DockerCon keynote, Solomon Hykes suggested Swarm would be a tier that could provide a common interface onto the many orchestration and scheduling frameworks. From what I can see, Swarm is designed to provide a smooth Docker deployment workflow, working with some existing container workflow frameworks such as Deis, but flexible enough to yield to "heavyweight" deployment and resource management such as Mesos.
Hope this helps - this could be an enormous post. I think the key is that these are young, evolving services that will likely merge and become interoperable, but we need to ride out the next 12 months to see how it plays out. There's some very clever people on the problem, so the future looks very bright.
As far as I understand it:
Mesos, Kubernetes and Fleet are all trying to solve a very similar problem. The idea is that you abstract away all your hardware from developers and the 'cluster management tool' sorts it all out for you. Then all you need to do is give a container to the cluster, give it some info (keep it running permanently, scale up if X happens etc) and the cluster manager will make it happen.
With Mesos, it does all the cluster management for you, but it doesn't include the scheduler. The scheduler is the bit that says, ok this process needs 2 procs and 512MB RAM, and I have a machine over there with that free, so I'll run it on that machine. There are some plugin schedulers available for Mesos: Marathon and Chronos and you can write your own. This gives you a lot of power of resource distribution and cluster scaling etc.
Fleet and Kubernetes seem to abstract away those sorts of details (so you don't have to write your own scheduler basically). This means you have to define your tasks and submit them in the format/manner defined by Fleet or Kubernetes and then they take over and schedule the tasks (containers) for you.
So I guess: Using Mesos may mean a bit more work in writing your own scheduler, but potentially provides more flexibility if required.
I think the idea of running Kubernetes on top of Mesos is that Kubernetes acts as the scheduler for Mesos. Personally I'm not sure what benefits this brings over running one or the other on its own though (hopefully someone will jump in and explain!)
As MikeB said.. it's early days, and it's all up for grabs (keep an eye on Amazon's ECS as well) so there are many competing standards and a lot of overlap!
-edit- I didn't mention Docker swarm as I don't really have much experience with it.
For anyone coming to this after 2017 fleet is deprecated. Do not use it anymore.
Fleet docs say "fleet is no longer actively developed or maintained by CoreOS" and link to Container orchestration: Moving from fleet to Kubernetes. Fleet was removed from Container Linux (formerly known as CoreOS Linux) and replaced with Kubernetes kubelet (agent). This coincided with a corporate pivot to offer Tectonic (a Kubernetes distro) as their primary product.

How Amazon ELB will distribute requests to Amazon EC2 instances of different instance types?

Anyone has any idea how the ELB will distribute requests if I register multiple EC2 instances of different sizes. Say one m1.medium, one m1.large and one m1.xlarge.
Will it be different if I register EC2 instances of same size? If so then how?
That's a fairly complicated topic, mostly due to the Amazon ELB routing documentation falling short of being non existent, so one needs to assemble some pieces to draw a conclusion - see my answer to the related question Can Elastic Load Balancers correctly distribute traffic to different size instances for a detailed analysis including all the references I'm aware of.
For the question at hand I think it boils down to the somewhat vague AWS team response from 2009 to ELB Strategy:
ELB loosely keeps track of how many requests (or connections in the
case of TCP) are outstanding at each instance. It does not monitor
resource usage (such as CPU or memory) at each instance. ELB
currently will round-robin amongst those instances that it believes
has the fewest outstanding requests. [emphasis mine]
Depending on your application architecture and request variety, larger Amazon EC2 instance types might be able to serve requests faster, thus have less outstanding requests and receive more traffic accordingly, but either way the ELB supposedly distributes traffic appropriately on average, i.e. should implicitly account for the uneven instance characteristics to some extent - I haven't tried this myself though and would recommend both, Monitoring Your Load Balancer Using CloudWatch as well as monitoring your individual EC2 instances and correlate the results in order to gain respective insight and confidence into such a setup eventually.
Hi I agree with Steffen Opel, also I met one of the solution architect of AWS recently, he gave couple of heads up on Elastic Load balancing to achieve better performance through ELB.
1) make sure you have equal number of instances running on all the availability zones. For example in case of ap-southeast, we have to availability zones 1a and 1b so make sure you have equal number of instances attached to the ELB from both the regions.
2) Make sure your application is stateless, that's the cloud enforces and suggests the developers.
3)Dont use sticky sessions.
5)Reduce the TTL (Time to live) to the maximum possible level, like 10 secs or something.
6) Unhealthy checks TTL should be minimum so that ELB doesn't retain the unhealthy instances.
7)If you are excepting a lot of traffic to your ELB make sure you do a load testing on ELB itself, it doesn't scale as fast as your ec2 instances.
8) If you are caching then think thousand times from which point you are picking the data to cache.
Above all tips is just to help you get this better. Its better you have a same size of instances.

Resources