Managing server instance identity on EC2 - amazon-ec2

I recently brought up a cluster on EC2, and I felt like I had to invent a lot of things. I'm wondering what kinds of tools, patterns, ideas are out there for how to deal with this.
Some context:
I had 3 different kinds of servers, so first I created AMIs for each of them. The first AMI had zookeeper, so step one in deploying the system was to get the zookeeper server running.
My script then made a note of the mapping between EC2's completely arbitrary and unpredictable hostnames, and the zookeeper server.
Then as I brought up new instances of the other 2 kinds of servers, the first thing I would do is ssh to the new server, and add the zookeeper server to its /etc/hosts file. Then as the server software on each instance starts up, it can find zookeeper.
Obviously this is a problem that lots of people have to solve, and it probably works a little bit differently in different clouds.
Are there products that address this concept? I was pretty surprised that EC2 didn't provide some kind of way to tie your own name to its name.
Thanks for any ideas.

How to do some service discovery on Amazon EC2 seems to have some good options.

I think you might want to look at http://puppetlabs.com/mcollective/introduction/ and the suite of tools from http://puppetlabs.com in general.
From the site:
The Marionette Collective AKA MCollective is a framework to build server orchestration or parallel job execution systems.
Primarily we’ll use it as a means of programmatic execution of Systems Administration actions on clusters of servers. In this regard we operate in the same space as tools like Func, Fabric or Capistrano.
I am fairly certain mcollective was built to solve exactly the problem you are trying to address. But, be forewarned, it's not a DNS-based solution, it's a method of addressing arbitrarily large and arbitrarily tagged groups of hosts.

Related

Clustering Microservice Components

We have a set of Microservices collaborating with each other in the eco system. We used to have occasional problems where one or more of these Microservices would go down accidentally. Thankfully, we have some monitoring built around which would realize this and take corrective action.
Now, we would like to have redundancy built around each of those Microservices. I'm thinking more like a master / slave approach where a slave is always on stand by and when the master goes off, the slave picks it up.
Should we consider using any framework that we could use as service registry, where we register each of those Microservices and allow them to be controlled? Any other suggestions on how to achieve the kind of master / slave architecture with the Microservices that would enable us to have failover redundancy?
I thought about this for a couple of minutes and this is what I currently think is the best method, based on experience.
There are a couple of problems you will face with availability. First is always having at least one endpoint up. This is easy enough to do by installing on multiple servers. In the enterprise space, you would use a name for the endpoint and then have it resolve to multiple servers (virtual or hardware). You would also load balance it.
The second is registry. This is a very easy problem with API management software. The really good software in this space is not cheap, so this is not a weekend hobbyist type of software. But there are open source API Management solutions out there. As I work in the Enterprise space, I am very familiar with options like Apigee, CA, Mashery, etc. so I cannot recommend an open source option and feel good about myself.
You could build your own registry, if you desire. Just be careful how you design it, as a "registry of all interface points" leads to a service that becomes more tightly coupled.

Marathon vs Aurora and their purposes

Both Marathon and Aurora are built on Mesos and supposedly are engineered for running long running services. My questions are:
What are their differences? I have struggled in finding any good explanations regarding their key differences
Do these frameworks run anything that runs on Linux? For Marathon they state that it can run anything that "is executable in a shell" but this is sort of vague :)
Thanks!
Disclaimer: I am the VP of Apache Aurora, and have been the tech lead of the Aurora team at Twitter for ~5 years. My likely-biased opinions are my own and do not necessarily represent those of Twitter or the ASF.
Do these frameworks run anything that runs on Linux? For Marathon they
state that it can run anything that "is executable in a shell" but
this is sort of vague :)
Essentially, yes. Ultimately these systems are sophisticated machinery to execute shell code somewhere in a cluster :-)
What are their differences? I have struggled in finding any good
explanations regarding their key differences
Aurora and Marathon do indeed offer similar feature sets, both being classified as "service schedulers". In other words, you hand us instructions for how to run your application servers, and we do our best to keep them up.
I'll offer some differences in broad strokes. When it comes to shortcomings mentioned in each, I think it's safe to say that the communities are aware and intend to fix them.
Ease of use
Aurora is not easy to install. It will likely feel like you are trailblazing while setting it up. It exposes a thrift API, which means you'll need a thrift client to interact with it programmatically (a REST-like API is coming, but is vaporware at the moment), or use our command line client. Aurora has a DSL for configuration which can be daunting, but allows you to easily share templates and common patterns as you use the system more.
Marathon, on the other hand, helps you to run 'Hello World' as quickly as possible. It has great docs to do this in many environments and there's little overhead to get going. It has a REST API, making it easier to adapt to custom tools. It uses JSON for configuration, which is easy to start with but more prone to cargo culting.
Targeted use cases
Aurora has always been designed to handle a large engineering organization. The clusters at Twitter have tens of thousands of machines and hundreds of engineers using them. It is critical to Twitter's business. As a result, we take our requirements of scale, stability, and security very seriously. We make sure to only condone features that we believe are trustworthy at scale in production (for example, we have our Docker support labeled as beta because of known issues with Docker itself and the Mesos-Docker integration). We also have features like preemption that make our clusters suitable for mixing business-critical services with prototypes and experiments.
I can't make any claim for or against Marathon's scalability. On the feature front, Marathon has build out features quickly, but this can feel bleeding edge in practice (Docker support is a good example). This is not always due to Marathon itself, but also layers down the stack. Marathon does not provide preemption.
Ownership
To some, ownership and governance of a project is important. It feel that in practice it does not define the openness of a project, but for some people/companies the legal fine print can be a deal-breaker.
Marathon is owned by a company (Mesosphere)
To some, this is beneficial, to others is is not. It means that you can pay for support and features. It also means that there is something to be sold, and the project direction is ultimately decided by Mesosphere's interests.
Aurora is owned by the Apache Software Foundation
This means it is subject to the governance model of the ASF, driven by the community. Aurora does not have paying customers, and there is not currently a software shop that you can pay for development.
tl;dr If you are just getting your feet wet with running services on Mesos, I would suggest Marathon as your first port of call. It will be easier for you to get running and poke around the ecosystem. If you are forming the 'private cloud strategy' for a company, I suggest seriously considering Aurora, as it is proven and specifically designed for that.
So I've been evaluating both and this is my summary.
Aurora
[+] also handles recurring jobs
[+] finer grained, extensive file-based configuration
[+] has namespaces so multiple environments can co-exist
[-] read-only UI, no official API
[~] file based configuration and cli based execution brings overhead (which can be justified with more extensive feature set)
Marathon
[+] very easy to setup and use
[+] UI that provides control and extensive API (even with features missing from UI at the moment)
[+] event bus to listen in on api calls
[-] handles only long-running jobs
[-] does not have separate deployment-run-cleanup steps, these if necessary need to be combined in a script of one-liner
Even though Aurora has better capabilities, I prefer Marathon due to Auroras complexity/overhead and lack of UI (for control) & API
I have more experience with Marathon.
Ideological:
Marathon is a relatively tested product that is used in production at AirBnB. Aurora is an early Apache project (so YMMV).
Both are open source and active. Feel free to contribute pull requests or file issues!
Technical:
Marathon doesn't schedule batch tasks or cron jobs
Marathon has a friendly UI and better health indicators (in 0.8.x)
In regards to your second question, you can run any command or docker container, and Mesos will do the resource isolation for you. If you have 50% CentOS nodes and 50% Ubuntu nodes and you run a task that executes apt-get, the task will have a 50% chance of failure. Mesos and Marathon have no awareness of the actual machines.
Disclaimer: I don't have hands-on experience with Aurora, only with Marathon.
ad Q1: In a nutshell Apache Aurora is capable of doing what Marathon + Chronos can provide, that is, schedule both long-running services and recurring (batch) jobs; see also Aurora user guide.
ad Q2: Yes, anything. Currently based on cgroups and Docker but hey, you can roll your own.

Docker-Swarm, Kubernetes, Mesos & Core-OS Fleet

I am relatively new to all these, but I'm having troubles getting a clear picture among the listed technologies.
Though, all of these try to solve different problems, but do have things in common too. I would like to understand what are the things that are common and what is different. It is likely that the combination of few would be great fit, if so what are they?
I am listing a few of them along with questions, but it would be great if someone lists all of them in detail and answers the questions.
Kubernetes vs Mesos:
This link
What's the difference between Apache's Mesos and Google's Kubernetes
provides a good insight into the differences, but I'm unable to understand as to why Kubernetes should run on top of Mesos. Is it more to do with coming together of two opensource solutions?
Kubernetes vs Core-OS Fleet:
If I use kubernetes, is fleet required?
How does Docker-Swarm fit into all the above?
Disclosure: I'm a lead engineer on Kubernetes
I think that Mesos and Kubernetes are largely aimed at solving similar problems of running clustered applications, they have different histories and different approaches to solving the problem.
Mesos focuses its energy on very generic scheduling, and plugging in multiple different schedulers. This means that it enables systems like Hadoop and Marathon to co-exist in the same scheduling environment. Mesos is less focused on running containers. Mesos existed prior to widespread interest in containers and has been re-factored in parts to support containers.
In contrast, Kubernetes was designed from the ground up to be an environment for building distributed applications from containers. It includes primitives for replication and service discovery as core primitives, where-as such things are added via frameworks in Mesos. The primary goal of Kubernetes is a system for building, running and managing distributed systems.
Fleet is a lower-level task distributor. It is useful for bootstrapping a cluster system, for example CoreOS uses it to distribute the kubernetes agents and binaries out to the machines in a cluster in order to turn-up a kubernetes cluster. It is not really intended to solve the same distributed application development problems, think of it more like systemd/init.d/upstart for your cluster. It's not required if you run kubernetes, you can use other tools (e.g. Salt, Puppet, Ansible, Chef, ...) to accomplish the same binary distribution.
Swarm is an effort by Docker to extend the existing Docker API to make a cluster of machines look like a single Docker API. Fundamentally, our experience at Google and elsewhere indicates that the node API is insufficient for a cluster API. You can see a bunch of discussion on this here: https://github.com/docker/docker/pull/8859 and here: https://github.com/docker/docker/issues/8781
Join us on IRC # #google-containers if you want to talk more.
I think the simplest answer is that there is no simple answer. The swift rise to power of containers, and Docker in particular has left a power vacuum for "container scheduling and orchestration", whatever that might mean. In reality, that means you have a number of technologies that can work in harmony on some levels, but with certain aspects in competition. For example, Kubernetes can be used as a one stop shop for deploying and managing containers on a compute cluster (as Google originally designed it), but could also sit atop Fleet, making use of the resilience tier that Fleet provides on CoreOS.
As this Google vid states Kubernetes is not a complete out the box container scaling solution, but is a good statement to start from. In the same way, you would at some stage expect Apache Mesos to be able to work with Kubernetes, but not with Marathon, in as much as Marathon appears to fulfil the same role as Kubernetes. Somewhere I think I've read these could become part of the same effort, but I could be wrong about that - it's really about the strategic direction of Mesosphere and the corresponding adoption of Kubernetes principles.
In the DockerCon keynote, Solomon Hykes suggested Swarm would be a tier that could provide a common interface onto the many orchestration and scheduling frameworks. From what I can see, Swarm is designed to provide a smooth Docker deployment workflow, working with some existing container workflow frameworks such as Deis, but flexible enough to yield to "heavyweight" deployment and resource management such as Mesos.
Hope this helps - this could be an enormous post. I think the key is that these are young, evolving services that will likely merge and become interoperable, but we need to ride out the next 12 months to see how it plays out. There's some very clever people on the problem, so the future looks very bright.
As far as I understand it:
Mesos, Kubernetes and Fleet are all trying to solve a very similar problem. The idea is that you abstract away all your hardware from developers and the 'cluster management tool' sorts it all out for you. Then all you need to do is give a container to the cluster, give it some info (keep it running permanently, scale up if X happens etc) and the cluster manager will make it happen.
With Mesos, it does all the cluster management for you, but it doesn't include the scheduler. The scheduler is the bit that says, ok this process needs 2 procs and 512MB RAM, and I have a machine over there with that free, so I'll run it on that machine. There are some plugin schedulers available for Mesos: Marathon and Chronos and you can write your own. This gives you a lot of power of resource distribution and cluster scaling etc.
Fleet and Kubernetes seem to abstract away those sorts of details (so you don't have to write your own scheduler basically). This means you have to define your tasks and submit them in the format/manner defined by Fleet or Kubernetes and then they take over and schedule the tasks (containers) for you.
So I guess: Using Mesos may mean a bit more work in writing your own scheduler, but potentially provides more flexibility if required.
I think the idea of running Kubernetes on top of Mesos is that Kubernetes acts as the scheduler for Mesos. Personally I'm not sure what benefits this brings over running one or the other on its own though (hopefully someone will jump in and explain!)
As MikeB said.. it's early days, and it's all up for grabs (keep an eye on Amazon's ECS as well) so there are many competing standards and a lot of overlap!
-edit- I didn't mention Docker swarm as I don't really have much experience with it.
For anyone coming to this after 2017 fleet is deprecated. Do not use it anymore.
Fleet docs say "fleet is no longer actively developed or maintained by CoreOS" and link to Container orchestration: Moving from fleet to Kubernetes. Fleet was removed from Container Linux (formerly known as CoreOS Linux) and replaced with Kubernetes kubelet (agent). This coincided with a corporate pivot to offer Tectonic (a Kubernetes distro) as their primary product.

Basic AWS questions

I'm newbie on AWS, and it has so many products (EC2, Load Balancer, EBS, S3, SimpleDB etc.), and so many docs, that I can't figure out where I must start from.
My goal is to be ready for scalability.
Suppose I want to set up a simple webserver, which access a database in mongolab. I suppose I need one EC2 instance to run it. At this point, do I need something more (EBS, S3, etc.)?
At some point of time, my app has reached enough traffic and I must scale it. I was thinking of starting a new copy (instance) of my EC2 machine. But then it will have another IP. So, how traffic is distributed between both EC2 instances? Is that did automatically? Must I hire a Load Balancer service to distribute the traffic? And then will I have to pay for 2 EC2 instances and 1 LB? At this point, do I need something more (e.g.: Elastic IP)?
Welcome to the club Sony Santos,
AWS is a very powerfull architecture, but with this power comes responsibility. I and presumably many others have learned the hard way building applications using AWS's services.
You ask, where do I start? This is actually a very good question, but you probably won't like my answer. You need to read and do research about all the technologies offered by amazon and even other providers such as Rackspace, GoGrid, Google's Cloud and Azure. Amazon is not easy to get going but its not meant to be really, its focus is more about being very customizable and have a very extensive api. But lets get back to your question.
To run a simple webserver you would need to start an EC2 instance this instance by default runs on a diskdrive called EBS. Essentially an EBS drive is a normal harddrive except that you can do lots of other cool stuff with it like take it off one server and move it to another. S3 is really more of a file storage system its more useful if you have a bunch of images or if you want to store a lot of backups of your databases etc, but its not a requirement for a simple webserver. Just running an EC2 instance is all you need, everything else will happen behind the scenes.
If you app reaches a lot of traffic you have two options. You can scale your machine up by shutting it off and starting it with a larger instance. Generally speaking this is the easiest thing to do, but you'll get to a point where you either cannot handle all the traffic with 1 instance even at the larger size and you'll decide you need two OR you'll want a more fault tolerant application that will still be online in the event of a failure or update.
If you create a second instance you will need to do some form of loadbalancing. I recommend using amazons Elastic Load Balancer as its easy to configure and its integration with the cloud is better than using Round Robin DNS or a application like haproxy. Elastic Load Balancers are not expensive, I believe they cost around $18 / month + data that's passed between the loadbalancer.
But no, you don't need anything else to do scale up your site. 2 EC2 instances and a ELB will do the trick.
Additional questions you didn't ask but probably should have.
How often does an EC2 instance experience hardware failure and crash my server. What can I do if this happens?
It happens frequently, usually in batches. Sometimes I go months without any problems then I will get a few servers crash at a time. But its defiantly something you should plan for I didn't in the beginning and I paid for it. Make sure you create scripts and have backups and a backup plan ready incase your server fails. Be ok with it being down or have a load balanced solution from day 1.
Whats the hardest part about scalabilty?
Testing testing testing testing... Don't ever assume anything. Also be prepared for sudden spikes in your traffic. You have to be prepared for anything if you page goes from 1 to 1000 people over night are you prepared to handle it? Have you tested what you "think" will happen?
Best of luck and have fun... I know I have :)

Amazon EC2 consideration - redundancy and elastic IPs

I've been tasked with determining if Amazon EC2 is something we should move our ecommerce site to. We currently use Amazon S3 for a lot of images and files. The cost would go up by about $20/mo for our host costs, but we could sell our server for a few thousand dollars. This all came up because right now there are no procedures in place if something happened to our server.
How reliable is Amazon EC2? Is the redundancy good, I don't see anything about this in the FAQ and it's a problem on our current system I'm looking to solve.
Are elastic IPs beneficial? It sounds like you could point DNS to that IP and then on Amazon's end, reroute that IP address to any EC2 instance so you could easily get another instance up and running if the first one failed.
I'm aware of scalability, it's the redundancy and reliability that I'm asking about.
At work, I've had something like 20-40 instances running at all times for over a year. I think we've had 1-3 alert emails come from amazon suggesting that we terminate and boot another instance (presumably because they are detecting possible failure in the underlying hardware). We've never had an instance go down suddenly, which seems rather good.
Elastic IP's are amazing and are part of the solution. The other part is being able to rapidly bring up new instances. I've learned that you shouldn't care about instances going down, that it's more important to use proper load balancing and be able to bring up commodity instances quickly.
Yes, it's very good. If you aren't able to put together a concurrent redundancy (where you have multiple servers fulfilling requests simultaneously), using the elastic IP to quickly redirect to another EC2 instance would be a way to minimize downtime.
Yeah I think moving from inhouse server to Amazon will definitely make a lot of sense economically. EBS backed instances ensure that even if the machine gets rebooted, the transient memory is not lost. And if you have a clear separation between your application and data layer and can have them on different machines, then you can build even better redundancy for your data.
For ex, if you use mysql, then you can consider using Amazon RDS service - which gives you a highly available and reliable MySQL instance, fully managed (patches and all). The application layer then can be made more resilient by having more smaller instances rather than one larger instance, through load balancing.
The cost you will save on is really hardware maintenance and the cost you would have to incur to build in disaster recovery.

Resources