Provision simple stack in any of several clouds from a single configuration? - ansible

Suppose I want to provision a simple stack in any of several public clouds from a single configuration. Will any existing IaC tools do this?
For example:
a virtual network (VPC, VNET)
one small* Ubuntu instance on the latest Xenial image from Canonical with a public IP (EC2, VM, GCE)
one big* server on the latest Trusty image from Canonical
an internet gateway (IGW)
I can do this easily using Terraform or Ansible. But, per my current understanding, this would be a separate Ansible playbooks or Terraform configurations for each cloud environment (AWS, Azure, GCP).
Does a tool exist that would allow me to point to a single configuration and pass the cloud in which to provision the stack as an option.
i.e.
toolname create --config=my_simple_stack --provider=azure
or
toolname create --config=my_simple_stack --provider=gcp
Then if my_simple_stack configuration needed to change, the change could be made in one place rather than three.
* sizes being ballpark as I realize that available VM sizes are not necessarily consistent across providers. So small might be 2+ core / 2+ GB RAM and big might be 16+ core / 16+ GB RAM depending on what the provider offers.

Thanks all. I suspect that there isn't tooling available to do this for the reason that ydaetskcoR said - it isn't necessarily a good idea.
Containerization is the better approach rather than trying to build like-for-like least-common-denominator environments across a bunch of different cloud platforms.

Related

What are the minimum machine specifications necessary for Admin and Container processes?

The reference material simply states that JDK7 is required for Spring XD.
What are the minimum requirements (RAM, CPU, Disk) for hosts meant to run Spring XD Admin?
What are the minimum requirements (RAM, CPU, Disk) for hosts meant to run Spring XD Containers?
The answer in both cases is it depends what you need to use them for. It seems like Spring XD is designed for high throughput computing(HTC), so unlike traditional high performance computing the addition of GPUs or coprocessors in this case would probably not be particularly beneficial. If you just want to try it out and happen to have several servers laying around it seems like as long as you have something that is powerful enough to run an OS that supports Java you could probably at least make it work. If you are in the initial stages of testing Spring XD to see if it will integrate with your existing infrastructure this would allow you to at least try it out. If you have passed that stage of testing and are confident that Spring XD will work and would like to purchase hardware to optimize its performance feel free to continue reading.
I have not used Spring XD before, but based on the documentation I have been reading and some experiences with HTC there are a few considerations for setting up systems to run it. if you take a look at the diagram from the docs and read a little bit about the services it seems like the Admin, Zookeeper, Analytics Repo and Batch Job DB could be hosted on virtual machines(VMs) under the hypervisor of your choice.
Using a setup with several of the subsystems required to use the distributed model running on VMs would give you the ability to scale resources as necessary, e.g. to begin a single hypervisor system may be sufficient to run everything but as traffic/use grows it may be desirable to separate the VMs onto multiple hypervisors and give some of the VMs additional resources.
With the containers it seems like many other virtualization or containerization schemes for HTC, where more powerful systems e.g. lots of RAM, SSD storage, allow users to run more containers on a single physical box.
To adequately assess the needs for a new system running any application it is important to understand what the limiting factor on the problem is; is it memory bound, IO bound or CPU bound? For large scale parallel applications there are a variety of tools for profiling code and determining where bottlenecks occur. TAU is a common profiling utility in HPC and there are several proprietary offerings available as well.
Once the limitations of the program are clear specing out a system with hardware to reduce/minimize the issue is a lot easier, and normally less expensive. Hopefully this information is helpful.
Additions based on comments:
It seems like it would run with 128k of memory if you have an OS that will boot and run java and any other requirements. If there is backend storage setups somewhere, like a standalone DB server which can be used for the databases as described in the DB Config section of the guide it seems like only a small amount of storage would be necessary.
Depending on how you deploy the images for the Admin OS that may not even be necessary as you could use KIWI to create and deploy a custom OS image of your choosing with configuration files and other customizations embedded in the image. This image could be loaded via the network over PXE or to one of the other output formats KIWI supports like VMs, bootable USB and more.
The exact configuration of the systems running Spring XD will depend on the end goals, available infrastructure and a number of other things. It seems like the Spring XD Admin node could be run on most infrastructure servers. Factors such as reliability, stability and desired performance must also be considered when choosing hardware.
Q: Will Spring XD Admin run on a system with RaspberryPi like specs?
A: based on documentation, yes
Q: Will it run with good performance or reliably on such a system?
A: Probably not if being used for extended periods of time or for large amounts of traffic.

Heroku-like deployment and environment configuration via EC2

I really like the approach of a 12factor app, which you are kinda forced into, when you deploy an application to Heroku. For this question I'm particularly interested in setting environment variables for configuration, like one would do on Heroku.
As far as I can tell, there's no way to change the ENV for one or multiple instances within the EC2 console (though it's seems to be possible to set 5 ENV vars when using elastic beanstalk). Therefore my next bet on an Ubuntu based system would be to use /etc/environment, /etc/profile, ~/.profile or just the export command to set ENV variables.
Is this the correct approach or am I missing something?
And if so, is there a best practice on how to do it? I guess I could use something like Capistrano or Fabric, get a list of servers from the AWS api, connect to all of them and change the mentioned files/call export. Though 12factor is pretty well known, I couldn't find any blog post describing how to handle the ENV for a non-trivial amount of instances on EC2. And I don't want to implement such a thing, if somebody already did it very well and I just missed something.
Note: I want a solution without using elastic beanstalk and I don't care about git push deployment or any other Heroku-like feature, this is solely related to app configuration.
Any hints appreciated, thanks!
Good question. There are many ways you can approach your deployment/environment setup.
One thing to keep in mind is that with Heroku (or Elastic Beanstalk for that matter) you only push the code. Their service takes care of the scalability factor and replication of your services across their infrastructure (once you push the code).
If you are using fabric (or capistrano) you are using a push model too, but you have to take care of all the scalability/replication/fault tolerance of your application.
Having said that, if you are using EC2, in my opinion it's better if you leverage AMIs, Autoscale and Cloudformation for your deployments. This is the beauty of elasticity and Virtualization in that you can think of resources as ephemeral. You can still use fabric/capistrano to automate the AMI builds (I use Ansible) and configure environment variables, packages, etc. Then you can define a Cloudformation stack (with a JSON file) and in it you can add an autoscaling group with your prebaked AMI.
Another way of deploying your app is to simply use the AWS Opsworks service. It's pretty comprehensive and it has a lot of options but it may not be for everybody since some people may want a bit more flexibility.
If you want to go 'pull' model you can use Puppet, Chef or CFEngine. In this case you have a master policy server somewhere in the cloud (Puppetmaster, Chef Server or Policy Server). When a server gets spun up, an agent (Puppet agent, Chef Client, Cfengine agent) connects to its master to pick up its policy and then executes it. The policy may contains all the packages and environment variables that you need for your application to function. Again, it's a different model. This model scales pretty well but it depends on how many agents the master can handle and how you stagger the connections from the agents to the master. You can load balance multiple masters too if you want to scale to thousands of servers or you can just simply use multiple masters. From experience, if you want something really "fast" Cfengine works pretty good, there's a good blog comparing the speed of Puppet and CFengine here: http://www.blogcompiler.com/2012/09/30/scalability-of-cfengine-and-puppet-2/
You can also go "push" completely with tools like fabric, Ansible, Capistrano. However, you are constrained by how much a single server (or laptop) can handle multiple connections to thousands of servers that its trying to push to. This is also constrained by network bandwidth, but hey you can get creative and stagger your push updates and perhaps use multiple servers to push. Again it works and it's a different model so it depends which direction you want to go.
Hope this helps.
If you dont need beanstalk, you can look at AWS Opsworks (http://aws.amazon.com/opsworks/). Ideal for Web worker kind of deployment scenerios. You can pass any variable from outside the code here (even Chef recipies)
It's might be late but they what we are doing.
We have python script that take env var in Json and send that to as post data to another python script that convert those vars to ymal file.
After that we use Jenkins pipline groovy using multibranch. Jenkins do all the build and then code deploy copies those env vars to ec2 instanced running in autoscaling.
Off course we are doing some manapulation from yaml to simple text file so code deploy can paste it on /etc/envoirments

EC2 cluster Instances for offloading desktop-scale computing tasks

I'm using EC2 to offload some computing tasks from my desktop - basically running some jobs that would take hours or days on a desktop, nothing particularly large scale, so I'm not looking to setup anything too complex - it should be able to run on a single instance running ubuntu. I know this is stretching the use case of EC2 and there are better long term solutions than using EC2 in this way, but I'll address that at a later point in time.
However, if I use standard, high memory, or high cpu ubuntu server instances, even the XL classes (e.g. m2.4xlarge) are fairly slow in terms of their computing capability, and the cluster compute instances are probably more appropriate for my needs. However, I can't use the cluster compute instances unless I choose the "ubuntu server for cluster instances" images, which are lacking in preinstalled libraries and software. I can install the packages piece-by-piece but this seems like a roundabout way of doing something they're not intended for (I tried swapping an EBS volume from a regular server instance into a cluster instance, but the instance wouldn't boot when I did that).
Basically the bottom line is I would like to use the hardware of their cluster compute instances but not use the stripped down OS so I can run some single instance jobs with a minimal setup. What's the best way to go about this?
You can try to use the CloudInit methods to install your required packages on bootup. Basically you write a shell script that is executed every time the instance is started.
Did you look into bootstrapping? A CloudFormation template might be an answer.

getting started with EC2 for compute-intensive (non-web) parallel application

I'm using LIBSVM for regression analysis. Works like a champ. But a 3-parameter grid search to optimize parameters for the model maxes out all four cores on my 2.66 GHz Intel box, and I still have to wait a couple of hours to generate a single model.
This seems like a job for Amazon EC2.
I've seen plenty of tutorials and introductory material on using EC2 for web-related tasks.
But what if you have a small compute-intensive custom ANSI-C program that you want to run multiple instances of on EC2? Can anyone provide pointers on how to do that (or even just buzzwords to search for)?
I don't think your quest is too different from that of a web application. Your stack is different of course, but regardless – the principles remain the same.
As someone commented on your question: Elastic Map Reduce might be what you're looking for the parallelize your work easily, etc.. If that is too limited, you could look into Cloudera. A ready-to-rumble hadoop distribution with support for EC2 as well.
If map-reduce is not to your liking, then you need to setup your own instance. Roughly speaking, the keypoints are as follows:
You want to figure out a way to start EC2 instances.
You want to figure out a way to bootstrap and configure them.
Cluster/network?
Starting EC2 instances
If you don't require something like auto-scaling or a custom interface, the AWS Console does an extremely good job. You have to select an AMI (Amazon Machine Image) suitable for your project. I'd probably look into either the official AMI or something Ubuntu-based (If I remember correctly, Ubuntu is the most used Linux on EC2).
But that is up to you and your liking. (And I don't know enough about your project.)
Once you figured out a setup that works for you, the easiest way to clone your work is to setup your own AMI and start instances with it, etc..
Bootstrapping
Bootstrapping can be using what EC2 calls user-script. It allows you to pass shell script to the instance, which would execute calls to setup your stack, etc.. I'm not sure what is required in this case, etc.. So in case you comment or extend your answer, I could go into detail here.
Cluster/Networking
This is a wild guess since I'm not sure what your code does, or how it works, etc.. If it's not necessary, I'd probably scale this out using a single instance first. You can get a lot of cores and RAM provisioned easily with EC2. Depending if your work requires more RAM or CPU, look into high-cpu and high-memory instance types.
You can start off with a t1.micro, which you can currently get for free even and go from there.
Let me know if this helps!

Managing server instance identity on EC2

I recently brought up a cluster on EC2, and I felt like I had to invent a lot of things. I'm wondering what kinds of tools, patterns, ideas are out there for how to deal with this.
Some context:
I had 3 different kinds of servers, so first I created AMIs for each of them. The first AMI had zookeeper, so step one in deploying the system was to get the zookeeper server running.
My script then made a note of the mapping between EC2's completely arbitrary and unpredictable hostnames, and the zookeeper server.
Then as I brought up new instances of the other 2 kinds of servers, the first thing I would do is ssh to the new server, and add the zookeeper server to its /etc/hosts file. Then as the server software on each instance starts up, it can find zookeeper.
Obviously this is a problem that lots of people have to solve, and it probably works a little bit differently in different clouds.
Are there products that address this concept? I was pretty surprised that EC2 didn't provide some kind of way to tie your own name to its name.
Thanks for any ideas.
How to do some service discovery on Amazon EC2 seems to have some good options.
I think you might want to look at http://puppetlabs.com/mcollective/introduction/ and the suite of tools from http://puppetlabs.com in general.
From the site:
The Marionette Collective AKA MCollective is a framework to build server orchestration or parallel job execution systems.
Primarily we’ll use it as a means of programmatic execution of Systems Administration actions on clusters of servers. In this regard we operate in the same space as tools like Func, Fabric or Capistrano.
I am fairly certain mcollective was built to solve exactly the problem you are trying to address. But, be forewarned, it's not a DNS-based solution, it's a method of addressing arbitrarily large and arbitrarily tagged groups of hosts.

Resources