EC2 cluster Instances for offloading desktop-scale computing tasks - amazon-ec2

I'm using EC2 to offload some computing tasks from my desktop - basically running some jobs that would take hours or days on a desktop, nothing particularly large scale, so I'm not looking to setup anything too complex - it should be able to run on a single instance running ubuntu. I know this is stretching the use case of EC2 and there are better long term solutions than using EC2 in this way, but I'll address that at a later point in time.
However, if I use standard, high memory, or high cpu ubuntu server instances, even the XL classes (e.g. m2.4xlarge) are fairly slow in terms of their computing capability, and the cluster compute instances are probably more appropriate for my needs. However, I can't use the cluster compute instances unless I choose the "ubuntu server for cluster instances" images, which are lacking in preinstalled libraries and software. I can install the packages piece-by-piece but this seems like a roundabout way of doing something they're not intended for (I tried swapping an EBS volume from a regular server instance into a cluster instance, but the instance wouldn't boot when I did that).
Basically the bottom line is I would like to use the hardware of their cluster compute instances but not use the stripped down OS so I can run some single instance jobs with a minimal setup. What's the best way to go about this?

You can try to use the CloudInit methods to install your required packages on bootup. Basically you write a shell script that is executed every time the instance is started.

Did you look into bootstrapping? A CloudFormation template might be an answer.

Related

Choosing Hadoop solution for Big Data project - Pricing Options

I have to use Hadoop for my research work and I am deciding for the best option to start with. So far I have end up to go with Cloudera. I've downloaded the quick start VM
and started learning different turorials.
The issue is that my system can't afford to run it and perform very slow and I think it might just stop working after I feed it with all the data and run other services.
I was advised to go for a cloud service with 4 cluster node. Can someone please help me by providing the best option and estimated pricing to consider? 1 year plan might be enough to complete my research.
Thanks.
If you are a linux user, Just download the individual components(like hdfs, MR1, YARN, Hbase, Hive etc...) from this Cloudera Archives instead of loading Cloudera Quickstart VM.
If you want to try the 4 node cluster, easiest option is to use cloud.
There are plenty of cloud providers. I have personally used AWS, Google Cloud, Microsoft Azure, IBM SmartCloud. Out of which, AWS is the best to start with.
It is like pay as you go(use).I can recommend you to use a decent EC2 Machine(4 X m3.large Machines)
Type: m3.large CPU:2 RAM:7.5G Storage: 1 x 32 SSD Price: $0.133 per Hour AWS Pricing
If you plan to do the research for one year, I recommend you to go for VPC.
Cons of AWS EC2:
If you launch a machine in EC2, the moment you restart your machine, Your IP and the hostname will get changed.
In AWS VPC, your IP and hostname will remain the same.
If you use 4 Machinesx24x7xone month,it costs you $389.44.
You can calculate the AWS cost by yourself
As best as I can see you have two paths:
Setup Hadoop in a cloud service provider (i.e. Amazon's EC2 or
Redhat's Openshift.
Use Hadoop-as-a-service (i.e. Amazon's EMR or Microsoft's HDInsight).
The first path, setting up Hadoop in a cloud service provider will require you to become a semi-competent Hadoop administrator. If that's your goal, great! However you'll spend a great deal of time learning the necessary skills and mindset to become that. I don't suspect that that is your goal.
The second path is the one I'd recommend out of these two. Using Hadoop-as-a-service you will get up and running faster, but will cost more up front and on an ongoing (per hour basis). You'll still probably save money because you'll be spending less time troubleshooting your Hadoop cluster and more time doing the work you wanted to do in the first place.
I have to wonder, if you can even fit your dataset on your laptop, why are you using big data tools in the first place? True, they'll work. However Big Data is at least partially defined as data sets and computational problems that just don't fit on a single machine.

About online distributed environment

I am learning Mapreduce and Hadoop now. I know I can do some tests and run some samples on a singe node. But I really want to do some practice on a real distributed environment. So I want to ask :
Is there a website which can offer a distributed environment for me to do some experiments?
Somebody told me that I can use Amazon web service to build a distributed environment. Is it real? Does someone have such an experience?
And I want to know how you guys learn hadoop before you use it in your work?
Thank you!
There are a few options:
If you just want to learn about the Map/Reduce paradigm, I would recommend you take a look at JSMapReduce. This is embedded directly in the browser, you have nothing to install, and you can create real Map/Reduce programs.
If you want to learn about Hadoop specifically, Amazon has this thing called Elastic Map Reduce which is essentially Hadoop running on AWS, so this enables you to write your Hadoop job, decide how many machines you want in your cluster, which type of machines you want, and then run it, and EMR will do everything, bootstrap the machines for you, run your job and store the results on S3. I would recommend looking at this tutorial to get an idea how to setup a job on EMR. Just remember, EMR is not free, so you'll have to pay for your computing resources.
Alternatively if you're not looking to pay the cost of EMR, you could always setup Hadoop on your local machine in non-distributed mode, and experiment with it, as described here. Even if it's a single node setup, the abstractions will be the same as if you were using a big cluster, so it's a good way to get up to speed and then go on EMR or a real cluster when you want to get serious.
Amazon offers a free tier, so you can spin up some vms and try experimenting that way. The micro instances they have aren't very powerful, but are fine for small scale tests.
You can also spin up VMs on your desktop if it is powerful enough. I have done this myself using VMPlayer. You can install any flavor of Linux you like for free. Ubuntu is pretty easy to start with. When you setup the networking for your VMs, be sure to use bridged networking. That way each VM will get its own IP address on your network so they can communicate with each other.
Well, it's maybe not about '100% online' but should give really good alternative with some details.
If you are not ready to pay for online cluster resources (such as EMR solution mentioned here) and you don't like to build your own cluster but you are not satisfied with single node setup, you can try to build virtual cluster on powerful enough desktop.
You need minimun 3 VM, I prefer Ubuntu. 4 is better. To see real Hadoop you need minimal replication factor 3. So you need 3 dataNode, 3 taskTrackers. Well, you also need nameNode / JobTracker - it could be one of nodes used for dataNode but I'd recommend to have separate VM. If you need HBase, for example, you again need one Master and minimum 3 RegionServer. So, again, you need 3 but better 4 VM,
There is pretty good free product, Cloudera CDH which is 'somewhat commercial' Hadoop distribution. They also have manager with GUI and simplified installation. BTW they have even prepared demo VMs but I never have used them. You can download everything here. They also host lot of materials about Hadoop and their environment.
Alternative between completely free solution with VMs on desktop and paid service like EMR is your virtual cluster built on top of one dedicated server if you have spare. This is what I personally did. One physical server powered by VmWare free solution, 4 virtual machine, 1 SSD for OS and 3 'general' HDD for storages. Every VM runs Ubuntu 11.04 (again free). Cloudera manager free edition, CDH. So everything is free but you need some hardware that is often available as spare. And you have playground. OK, you need to invest time but by my mind you will get greatest experience from this approach.
Although I do not know much about it, another option may be Greenplum's analytic workbench (1000 node cluster w/ Hadoop for testing): http://www.greenplum.com/solutions/analytics-workbench

getting started with EC2 for compute-intensive (non-web) parallel application

I'm using LIBSVM for regression analysis. Works like a champ. But a 3-parameter grid search to optimize parameters for the model maxes out all four cores on my 2.66 GHz Intel box, and I still have to wait a couple of hours to generate a single model.
This seems like a job for Amazon EC2.
I've seen plenty of tutorials and introductory material on using EC2 for web-related tasks.
But what if you have a small compute-intensive custom ANSI-C program that you want to run multiple instances of on EC2? Can anyone provide pointers on how to do that (or even just buzzwords to search for)?
I don't think your quest is too different from that of a web application. Your stack is different of course, but regardless – the principles remain the same.
As someone commented on your question: Elastic Map Reduce might be what you're looking for the parallelize your work easily, etc.. If that is too limited, you could look into Cloudera. A ready-to-rumble hadoop distribution with support for EC2 as well.
If map-reduce is not to your liking, then you need to setup your own instance. Roughly speaking, the keypoints are as follows:
You want to figure out a way to start EC2 instances.
You want to figure out a way to bootstrap and configure them.
Cluster/network?
Starting EC2 instances
If you don't require something like auto-scaling or a custom interface, the AWS Console does an extremely good job. You have to select an AMI (Amazon Machine Image) suitable for your project. I'd probably look into either the official AMI or something Ubuntu-based (If I remember correctly, Ubuntu is the most used Linux on EC2).
But that is up to you and your liking. (And I don't know enough about your project.)
Once you figured out a setup that works for you, the easiest way to clone your work is to setup your own AMI and start instances with it, etc..
Bootstrapping
Bootstrapping can be using what EC2 calls user-script. It allows you to pass shell script to the instance, which would execute calls to setup your stack, etc.. I'm not sure what is required in this case, etc.. So in case you comment or extend your answer, I could go into detail here.
Cluster/Networking
This is a wild guess since I'm not sure what your code does, or how it works, etc.. If it's not necessary, I'd probably scale this out using a single instance first. You can get a lot of cores and RAM provisioned easily with EC2. Depending if your work requires more RAM or CPU, look into high-cpu and high-memory instance types.
You can start off with a t1.micro, which you can currently get for free even and go from there.
Let me know if this helps!

how to monitor a linux system on amazon ec2 without cloudwatch?

I would like to monitor the following on Amazon ec2 instances loaded with amazon linux, every X minutes :
disk statistics
process stats (similar to what top does)
ram usage
check if my scripts are running fine
should I use my own scripts and things or are there any tools that already achieve this ?
I searched and there was a suggestion about munin
what seems to be the better approach ?
Here is a great article on scale. About halfway down, the author lists monitoring tools and how they differ.
http://highscalability.com/blog/2010/8/16/scaling-an-aws-infrastructure-tools-and-patterns.html
We've implemented Cacti, it was very easy, it creates all sorts of graphs/reports (most of what you mentioned). Munin is one that he lists as well, but we have not tried that solution yet.

How to use a "Rocks" cluster

I've just joined a research lab at my University and been given access to a Cluster to compile and run the c++ code that I write. I use SSH to access it and simply use the cluster like a Linux terminal.
I often have to wait a relatively long time while my code runs. I'm trying to figure out if there's a more efficient way use the Cluster. For example, there are different CPUs/Nodes in the cluster, some of which are more in use and others less in use. How do I access a specific CPU? I have access to the "Ganglia" overview page which gives information about the different Nodes.
Also, if I run 2 processes in a different SSH windows will it automatically use different processors or nodes, or do I have to manually specify that.
I couldn't find any documentation to help me with these issues, so I'd appreciate a little help.
Thanks
Simply running something on a cluster does not mean it is taking advantage of the cluster at all. By default, it will probably just run on the head node. Software needs to be written specifically for a cluster.
There is likely to be some kind of scheduler running that you need to interface with. Perhaps you could also see if distcc is installed and configured for your particular cluster (for doing the compilation across multiple machines). There may also be a particular flavour of MPI running to allow processes on different nodes to communicate.
Clusters software setups tend to be very specialised to the hardware and computing environment. Really, I would recommend that you ask someone who has used the machine before these kinds of questions, because any advice you receive here is unlikely to be completely accurate for your particular cluster.

Resources