How to use a "Rocks" cluster - cluster-computing

I've just joined a research lab at my University and been given access to a Cluster to compile and run the c++ code that I write. I use SSH to access it and simply use the cluster like a Linux terminal.
I often have to wait a relatively long time while my code runs. I'm trying to figure out if there's a more efficient way use the Cluster. For example, there are different CPUs/Nodes in the cluster, some of which are more in use and others less in use. How do I access a specific CPU? I have access to the "Ganglia" overview page which gives information about the different Nodes.
Also, if I run 2 processes in a different SSH windows will it automatically use different processors or nodes, or do I have to manually specify that.
I couldn't find any documentation to help me with these issues, so I'd appreciate a little help.
Thanks

Simply running something on a cluster does not mean it is taking advantage of the cluster at all. By default, it will probably just run on the head node. Software needs to be written specifically for a cluster.
There is likely to be some kind of scheduler running that you need to interface with. Perhaps you could also see if distcc is installed and configured for your particular cluster (for doing the compilation across multiple machines). There may also be a particular flavour of MPI running to allow processes on different nodes to communicate.
Clusters software setups tend to be very specialised to the hardware and computing environment. Really, I would recommend that you ask someone who has used the machine before these kinds of questions, because any advice you receive here is unlikely to be completely accurate for your particular cluster.

Related

What is the purpose of a single Hadoop node?

I am new to Hadoop so this may seem like a silly question.
The purpose of Hadoop is to distribute processing power and storage across multiple computers.
So what is the purpose of a single Hadoop node? Its only one computer so there is no distribution or sharing of resources available?
Strictly learning and getting started.
Also useful for local unit testing very small workloads without touching a standing production cluster or a large dataset. For example, parsing a file, and making sure the logic works before you make it scalable and run into eventual issues.

About online distributed environment

I am learning Mapreduce and Hadoop now. I know I can do some tests and run some samples on a singe node. But I really want to do some practice on a real distributed environment. So I want to ask :
Is there a website which can offer a distributed environment for me to do some experiments?
Somebody told me that I can use Amazon web service to build a distributed environment. Is it real? Does someone have such an experience?
And I want to know how you guys learn hadoop before you use it in your work?
Thank you!
There are a few options:
If you just want to learn about the Map/Reduce paradigm, I would recommend you take a look at JSMapReduce. This is embedded directly in the browser, you have nothing to install, and you can create real Map/Reduce programs.
If you want to learn about Hadoop specifically, Amazon has this thing called Elastic Map Reduce which is essentially Hadoop running on AWS, so this enables you to write your Hadoop job, decide how many machines you want in your cluster, which type of machines you want, and then run it, and EMR will do everything, bootstrap the machines for you, run your job and store the results on S3. I would recommend looking at this tutorial to get an idea how to setup a job on EMR. Just remember, EMR is not free, so you'll have to pay for your computing resources.
Alternatively if you're not looking to pay the cost of EMR, you could always setup Hadoop on your local machine in non-distributed mode, and experiment with it, as described here. Even if it's a single node setup, the abstractions will be the same as if you were using a big cluster, so it's a good way to get up to speed and then go on EMR or a real cluster when you want to get serious.
Amazon offers a free tier, so you can spin up some vms and try experimenting that way. The micro instances they have aren't very powerful, but are fine for small scale tests.
You can also spin up VMs on your desktop if it is powerful enough. I have done this myself using VMPlayer. You can install any flavor of Linux you like for free. Ubuntu is pretty easy to start with. When you setup the networking for your VMs, be sure to use bridged networking. That way each VM will get its own IP address on your network so they can communicate with each other.
Well, it's maybe not about '100% online' but should give really good alternative with some details.
If you are not ready to pay for online cluster resources (such as EMR solution mentioned here) and you don't like to build your own cluster but you are not satisfied with single node setup, you can try to build virtual cluster on powerful enough desktop.
You need minimun 3 VM, I prefer Ubuntu. 4 is better. To see real Hadoop you need minimal replication factor 3. So you need 3 dataNode, 3 taskTrackers. Well, you also need nameNode / JobTracker - it could be one of nodes used for dataNode but I'd recommend to have separate VM. If you need HBase, for example, you again need one Master and minimum 3 RegionServer. So, again, you need 3 but better 4 VM,
There is pretty good free product, Cloudera CDH which is 'somewhat commercial' Hadoop distribution. They also have manager with GUI and simplified installation. BTW they have even prepared demo VMs but I never have used them. You can download everything here. They also host lot of materials about Hadoop and their environment.
Alternative between completely free solution with VMs on desktop and paid service like EMR is your virtual cluster built on top of one dedicated server if you have spare. This is what I personally did. One physical server powered by VmWare free solution, 4 virtual machine, 1 SSD for OS and 3 'general' HDD for storages. Every VM runs Ubuntu 11.04 (again free). Cloudera manager free edition, CDH. So everything is free but you need some hardware that is often available as spare. And you have playground. OK, you need to invest time but by my mind you will get greatest experience from this approach.
Although I do not know much about it, another option may be Greenplum's analytic workbench (1000 node cluster w/ Hadoop for testing): http://www.greenplum.com/solutions/analytics-workbench

can i run two process that use spread on the same machine without them see each other?

I have two process that use spread-toolkit and i want to run them on the same machine but the are not suppose to see each other in the spread.
The only simple solution that I can come up with is running to spread instances on different ports and configuration on the same machine.
Is there any way to separate them in the spread configuration instead of the solution above?
Spread specific answer
According to the FAQ:
Configuration and Setup Questions
What ports can you run it on?
Any ports you want. Just change the ports in the configuration file spread.conf
and restart the Spread daemons. We recommend using random high ports over 2000.
If you're on a linux or similar platform, the configuration is in /etc/spread.conf. If you're on a Windows platform, you'll need to poke around to find it.
You can set up multiple spread segments on different ports. See pages 9-12 of the user guide. In addition, you may find a scrap or two of information in this Stack Overflow question. Here's a quick example fragment:
Spread_Segment 192.168.0.255:2000 {
machine1 192.168.0.1
machine1 192.168.0.2
}
Spread_Segment 192.168.0.255:2001 {
machine1 192.168.0.1
machine1 192.168.0.2
}
Caveat: I have merely updated my answer with some readily available information that I hope will be helpful. I do not carry practical experience in using Spread at this time.
Original Answer
There may be a solution specific to the spread toolkit, however, not being familiar with it, I will mention some more general methods you might use.
If your cluster is running linux, you can probably do what you want by using Linux Containers. These are based on a kernel feature called control groups.
If your cluster is running a BSD derivative, the corresponding technology are BSD Jails. BSD Jails have been in existance longer than the linux option, and is very well tested.
Both of these methods use operating system virtualisation, which is much lighter (less overhead) than both full- and para-virtualisation.

EC2 cluster Instances for offloading desktop-scale computing tasks

I'm using EC2 to offload some computing tasks from my desktop - basically running some jobs that would take hours or days on a desktop, nothing particularly large scale, so I'm not looking to setup anything too complex - it should be able to run on a single instance running ubuntu. I know this is stretching the use case of EC2 and there are better long term solutions than using EC2 in this way, but I'll address that at a later point in time.
However, if I use standard, high memory, or high cpu ubuntu server instances, even the XL classes (e.g. m2.4xlarge) are fairly slow in terms of their computing capability, and the cluster compute instances are probably more appropriate for my needs. However, I can't use the cluster compute instances unless I choose the "ubuntu server for cluster instances" images, which are lacking in preinstalled libraries and software. I can install the packages piece-by-piece but this seems like a roundabout way of doing something they're not intended for (I tried swapping an EBS volume from a regular server instance into a cluster instance, but the instance wouldn't boot when I did that).
Basically the bottom line is I would like to use the hardware of their cluster compute instances but not use the stripped down OS so I can run some single instance jobs with a minimal setup. What's the best way to go about this?
You can try to use the CloudInit methods to install your required packages on bootup. Basically you write a shell script that is executed every time the instance is started.
Did you look into bootstrapping? A CloudFormation template might be an answer.

getting started with EC2 for compute-intensive (non-web) parallel application

I'm using LIBSVM for regression analysis. Works like a champ. But a 3-parameter grid search to optimize parameters for the model maxes out all four cores on my 2.66 GHz Intel box, and I still have to wait a couple of hours to generate a single model.
This seems like a job for Amazon EC2.
I've seen plenty of tutorials and introductory material on using EC2 for web-related tasks.
But what if you have a small compute-intensive custom ANSI-C program that you want to run multiple instances of on EC2? Can anyone provide pointers on how to do that (or even just buzzwords to search for)?
I don't think your quest is too different from that of a web application. Your stack is different of course, but regardless – the principles remain the same.
As someone commented on your question: Elastic Map Reduce might be what you're looking for the parallelize your work easily, etc.. If that is too limited, you could look into Cloudera. A ready-to-rumble hadoop distribution with support for EC2 as well.
If map-reduce is not to your liking, then you need to setup your own instance. Roughly speaking, the keypoints are as follows:
You want to figure out a way to start EC2 instances.
You want to figure out a way to bootstrap and configure them.
Cluster/network?
Starting EC2 instances
If you don't require something like auto-scaling or a custom interface, the AWS Console does an extremely good job. You have to select an AMI (Amazon Machine Image) suitable for your project. I'd probably look into either the official AMI or something Ubuntu-based (If I remember correctly, Ubuntu is the most used Linux on EC2).
But that is up to you and your liking. (And I don't know enough about your project.)
Once you figured out a setup that works for you, the easiest way to clone your work is to setup your own AMI and start instances with it, etc..
Bootstrapping
Bootstrapping can be using what EC2 calls user-script. It allows you to pass shell script to the instance, which would execute calls to setup your stack, etc.. I'm not sure what is required in this case, etc.. So in case you comment or extend your answer, I could go into detail here.
Cluster/Networking
This is a wild guess since I'm not sure what your code does, or how it works, etc.. If it's not necessary, I'd probably scale this out using a single instance first. You can get a lot of cores and RAM provisioned easily with EC2. Depending if your work requires more RAM or CPU, look into high-cpu and high-memory instance types.
You can start off with a t1.micro, which you can currently get for free even and go from there.
Let me know if this helps!

Resources