Which one of the following in not part of a Beowulf cluster? - cluster-computing

I need an answer for this,
Which one of the following in not part of a Beowulf cluster?
Windows.
Linux.
MPI.
IP Stack.

IP Cluster.
You can eliminate the other options using: http://en.wikipedia.org/wiki/Beowulf_cluster (Paragraph 3)
I think a simple google search could have answered the question.

Related

About online distributed environment

I am learning Mapreduce and Hadoop now. I know I can do some tests and run some samples on a singe node. But I really want to do some practice on a real distributed environment. So I want to ask :
Is there a website which can offer a distributed environment for me to do some experiments?
Somebody told me that I can use Amazon web service to build a distributed environment. Is it real? Does someone have such an experience?
And I want to know how you guys learn hadoop before you use it in your work?
Thank you!
There are a few options:
If you just want to learn about the Map/Reduce paradigm, I would recommend you take a look at JSMapReduce. This is embedded directly in the browser, you have nothing to install, and you can create real Map/Reduce programs.
If you want to learn about Hadoop specifically, Amazon has this thing called Elastic Map Reduce which is essentially Hadoop running on AWS, so this enables you to write your Hadoop job, decide how many machines you want in your cluster, which type of machines you want, and then run it, and EMR will do everything, bootstrap the machines for you, run your job and store the results on S3. I would recommend looking at this tutorial to get an idea how to setup a job on EMR. Just remember, EMR is not free, so you'll have to pay for your computing resources.
Alternatively if you're not looking to pay the cost of EMR, you could always setup Hadoop on your local machine in non-distributed mode, and experiment with it, as described here. Even if it's a single node setup, the abstractions will be the same as if you were using a big cluster, so it's a good way to get up to speed and then go on EMR or a real cluster when you want to get serious.
Amazon offers a free tier, so you can spin up some vms and try experimenting that way. The micro instances they have aren't very powerful, but are fine for small scale tests.
You can also spin up VMs on your desktop if it is powerful enough. I have done this myself using VMPlayer. You can install any flavor of Linux you like for free. Ubuntu is pretty easy to start with. When you setup the networking for your VMs, be sure to use bridged networking. That way each VM will get its own IP address on your network so they can communicate with each other.
Well, it's maybe not about '100% online' but should give really good alternative with some details.
If you are not ready to pay for online cluster resources (such as EMR solution mentioned here) and you don't like to build your own cluster but you are not satisfied with single node setup, you can try to build virtual cluster on powerful enough desktop.
You need minimun 3 VM, I prefer Ubuntu. 4 is better. To see real Hadoop you need minimal replication factor 3. So you need 3 dataNode, 3 taskTrackers. Well, you also need nameNode / JobTracker - it could be one of nodes used for dataNode but I'd recommend to have separate VM. If you need HBase, for example, you again need one Master and minimum 3 RegionServer. So, again, you need 3 but better 4 VM,
There is pretty good free product, Cloudera CDH which is 'somewhat commercial' Hadoop distribution. They also have manager with GUI and simplified installation. BTW they have even prepared demo VMs but I never have used them. You can download everything here. They also host lot of materials about Hadoop and their environment.
Alternative between completely free solution with VMs on desktop and paid service like EMR is your virtual cluster built on top of one dedicated server if you have spare. This is what I personally did. One physical server powered by VmWare free solution, 4 virtual machine, 1 SSD for OS and 3 'general' HDD for storages. Every VM runs Ubuntu 11.04 (again free). Cloudera manager free edition, CDH. So everything is free but you need some hardware that is often available as spare. And you have playground. OK, you need to invest time but by my mind you will get greatest experience from this approach.
Although I do not know much about it, another option may be Greenplum's analytic workbench (1000 node cluster w/ Hadoop for testing): http://www.greenplum.com/solutions/analytics-workbench

how to install hadoop?can i use it simulate many computers?

suppose i want to find connected components in a huge graph and the number of nodes is very large.i dont have many machines to do it.i want to just simulate a large network and the machines doing the computation using mapreduce. any direction?
I would try this: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
You can install Hadoop in a single computer and use it just for learning and developing.
Other option is to use Cloudera Virtual Machines (CDH4) (see http://blog.cloudera.com/blog/2012/08/hadoop-on-your-pc-clouderas-cdh4-virtual-machine/)
If you are looking to simulate a large number of nodes, the only practical way of doing this is using a service like Amazon's EMR

can i run two process that use spread on the same machine without them see each other?

I have two process that use spread-toolkit and i want to run them on the same machine but the are not suppose to see each other in the spread.
The only simple solution that I can come up with is running to spread instances on different ports and configuration on the same machine.
Is there any way to separate them in the spread configuration instead of the solution above?
Spread specific answer
According to the FAQ:
Configuration and Setup Questions
What ports can you run it on?
Any ports you want. Just change the ports in the configuration file spread.conf
and restart the Spread daemons. We recommend using random high ports over 2000.
If you're on a linux or similar platform, the configuration is in /etc/spread.conf. If you're on a Windows platform, you'll need to poke around to find it.
You can set up multiple spread segments on different ports. See pages 9-12 of the user guide. In addition, you may find a scrap or two of information in this Stack Overflow question. Here's a quick example fragment:
Spread_Segment 192.168.0.255:2000 {
machine1 192.168.0.1
machine1 192.168.0.2
}
Spread_Segment 192.168.0.255:2001 {
machine1 192.168.0.1
machine1 192.168.0.2
}
Caveat: I have merely updated my answer with some readily available information that I hope will be helpful. I do not carry practical experience in using Spread at this time.
Original Answer
There may be a solution specific to the spread toolkit, however, not being familiar with it, I will mention some more general methods you might use.
If your cluster is running linux, you can probably do what you want by using Linux Containers. These are based on a kernel feature called control groups.
If your cluster is running a BSD derivative, the corresponding technology are BSD Jails. BSD Jails have been in existance longer than the linux option, and is very well tested.
Both of these methods use operating system virtualisation, which is much lighter (less overhead) than both full- and para-virtualisation.

Setting multiple instances on multiple servers of Memcached

While reading the documentation on Memcached, I got the impression that one can setup
memcached across multiple servers thus creating a cluster. The question that naturally comes up, is the exact procedure of doing this.
I think that if such a feature exists is not well documented and the Memcached Wiki needs this addition. I found somewhere that you must include in the configuration files, the IP list of all the servers but is this enough? If someone could point me to a link or something outlining this procedure I would be grateful.
Install it, run it, point all your clients to it the same way. Did you try this?

How to use a "Rocks" cluster

I've just joined a research lab at my University and been given access to a Cluster to compile and run the c++ code that I write. I use SSH to access it and simply use the cluster like a Linux terminal.
I often have to wait a relatively long time while my code runs. I'm trying to figure out if there's a more efficient way use the Cluster. For example, there are different CPUs/Nodes in the cluster, some of which are more in use and others less in use. How do I access a specific CPU? I have access to the "Ganglia" overview page which gives information about the different Nodes.
Also, if I run 2 processes in a different SSH windows will it automatically use different processors or nodes, or do I have to manually specify that.
I couldn't find any documentation to help me with these issues, so I'd appreciate a little help.
Thanks
Simply running something on a cluster does not mean it is taking advantage of the cluster at all. By default, it will probably just run on the head node. Software needs to be written specifically for a cluster.
There is likely to be some kind of scheduler running that you need to interface with. Perhaps you could also see if distcc is installed and configured for your particular cluster (for doing the compilation across multiple machines). There may also be a particular flavour of MPI running to allow processes on different nodes to communicate.
Clusters software setups tend to be very specialised to the hardware and computing environment. Really, I would recommend that you ask someone who has used the machine before these kinds of questions, because any advice you receive here is unlikely to be completely accurate for your particular cluster.

Resources