Hadoop virtual cluster vs single machine - hadoop

I have a question regarding speed & performance of
using multiple virtualized nodes in a single machine VS single node on the single machine itself.
which one will perform better?
The reason why I ask this question is because I am currently learning hadoop on a single machine, and I see some tutorials on the internet that shows the use of multiple virtualized nodes in a single machine.
Thank you in advance

There is always some overhead that comes with virtualization, so unless really necessary I wouldn't advise to run Hadoop in a virtualized environment.
That being said, I know VMWare did a lot of work on making Hadoop work in a virtualized environment, and they have published some benchmarks in which they claim under certain conditions to have better performance with VMs that a native application. I haven't played much with vSphere, but this could be something to look at if you want to explore virtualization further. But don't take the numbers for granted, it really depends on the type of hardware you're running, so in some conditions I think you might gain some performance with VMs, but I'm guessing from experience that in most cases you won't gain anything.
If you're just getting started and testing with Hadoop, I think virtualizing is overkill. You can very easily run Hadoop in pseudo-distributed mode, which means that you can run multiple Hadoop daemons on the same box, each as a separate process. That's what I used to get started with Hadoop, and it's a good head start. You can find more info here (or might need another page depending on which Hadoop version you're running).
If you get to the point where you want to test with a real cluster, but don't have the resources, I would advise looking at Amazon Elastic Map/Reduce: it gives you a cluster on demand and it's pretty cheap. That way you can do more advanced tests. More info here.
the bottom line is, I think if the purpose is simply testing, you don't really need a virtual cluster.

A performance analysis case study conducted on this topic showed that a virtual Hadoop cluster is only around 4% less efficient compared to its native counterpart: Virtualized hadoop performance case study

Related

Suggestion for building a small Hadoop cluster for learning purpose

I have a test for my Big Data class where I have to do some sort of big data analytics with 'smaller' datasets. I actually have my stuff figured it out. I installed Hadoop 2.8.1 and Spark 2.2.0 (I use PySpark to build a program) in standalone mode on my Ubuntu 16.04 from source. I'm actually good to go to do my thing by my own.
The thing is, some of my friends are struggling in configuring all of these and I thought to myself "why don't I make my own little cluster with my classmates". So I'm looking for suggestions.
My laptop has 12 GB RAM and Intel Core i5.
If I understand correctly, your friends have trouble setting up spark in standalone mode (meaning no cluster at all, just local computation). I don't think setting up a cluster they can work with takes away from the complexity they will face. Or are they trying to set up a cluster? Because standalone mode of Spark really doesn't need much configuration.
Another approach is to use a preconfigured VM everyone can use individually. Either prepared by yourself, or there are sandboxes by different providers, e.g. Cloudera and Hortonworks.

Condor, Sun Grid Engine, or something else?

I'm trying to work out whether we should try out Condor or Sun Grid Engine at work (or possibly something else).
We often have lots of unused WinXp workstations. The hope is that we could use wake-on-LAN, run all our jobs, and then shut down automatically. We'd mainly be running Matlab, Java or Python simulations for either monte-carlo or parameter explorations.
With my limited knowledge of Condor, it sounds like using a the vm universe might be a convenient way of taking care of snapshots without having to modify existing code.
Is SGE or something else better than condor for this kind of work?
SGE doesn't really support windows. It comes with all kinds of caveats and missing bits on Windows.
I've been running Condor pools for many years now and it is a superb HTPC setup for both cycle-stealing and dedicated, always-on hardware, on Linux and Windows machines. The recent addition of their Rooster daemon lets you put machines to sleep between job cycles and wake them up when new work appears in the pool. They also have an active and very helpful support community. Checkpointing is the only Condor feature not available on Windows. Everything else is there. With the addition of the VM Universe, checkpointing is getting less and less useful. Really: to use checkpointing successfully you need to be able to relink your entire code stack. So if you're running Matlab jobs, even on Linux, checkpointing isn't going to be possible.
If you have specific questions about getting Condor running on Windows I'd be happy to answer them, share my experiences with it. I run Condor across 4 pools around the globe with a total of about 1500 dedicated machines in all the pools and some 1000 or so additional desktop machines that are available as users care to donate them.
I'd start with Condor. It has good support for Windows, and newer versions have built-in support for sending wake-on-lan in a very configurable way when jobs can run on certain machines. It can also shut the machines down based on user-defined policies.
After Oracle's takeover of SGE (Sun Grid Engine), there is the Open Grid Scheduler project that still offers open-source Grid Engine.
http://gridscheduler.sourceforge.net/
For dedicated hardware I'd go with Grid Engine.
For scavenging clock cycles on machines which may be in use I'd go with Condor.
For hardware which you have dedicated access to for fixed periods, such as overnight and at weekends, I'd probably still go with Condor but might be able to persuade myself to use Grid Engine.
I've had to choose between condor and SGE for a customer project recently. I was favoring SGE (because I was more familiar with that environment), but Condor won finally because:
the customer infrastructure is Windows oriented, and the SGE solution requires a Unix or Linux machine for the Central Manager, + installing MS Services for Unix on the computation hosts
support and installation process of Condor on Windows was much simpler.
However, you cannot use the most interesting features of Condor on Windows : checkpointing is not available, nor the Condor specific IOs. I'm not using the VM universe, so I cannot comment on that aspect.
I've only tried Condor, and it was a pain to attempt to set up. If you need all the clock cycles you can fully utiilize, go with Condor.
I'm about to try SGE, and I'll tell you how it goes. However at my company, people have had experience setting up SGE, so I'll probably say SGE is easier.
SGE doesn't exist... it's OGE, and it's very expensive. Go with Condor.

Can computer clusters be used for general everyday applications?

Does anyone know how a computer cluster can be used for everyday applications, like for example video games?
I would like to build a computer cluster that can run applications over the cluster that were not specifically designed for computer clusters and still see the performance increase. One use would be for video games, but I would also like to utilize the increased computing power for running a large network of virtualized machines.
It won't help, especially in the case of video games. You have to build around the cluster; the cluster does not work around you.
At any rate, video games require sub-50ms response time on input and response,and network propagation would just destroy any performance gains you might see. Video processing, on the other hand, benefits GREATLY from the cluster as the task is inherently geared toward parallelization. It does not require user input, and output is only measured in terms of the batch process.
If you have a program written for a single core, running it on a four-core processor won't help you (except that one core can be devoted to that program). For example, I have Visual Studio compiling on multiple cores on this machine, but linking is done on one core (and is annoyingly slow). In order to get use out of multiple cores, I have to either run something that can use multiple cores or run several separate programs.
Clusters are like that, only more so. All communication between the machines is explicit and must be programmed in. There are things you can do with a cluster (see Google's map-reduce algorithm), but they do require special programming and work.
Typical clusters are used either to specialize machines (one might be a database server and one a web server, for example), or to run large numbers of programs simultaneously.
You will not be able to easily run a video game on a cluster, unless it was already designed to work on multiple machines. I have not heard of such a game. You may have some luck creating a virtual server farm, but I doubt it will be easy to get it working perfectly. If you are interested in this, one example would be amazon's EC2 service. They offer virtual machines for "rent" by the hour. Behind the scenes, I assume they have a giant cluster that is supplying all of these virtual machines.
Unfortunately, unless you have some pretty clever operating system / software design in mind - simply connecting programs together via a cluster and hoping to get increased performance is not likely to work - especially not for video games. In order to get increased performance from running things in a cluster you have to program for it otherwise there is a good change you'd see a decrease in performance rather than an increase.

Software needed to build a cluster

I've been thinking about getting a little bit greener with my computers and using some lower power, mini-itx boards in my next computer. Some can generate under 10 watts and are pretty inexpensive.
So I thought, if one is such low cost and low power, why not try to make a cluster out of them? However, I'm not really sure what I would need to do in terms of Operating System or management software to make this happen?
Can anyone provide advice on existing software to do this or any ideas as to how to design my own?
What do you want to actually do with your cluster kind of decides what software you will need.
Do you need job scheduling?
Monitoring tools?
Do you need to deploy software across all nodes at once seamlessly?
One file system across all nodes (recommended).
You could just as easily install a linux or *BSD on the boards and just use ssh to manage and run jobs across all the nodes. No other software really required.
Software you might find useful:
PBS (mostly job scheduling, google)
Kerrighed (Single System Image based, Linux distro)
Rocks (cluster based distribution)
Mosix ( cluster Management, openMosix also )
Ganglia (Monitoring, probably over kill for you)
Lustre (Super fast, opensource cluster filesytem from Sun)
Take a look at beowulf to get started.
That being said, the best advice I can give is to carefully measure whether you are actually being more green with your cluster. I've been a little way down this road before, and in my experience, the losses involved in having many separate computers end up wiping out any energy savings. Keep in mind that every computer needs a power supply, which converts your household voltage down to a level that the computer wants. The conversion is inefficient, and wastes heat (this is why the power supplies have fans). The same can be said for each hard drive, RAM bank and motherboard that you need.
This isn't meant to discourage you from the project. Just be sure to profile. Exactly like writing software! :)
You can use Beowulf to run a cluster.
There's a lot to this question.
First, if you just want to get a cluster up and running, there are many suggestions listed already here. Once you have the cluster up and running, though, you're just starting.
At that point, you need to have software that will work correctly across the cluster. If you are working on your own software, you'll need to design it to be parallelized across a cluster, using something like MPI.
Without software written to run across the cluster, though, the cluster is nothing but a highly customized box that doens't do anything special...

How to benchmark virtual machines

I am trying to perform a fair comparison of XenServer vs ESX and one comparison I would like to make is performance with multiple VMs. Does anyone know how to go about benchmarking VM performance in a fair way?
On each server I would like to run a fixed number of XP/Vista VMs (for example 8) and have some measure of how quickly each one runs when under load. Ideally I would like some benchmark of the overall system (CPU/Memory/Disk/Network) rather than just one aspect.
It seems to me this is actually a very tricky thing to do and obtain any meaningful results so would be grateful for any suggestions!
I would also be interested to see any existing reports or comparisons that have been published (preferably independent rather than vendor biased!)
As a general answer, VMware (together with other virtualization vendors in the SPEC Virtualization sub-committee) has put together a hypervisor benchmarking suite called VMmark that is available for download. The VMmark website discusses why this benchmark may be useful for comparing hypervisors, including an FAQ and a whitepaper describing the benchmark.
That said, if you are looking for something very specific (e.g., how will it perform under your workload), you may have to roll your own variants of VMmark, especially if you are not trying to do the sorts of things that VMmark benchmarks (e.g., web servers, database servers, file servers, etc.) Nonetheless, the methodology behind its development should be of interest.
Disclaimer: I work for VMware, though not on VMmark.
I don't see why you can't use common benchmarks inside the VMs: WinSAT, Passmark, Futuremark, SiSoftware, etc... Host the VMs over different hosts and see how it goes.
As an aside, benchmarks that don't closely match your intended usage may actually hinder your evaluation. Depending on the importance of getting this right, you may have to build-your-own to make it relevant.
Why do you want to bench?
How about some anecdotal evidence?
I'm going to assume this is a test environment, because you're wanting to benchmark on XP/Vista. Please correct me if I'm wrong.
My current test environment is about 20 VMs with varying OS's (2000/XP/Vista/Vista64/Server 2008/Server 2003) in different configurations on a Dual Quad Core Xeon machine with 8Gb RAM (looking to upgrade to 16Gb soon) and the slowest machines of all are Vista primarily due to heavy disk access (this is even with Windows Defender disabled)
Recommendations
- Hardware RAID. Too painful to run Vista VMs otherwise.
- More RAM.
If you're benchmarking and looking to run Vista VMs, I would suggest putting your focus on benchmarking disk access. If there are going to be performance differences elsewhere I doubt they would be of anything significant.
I recently came across VMware's ESX Performance documentation - http://www.vmware.com/pdf/VI3.5_Performance.pdf. There is quite a bit in there on improving performance and benchmarking.

Resources