Installing a hadoop cluster between VMs in different hardware machines - hadoop

I am responsible for teaching Hadoop to a group of people (let's say 5 people), but without any hardware available.
Each of them has a laptop, with a quite good amount of memory and processors.
I would like to make them create a Hadoop cluster between their own laptops, which will be connected to the same network.
So far, what I think about is to:
create a VM image with ubuntu 16 preconfigured (ubuntu 16 is my choice) to be ready for being a cluster node
ask each of them to run the VM on their computer
creating a cluster on top of this network of VMs
However, I have some locks:
1/ is it possible to create a private network of VMs located on their different machines, so that the hadoop cluster is isolated from the network that links the physical machines?
2/ What could be wrong with this approach?
3/ Is there a better way for handling this need of setting a Hadoop cluster between different personal machine?
By the way, I am pretty ok with Hadoop installation and so on...
Thanks in advance for your help, suggestions, ...

is it possible to create a private network of VMs located on their different machines
Yes, companies do this all the time with clusters of VMs. Granted, these companies have people with years of experience doing networking setups like this, and have some deep knowledge of firewalls and routing tables
so that the hadoop cluster is isolated from the network that links the physical machines?
Not without a specific subnet for connecting all the machines. I'm guessing each laptop is sharing the same router, though, and each device has one network interface shared between the host and the VM, so creating this may prove difficult.
What could be wrong with this approach?
You need to designate at least one machine as the "master" - the namenode, and the ResourceManager. Without this machine, nothing will work. A better approach uses HA deployments, but then you're reliant on "two people"
Is there a better way for handling this need of setting a Hadoop cluster
Use a free tier/credit of AWS, Azure, or GCP for setting up a cluster. It can start with 2-3 nodes, not 5

Related

server web with redundant mariaDB databases

I would like like some advices on how to set up a server web in HA with a php application using virtual machines with a RED HAT linux OS.
The idea is to have two virtual server web sharing a common document root with NFS or iSCSI plus other two MariaDB databases replicating the data.
I have some documentation to follow anyway I'd like to know your opinion in particular about how to cope with the replication of the databases which must be redundant.
Many Thanks
Riccardo
Do not try to share files. The code does not know how to coordinate things between two instances of mysqld touching the same data files.
Then you talk about replication. This involves separate data files. But, if you put them on the same physical drive, what is the advantage? If the drive crashes, you lose both.
Read more about HA solutions, and make a priority list of what things "keep you up at night". Drive failure, motherboard failure, network corruption, floods, earthquakes, bats in the belfry, etc.

Is it possible to have windows machine participate part time in a hadoop cluster?

We are running into the need for some map-reduce computation and don't want to built out a dedicated cluster for it just yet. We do have a lot of Windows machines on our network that are used for development, mail, etc. It would be great if these machines could participate when they are underutilized.
Is there a way to have machines participate in Hadoop (or a different map-reduce/distributed computation system) when they are idle rather than being a dedicated resource?

About renting and using a cluster on Amazon EC2

I am researching now in the topic of improving the MapReduce scheduler but unfortunately my university does not provide a cluster for research purposes. I was thinking about renting a cluster and I heard about Amazon EC2, but I have no experience with its services and I do not know how to use them.
I am in need of 5 machines with the following specifications (for each machine):
A dual-processor (2.2 GHz AMD Opteron(m) Processor 4122 with 4 physical cores)
8GB of RAM
500GB disk
I want to setup the Linux operating system and the Hadoop framework manually, just like I would if I had the machines physically on my hands. I would like to know if Amazon EC2 offers something like this, and I would like to estimate the cost of this infrastructure for, let's say, a month.
In the case I choose Amazon's Elastic MapReduce framework, would I be able to control de version of Hadoop? Could I also be able to change the configuration of the scheduler in it so that I can set my algorithm?
Finally, I would like to know if there is any kind of simulator for MapReduce to make different experiments.
Please excuse my multiple questions, I am new in this field and any guidance would be really appreciated.
I was thinking about renting a cluster and I heard about Amazon EC2, but I have no experience with its services and I do not know how to use them.
Amazon's AWS has a elaborate documentation, for reference here is the Getting Started link to get you going. Also, AWS self-paced labs are worth checking out.
I am in need of 5 machines with the following specifications (for each machine): A dual-processor, 8GB of RAM, and 500GB of disk.
Amazon's AWS provides a wide range of EC2 instance types. Choose which one best fits your use-case from a list of instance types.
I want to setup the Linux operating system and the Hadoop framework manually, just like I would if I had the machines physically on my hands. I would like to know if Amazon EC2 offers something like this, and I would like to estimate the cost of this infrastructure for, let's say, a month.
AWS does not provide a VM without an OS installed in it. All the VM's provided by AWS are pre-loaded with OS and you could manually install Hadoop on top of that. Of course AWS provides a wide range of OS.
Amazon AWS also provides a Simple Monthly Calculator to calculate how much your cluster might cost based on the instances you have selected and number of EB2 volumes you have attached to each instance.
In the case I choose Amazon's Elastic MapReduce framework, would I be able to control de version of Hadoop? Could I also be able to change the configuration of the scheduler in it so that I can set my algorithm?
If you are using AWS EMR to deploy Hadoop cluster then you could select the version of Hadoop to be installed, supported Hadoop versions by Amazon are 2.4.0, 2.2.0, 1.0.3, 0.20.205.
Finally, I would like to know if there is any kind of simulator for MapReduce to make different experiments.
I did not understand about the mapreduce simulator part though.

About online distributed environment

I am learning Mapreduce and Hadoop now. I know I can do some tests and run some samples on a singe node. But I really want to do some practice on a real distributed environment. So I want to ask :
Is there a website which can offer a distributed environment for me to do some experiments?
Somebody told me that I can use Amazon web service to build a distributed environment. Is it real? Does someone have such an experience?
And I want to know how you guys learn hadoop before you use it in your work?
Thank you!
There are a few options:
If you just want to learn about the Map/Reduce paradigm, I would recommend you take a look at JSMapReduce. This is embedded directly in the browser, you have nothing to install, and you can create real Map/Reduce programs.
If you want to learn about Hadoop specifically, Amazon has this thing called Elastic Map Reduce which is essentially Hadoop running on AWS, so this enables you to write your Hadoop job, decide how many machines you want in your cluster, which type of machines you want, and then run it, and EMR will do everything, bootstrap the machines for you, run your job and store the results on S3. I would recommend looking at this tutorial to get an idea how to setup a job on EMR. Just remember, EMR is not free, so you'll have to pay for your computing resources.
Alternatively if you're not looking to pay the cost of EMR, you could always setup Hadoop on your local machine in non-distributed mode, and experiment with it, as described here. Even if it's a single node setup, the abstractions will be the same as if you were using a big cluster, so it's a good way to get up to speed and then go on EMR or a real cluster when you want to get serious.
Amazon offers a free tier, so you can spin up some vms and try experimenting that way. The micro instances they have aren't very powerful, but are fine for small scale tests.
You can also spin up VMs on your desktop if it is powerful enough. I have done this myself using VMPlayer. You can install any flavor of Linux you like for free. Ubuntu is pretty easy to start with. When you setup the networking for your VMs, be sure to use bridged networking. That way each VM will get its own IP address on your network so they can communicate with each other.
Well, it's maybe not about '100% online' but should give really good alternative with some details.
If you are not ready to pay for online cluster resources (such as EMR solution mentioned here) and you don't like to build your own cluster but you are not satisfied with single node setup, you can try to build virtual cluster on powerful enough desktop.
You need minimun 3 VM, I prefer Ubuntu. 4 is better. To see real Hadoop you need minimal replication factor 3. So you need 3 dataNode, 3 taskTrackers. Well, you also need nameNode / JobTracker - it could be one of nodes used for dataNode but I'd recommend to have separate VM. If you need HBase, for example, you again need one Master and minimum 3 RegionServer. So, again, you need 3 but better 4 VM,
There is pretty good free product, Cloudera CDH which is 'somewhat commercial' Hadoop distribution. They also have manager with GUI and simplified installation. BTW they have even prepared demo VMs but I never have used them. You can download everything here. They also host lot of materials about Hadoop and their environment.
Alternative between completely free solution with VMs on desktop and paid service like EMR is your virtual cluster built on top of one dedicated server if you have spare. This is what I personally did. One physical server powered by VmWare free solution, 4 virtual machine, 1 SSD for OS and 3 'general' HDD for storages. Every VM runs Ubuntu 11.04 (again free). Cloudera manager free edition, CDH. So everything is free but you need some hardware that is often available as spare. And you have playground. OK, you need to invest time but by my mind you will get greatest experience from this approach.
Although I do not know much about it, another option may be Greenplum's analytic workbench (1000 node cluster w/ Hadoop for testing): http://www.greenplum.com/solutions/analytics-workbench

Managing server instance identity on EC2

I recently brought up a cluster on EC2, and I felt like I had to invent a lot of things. I'm wondering what kinds of tools, patterns, ideas are out there for how to deal with this.
Some context:
I had 3 different kinds of servers, so first I created AMIs for each of them. The first AMI had zookeeper, so step one in deploying the system was to get the zookeeper server running.
My script then made a note of the mapping between EC2's completely arbitrary and unpredictable hostnames, and the zookeeper server.
Then as I brought up new instances of the other 2 kinds of servers, the first thing I would do is ssh to the new server, and add the zookeeper server to its /etc/hosts file. Then as the server software on each instance starts up, it can find zookeeper.
Obviously this is a problem that lots of people have to solve, and it probably works a little bit differently in different clouds.
Are there products that address this concept? I was pretty surprised that EC2 didn't provide some kind of way to tie your own name to its name.
Thanks for any ideas.
How to do some service discovery on Amazon EC2 seems to have some good options.
I think you might want to look at http://puppetlabs.com/mcollective/introduction/ and the suite of tools from http://puppetlabs.com in general.
From the site:
The Marionette Collective AKA MCollective is a framework to build server orchestration or parallel job execution systems.
Primarily we’ll use it as a means of programmatic execution of Systems Administration actions on clusters of servers. In this regard we operate in the same space as tools like Func, Fabric or Capistrano.
I am fairly certain mcollective was built to solve exactly the problem you are trying to address. But, be forewarned, it's not a DNS-based solution, it's a method of addressing arbitrarily large and arbitrarily tagged groups of hosts.

Resources