Apache Hadoop - node machine disparity? - hadoop

I have an old desktop having an Intel dual core processor(32-bit) and I have Ubuntu 12.04 Desktop edition(again, 32-bit) running on it. I wish to set-up at least a 4-node Apache Hadoop cluster. For that, I'm planning to buy some used desktops which may come at a cheap price. However, I'm confused with the following queries :
Can Apache Hadoop work with disparate nodes in a cluster - one 32-bit Ubuntu 12.04 while another is the 64-bit version ?
I think the O.S version has to be the same across the cluster nodes - am I correct?
As per the official site, 1.0.3 is the latest stable version - will it work with 32-bit machines or needs all the nodes to be 64-bit?
The answers to the above queries will help me to determine what kind of processor etc. must I purchase to build a cluster(suggestions are welcome!!!)

Can Apache Hadoop work with disparate nodes in a cluster - one 32-bit
Ubuntu 12.04 while another is the 64-bit version ?
As per the official site, 1.0.3 is the latest stable version - will it
work with 32-bit machines or needs all the nodes to be 64-bit?
Everything runs on top of Java, so if you can install a 32bit Java, you can run Java. There are however some native parts, I believe they are crosscompiled working for x86 and x64.
Since the communication takes place via RPC (pure java code) this should work, although I haven't tried it out yet.
I think the O.S version has to be the same across the cluster nodes - am I correct?
Not necessarily, but for the ease of your use in debugging problems and keep clusters homogenous in case of updates I wouldn't do this.

Related

How to install pyspark & spark for learning purpose on a laptop with limited resources?

I have a windows 7 laptop with 6GB RAM . What is the most RAM/resource efficient way to install pyspark & spark on this laptop just for learning purpose. I don't want to work on actual big data but small dataset is ideal since this is just for learning pyspark & spark in general. I would prefer the latest version of Spark.
FYI: I don't have hadoop installed.
Thanks
You've basically got three options:
Build everything from source
Install Virtualbox and use a pre-built VM like Cloudera Quickstart
Install Docker and find a suitable container
Getting everything up and running when you choose to build from source can be a pain. You've got to install the JDK, build hadoop and spark (both of which require you to install additional software to build them), set up a bunch of environment variables and then pray that didn't mess anything up.
VMs are nice, particularly the one from Cloudera, but you'll often be stuck with an older version of Spark and it might be tight with the resources you described.
I'd go with Docker.
Once you've got docker installed, it becomes very easy to try Spark (and lots of other technologies). My favorite containers for playing around use ipython or jupyter notebooks.
Install Docker:
https://docs.docker.com/installation/windows/
Jupyter Notebook Python, Spark, Mesos Stack
https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook
One thing to keep in mind is that you are going to have to allocate a certain amount of memory for the VM and the remaining memory still has to operate Windows. Windows 7 requires a minimum of 1 GB for a 32-bit OS or 2 GB for a 64-bit OS. So likely you are only going to wind up with around 4 GB of RAM for running the VM, which is not much.
Assuming you are 64-bit, note that Cloudera requires a minimum of 4 GB RAM to run CDH 5, but if you want to run Cloudera Express, you need 8 GB.
Running Docker from Windows will require you to use boot2docker, which keeps the entire VM in memory. It uses minimal memory (like around 27 MB) to run, so you should be fine there. A MUCH better solution than running VirtualBox!
Another option to consider would be to spin up a free machine on something like Amazon Web Services (http://aws.amazon.com) or Google Cloud (http://cloud.google.com). Particularly with the later, you can get a free trial amount of credits, which you could use to spin up a machine with more RAM than you would typically get with AWS.

Hadoop features when installed on windows using virtual box

Do I get less features or functions of hadoop env. when installed on windows machine using virtual box? Is is good to have this sort of hadoop installation for beginners practice? or What is the difference when hadoop in installed on linux machine vs installation on virtual box on a windows machine.
You can have fully distributed cluster on your windows machine using multiple nodes in the virtual box . However for beginners I will recommend you set up a single node cluster and do the practice. There is no thing as such that you will get less features . You will be running pseudo distributed mode of hadoop . All the daemons will be running. Only thing is that since you have single windows machine with limited storage/ram, you cant test the cluster with huge amounts of data. Hope this helps.

Practicing Hadoop in local machine

I have a pc with windows 8.1 and ubuntu 12.04 os . Hardware specification is core i3 with 4 GB of Ram.
Now i want to practice hadoop in my local pc. Is my current system is competitive with the hadoop framework available in the market. I am little bit confused. I went through may tutorial but they proposed software that is not applicable to my current system. So what can i do now .
you can install hadoop on the system. go through this link which describes about "installation of hadoop on single node(computer)". it will help
"http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/"

Setup multinode Hadoop cluster using virtual machines on my laptop

I have a windows 7 laptop and I need to setup hadoop (mutlinode) cluster on it.
I have the following things ready -
virtual softwares, i.e. virtualbox and vmware player.
Two virtual machines, i.e.
Ubuntu - for Hadoop master and
Ubuntu - for (1X) Hadoop slave
Has anyone done a setup of such a cluster using Virtual machines on
your laptop ?
If yes please help me to install it.
I've searched over google but I am not getting how to configure this multi-node cluster on hadoop using VMs?
How to run two Ubuntu OS on windows 7 using VMware or virtualbox?
Should we use same Ubuntu version VM image or
vm images with different versions of Ubuntu linux?
Yes you can use ubuntu two node. I am using five nodes(1 master, 4 datanodes).
If you want install multi node in vm ware.
Just download ubutnu from this link: http://www.ubuntu.com/download/desktop
And install two machine. And install java and openssh.
And download shell script for multinode from this link::
https://github.com/tonyreddy/Apache-MultiNode-Insatallation-Shellscript
And try it .....
All the best............
Since you're running Hadoop on your laptop, obviously you're doing it for learning purposes or building POC or functional debugging.
Instead of going through the hassles of installing and setting up Hadoop and related Big-Data softwares, you can simply install a pre-configured pseudo-distributed VM.
Some good options are:
Cloudera QuickStart VM
Hortonworks Sandbox
I've been using the Cloudera's VM on my laptop for quite sometime now and it's been working great.
Cloudera and Hortonworks are the fastest way to get it up and running.
Make sure you have enough RAM installed on your laptop for the Operating system already running, else your laptop will restart abruptly often while you use the Virtual machines.
Let me give you an example -
If you are using Windows 10, it needs 3-5GB RAM to be used to work smoothly,
This means if you load a Virtual Machine of 5GB size in your RAM, Windows may crash when it does not find enough RAM to operate.
You must upgrade the RAM from 8GB to 12GB or best 16GB for smooth operation of your laptop.
Hope it helps

Does CentOS support Condor?

I plan to make HPC cluster using Condor as middle-ware. Is CentOS a good choice to be the OS I mean does it support condor and is there any tutorial which could be helpful in installation process?
Regards,
Yes - indeed, condor is even distributed as RPMs for RHEL4, 5, and 6 (which is to say, CentOS 4/5/6, as CentOS is just a rebadged RHEL). So I guess it would be more fair to say that Condor supports CentOS than vice versa.
A lot of people use RHEL/CentOS for HPC applications; I'm not a big fan myself, as an HPC compute node is not the same as a print server or web server, and having very out of date libraries and compilers can be a PITA. But the issue isn't as bad (now) for CentOS 6 as it was for CentOS 5.

Resources