Hadoop install on computer or virtual machine? - macos

I have a school project that requires Hadoop installation (It is basically so we get familiar with it. I don't see it needing further applications). Would you recommend installing it on my computer (I have a mac with M1) or using parallels and installing it in a windows VM?
TIA

I would definitely not recommend a Windows environment for Hadoop, virtual or not.
If it's a throw away environment, a VM (or Docker setup) would be preferred. However, it's easiest installed directly on the host (brew install hadoop), and will therefore have full access to your machine for multi threading.
Alternatively, cloud providers offer schools deep discounts, and a cluster of several machines is a few clicks away rather than needing to tune everything just for your one machine.

Related

How to install pyspark & spark for learning purpose on a laptop with limited resources?

I have a windows 7 laptop with 6GB RAM . What is the most RAM/resource efficient way to install pyspark & spark on this laptop just for learning purpose. I don't want to work on actual big data but small dataset is ideal since this is just for learning pyspark & spark in general. I would prefer the latest version of Spark.
FYI: I don't have hadoop installed.
Thanks
You've basically got three options:
Build everything from source
Install Virtualbox and use a pre-built VM like Cloudera Quickstart
Install Docker and find a suitable container
Getting everything up and running when you choose to build from source can be a pain. You've got to install the JDK, build hadoop and spark (both of which require you to install additional software to build them), set up a bunch of environment variables and then pray that didn't mess anything up.
VMs are nice, particularly the one from Cloudera, but you'll often be stuck with an older version of Spark and it might be tight with the resources you described.
I'd go with Docker.
Once you've got docker installed, it becomes very easy to try Spark (and lots of other technologies). My favorite containers for playing around use ipython or jupyter notebooks.
Install Docker:
https://docs.docker.com/installation/windows/
Jupyter Notebook Python, Spark, Mesos Stack
https://github.com/jupyter/docker-stacks/tree/master/pyspark-notebook
One thing to keep in mind is that you are going to have to allocate a certain amount of memory for the VM and the remaining memory still has to operate Windows. Windows 7 requires a minimum of 1 GB for a 32-bit OS or 2 GB for a 64-bit OS. So likely you are only going to wind up with around 4 GB of RAM for running the VM, which is not much.
Assuming you are 64-bit, note that Cloudera requires a minimum of 4 GB RAM to run CDH 5, but if you want to run Cloudera Express, you need 8 GB.
Running Docker from Windows will require you to use boot2docker, which keeps the entire VM in memory. It uses minimal memory (like around 27 MB) to run, so you should be fine there. A MUCH better solution than running VirtualBox!
Another option to consider would be to spin up a free machine on something like Amazon Web Services (http://aws.amazon.com) or Google Cloud (http://cloud.google.com). Particularly with the later, you can get a free trial amount of credits, which you could use to spin up a machine with more RAM than you would typically get with AWS.

Hadoop features when installed on windows using virtual box

Do I get less features or functions of hadoop env. when installed on windows machine using virtual box? Is is good to have this sort of hadoop installation for beginners practice? or What is the difference when hadoop in installed on linux machine vs installation on virtual box on a windows machine.
You can have fully distributed cluster on your windows machine using multiple nodes in the virtual box . However for beginners I will recommend you set up a single node cluster and do the practice. There is no thing as such that you will get less features . You will be running pseudo distributed mode of hadoop . All the daemons will be running. Only thing is that since you have single windows machine with limited storage/ram, you cant test the cluster with huge amounts of data. Hope this helps.

How to create a redistributable self contained binary distribution of a VM with VirtualBox?

Is it possible to create a self contained binary distribution of a VM with VirtualBox or some other tool?
My requirements:
no VirtualBox install
self contained binary/-ies to start and stop VM (with all VirtualBox environment support on it)
possibly no administrator rights to start and stop the VM
at least windows, but better if cross platform
In theory it is possible to create a giant blob that bundles some kind of hypervisor which will first extract install along with the VM (disk, config. etc.) and then run itself and the extracted VM.
However, that is only theory. In practice, hypervisors are very complex pieces of software and require some sort of ring-0 access (kernel level) to talk directly with the CPU and other hardware and VirtualBox is no exception. So installing them, on any operating system that cares even a little bit about security, will require admin/root/supervisor access as you cannot install drivers and other kernel components otherwise.
If performance is of no concern, it may be possible to use an emulator like Qemu/Bochs which can work without ring-0 access. However, I'm not currently aware of any projects that have such self-extracting and runnable emulators for pre-baked VM images (even more so on Windows).
As Tekn0 writes, it is required a low level access to the host OS layer.
I found the project Portable VirtualBox which setups the host machine on the fly.
I tested it and it is not enough satisfactory. From the site:
Note
VirtualBox needs several kernel drivers installed and needs to start
several services: if the drivers and services are not already
installed you’ll need administrator rights to run Portable-VirtualBox.
When Portable-VirtualBox starts, it checks to see if the drivers are
installed. If they are not it will install them before running
VirtualBox and will remove them afterward. Similarly,
Portable-VirtualBox checks to see if the services are running. If not,
it will start them and then stop them when it exits.
The result is a product not always running and with strange kernel errors.
There is another project (starting from Tekn0 observations) Kquemu Portable
and finally Bochs.

Hadoop cluster with ubuntu and Windows

I have three laptops(with ubuntu) that I am networking to act as a cluster for hadoop. I also have a windows only machine, is it possible to add that to the cluster and make it act as a node? Is it feasible? Has anyone come across such an issue?
If you have windows environment, I would suggest that you use VirtualBox and any Linux as Guest OS.
You can build your Hadoop cluster on that. There are numerous installation procedures available for Linux and you can't go wrong with that.
We are using it exactly this way for development purposes. Performance of Hadoop cluster is not a concern as is the functionality.
It also allows you to fine tune your dev ops since you can tear apart and start afresh with a new VM.
Easiest approach to build this way is to :
Install VirtualBox
Install Vagrant
Use a community provided box from: http://www.vagrantbox.es/
Bootstrap your VM for yum packages
Move from NAT interface to Bridged Ethernet interface
Install Hadoop using SCM: http://www.cloudera.com/products-services/tools/
Bring up your cluster
Yes it is possible. On the ubuntu machines, Hadoop installation should be straightforward, you just need to follow the regular steps. Since Hadoop runs on Linux environment, you need to install Cygwin on your windows Machine which is a Linux-like environment for Windows, and will enable you to install and run Linux-based applications (like hadoop) on a Windows machine.
Here is the link for Cygwin Installation: http://www.cygwin.com/install.html

Which virtualization software for windows to virtualize a linux distro for small web server etc?

I've bought a new notebook yet I'm not sure whether Linux fully supports it or not, so I decided to use a VM for the time being. The only virtualization software I've used so far is VirtualBox on linux, but I think it's a bit overkill for my needs.
All I need is to use it like a vps hosted on my machine. Command line access would be enough. It'd be nice if it's free/opensource and it's easy to configure.
Thanks.
VirtualBox + VBoxHeadless + Ubuntu Server edition works for me, I access it with winSCP/Putty and I don't have performance issue on notebook.
The Vagrant utility seems to be specifically designed to make it easy to do this. It requires you to have VirtualBox installed, but manages the configuration etc. You also could use the free VMWare Player and one of the ready-made VM images for it.

Resources