Configuring hadoop in Local Mode - hadoop

Hi I had successfully installed hadoop in pseudo-distributed mode on a VM. I am writing code in eclipse and than exporting as jar file onto hadoop cluster and than doing debugging there. Now just for learning purpose I was trying to install hadoop in local configuration mode on my windows machine. By doing this I will be able to do testing without going through all the hassle of creating jar files,exporting and do testing on hadoop cluster.
My question is can anyone help me in understanding how hadoop will work in local mode ( hdfs vs local file system ) on windows and How I can configure hadoop in local machine on the windows machine ( what steps I can follow).
I tried following various blogs for doing same but was not able to understand much from them. So posting here the same.
Let me know if any-other information is needed. Thanks in advance.

Unfortunately, you cant use hadoop on windows from the get go - however, you can use Cygwin to achieve effectively the same thing.
I managed to set up local mode and distributed mode running directly from cygwin, however was unable to get pseudo-distributed to work nicely due to various cygpath conversion issues between Unix and Windows path styles.
However, in practise I still make the jars and send them straight across to the cluster using rsync, as it is much faster once your project reaches a certain magnitude for testing and remote debugging can be done from eclipse on windows to the linux cluster.

Related

Running Jenkins slave on different OS than master (and host)

I'm trying to introduce continuous integration in an old project, and we've got quite specific situation - it's possible to put the CI server only on our test server that runs on CentOS. The server has quite a lot of unused RAM and CPU capability.
However, we need to run Ant builds on Windows (this also used to be how the project did packaging before), however it turned out that not the same output (after binary compare) is produced by just using Unix versions of Java and Ant.
I drew up a diagram of how in my mind it could work, but I'm really wondering whether that is even possible (with already given tools).
The black part is implemented, I'm curious whether the red part could be possible. Could the Jenkins slave communicate with master on different OS?
It should be possible. I have a feeling you will need to play with your network settings. But if before you start changing anything see if you can start a headless slave by following these directions: https://wiki.jenkins-ci.org/display/JENKINS/Step+by+step+guide+to+set+up+master+and+slave+machine
Using VirtualBox for CentOS, it will possible to run a Windows VM on your CentOS host.
I'm not sure you need Docker to launch your Jenkins slave.
It maybe better to use a standard JNLP Windows service to connect your Windows slave to Dockerised Jenkins master.
If the master is not able to view the Windows node using this method, you may have to tweak your network configuration on the Windows VM.
But I'm not sure it's necessary.

how to install sqoop in cygwin with windows7?

I've already installed cygwin in windows7. Now I plan to add sqoop to cygwin for hadoop but I'm not getting it right...
Can anybody please suggest me the correct way for doing so, or a link detailing it?
I think you should reconsider installing Hadoop on Windows, it is not very easy to do it and it is probably more trouble than it is worth, although I believe others have done it.
Anyway there are several other options you could consider with regards to hadoop, first there's two companies I know of that provide free VM's and one of them has worked with Microsoft to try and integrate Hadoop into Windows. Anyway, these are the links:
http://www.cloudera.com/content/www/en-us/downloads/quickstart_vms/5-4.html
http://hortonworks.com/products/hortonworks-sandbox/#install
Otherwise you can try your luck with the default apache installation, though I warn you, if you're new to linux or don't like to spend a lot of time changing configuration files, going this way is not the best. I did my installation this way, and you have to modify a lot of files, plus anything extra like Hive, Sqoop, HBase, etc. needs to be installed separately and configured as well.
Please don't make yourself complicated.
I can only recommend running sqoop on hadoop in a linux virtual machine or native linux. Although successfully running hadoop 0.20.0 on windows xp+cygwin and windows7+cygwin, I once tried setting up a newer version of hadoop on windows7, but failed miserably due to errors in hadoop.
I have wasted days and weeks on this.
So my advice: run hadoop on linux if you can, you'll avoid a serious amount of problems.

Is it posible to install Hive and Hadoop on Windows?

I want to know if I can install Hive on windows? If yes how can I do that?
As of now the Microsoft provided "Hadoop on Windows" is not available to not available to general consumption and there is no public information about its general availability.
If you see my blog below you will see that I have had chance to use the binaries in past but then most of the focus is on "Hadoop on Azure" now which is in limited CTP release 2:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/28/creating-your-own-hadoop-cluster-on-windows-azure-by-using-your-own-windows-azure-subscription-account.aspx
I would say that there are developers who have written some articles or solution on having Hadoop and other components running on Windows using Cigwin which you can try however there is nothing very robust and stable I really know. If you really want to give a try I would personally suggest downloading Cloudera Hadoop VM on your Windows Box and give a try using any Virtual Machine player application.
http://hadoop.apache.org/common/docs/r1.0.3/single_node_setup.html
You need to install cygwin.
Would I? No; I'd run the Cloudera VMs and not try to deal with all the possible issues.
Coming real-soon-now is the Hadoop for Windows Server program formerly known as Isotope.
Getting Started with Hadoop For Windows Server
Performance is way better than using Cygwin equivalent install - all the file I/O is done natively instead. Comes with Pig and Hive thrown into the bag too, and has an Azure equivalent install package as well - check it out.

Is that possible to install Hadoop on a linux box and try most of (if not all of) the Hadoop utilities?

I am trying to learn Hadoop, is that possible to install Hadoop on a linux box and try most of (if not all of) the Hadoop utilities?
You can download CDH3 virtual machine from cloudera.(https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hadoop+Demo+VM) and have everything integrated in one VM. IMHO It is a simplest way to start with hadoop.
Yes. It is possible.
Hadoop can run in two modes locally:
Standalone mode -- nothing starts up, you can run hadoop jobs locally off of local files.
Pseudo-distributed mode -- Effectively distributed mode, but all of the daemons start up locally on one node.
How to set these up and get started with them is documented on the hadoop page.
Since you say you want to try Hadoop utilities, you probably want to try out pseudo-distributed mode. When using the command line tools, mapreduce jobs, pig, hive, etc., a local cluster running in pseudo-distributed mode will look like a 1000 node cluster (except that it can't hold as much data).

How to use Mahout in a Windows environment?

I am trying to use Mahout in an application running on Windows. I want to build clusters from a lucene index using k-means.
As soon as I have to create sequence files (creating vectors from a lucene index), I get a Hadoop-Exception, since Hadoop makes command line calls to programs unknown in a Windows environment (e.g. chmod). Running in Cygwin is not an option, since I want to be able to run the App from eclipse.
So my question is
is there a way to avoid having to create sequence files to retrieve my vectors from a lucene index?
or is there a way to create sequence files in a Windows environment?
The only way you can run Hadoop on a Windows environment is to install Cygwin. For more info, see this blog post:
http://hayesdavis.net/2008/06/14/running-hadoop-on-windows/
Cygwin will provide all the command-line utilities (like chmod) that Hadoop relies on. You can still run your Hadoop jobs from within Eclipse if you want.
Do you know the SequenceFile API? Have a look here: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html
You can try to write/read the data by yourself.
I think you can run Mahout from eclipse in Windowns in stand-alone mode. But you will appear several short comings and barriers. You should try how far you come.
In my opinion you shouldn't insist on running mahout from eclipse. ;-)
You can use a virtual machine to run you Hadoop environment.
As for me, the best solution is using http://hortonworks.com/ project.
Everything works pretty.

Resources