Is it posible to install Hive and Hadoop on Windows? - windows

I want to know if I can install Hive on windows? If yes how can I do that?

As of now the Microsoft provided "Hadoop on Windows" is not available to not available to general consumption and there is no public information about its general availability.
If you see my blog below you will see that I have had chance to use the binaries in past but then most of the focus is on "Hadoop on Azure" now which is in limited CTP release 2:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/28/creating-your-own-hadoop-cluster-on-windows-azure-by-using-your-own-windows-azure-subscription-account.aspx
I would say that there are developers who have written some articles or solution on having Hadoop and other components running on Windows using Cigwin which you can try however there is nothing very robust and stable I really know. If you really want to give a try I would personally suggest downloading Cloudera Hadoop VM on your Windows Box and give a try using any Virtual Machine player application.

http://hadoop.apache.org/common/docs/r1.0.3/single_node_setup.html
You need to install cygwin.
Would I? No; I'd run the Cloudera VMs and not try to deal with all the possible issues.

Coming real-soon-now is the Hadoop for Windows Server program formerly known as Isotope.
Getting Started with Hadoop For Windows Server
Performance is way better than using Cygwin equivalent install - all the file I/O is done natively instead. Comes with Pig and Hive thrown into the bag too, and has an Azure equivalent install package as well - check it out.

Related

how to install sqoop in cygwin with windows7?

I've already installed cygwin in windows7. Now I plan to add sqoop to cygwin for hadoop but I'm not getting it right...
Can anybody please suggest me the correct way for doing so, or a link detailing it?
I think you should reconsider installing Hadoop on Windows, it is not very easy to do it and it is probably more trouble than it is worth, although I believe others have done it.
Anyway there are several other options you could consider with regards to hadoop, first there's two companies I know of that provide free VM's and one of them has worked with Microsoft to try and integrate Hadoop into Windows. Anyway, these are the links:
http://www.cloudera.com/content/www/en-us/downloads/quickstart_vms/5-4.html
http://hortonworks.com/products/hortonworks-sandbox/#install
Otherwise you can try your luck with the default apache installation, though I warn you, if you're new to linux or don't like to spend a lot of time changing configuration files, going this way is not the best. I did my installation this way, and you have to modify a lot of files, plus anything extra like Hive, Sqoop, HBase, etc. needs to be installed separately and configured as well.
Please don't make yourself complicated.
I can only recommend running sqoop on hadoop in a linux virtual machine or native linux. Although successfully running hadoop 0.20.0 on windows xp+cygwin and windows7+cygwin, I once tried setting up a newer version of hadoop on windows7, but failed miserably due to errors in hadoop.
I have wasted days and weeks on this.
So my advice: run hadoop on linux if you can, you'll avoid a serious amount of problems.

How to build and install spark in an offline environment?

I am trying to install Spark 1.3.1 in an offline clusters (No Internet at all, only Lan). However, I don't know how to build it from source code since either via maven or sbt requires network connection. Can someone offer some help or possible solutions?
Thanks.
A simply (albiet somewhat hacky) solution would be to build it on a machine with internet access and then copy all of the items in ~/.ivy2 over to the machine with only lan access so that it can access the cached items. Another, perhaps simpler, option would be to use a pre-built Spark is thats an acceptable solution.

Configuring hadoop in Local Mode

Hi I had successfully installed hadoop in pseudo-distributed mode on a VM. I am writing code in eclipse and than exporting as jar file onto hadoop cluster and than doing debugging there. Now just for learning purpose I was trying to install hadoop in local configuration mode on my windows machine. By doing this I will be able to do testing without going through all the hassle of creating jar files,exporting and do testing on hadoop cluster.
My question is can anyone help me in understanding how hadoop will work in local mode ( hdfs vs local file system ) on windows and How I can configure hadoop in local machine on the windows machine ( what steps I can follow).
I tried following various blogs for doing same but was not able to understand much from them. So posting here the same.
Let me know if any-other information is needed. Thanks in advance.
Unfortunately, you cant use hadoop on windows from the get go - however, you can use Cygwin to achieve effectively the same thing.
I managed to set up local mode and distributed mode running directly from cygwin, however was unable to get pseudo-distributed to work nicely due to various cygpath conversion issues between Unix and Windows path styles.
However, in practise I still make the jars and send them straight across to the cluster using rsync, as it is much faster once your project reaches a certain magnitude for testing and remote debugging can be done from eclipse on windows to the linux cluster.

How to use Mahout in a Windows environment?

I am trying to use Mahout in an application running on Windows. I want to build clusters from a lucene index using k-means.
As soon as I have to create sequence files (creating vectors from a lucene index), I get a Hadoop-Exception, since Hadoop makes command line calls to programs unknown in a Windows environment (e.g. chmod). Running in Cygwin is not an option, since I want to be able to run the App from eclipse.
So my question is
is there a way to avoid having to create sequence files to retrieve my vectors from a lucene index?
or is there a way to create sequence files in a Windows environment?
The only way you can run Hadoop on a Windows environment is to install Cygwin. For more info, see this blog post:
http://hayesdavis.net/2008/06/14/running-hadoop-on-windows/
Cygwin will provide all the command-line utilities (like chmod) that Hadoop relies on. You can still run your Hadoop jobs from within Eclipse if you want.
Do you know the SequenceFile API? Have a look here: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html
You can try to write/read the data by yourself.
I think you can run Mahout from eclipse in Windowns in stand-alone mode. But you will appear several short comings and barriers. You should try how far you come.
In my opinion you shouldn't insist on running mahout from eclipse. ;-)
You can use a virtual machine to run you Hadoop environment.
As for me, the best solution is using http://hortonworks.com/ project.
Everything works pretty.

How to Setup a Low cost cluster

At my house I have about 10 computers all different processors and speeds (all x86 compatible). I would like to cluster these. I have looked at openMosix but since they stopped development on it I am deciding against using it. I would prefer to use the latest or next to latest version of a mainstream distribution of Linux (Suse 11, Suse 10.3, Fedora 9 etc).
Does anyone know any good sites (or books) that explain how to get a cluster up and running using free open source applications that are common on most mainstream distributions?
I would like a load balancing cluster for custom software I would be writing. I can not use something like Folding#home because I need constant contact with every part of the application. For example if I was running a simulation and one computer was controlling where rain was falling, and another controlling what my herbivores are doing in the simulation.
I recently set up an OpenMPI cluster using Ubuntu. Some existing write up is at https://wiki.ubuntu.com/MpichCluster .
Your question is too vague. What cluster application do you want to use?
By far the easiest way to set up a "cluster" is to install Folding#Home on each of your machines. But I doubt that's really what you're asking for.
I have set up clusters for music/video transcoding using simple bash scripts and ssh shared keys before.
I manage mail server clusters at work.
You only need a cluster if you know what you want to do. Come back with an actual requirement, and someone will suggest a solution.
Take a look at Rocks. It's a fullblown cluster "distribution" based on CentOS 5.1. It installs all you need (libs, applications and tools) to run a cluster and is dead simple to install and use. You do all the tweaking and configuration on the master node and it helps you with kickstarting all your other nodes. I've recently been installing a 1200+ nodes (over 10.000 cores!) cluster with it! And would not hesitate to install it on a 4 node cluster since the workload to install the master is none!
You could either run applications written for cluster libs such as MPI or PVM or you could use the queue system (Sun Grid Engine) to distribute any type of jobs. Or distcc to compile code of choice on all nodes!
And it's open source, gpl, free, everything that you like!
I think he's looking for something similar with openMosix, some kind of a general cluster on top of which any application can run distributed among the nodes. AFAIK there's nothing like that available. MPI based clusters are the closest thing you can get, but I think you can only run MPI applications on them.
Linux Virtual Server
http://www.linuxvirtualserver.org/
I use pvm and it works. But even with a nice ssh setup, allowing for login without entering passwd to the machine, you can easily remotely launch commands on your different computing nodes.

Resources