I've already installed cygwin in windows7. Now I plan to add sqoop to cygwin for hadoop but I'm not getting it right...
Can anybody please suggest me the correct way for doing so, or a link detailing it?
I think you should reconsider installing Hadoop on Windows, it is not very easy to do it and it is probably more trouble than it is worth, although I believe others have done it.
Anyway there are several other options you could consider with regards to hadoop, first there's two companies I know of that provide free VM's and one of them has worked with Microsoft to try and integrate Hadoop into Windows. Anyway, these are the links:
http://www.cloudera.com/content/www/en-us/downloads/quickstart_vms/5-4.html
http://hortonworks.com/products/hortonworks-sandbox/#install
Otherwise you can try your luck with the default apache installation, though I warn you, if you're new to linux or don't like to spend a lot of time changing configuration files, going this way is not the best. I did my installation this way, and you have to modify a lot of files, plus anything extra like Hive, Sqoop, HBase, etc. needs to be installed separately and configured as well.
Please don't make yourself complicated.
I can only recommend running sqoop on hadoop in a linux virtual machine or native linux. Although successfully running hadoop 0.20.0 on windows xp+cygwin and windows7+cygwin, I once tried setting up a newer version of hadoop on windows7, but failed miserably due to errors in hadoop.
I have wasted days and weeks on this.
So my advice: run hadoop on linux if you can, you'll avoid a serious amount of problems.
Related
I am experimenting with Hadoop and Spark, as the company I work for is getting ready to start spinning up Hadoop and want to use Spark and other resources to do a lot of machine learning on our data.
Most of that falls to me, so I am preparing by learning on my own.
I have a machine I have setup as a single node Hadoop cluster.
Here is what I have:
CentOS 7 (minimal server install, added XOrg and OpenBox for GUI)
Python 2.7
Hadoop 2.7.2
Spark 2.0.0
I followed these guides to set this up:
http://www.tecmint.com/install-configure-apache-hadoop-centos-7/
http://davidssysadminnotes.blogspot.com/2016/01/installing-spark-centos-7.html
When I attempt to run 'pyspark' I get the following:
IPYTHON and IPYTHON_OPTS are removed in Spark 2.0+. Remove these from the environment and set PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYHTON_OPTS instead.
I opened up the pyspark file in vi and examined it.
I see a lot of stuff going on there, but I don't know where to start to make the corrections I need to make.
My Spark installation is under:
/opt/spark-latest
The pyspark is under /opt/spark-latest/bin/ and my Hadoop installation (though I don't think this factors in) is /opt/hadoop/.
I know there must be a change I need to make in the pyspark file somewhere, I just don't know where to being on this.
I did some googling and found references to similar things, but nothing that indicated steps in order to fix this.
Can anyone give me a push in the right direction?
If just starting to learn Spark's compatibility in a Hadoop environment, at the moment, Spark 2.0 isn't officially supported (Cloudera CDH or Hortonworks HDP). I'll go ahead and assume your company isn't standing up Hadoop outside of one of those distributions (because enterprise support).
That being said, Spark 1.6 (and Hadoop 2.6) is the latest supported version. Reason being is that there are a few breaking changes in Spark 2.0.
Now, if using Spark 1.6, you shouldn't get those errors. Anaconda isn't completely necessary (PySpark and Scala shells should just work). If using Jupyter notebooks, you could look up Apache Toree, which I've had good success getting notebooks setup. Otherwise, Apache Zeppelin is probably the recommended notebook environment in a production Hadoop cluster.
Has anyone tried and succeeded in building and configuring Hue on a standard Hadoop Installation (no CDH etc.)?
And if so: what versions (Hadoop, Hive, Hue) did you use? Is there a guide somewhere that explains how to do this?
I have tried and run into a bunch of problems trying to get it to work.
After I make apps with the pom.xml configured for Hadoop 2.6.1 it seems to build successfully, but it doesnt connect to Hive, the Oozie Server wont start and so on...
Any guide I can find online seems to be for CDH and nothing has worked so far.
Im using Os X MountainLion btw. do you know if there is a way to make Clouderas or Hortonworks distribution work on this OS? If so I might try that first...
Hue is compatible with any standard Hadoop, see the guides in the 'Configure' menu of gethue.com, and in particular:
http://gethue.com/hadoop-hue-3-on-hdp-installation-tutorial/
http://cloudera.github.io/hue/docs-3.7.0/index.html
Hi I had successfully installed hadoop in pseudo-distributed mode on a VM. I am writing code in eclipse and than exporting as jar file onto hadoop cluster and than doing debugging there. Now just for learning purpose I was trying to install hadoop in local configuration mode on my windows machine. By doing this I will be able to do testing without going through all the hassle of creating jar files,exporting and do testing on hadoop cluster.
My question is can anyone help me in understanding how hadoop will work in local mode ( hdfs vs local file system ) on windows and How I can configure hadoop in local machine on the windows machine ( what steps I can follow).
I tried following various blogs for doing same but was not able to understand much from them. So posting here the same.
Let me know if any-other information is needed. Thanks in advance.
Unfortunately, you cant use hadoop on windows from the get go - however, you can use Cygwin to achieve effectively the same thing.
I managed to set up local mode and distributed mode running directly from cygwin, however was unable to get pseudo-distributed to work nicely due to various cygpath conversion issues between Unix and Windows path styles.
However, in practise I still make the jars and send them straight across to the cluster using rsync, as it is much faster once your project reaches a certain magnitude for testing and remote debugging can be done from eclipse on windows to the linux cluster.
I want to know if I can install Hive on windows? If yes how can I do that?
As of now the Microsoft provided "Hadoop on Windows" is not available to not available to general consumption and there is no public information about its general availability.
If you see my blog below you will see that I have had chance to use the binaries in past but then most of the focus is on "Hadoop on Azure" now which is in limited CTP release 2:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/28/creating-your-own-hadoop-cluster-on-windows-azure-by-using-your-own-windows-azure-subscription-account.aspx
I would say that there are developers who have written some articles or solution on having Hadoop and other components running on Windows using Cigwin which you can try however there is nothing very robust and stable I really know. If you really want to give a try I would personally suggest downloading Cloudera Hadoop VM on your Windows Box and give a try using any Virtual Machine player application.
http://hadoop.apache.org/common/docs/r1.0.3/single_node_setup.html
You need to install cygwin.
Would I? No; I'd run the Cloudera VMs and not try to deal with all the possible issues.
Coming real-soon-now is the Hadoop for Windows Server program formerly known as Isotope.
Getting Started with Hadoop For Windows Server
Performance is way better than using Cygwin equivalent install - all the file I/O is done natively instead. Comes with Pig and Hive thrown into the bag too, and has an Azure equivalent install package as well - check it out.
I am trying to use Mahout in an application running on Windows. I want to build clusters from a lucene index using k-means.
As soon as I have to create sequence files (creating vectors from a lucene index), I get a Hadoop-Exception, since Hadoop makes command line calls to programs unknown in a Windows environment (e.g. chmod). Running in Cygwin is not an option, since I want to be able to run the App from eclipse.
So my question is
is there a way to avoid having to create sequence files to retrieve my vectors from a lucene index?
or is there a way to create sequence files in a Windows environment?
The only way you can run Hadoop on a Windows environment is to install Cygwin. For more info, see this blog post:
http://hayesdavis.net/2008/06/14/running-hadoop-on-windows/
Cygwin will provide all the command-line utilities (like chmod) that Hadoop relies on. You can still run your Hadoop jobs from within Eclipse if you want.
Do you know the SequenceFile API? Have a look here: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/SequenceFile.html
You can try to write/read the data by yourself.
I think you can run Mahout from eclipse in Windowns in stand-alone mode. But you will appear several short comings and barriers. You should try how far you come.
In my opinion you shouldn't insist on running mahout from eclipse. ;-)
You can use a virtual machine to run you Hadoop environment.
As for me, the best solution is using http://hortonworks.com/ project.
Everything works pretty.