Spark and IPython on CentOS 7 - hadoop

I am experimenting with Hadoop and Spark, as the company I work for is getting ready to start spinning up Hadoop and want to use Spark and other resources to do a lot of machine learning on our data.
Most of that falls to me, so I am preparing by learning on my own.
I have a machine I have setup as a single node Hadoop cluster.
Here is what I have:
CentOS 7 (minimal server install, added XOrg and OpenBox for GUI)
Python 2.7
Hadoop 2.7.2
Spark 2.0.0
I followed these guides to set this up:
http://www.tecmint.com/install-configure-apache-hadoop-centos-7/
http://davidssysadminnotes.blogspot.com/2016/01/installing-spark-centos-7.html
When I attempt to run 'pyspark' I get the following:
IPYTHON and IPYTHON_OPTS are removed in Spark 2.0+. Remove these from the environment and set PYSPARK_DRIVER_PYTHON and PYSPARK_DRIVER_PYHTON_OPTS instead.
I opened up the pyspark file in vi and examined it.
I see a lot of stuff going on there, but I don't know where to start to make the corrections I need to make.
My Spark installation is under:
/opt/spark-latest
The pyspark is under /opt/spark-latest/bin/ and my Hadoop installation (though I don't think this factors in) is /opt/hadoop/.
I know there must be a change I need to make in the pyspark file somewhere, I just don't know where to being on this.
I did some googling and found references to similar things, but nothing that indicated steps in order to fix this.
Can anyone give me a push in the right direction?

If just starting to learn Spark's compatibility in a Hadoop environment, at the moment, Spark 2.0 isn't officially supported (Cloudera CDH or Hortonworks HDP). I'll go ahead and assume your company isn't standing up Hadoop outside of one of those distributions (because enterprise support).
That being said, Spark 1.6 (and Hadoop 2.6) is the latest supported version. Reason being is that there are a few breaking changes in Spark 2.0.
Now, if using Spark 1.6, you shouldn't get those errors. Anaconda isn't completely necessary (PySpark and Scala shells should just work). If using Jupyter notebooks, you could look up Apache Toree, which I've had good success getting notebooks setup. Otherwise, Apache Zeppelin is probably the recommended notebook environment in a production Hadoop cluster.

Related

how to install sqoop in cygwin with windows7?

I've already installed cygwin in windows7. Now I plan to add sqoop to cygwin for hadoop but I'm not getting it right...
Can anybody please suggest me the correct way for doing so, or a link detailing it?
I think you should reconsider installing Hadoop on Windows, it is not very easy to do it and it is probably more trouble than it is worth, although I believe others have done it.
Anyway there are several other options you could consider with regards to hadoop, first there's two companies I know of that provide free VM's and one of them has worked with Microsoft to try and integrate Hadoop into Windows. Anyway, these are the links:
http://www.cloudera.com/content/www/en-us/downloads/quickstart_vms/5-4.html
http://hortonworks.com/products/hortonworks-sandbox/#install
Otherwise you can try your luck with the default apache installation, though I warn you, if you're new to linux or don't like to spend a lot of time changing configuration files, going this way is not the best. I did my installation this way, and you have to modify a lot of files, plus anything extra like Hive, Sqoop, HBase, etc. needs to be installed separately and configured as well.
Please don't make yourself complicated.
I can only recommend running sqoop on hadoop in a linux virtual machine or native linux. Although successfully running hadoop 0.20.0 on windows xp+cygwin and windows7+cygwin, I once tried setting up a newer version of hadoop on windows7, but failed miserably due to errors in hadoop.
I have wasted days and weeks on this.
So my advice: run hadoop on linux if you can, you'll avoid a serious amount of problems.

Hue for Standard Hadoop 2.6.0

Has anyone tried and succeeded in building and configuring Hue on a standard Hadoop Installation (no CDH etc.)?
And if so: what versions (Hadoop, Hive, Hue) did you use? Is there a guide somewhere that explains how to do this?
I have tried and run into a bunch of problems trying to get it to work.
After I make apps with the pom.xml configured for Hadoop 2.6.1 it seems to build successfully, but it doesnt connect to Hive, the Oozie Server wont start and so on...
Any guide I can find online seems to be for CDH and nothing has worked so far.
Im using Os X MountainLion btw. do you know if there is a way to make Clouderas or Hortonworks distribution work on this OS? If so I might try that first...
Hue is compatible with any standard Hadoop, see the guides in the 'Configure' menu of gethue.com, and in particular:
http://gethue.com/hadoop-hue-3-on-hdp-installation-tutorial/
http://cloudera.github.io/hue/docs-3.7.0/index.html

Configuring hadoop in Local Mode

Hi I had successfully installed hadoop in pseudo-distributed mode on a VM. I am writing code in eclipse and than exporting as jar file onto hadoop cluster and than doing debugging there. Now just for learning purpose I was trying to install hadoop in local configuration mode on my windows machine. By doing this I will be able to do testing without going through all the hassle of creating jar files,exporting and do testing on hadoop cluster.
My question is can anyone help me in understanding how hadoop will work in local mode ( hdfs vs local file system ) on windows and How I can configure hadoop in local machine on the windows machine ( what steps I can follow).
I tried following various blogs for doing same but was not able to understand much from them. So posting here the same.
Let me know if any-other information is needed. Thanks in advance.
Unfortunately, you cant use hadoop on windows from the get go - however, you can use Cygwin to achieve effectively the same thing.
I managed to set up local mode and distributed mode running directly from cygwin, however was unable to get pseudo-distributed to work nicely due to various cygpath conversion issues between Unix and Windows path styles.
However, in practise I still make the jars and send them straight across to the cluster using rsync, as it is much faster once your project reaches a certain magnitude for testing and remote debugging can be done from eclipse on windows to the linux cluster.

Is it posible to install Hive and Hadoop on Windows?

I want to know if I can install Hive on windows? If yes how can I do that?
As of now the Microsoft provided "Hadoop on Windows" is not available to not available to general consumption and there is no public information about its general availability.
If you see my blog below you will see that I have had chance to use the binaries in past but then most of the focus is on "Hadoop on Azure" now which is in limited CTP release 2:
http://blogs.msdn.com/b/avkashchauhan/archive/2012/01/28/creating-your-own-hadoop-cluster-on-windows-azure-by-using-your-own-windows-azure-subscription-account.aspx
I would say that there are developers who have written some articles or solution on having Hadoop and other components running on Windows using Cigwin which you can try however there is nothing very robust and stable I really know. If you really want to give a try I would personally suggest downloading Cloudera Hadoop VM on your Windows Box and give a try using any Virtual Machine player application.
http://hadoop.apache.org/common/docs/r1.0.3/single_node_setup.html
You need to install cygwin.
Would I? No; I'd run the Cloudera VMs and not try to deal with all the possible issues.
Coming real-soon-now is the Hadoop for Windows Server program formerly known as Isotope.
Getting Started with Hadoop For Windows Server
Performance is way better than using Cygwin equivalent install - all the file I/O is done natively instead. Comes with Pig and Hive thrown into the bag too, and has an Azure equivalent install package as well - check it out.

HBase data loss? Missing HDFS append support? running the HMaster without HDFS append support enabled?

I am using HBase. I have installed and have the distributed environment running now.
However, it shows a warning in HMaster's interface page:
"You are currently running the HMaster without HDFS append support enabled. This may result in data loss"
How can I solve this? If I don't use CDH3's hadoop? Can someone give me very detailed instructions please?
Thanks!!!!
As you just found out you cannot (should not) use the standard Apache release of Hadoop 0.20.* with HBase as it is missing append support, HDFS-200. There is no official ASF Hadoop release that has append support. Cloudera's release is the easiest way, can you elaborate on why you cannot use it? It is distributed with the same license as Apache, and if you use a tarball release it is similar to the Apache release and you don't need special permission to install RPMs.
The other choices that I am aware of are rolling your own hadoop from the hadoop-append branch (not fun) and using MapR, which I have no first hand experience with.
For a while on the HBase mail lists some people have had luck replacing the hadoop jar in their hadoop install with the hadoop jar that gets distributed with HBase. That way does seem fraught with risk and not everyone is happy with it.

Resources