Hadoop Customization overwritten by Ambari - hadoop-env.sh - hadoop

It seems a simple task: change JAVA_HOME in /etc/hadoop/conf/hadoop-env.sh to use a different version of Java.
However, Ambari will, it seems, overwrite any changes you make in the hadoop-env.sh, by using it's templating scheme.
The template seems to contain following line:
export JAVA_HOME={{java_home}}
So, now if this is used to generate and replace the environment on each node, how do I define the {{java_home}}?

As of Ambari 1.7.0, you can modify hadoop-env from Ambari Web UI. You can learn about the other features in Ambari 1.7.0, as well as 2.0.0 (which is the latest release) from the links on this page:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=30755705

Related

How to change tmp directory in yarn

I have written a MR job and have run it in local mode with following configuration settings
mapred.local.dir=<<local directory having good amount of space>>
fs.default.name=file:///
mapred.job.tracker=local
on Hadoop 1.x
Now I am using Hadoop 2.x and the same Job I am running with the same Configuration settings, but I am getting error :
Disk Out of Space
Is it that If I switch from Hadoop 1.x to 2.x (using Hadoop-2.6 jars), the same Configuration Settings to change the Tmp Dir not work.??
What are the new Settings to configure the "tmp" directory of MR1 (mapred API) on Hadoop 2.6.
Kindly advice.
Regards
Cheers :))
Many properties in 1.x have been deprecated and replaced with new properties in 2.x.
mapred.child.tmp has been replaced by mapreduce.task.tmp.dir
mapred.local.dir has been replaced by mapreduce.cluster.local.dir
Have a look at complete list of deprecated properties and new equivalent properties at Apache website link
It can be done by setting
mapreduce.cluster.local.dir=<<local directory having good amount of space>>

Setting up Hadoop Client on Mac OS X

Currently, I have 3-node cluster running CDH 5.0 using MRv1. I am trying to figure out how to setup Hadoop on my Mac. So, I can submit jobs to the cluster. According to the "Managing Hadoop API Dependencies in CDH 5", you just need the files in /usr/lib/hadoop/client-0.20/* Do I need the following files too? Does Cloudera has hadoop-client in tarball?
- core-site.xml
- hdfs-site.xml
- mapred-site.xml
Yes, I'nk you can make use of cloudera tarball for setting up hadoop client, the same can be downloaded from the following path, configuration files are availble under etc/hadoop/ directory under Hadoop, just need to modify those files according to your environment.
http://archive-primary.cloudera.com/cdh5/cdh/5/hadoop-2.2.0-cdh5.0.0-beta-2.tar.gz
If the above link doesn't match your version, use the following link for getting the available hadoop versions
http://archive-primary.cloudera.com/cdh5/cdh/5/

Not able to run Hadoop job remotely

I want to run a hadoop job remotely from a windows machine. The cluster is running on Ubuntu.
Basically, I want to do two things:
Execute the hadoop job remotely.
Retrieve the result from hadoop output directory.
I don't have any idea how to achieve this. I am using hadoop version 1.1.2
I tried passing jobtracker/namenode URL in the Job configuration but it fails.
I have tried the following example : Running java hadoop job on local/remote cluster
Result: Getting error consistently as cannot load directory. It is similar to this post:
Exception while submitting a mapreduce job from remote system
Welcome to a world of pain. I've just implemented this exact use case, but using Hadoop 2.2 (the current stable release) patched and compiled from source.
What I did, in a nutshell, was:
Download the Hadoop 2.2 sources tarball to a Linux machine and decompress it to a temp dir.
Apply these patches which solve the problem of connecting from a Windows client to a Linux server.
Build it from source, using these instructions. It will also ensure that you have 64-bit native libs if you have a 64-bit Linux server. Make sure you fix the build files as the post instructs or the build would fail. Note that after installing protobuf 2.5, you have to run sudo ldconfig, see this post.
Deploy the resulted dist tar from hadoop-2.2.0-src/hadoop-dist/target on the server node(s) and configure it. I can't help you with that since you need to tweak it to your cluster topology.
Install Java on your client Windows machine. Make sure that the path to it has no spaces in it, e.g. c:\java\jdk1.7.
Deploy the same Hadoop dist tar you built on your Windows client. It will contain the crucial fix for the Windox/Linux connection problem.
Compile winutils and Windows native libraries as described in this Stackoverflow answer. It's simpler than building entire Hadoop on Windows.
Set up JAVA_HOME, HADOOP_HOME and PATH environment variables as described in these instructions
Use a text editor or unix2dos (from Cygwin or standalone) to convert all .cmd files in the bin and etc\hadoop directories, otherwise you'll get weird errors about labels when running them.
Configure the connection properties to your cluster in your config XML files, namely fs.default.name, mapreduce.jobtracker.address, yarn.resourcemanager.hostname and the alike.
Add the rest of the configuration required by the patches from item 2. This is required for the client side only. Otherwise the patch won't work.
If you've managed all of that, you can start your Linux Hadoop cluster and connect to it from your Windows command prompt. Joy!

Install Hue without Cloudera

Has anyone tried/succeeded in installing Hue on Hadoop without Cloudera?
I have gotten to a point where I can reliably reproduce a hadoop cluster with hbase and hive and can set it all up in about 15 minutes. I'd love to have Hue along with all this without having to go back and redo my setup with Cloudera.
Checkout slides #19 & #5, Hue is getting everywhere and is compatible with Hadoop 0.20 / 1.2.0 / 2.2.0: http://gethue.com/hue-goes-to-paris-hug-france/
Hue has tarball releases releases that you are free to install. You can also simply clone the source code (Hue is open source and Apache Licenced) github: https://github.com/cloudera/hue and build the branch you want.
Upstream documentation is here or CDH's one here.
Hue is also packaged in BigTop (and so based on Vanilla Hadoop).
Hue is a Web Server (Django based) which acts as a view on top of Hadoop. So Hue just needs to be installed and then configured by adding the hosts of NameNode, JobTracker, Resource Manager, Oozie, HiveServer... etc in its hue.ini.
Also, as detailed on the gehue.com/releases, the version you need might depend on your Hive version.
Notice that without Cloudera's distribution your mileage might vary but feel free to chime-in on the Hue user-list or gethue.com ;)
We are also seeing for improving Hue setup with Amazon AWS/EMR!
To build and run hue 3.6.0 with apache hadoop 2.4.1
git clone https://github.com/cloudera/hue.git (Notice! releases/tag/release-3.6.0 is unstable, It's better to build from latest master. I built from Aug 7, 87d6b2da1 - it's stable)
cd hue
$ vi maven/pom.xml
change hadoop.version to 2.4.1
replace hadoop-core with hadoop-common
set hadoop-test version to 1.2.1
remove files which need hadoop mr1
$ rm desktop/libs/hadoop/java/src/main/java/org/apache/hadoop/mapred/ThriftJobTrackerPlugin.java
$ rm desktop/libs/hadoop/java/src/main/java/org/apache/hadoop/thriftfs/ThriftJobTrackerPlugin.java
build hue $ make apps
configure hue $ vi desktop/conf/pseudo-distributed.ini
run hue server in dev mode $ build/env/bin/hue runserver 0.0.0.0:8000
Follow the Hue manual installation steps from Hortonworks documentation, it will take you step-by-step on how to do it manually.
Quote: "...without Cloudera's distribution your mileage might vary...."
Indeed, it will vary A LOT! It would seem that the following is quite true:
Per the install giude:
http://cloudera.github.io/hue/docs-2.0.1/manual.html#_install_hue
NOTE:
Hue requires the Hadoop contained in Cloudera’s Distribution including Apache Hadoop (CDH), version 3 update 4 or later.
I've tried it and have run into walls with Hue trying to connect to Hive, Pig and OOZIE.
At this stage - from my experience at least - Hue will NOT run on a standard Apache Hadoop installation using standard Apache tools like Hive and Pig. It must be a vintage of Cloudera’s Distribution.
If anyone has any other (positive) experiences installing Hue outside of the Cloudera’s Distribution, I'd be quite interested to hear about them...

Installing/Configuring Hue with Apache frameworks

Has anyone tried configuring Hue with the frameworks from Apache and not with CDH. The documentation says to set the mapred.jobtracker.plugins property to org.apache.hadoop.thriftfs.ThriftJobTrackerPlugin and check the JT log files to make sure that the Thrift plugin has been loaded. But, I don't see anything related to Thrift in the JT log files. And, also looks like the mapred.jobtracker.plugins in not defined in the mapred-site.xml for Hadoop 1.2.1 which is the latest stable release.
Did you had the mapred.jobtracker.plugins to the mapred-site.xml? Hadoop support plugins since 1.2.0 so you should be good.

Resources