installation of Oozie on a separate machine then Hadoop - hadoop

Very new to Oozie, hence please excuse me if I sound like a newbie.
I have a hadoop cluster which is up and running. I want to install Oozie, this i want on a separate machine then then hadoop. Is this possible? the reason for asking is that on every installation guide I have seen it asks to install hadoop on the machine hence am not sure if its technically possible to have hadoop on a separate machine then Oozie.
Thanks in advance

Oozie server serves client's requests, it's a web application which uses embedded Tomcat, it can be installed on any machine where hadoop is reachable from, it's not tied to hadoop by itself. You can specify hadoop's nameNode and jobTracker in workflow properties so oozie will know where to send it's jobs.

Related

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

Apache Kylin installation without Sandbox

I was wondering if there are any resources regarding Apache Kylin installation without any sandbox (like cloudera, hortonworks) support. I have managed to do the following:
Install Hadoop 2.6
Install Hive
Install HBase
Then I used the binary from kylin site and so far been able to run it. The problem start when I try to build a cube, the map reduce job gets stuck in step 2. I am thinking if it is still assuming to be in sandbox mode and not submitting job to hadoop at all (there is no entry in hadoop jobtracker).
So I need solution regarding the two:
1. Possible configuration of kylin in pure hadoop setup (no sandbox)
2. somehow enable the kylin setup to submit job to hadoop.
There is no such sandbox or non-sandbox configuration in Kylin. Just make sure the machine where Kylin runs has hadoop setup correctly and you should be fine.
Under the scene, kylin.sh uses hbase classpath and hive -e set | grep 'env:CLASSPATH' to detect hadoop settings. Double check these commands work as expect if you are not sure what cluster Kylin connects to.
If Kylin has problem submitting MR jobs, check two places. First is hadoop resource manager, see if the job has really been submitted or not. Sometimes it's just running slow. Second check kylin.log, see if any exception there. Post the log to kylin dev mailing list and someone will be able to help.
You can install hadoop-2.6 , hive-0.14 ,hbase-0.98.8-hadoop2 with Zookeeper inbuilt or external zookeeper-3.5
Now you can run kylin-v1.1-release on it
If you still face Issues paste the log here

Should oozie be installed on all the hadoop nodes inside a single hadoop cluster?

I am running oozie over hadoop 1.0.3. I wanted to find out whether oozie has to be installed over all the hadoop nodes inside a single cluster ? Is it sufficient to install it on the master node (hadoop) only ? I searched through the oozie documentation, but could not find the answer to my question.
Thankyou,
Mohsin.
Oozie need not be installed on all the nodes in a cluster. It can be installed on a dedicated machine or along with any other framework. Check this guide for a quick installation of Oozie.
Note that Oozie has got a client and a server component. The server component has a Scheduler and also a WorkFlow engine. And the WorkFlow engine used hPDL (Hadoop Process Definition Language) for defining the WorkFlow.

How to administer Hadoop Cluster

i have running 4 nodes hadoop cluster and i am asking about any way to administer that cluster remotely
for example
administering the cluster from my laptop for
executing MapReduce tasks
disabling or enabling data nodes
is there any way to do that remotely ?
If you're using the Cloudera distribution, the Cloudera Manager webapp would let you do that.
Other distributions may have similar control apps. That would give you per-node control.
For executing MR tasks, you would setup normally submit the job from an external node anyway, pointing to the correct JobTracker and NameNode. So I'm not sure what else you're asking for there.

Does Mahout need to be installed on the Hadoop's master node?

That's a dumb question, but somebody has to ask it.
I've tried running Mahout locally, which worked. Now, I wanna the work to be performed by a remote cluster, not my local machine.
So, should I deploy the Mahout code on Hadoop machines or I can still make Mahout on my local machine interface remotely with Hadoop?
No, you don't install Hadoop programs on the Hadoop workers yourself. That would be a nightmare to maintain. Hadoop does it for you when you provide it the JAR file with all code via hadoop jar.
What runs on your local machine, when you run Mahout or anything else Hadoop-based, is a client program that uses Hadoop code to send info to a cluster to start work. That cluster might be local, or remote -- makes no difference to how you run the client, just what the client talks to.

Resources