Running Hadoop Job Through Web Interface - hadoop

Is there any way to run Hadoop job through a web interface? e.g. giving command for Hadoop job execution with a button.
I want to implement a web interface for my Hadoop project.
Thanks!

Cloudera will be useful, which is designed for this purpose.
http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/4.2.2/Hue-2-User-Guide/hue26.html

Try the below options :
Option -1
Create a java web project with Web service and Add all UI inputs to this client server.
Create another web project as remote server , And receive all the above inputs of the job and pass it to the Jobs.
The remote server web project should be in the cluster always up and running, and capturing the client signals.
Use JSCH in the server side and invoke as and when all the hadoop commands you pass from the UI.
OR
Option - 2
You could you a MySql database and store all job parameters from the UI . Then write a simple java code with JSCH to run these hadoop comands by polling the DB. A runnable jar running all time.
Hope the above 2 ideas help you.

Related

Call MapReduce from Local a web app

I have a web app deploy in my localhost.
Also I have a MapReduce job (cleandata.jar) in a hortonworks sandbox in my pc.
How can I call from my web app to my MapReduce .jar?
I'm trying with JSch y Channel Exec to do this in order to perform a call system to the virtual machine and this works. There are a more elegant/easy form to do this?
I didn't use Hortonworks Sandbox but the proper way of launching Yarn (and MapReduce) applications programmatically is by using YarnClient java class. It's quite complicated tough because you need to know some Hadoop internals to do that. First, you should have network access to ResourceManager, NodeManagers, DataNodes and NameNode. Next you should set Configuration properties according to hdfs-site.xml and yarn-site.xml files you will probably find in the sandbox (you can copy them and put them into classpath of your webapp).
You can take a look here: https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
Note that if your cluster is secured, webapp from which you will submit the job have to be run on Java with extended security (JCE) and you should authenticate using UserGroupInformation.

Does Embedded flume agent need hadoop to function on cluster?

I am trying to write embedded flume agent in my web service to transfer my logs to another hadoop cluster where my flume agent is running. To work with Embedded flume agent, do we need hadoop to be run in server where my web service is running.
TLDR: I think, no.
Longer version: I haven't checked, but in the developer guide (https://flume.apache.org/FlumeDeveloperGuide.html#embedded-agent) it says
Note: The embedded agent has a dependency on hadoop-core.jar.
(https://flume.apache.org/FlumeDeveloperGuide.html#embedded-agent)
And in the User Guide (https://flume.apache.org/FlumeUserGuide.html#hdfs-sink), you can specify the HDFS path:
HDFS directory path (eg hdfs://namenode/flume/webdata/)
On the other hand, are you sure you want to work with the embedded agent instead of running Flume where you want to put the data and use HTTP Source for example? (https://flume.apache.org/FlumeUserGuide.html#http-source) (...or any other source you can send data to)

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

JobTracker web UI - not working in psedo-distributed mode v 2.7.1

I have installed Hadoop 2.7.1 in psuedo distributed mode (all daemons on single machine). It's up and running and I'm able to access HDFS through command line and run the jobs and I'm able to see the output.
I can access http://localhost:50070/dfshealth.html#tab-overview. it shows version and cluster status and can access hadoop file system.
I found one link and applied its accepted solution but that does not work for me. When I am trying to access http://127.0.0.1:54310, I am getting below error message
It looks like you are making an HTTP request to a Hadoop IPC port. This is
not the correct port for the web interface on this daemon.
Any help is appreciated.
Thanks..
I am using MR2 and not able to track my job on 8088. When I run map reduce job, it submit the job on http://localhost:8080 and thats url is not opening to track the job.
Use port 50030 if you are using MRV1 for YARN use port 8088 for accessing resource manager.

How to know current running topologies from storm command line client?

Is there any way to display all the current running Storm topologies from storm command line client?
Storm documentation doesn't say anything about this.
http://storm.apache.org/documentation/Command-line-client.html
You can run
$STORM_HOME/bin/storm list
storm provides a web based UI for monitoring such informations.
However you can start writing your own Thrift client to connect to the broker and get various matrix based on your need. If you are from Java background or similar then it should be easy to write and execute from the prompt.

Resources