Call MapReduce from Local a web app - hadoop

I have a web app deploy in my localhost.
Also I have a MapReduce job (cleandata.jar) in a hortonworks sandbox in my pc.
How can I call from my web app to my MapReduce .jar?
I'm trying with JSch y Channel Exec to do this in order to perform a call system to the virtual machine and this works. There are a more elegant/easy form to do this?

I didn't use Hortonworks Sandbox but the proper way of launching Yarn (and MapReduce) applications programmatically is by using YarnClient java class. It's quite complicated tough because you need to know some Hadoop internals to do that. First, you should have network access to ResourceManager, NodeManagers, DataNodes and NameNode. Next you should set Configuration properties according to hdfs-site.xml and yarn-site.xml files you will probably find in the sandbox (you can copy them and put them into classpath of your webapp).
You can take a look here: https://hadoop.apache.org/docs/r2.7.2/hadoop-yarn/hadoop-yarn-site/WritingYarnApplications.html
Note that if your cluster is secured, webapp from which you will submit the job have to be run on Java with extended security (JCE) and you should authenticate using UserGroupInformation.

Related

hadoop access without ssh

Is there a way to allow a developer to access a hadoop command line without SSH? I would like to place some hadoop clusters in a specific environment where SSH is not permitted. I have searched for alternatives such as a desktop client but so far have not seen anything. I will also need to federate sign on info for developers.
If you're asking about hadoop fs and similar commands, you don't need SSH for this.
You just need to download Hadoop clients and configure the hdfs-site.xml file to point at a remote cluster. However, this is an administrative security hole, so setting up an edge node that does have trusted and audited SSH access is preferred.
Similarly, Hive or HBase or Spark jobs can be ran with the appropriate clients or configuration files without any SSH access, just local libraries
You don't need SSH to use Hadoop. Also Hadoop is a combination of different stacks, which part of Hadoop are you referring to specifically? If you are talking about HDFS you can use web HDFS. If you are talking about YARN you can use API call. There are also various UI tools such as HUE you can use. Notebook apps such as Zeppelin or Jupiter can also be helpful.

Spark Job Submission with AWS Hadoop cluster setup

I have a hadoop cluster setup in AWS EC2, but my development setup(spark) is in local windows system. When I am trying to connect AWS Hive thrift server I able to connect , but it is showing some connection refused error when trying to submit a job from my local spark configuration. Please note in windows my user name is different that the user name for which Hadoop eco system is running in AWS server. Can any one explain me how the underlying system works in this setup?
1) When I am submitting a job from my local Spark to HIVE thrift , if it is associated any MR job , ASW Hive setup will submit that job NN with its own identity or it will carry forward my spark setup identity.
2) In my configuration do I need to run spark in local with same user name as I have for hadoop cluster in AWS ?
3) Do I need to configure SSL also to authenticate my local system?
Please note , my local system is not part of hadoop cluster and I can not include also in AWS Hadoop cluster.
Please let me know what will be actual setup for environment where my hadoop cluster is in AWS and spark is running on my local.
To simplify the problem, you are free to compile your code locally, produce an uber/shaded JAR, SCP to any spark-client in AWS, then run spark-submit --master yarn --class <classname> <jar-file>.
However, if you want to just Spark against EC2 locally, then you can set a few properties programmatically.
Spark submit YARN mode HADOOP_CONF_DIR contents
Alternatively, as mentioned in that post, the best way would be getting your cluster's XML files from HADOOP_CONF_DIR, and copying them over into your application's classpath. This is typically src/main/resources for a Java/Scala application.
Not sure about Python, R, or the SSL configs.
And yes, you need to add a remote user account for your local Windows username on all the nodes. This is how user impersonation will be handled by Spark executors.

Does Embedded flume agent need hadoop to function on cluster?

I am trying to write embedded flume agent in my web service to transfer my logs to another hadoop cluster where my flume agent is running. To work with Embedded flume agent, do we need hadoop to be run in server where my web service is running.
TLDR: I think, no.
Longer version: I haven't checked, but in the developer guide (https://flume.apache.org/FlumeDeveloperGuide.html#embedded-agent) it says
Note: The embedded agent has a dependency on hadoop-core.jar.
(https://flume.apache.org/FlumeDeveloperGuide.html#embedded-agent)
And in the User Guide (https://flume.apache.org/FlumeUserGuide.html#hdfs-sink), you can specify the HDFS path:
HDFS directory path (eg hdfs://namenode/flume/webdata/)
On the other hand, are you sure you want to work with the embedded agent instead of running Flume where you want to put the data and use HTTP Source for example? (https://flume.apache.org/FlumeUserGuide.html#http-source) (...or any other source you can send data to)

Make spark environment for cluster

I made a spark application that analyze file data. Since input file data size could be big, It's not enough to run my application as standalone. With one more physical machine, how should I make architecture for it?
I'm considering using mesos for cluster manager but pretty noobie at hdfs. Is there any way to make it without hdfs (for sharing file data)?
Spark maintain couple cluster modes. Yarn, Mesos and Standalone. You may start with the Standalone mode which means you work on your cluster file-system.
If you are running on Amazon EC2, you may refer to the following article in order to use Spark built-in scripts that loads Spark cluster automatically.
If you are running on an on-prem environment, the way to run in Standalone mode is as follows:
-Start a standalone master
./sbin/start-master.sh
-The master will print out a spark://HOST:PORT URL for itself. For each worker (machine) on your cluster use the URL in the following command:
./sbin/start-slave.sh <master-spark-URL>
-In order to validate that the worker was added to the cluster, you may refer to the following URL: http://localhost:8080 on your master machine and get Spark UI that shows more info about the cluster and its workers.
There are many more parameters to play with. For more info, please refer to this documentation
Hope I have managed to help! :)

How to renew a delegation token for a long running applications besides the time set in the hadoop cluster

I have an Apache Apex application which runs on my Hadoop environment.
I have no problem with the application except that, it is failing after 7days. And, i realized that it is because of the cluster level setting for any application.
Is there any way, i can renew the delegation token perodically at some interval to ensure job runs continously without failing!!
I could find any resources online for on how to renew a hdfs delegation tokens!! Can someone please share your knowledge ?
The problem is mentioned in the Apex documentation.
Also it offers 2 solution in detail. Non-intrusive for the Hadoop system would be to choose the 'Auto-refresh approach'.
Basically you need to copy your keytab file into HDFS and configure
<property>
<name>dt.authentication.store.keytab</name>
<value>hdfs-path-to-keytab-file</value>
</property>
in your dt-site.xml.
HTH

Resources