submit hadoop job on cloudera - hadoop

I am wondering if we can setup a cloudera cluster on amazon and kick off a hadoop job from my local linux without ssh into amazon's node.
Is there anything like a client to do this communication?

The tips from the following tutorial really work. You should be able to put a working Hadoop Cluster in under 20 minutes, from cold iron to production ready, using just his guidance:
Hadoop Quickstart: Build a Cluster In The Cloud In 20 Minutes
Really worth checking it.

You can install an Hadoop client in your local linux and use the "hadoop jar" command with your own jar. Specify the option mapred.job.tracker in the command line and the client will push your jar to the jobtracker and duplicate it in all the tasktrackers that will be used for this job.

Related

Can Luigi run remote Hadoop jobs?

If one of the tasks in the Luigi graph need to run on a remote Hadoop cluster, is that possible? The machine on which Luigi runs is different from the Hadoop cluster. Can luigi still check the if the HDFS file in the remote cluster exists?
I tried to find documentation for this but wasn't able to.
You can run a job that launches any script.
The HDFS target documentation is here:
https://luigi.readthedocs.io/en/stable/api/luigi.contrib.hdfs.html
https://luigi.readthedocs.io/en/stable/api/luigi.contrib.hdfs.target.html

Hadoop job submission using Apache Ignite Hadoop Accelerators

Disclaimer: I am new to both Hadoop and Apache Ignite. sorry for the lengthy background info.
Setup:
I have installed and configured Apache Ignite Hadoop Accelerator. Start-All.sh brings up the below services. I can submit Hadoop jobs. They complete and I can see results as expected. The start all uses traditional core-site, hdfs-site, mapred-site, and yarn-site configuration files.
28336 NodeManager
28035 ResourceManager
27780 SecondaryNameNode
27429 NameNode
28552 Jps
27547 DataNode
I also have installed Apache Ignite 2.6.0. I am able to start ignite nodes, connect to it using web console. I was able to load the cache from MySQL and run SQL queries and java programs against this cache.
For running Hadoop jobs using ignited Hadoop, I created a separate ignite-config directory, in which I have customized core-site and mapred-site configurations as per the instructions in the Apache ignite web site.
Issue:
When I run a Hadoop job using the command:
hadoop --config ~/ignite-conf jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.1.0.jar wordcount input output1
I get the below error (Note, the same job ran successfully against the Hadoop/without ignite):
java.io.IOException: Failed to get new job ID.
...
...
Caused by: class org.apache.ignite.internal.client.GridClientDisconnectedException: Latest topology update failed.
...
...
Caused by: class org.apache.ignite.internal.client.GridServerUnreachableException: Failed to connect to any of the servers in list: [/:13500]
...
...
It looks like, there was attempt made to lookup the jobtracker (13500) and it was not able to find. From the service list above, it's obvious that job tracker is not running. However, the job ran just fine on non-ignited hadoop over YARN.
Can you help please?
This is resolved in my case.
The job tracker here meant the Apache Ignite memory cache services listening on port 11211.
After making this change in mapred-site.xml, the job ran!

Setting spark yarn client

I would like to make spark yarn client (link). Does it need to install hadoop ? or is it ok to install only yarn? ( by this
link)
No Spark do not require Hadoop for running. Apache Spark is an independent project which can run on its own. If you want you can even run it without apache yarn.
Spark support 3 type of cluster manager which are mesos, yarn and standalone. if you do not have yarn installed then it can use mesos and standalone and by default it uses standalone when you do not mention any preference for cluster manager.Links which you have mentioned is fine to use but I think more better resources are available on google.

where is the hadoop task manager UI

I installed the hadoop 2.2 system on my ubuntu box using this tutorial
http://codesfusion.blogspot.com/2013/11/hadoop-2x-core-hdfs-and-yarn-components.html
Everything worked fine for me and now when I do
http://localhost:50070
I can see the management UI for HDFS. Very good!!
But the I am going through another tutorial which tells me that there must be a task manager UI running at http://mymachine.com:50030 and http://mymachine.com:50060
on my machine I cannot open these ports.
I have already done
start-dfs.sh
start-yarn.sh
start-all.sh
is something wrong? why can't I see the task manager UI?
You have installed YARN (MRv2) which runs the ResourceManager. The URL http://mymachine.com:50030 is the web address for the JobTracker daemon that comes with MRv1 and hence you are not able to see it.
To see the ResourceManager UI, check your yarn-site.xml file for the following property:
yarn.resourcemanager.webapp.address
By default, it should point to : resource_manager_hostname:8088
Assuming your ResourceManager runs on mymachine, you should see the ResourceManager UI at http://mymachine.com:8088/
Make sure all your deamons are up and running before you visit the URL for the ResourceManager.
For Hadoop 2[aka YARN/MRV2] - Any hadoop installation version-ed 2.x or higher its at port number 8088. eg. localhost:8088
For Hadoop 1 - Any hadoop installation version-ed lower than 2.x[eg 1.x or 0.x] its at port number 50030. eg localhost:50030
By default HadoopUI location is as below
http://mymachine.com:50070

hadoop cluster clarification

I am a newbie in hadoop and I am trying to run a hadoop jar on Amazon EC2. I have started my amazon ec2 instance through the console, uploaded my files to the dfs and then was able to successfully run the job jar and generate output on the instance.
But still I am confused on one part. I am not sure if the job was run on a single machine in amazon ec2 or was it ran on a cluster? How do I find the number of worker nodes involved for my jar run?
In some reference links I see we have to use launch-cluster command , for example "bin/hadoop-ec2 launch-cluster test-cluster 2" . What is the difference in starting the instance from the console and using this command like launch-cluster.

Resources