Get status when running job without hadoop - hadoop

When I run a hadoop job with the hadoop application it prints a lot of stuff. Among them, It show the relative progress of the job ("map: 30%, reduce: 0%" and stuff like that). But, when running a job without the application it does not print anything, not even errors. Is there a way to get that level of logging without the application? That is, without running [hadoop_folder]/bin/hadoop jar <my_jar> <indexer> <args>....

You can get this information from Application Master (assuming you use YARN and not MR1 where you would get it from Job Tracker). There is usually web UI where you can find this information. Details will depend on your Hadoop installation / distribution.

In case of Hadoop v1 check Job tracker web URL and in case of Hadoop v2 check Application Master web UI

Related

Spark Launcher Jobs not starting because of token cant be found in cache after 24 hours

I have a Java Application, which runs continuously and checks a table in database for new records. When a New record is added in the table, the Java application do a unzip file and puts into HDFS location and then a Spark Job gets triggered(I am pro-grammatically triggering the Spark Job using 'SparkLauncher" class inside the Java Application), which does the processing for newly added file in HDFS location.
I have scheduled the Java Application in cluster using Oozie Java Action.
The cluster is HDP kerberized cluster.
The Job is working perfectly fine for 24 hours. All the unzip happens and spark job is running.
But after 24 hours the unzip happens in Java Application but the Spark Job is not get triggered in Resource Manager.
Exception : Exception encountered while connecting to the server :INFO: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): token (owner=****, renewer=oozie mr token, realUser=oozie, issueDate=1498798762481, maxDate=1499403562481, sequenceNumber=36550, masterKeyId=619) can't be found in cache
As per my understanding, after 24 hours oozie is renewing the token, and that token is not getting updated for the Spark launcher Job. The spark Launcher is still looking for the older Token which is not available in cache.
Please help me, how I can make Spark Launcher to look for the new-token.
As per my understanding, after 24 hours oozie is renewing the token
Why? Can you point to any documentation, source code, blog?
Remember that Oozie is a scheduler for batch jobs, and its canonical use case (at Yahoo!) is for triggering hourly jobs.
Only a pathological batch job would run for more than 24h, therefore renewal of the Hadoop delegation token is not really useful in Oozie.
But your Java thing acts as a service, running continuously, and needing automatic restart if it ever crashes. So you should consider...
either Slider, if you really want to run it inside YARN (although there
are many, many drawbacks -- how do you inspect the
logs of a running YARN job? how can you make sure that the app starts on time and is not delayed by a lack of resources? how can you make sure that your app will not be killed because YARN needs resources for a high-priority job?) but it is probably overkill for simply running your toy app
or a plain Linux service running on some Edge Node -- it's a Do-It-Yourself task, but not extremely complicated, and there are tutorials on the web
If you insist on using Oozie, in spite of all the limitations of both YARN and Oozie, then you have to change the way your app runs -- for instance, schedule the Coordinator to launch a job every 12h and pass the "nominal time" as Workflow property, edit the Workflow to pass that time to the Java app, edit the Java code so that the app exits at (arg + 11:58) and clears the way for the next exec.

Find and set Hadoop logs to verbose level

I need to track what is happening when I run a job or upload a file to HDFS. I do this using sql profiler in sql server. However, I miss such a tool for hadoop and so I am assuming that I can get some information from logs. I thing all logs are stored at /var/logs/hadoop/ but I am confused with what file I need to look at and how to set that file to capture detailed level information.
I am using HDP2.2.
Thanks,
Sree
'Hadoop' represents an entire ecosystem of different products. Each one has its own logging.
HDFS consists of NameNode and DataNode services. Each has its own log. Location of logs is distribution dependent. See File Locations for Hortonworks or Apache Hadoop Log Files: Where to find them in CDH, and what info they contain for Cloudera.
In Hadoop 2.2, MapReduce ('jobs') is a specific application in YARN, so you are talking about ResourceManager and NodeManager services (the YARN components), each with its own log, and then there is the MRApplication (the M/R component), which is a YARN applicaiton yet with its own log.
Jobs consists of taks, and tasks themselves have their own logs.
In Hadoop 2 there is a dedicated Job History service tasked with collecting and storing the logs from the jobs executed.
Higher level components (eg. Hive, Pig, Kafka) have their own logs, asside from the logs resulted from the jobs they submit (which are logging as any job does).
The good news is that vendor specific distribution (Cloudera, Hortonworks etc) will provide some specific UI to expose the most common logs for ease access. Usually they expose the JobHistory service collected logs from the UI that shows job status and job history.
I cannot point you to anything SQL Profiler equivalent, because the problem space is orders of magnitude more complex, with many different products, versions and vendor specific distributions being involved. I recommend to start by reading about and learning how the Job History server runs and how it can be accessed.

Hadoop ResourceManager not show any job's record

I install Hadoop MultiNode cluster based on this link http://pingax.com/install-apache-hadoop-ubuntu-cluster-setup/
then I try to run wordcount example in my environment, but when I access to Resource Manager http://HadoopMaster:8088 to see the job's details, no records show in UI.
I also search this problem, one guy give the solution like that Hadoop is not showing my job in the job tracker even though it is running but in my case, I'm just running hadoop's example, in which wordcount also didn't add any extra configuration for yarn.
Anyone has install successfully Hadoop2 Muiltinode and Hadoop web UI works correctly can help me about this issue or can give a link to install correctly.
Whether you got the output of word-count job?

Debugging procedure for Hadoop Failed/Hung job

I am learning Hadoop Administration but I don't know how to start debugging if my job is taking more time than its average, or where to start the debugging if my job has failed.
I generally starts with logs in Resource Manager UI but I want to know if there is any other process to debug as a Hadoop Admin. I am looking for a generic approach to follow for debugging Hadoop jobs using Hortonworks Ambari Web UI.
Logs help in case you end up with failed jobs. I am assuming your jobs are successful but slow.
Best place to start debugging for slow running jobs would be to look at the Job Counters (Once you get into the MR App Master page for the Job, you can find the counters link on left panel). Look at Hadoop Definitive Guide's Chapter 8 for details on what each counter means.

hadoop mapreduce - API for getting job log

I’m developing a hadoop mapreduce application and i need to present the end user the task log.
(same as hue does).
is there a java-api that extract the logs of specific job?
i tried "JobClient" API without any success.
the Job Attempts API of the HistoryServer provides a link to the logs of each task

Resources