YARN: get containers by applicationId - hadoop

I'd like to list the nodes on which the containers are running for a particular MR job.
I only have the application_id.
Is it possible to do it with Hadoop REST API and/or through command line?

This can be done using the yarn command.
Run yarn applicationattempt -list <Application Id> to get an app attempt id
Run yarn container -list <Application Attempt Id> to get the container ids
Run yarn container -status <Container Id> to get the host for any particular container.
If you want this in a bash script or want to get every host for an application with a large number of containers you will probably want to parse out the attempt/container id and host, but this is at least a start.

You can find them using Resource manager UI. Find your application by ID among the existing applications and click on the link with ID you have. You will see your application stats. Fint the tracking URL and click on the link 'History'. There you'll be able to find the tasks in your map operation and recude optration. You can open each task and see the information, to which node it was assigned for, nubmer of attempts, logs for each task and attempts and lots of other usefull information.
For getting the information about the container status from command line you can use yarn container -status command from bash

Related

No containers listed by command `yarn container -list <Application Attempt Id>`

In Yarn Node Manager Web UI, total allocated containers of application attempt appattempt_1606741386263_0002_000001 is 4. But The containers list of the appattempt in web UI prints No data available in table.
Also the command yarn container -list appattempt_1606741386263_0002_00000 doesn't list any container.
The version of hadoop is 3.1.4.
The linux distribution is Ubuntu 20.04.
Any help will be appreciated!!!
I started the timeline server, and the yarn container -list <Application Attempt Id> is able to print the containers' information. But The web UI still prints No data available in table. The Timeline Server Web UI(port: 8188) is able to print the information of containers.

Check whether the job is completed or not through unix

I have to run multiple spark job one by one in a sequence, So I am writing a shell script. One way I can do is to check success file in output folder for job status, but i wanna know that is there any other way to check the status of spark-submit job using unix script, where I am running my jobs.
You can use command
yarn application -status <APPLICATIOM ID>
where <APPLICATIOM ID> is your application ID and check for line like:
State : RUNNING
This will give you the status of your application
To check the list of application, run via yarn you can use command
yarn application --list
You can add also -appTypes to limit the listing based on the application type

How to get the job id of a specific running hadoop jobs

I need to get the id of a specific hadoop job.
In my case, I lunch a sqoop commande remotely and I went to verify the job status with this commande :
hadoop job -status job_id | grep -w 'state'
I can get this information from the GUI but i went to do something
can any one help me !!!
You can use the Yarn REST apis, via your browser or curl from the command line. It will list all the currently running and previously running jobs, including sqoop and the mapreduce jobs that sqoop generates and executes. Use the UI first, if you have it up and running just point your browser to http:<host>:8088/cluster (not sure if the port is the same on all hadoop distributions. I believe 8088 is the default on apache). Alternatively you can use yarn commands directly, e.g, yarn application -list.

View Log from Map/Reduce Task

I know that i can find the map/reduce task log inside: /usr/local/hadoop/logs/userlogs/.
Are there a friendly way to see it?
For example, when i clicked http://127.0.0.1:8088/cluster/, I can see all jobs executed in the cluster. Then i clicked in a FINISHED job. But now, when i try to click in Tracking URL: History it gives me an error, Why can i see the task logs from here?
I would like to see the stderr, stdout and syslog from each task.
Try using Job Browser from HUE
or Use the command
yarn logs -applicationId [OPTIONS]
general options are:
-appOwner AppOwner (assumed to be current user if
not specified)
-containerId ContainerId (must be specified if node
address is specified)
-nodeAddress NodeAddress in the format nodename:port
(must be specified if container id is
specified)
Example: yarn logs -applicationId application_1414530900704_0007

Error: Jobflow entered COMPLETED while waiting to ssh

I just started to practice AWS EMR.
I have a sample word-count application set-up, run and completed from the web interface.
Following the guideline here, I have setup the command-line interface.
so when I run the command:
./elastic-mapreduce --list
I receive
j-27PI699U14QHH COMPLETED ec2-54-200-169-112.us-west-2.compute.amazonaws.comWord count
COMPLETED Setup hadoop debugging
COMPLETED Word count
Now, I want to see the log files. I run the command
./elastic-mapreduce --ssh --jobflow j-27PI699U14QHH
Then I receive the following error:
Error: Jobflow entered COMPLETED while waiting to ssh
Can someone please help me understand what's going on here?
Thanks,
When you setup a job on EMR, this means that Amazon is going to provision a cluster on-demand for you for a limited amount of time. During that time, you are free to ssh to your cluster and look at the logs as much as you want, but by the time your job has finished running, then your cluster is going to be taken down ! At that point, you won't be able to ssh anymore because your cluster simply won't exist.
The workflow typically looks like this:
Create your jobflow
It will be for a few minutes in status STARTING. At that point if you try to run ./elastic-mapreduce --ssh --jobflow <jobid> it will simply wait because the cluster is not available yet.
After a while the status will switch to RUNNING. If you had already started the ssh command above it should automatically connect you to your cluster. Otherwise you can initiate your ssh command now and it should connect you directly without any wait.
Depending on the nature of your job, the RUNNING step could take a while or be very short, it depends what amount of data you're processing and the nature of your computations.
Once all your data has been processed, the status will switch to SHUTTING_DOWN. At that point, if you already sshed before you will get disconnected. If you try to use the ssh command at that point, it will not connect.
Once the cluster has finished shutting down it will enter a terminal state of either COMPLETED or FAILED depending on whether your job succeeded or not. At that point your cluster is no longer available, and if you try to ssh you will get the error you are seeing.
Of course there are exceptions, you could setup an EMR cluster in interactive mode, for example you just want to have Hive setup and then ssh there and run Hive queries and you would have to take your cluster down manually. But if you just want a MapReduce job to run, then you will only be able to ssh for the duration of the job.
That being said, if all you want to do is debugging, there is not even a need to ssh in the first place ! When you create your jobflow, you have the option to enable debugging, so you could do something like that:
./elastic-mapreduce --create --enable-debugging --log-uri s3://myawsbucket
What that means is that all the logs for your job will end up being written to the S3 bucket specified (you have to own this bucket of course and have permission to write to it). Also if you do that, you can go into the AWS console afterwards in the EMR section, and you will be able to see next to your job a button to debug as shown below in the screenshot, this should make your life much easier:

Resources