How to get names of the currently running hadoop jobs? - hadoop

I need to get the list of job names that currently running, but hadoop -job list give me a list of jobIDs.
Is there a way to get names of the running jobs?
Is there a way to get the job names from jobIDs?

I've had to do this a number of times so I came up with the following command line that you can throw in a script somewhere and reuse. It prints the jobid followed by the job name.
hadoop job -list | egrep '^job' | awk '{print $1}' | xargs -n 1 -I {} sh -c "hadoop job -status {} | egrep '^tracking' | awk '{print \$3}'" | xargs -n 1 -I{} sh -c "echo -n {} | sed 's/.*jobid=//'; echo -n ' ';curl -s -XGET {} | grep 'Job Name' | sed 's/.* //' | sed 's/<br>//'"

If you use Hadoop YARN don't use mapred job -list (or its deprecated version hadoop job -list) just do
yarn application -appStates RUNNING -list
That also prints out the application/job name. For mapreduce applications you can get the corresponding JobId by replacing the application prefix of the Application-Id with job.

Modifying AnthonyF's script, you can use the following on Yarn:
mapred job -list 2> /dev/null | egrep '^\sjob' | awk '{print $1}' | xargs -n 1 -I {} sh -c "mapred job -status {} 2>/dev/null | egrep 'Job File' | awk '{print \$3}'" | xargs -n 1 -I{} sh -c "hadoop fs -cat {} 2>/dev/null | egrep '' | sed 's/.*<value>//' | sed 's/<\/value>.*//'"

If you do $HADOOP_HOME/bin/hadoop -job -status <jobid> you will get a tracking URL in the output. Going to that URL will give you the tracking page, which has the name
Job Name: <job name here>
The -status command also gives a file, which can also be seen from the tracking URL. In this file is a which has the job name.
I didn't find a way to access the job name from the command line. Not to say there isn't... but not found by me. :)
The tracking URL and xml file are probably your best options for getting the job name.

You can find the information in JobTracker UI
You can see
Name of the job
State of the job whether it succeed or failed
Start Time
Finish Time
Map % Complete
Reduce % Complete etc

Just In case any one interested in latest query to get the Job Name :-). Modified Pirooz Command -
mapred job -list 2> /dev/null | egrep '^job' | awk '{print $1}' | xargs -n 1 -I {} sh -c "mapred job -status {} 2>/dev/null | egrep 'Job File'" | awk '{print $3}' | xargs -n 1 -I{} sh -c "hadoop fs -cat {} 2>/dev/null" | egrep '' | awk -F"" '{print $2}' | awk -F "" '{print $1}'

I needed to look through history, so I changed mapred job -list to mapred job -list all....
I ended up adding a -L to the curl command, so the block there was:
curl -s -L -XGET {}
This allows for redirection, such as if the job is retired and in the job history. I also found that it's JobName in the history HTML, so I changed the grep:
grep 'Job.*Name'
Plus of course changing hadoop to mapred. Here's the full command:
mapred job -list all | egrep '^job' | awk '{print $1}' | xargs -n 1 -I {} sh -c "mapred job -status {} | egrep '^tracking' | awk '{print \$3}'" | xargs -n 1 -I{} sh -c "echo -n {} | sed 's/.*jobid=//'; echo -n ' ';curl -s -L -XGET {} | grep 'Job.*Name' | sed 's/.* //' | sed 's/<br>//'"
(I also changed around the first grep so that I was only looking at a certain username....YMMV)

How do i redirect a list of IP addresses to a command line function?

I want to see what countries are trying to access my VPS. I have installed a tool called "goiplookup", which was forked from another effort called "geoiplookup". If I type this at the command line:
It returns this:
US, United States
So I figured out how to get a list of IPs that are trying to access my server by using this:
sudo grep "disconnect" /var/log/auth.log | grep -v COMMAND | awk '{print $9}'
Which gives a long list of IPs like this:
I cannot figure out how to get this list of IPs to be processed by the "goiplookup" tool. I tried this:
sudo grep "disconnect" /var/log/auth.log | grep -v COMMAND | awk '{print $9}' | goiplookup
but that did not work. I also tried with no luck:
sudo grep "disconnect" /var/log/auth.log | grep -v COMMAND | awk '{print $9}' | xargs -0 goiplookup
Try this:
sudo grep "disconnect" /var/log/auth.log | grep -v COMMAND | awk '{print $9}' | sort | uniq | xargs -n 1 goiplookup
I added | sort | uniq to ensure each IP only appears once
and xargs -n 1 so that each found IP is processes by goiplookup
I would put it into a file and make a small utility to parse it:
sudo grep "disconnect" /var/log/auth.log | grep -v COMMAND | awk '{print $9}' | sort -u > ./file.txt
cat ./file.txt | while read -r line; do
temp$(echo $line)
goiplookup $temp
This will read through the file one line at a time and execute the goiplookup with each IP.
sudo grep disconnect /var/log/auth.log | awk '!/COMMAND/ && !seen[$0]++ {system("geoiplookup \""$9"\""}
Note that geoiplookup only allows one IP per invocation.
The whole thing can be done in awk, but using grep allows the rest to be run unprivileged.
Consider whether grep -w (match whole word) is appropriate, and in awk you can do a similar thing with !/(^|[^[:alnum:]_])COMMAND($|[^[:alnum:]_])/.
I just made a shell script, which works.
readarray -t array < <(sudo grep "disconnect" /var/log/auth.log | grep -v COMMAND | awk '{print $9}' | sort | uniq)
for ip in "${array[#]}"
country=$(/usr/local/bin/goiplookup -c $ip)
echo "$ip $country"

ssh remote command execution quoting and piping awk

I'm working on a script, that should find certain disks and add hostname to them.
I'm using this for 40 servers with a for loop in bash
for i in myservers{1..40}
do ssh user#$i findmnt -o SIZE,TARGET -n -l |
grep '1.8T\|1.6T\|1.7T' |
sed 's/^[ \t]*//' |
cut -d ' ' -f 2 |
awk -v HOSTNAME=$HOSTNAME '{print HOSTNAME ":" $0}'; done |
tee sorted.log
can you help out with the quoting here? It looks like awk gets piped (hostname) from localhost, not the remote server.
Everything after the first pipe is running locally, not on the remote server.
Try quoting the entire pipeline to have it run on the remote server:
for i in myservers{1..40}
do ssh user#$i "findmnt -o SIZE,TARGET -n -l |
sed 's/^[ \t]*//' |
cut -d ' ' -f 2 |
awk -v HOSTNAME=\$HOSTNAME '{print HOSTNAME \":\" \$0}'" ;
done | tee sorted.log
This is a shorter version of your stuff:
findmnt -o SIZE,TARGET -n -l |
awk -v HOSTNAME=$HOSTNAME '/M/{print HOSTNAME ":" $2}'
Applied to the above:
for i in myservers{1..40}
do ssh user#$i bash -c '
findmnt -o SIZE,TARGET -n -l |
awk -v HOSTNAME=$HOSTNAME '"'"'/M/{print HOSTNAME ":" $2}'"'"' '
done |
tee sorted.log
see: How to escape the single quote character in an ssh / remote bash command?

Error in my Shell Script (at kill command)

ssh user#hostname "ps -ef | grep java | grep dev | kill -9 `awk '{print \$2}'` && nohup java -jar application.jar --server.port=8090&"
Usage: kill [-lL] [-n signum] [-s signame] job ...
Or: kill [ options ] -l [arg ...]
Does anyone know what is causing the error?
The \ in awk print is a syntax error.
Try this:
ps -ef | grep java | grep dev | kill -9 `awk '{print $2}'`

Command composition in bash

So I have the equivalent of a list of files being output by another command, and it looks something like this:
I need to run the XML in each file through xmlstarlet, so I'm doing ... | xargs gzip -d | xmlstarlet ..., except I want xmlstarlet to be called once for each line going into gzip, not on all of the xml documents appended to each other. Is it possible to compose 'gzip -d' 'xmlstarlet ...', so that xargs will supply one argument to each of their composite functions?
Why not read your file and process each line separately in the shell? i.e.
cat ${fileList} \
| while read fName ; do
gzip -d ${fName} | xmlstartlet > ${fName}.new
I hope this helps.
Although the right answer is the one suggested by shelter (+1), here is a one-liner "divertimento" providing that the input is the proposed by Andrey (a command that generates the list of urls) :-)
~$ eval $(command | awk '{a=a "wget -O - "$0" | gzip -d | xmlstartlet > $(basename "$0" .gz ).new; " } END {print a}')
It just generates a multi command line that does wget http://foo.xml.gz | gzip -d | xmlstartlet > $(basenname foo.xml.gz .gz).new for each of the urls in the input; after the resulting command is evaluated
Use GNU Parallel:
cat filelist | parallel 'zcat {} | xmlstarlet >{.}.out'
or if you want to include the fetching of urls:
cat urls | parallel 'wget -O - {} | zcat | xmlstarlet >{.}.out'
It is easy to read and you get the added benefit of having on job per CPU run in parallel. Watch the intro video to learn more:
If xmlstarlet can operate on stdin instead of having to pass it a filename, then:
some command | xargs -i -n1 sh -c 'zcat "{}" | xmlstarlet options ...'
The xargs option -i means you can use the "{}" placeholder to indicate where the filename should go. Use -n 1 to indicate xargs should only one line at a time from its input.

bash: comment a long pipeline

I've found that it's quite powerful to create long pipelines in bash scripts, but the main drawback that I see is that there doesn't seem to be a way to insert comments.
As an example, is there a good way to add comments to this script?
#find all my VNC sessions
ls -t $HOME/.vnc/*.pid \
| xargs -n1 \
| sed 's|\.pid$||; s|^.*\.vnc/||g' \
| xargs -P50 --replace vncconfig -display {} -get desktop \
| grep "($USER)" \
| awk '{print $1}' \
| xargs -n1 xdpyinfo -display \
| egrep "^name|dimensions|depths"
Let the pipe be the last character of each line and use # instead of \, like this:
ls -t $HOME/.vnc/*.pid | #comment here
xargs -n1 | #another comment
This works too:
# comment here
ls -t $HOME/.vnc/*.pid |
#comment here
xargs -n1 |
#another comment
based on
it comes down to s/|//;s!\!|!.
Unless they're spectacularly long pipelines, you don't have to comment inline, just comment at the top:
# Find all my VNC sessions.
# xargs does something.
# sed does something else
# the second xargs destroys the universe.
# :
# and so on.
ls -t $HOME/.vnc/*.pid \
| xargs -n1 \
| sed 's|\.pid$||; s|^.*\.vnc/||g' \
| xargs -P50 --replace /opt/tools/bin/restrict_resources -T1 \
-- vncconfig -display {} -get desktop 2>/dev/null \
| grep "($USER)" \
| awk '{print $1}' \
| xargs -n1 xdpyinfo -display \
| egrep "^name|dimensions|depths"
As long as comments are relatively localised, it's fine. So I wouldn't put them at the top of the file (unless your pipeline was the first thing in the file, of course) or scribbled down on toilet paper and locked in your desk at work.
But the first thing I do when looking at a block is to look for comments immediately preceding the block. Even in C code, I don't comment every line, since the intent of comments is to mostly show the why and a high-level how.
for pid in $HOME/.vnc/*.pid; do
xdpyinfo -display "$disp" | # commment here
egrep "^name|dimensions|depths"
I don't understand the need for vncconfig if all it does is append '(user)' which you subsequently remove for the call to xdpyinfo. Also, all those pipes take quite a bit of overhead, if you time your script vs mine I think you'll find the performance comparable if not faster.
