bash: comment a long pipeline - bash

I've found that it's quite powerful to create long pipelines in bash scripts, but the main drawback that I see is that there doesn't seem to be a way to insert comments.
As an example, is there a good way to add comments to this script?
#find all my VNC sessions
ls -t $HOME/.vnc/*.pid \
| xargs -n1 \
| sed 's|\.pid$||; s|^.*\.vnc/||g' \
| xargs -P50 --replace vncconfig -display {} -get desktop \
| grep "($USER)" \
| awk '{print $1}' \
| xargs -n1 xdpyinfo -display \
| egrep "^name|dimensions|depths"

Let the pipe be the last character of each line and use # instead of \, like this:
ls -t $HOME/.vnc/*.pid | #comment here
xargs -n1 | #another comment
...

This works too:
# comment here
ls -t $HOME/.vnc/*.pid |
#comment here
xargs -n1 |
#another comment
...
based on https://stackoverflow.com/a/5100821/1019205.
it comes down to s/|//;s!\!|!.

Unless they're spectacularly long pipelines, you don't have to comment inline, just comment at the top:
# Find all my VNC sessions.
# xargs does something.
# sed does something else
# the second xargs destroys the universe.
# :
# and so on.
ls -t $HOME/.vnc/*.pid \
| xargs -n1 \
| sed 's|\.pid$||; s|^.*\.vnc/||g' \
| xargs -P50 --replace /opt/tools/bin/restrict_resources -T1 \
-- vncconfig -display {} -get desktop 2>/dev/null \
| grep "($USER)" \
| awk '{print $1}' \
| xargs -n1 xdpyinfo -display \
| egrep "^name|dimensions|depths"
As long as comments are relatively localised, it's fine. So I wouldn't put them at the top of the file (unless your pipeline was the first thing in the file, of course) or scribbled down on toilet paper and locked in your desk at work.
But the first thing I do when looking at a block is to look for comments immediately preceding the block. Even in C code, I don't comment every line, since the intent of comments is to mostly show the why and a high-level how.

#!/bin/bash
for pid in $HOME/.vnc/*.pid; do
tmp=${pid##*/}
disp=${tmp%.*}
xdpyinfo -display "$disp" | # commment here
egrep "^name|dimensions|depths"
done
I don't understand the need for vncconfig if all it does is append '(user)' which you subsequently remove for the call to xdpyinfo. Also, all those pipes take quite a bit of overhead, if you time your script vs mine I think you'll find the performance comparable if not faster.

Related

Get latest version of Maven with curl

My task is to grab latest version of maven in bash script using curl https://apache.osuosl.org/maven/maven-3/
Output should be: 3.8.5
curl -s "https://apache.osuosl.org/maven/maven-3/" | grep -o "[0-9].[0-9].[0-9]" | tail -1
works for me ty
One way is to
Use grep to only get the lines containing the folder link
Use sed with an regex to get only the version number
Use sort to sort the lines
Use tail -n1 to get the last line
curl -s https://apache.osuosl.org/maven/maven-3/ \
| grep folder \
| gsed -E 's/.*([[:digit:]]\.[[:digit:]]\.[[:digit:]]).*/\1/' \
| sort \
| tail -n1
Output:
3.8.5
The url you're mentioning is an HTML-document and curl is not an HTML-parser, so I'd look into an HTML-parser, like xidel, if I were you:
$ xidel -s "https://apache.osuosl.org/maven/maven-3/" -e '//pre/a[last()]'
3.8.5/
$ xidel -s "https://apache.osuosl.org/maven/maven-3/" -e 'substring-before(//pre/a[last()],"/")'
3.8.5
I would suggest to use the central repository instead one of the apache maven sites because the information on central more maded for machine consumption.
It will be taken into account that the file maven-metadata.xml contains the appropriate information.
There are two options using the <version>..</version> tag and the <latest>..</latest> tag.
MVNVERSION=$(curl https://repo1.maven.org/maven2/org/apache/maven/maven/maven-metadata.xml \
| grep "<version>.*</version>" \
| tr -s " " \
| cut -d ">" -f2 \
| cut -d "<" -f1 \
| sort -V -r \
| head -1)
The second option:
MVNVERSION=$(curl https://repo1.maven.org/maven2/org/apache/maven/maven/maven-metadata.xml \
| grep "<latest>.*</latest>" \
| cut -d ">" -f2 \
| cut -d "<" -f1)

Is it possible to comment inline in a multi-line newline escaped script?

When I have long multi-line piped commands in my scripts I would like to comment on what each line does, but I haven't found a way of doing so.
Given this snippet:
git branch -r --merged \
| grep " $remote" \
| egrep -v "HEAD ->" \
| util.esed -n 's/ \w*\/(.*)/\1/p' \
| egrep -v \
"$(skipped $skip | util.esed -e 's/,/|/g' -e 's/(\w+)/^\1$/g' )" \
| paste -s
Is it possible to insert comments in between the lines? It seems that using the backslash to escape the newline prevents me from adding comments at the end of the line, and I can't add the comment before the backslash, as that would hide the escaping.
Pseudo-code of what I would like the above script to look like
It seems I was unclear (?) of what I wanted by the above section, so to have a clue on what I am looking for, it should be in the similar vein of this:
git branch -r --merged \ # list merged remote branches
| grep " $remote" \ # filter out the ones for $remote
| egrep -v "HEAD ->" \ # remove some garbage
#strip some whitespace:
| util.esed -n 's/ \w*\/(.*)/\1/p' \
# remove the skipped branches:
| egrep -v \
"$(skipped $skip | util.esed -e 's/,/|/g' -e 's/(\w+)/^\1$/g' )" \
| paste -s # something else
It doesn't have to be exactly like this (obviously, it's not valid syntax), but something similar. If it's not possible directly, due to syntactical restrictions, perhaps it's possible to write self-modifying code that will have comments that are removed before executing it?
You can try something like that:
git branch --remote | # some comment
grep origin | # another comment
tr a-z A-Z

Wget page title

Is it possible to Wget a page's title from the command line?
input:
$ wget http://bit.ly/rQyhG5 <<code>>
output:
If it’s broke, fix it right - Keeping it Real Estate. Home
This script would give you what you need:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'
But there are lots of situations where it breaks, including if there is a <title>...</title> in the body of the page, or if the title is on more than one line.
This might be a little better:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -e 's!.*<head>\(.*\)</head>.*!\1!' \
| sed -e 's!.*<title>\(.*\)</title>.*!\1!'
but it does not fit your case as your page contains the following head opening:
<head profile="http://gmpg.org/xfn/11">
Again, this might be better:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!' \
| sed -e 's!.*<title>\(.*\)</title>.*!\1!'
but there is still ways to break it, including no head/title in the page.
Again, a better solution might be:
wget --quiet -O - http://bit.ly/rQyhG5 \
| paste -s -d " " \
| sed -n -e 's!.*<head[^>]*>\(.*\)</head>.*!\1!p' \
| sed -n -e 's!.*<title>\(.*\)</title>.*!\1!p'
but I am sure we can find a way to break it. This is why a true xml parser is the right solution, but as your question is tagged shell, the above it the best I can come with.
The paste and the 2 sed can be merged in a single sed, but is less readable. However, this version has the advantage of working on multi-line titles:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;T;s!.*<title>\(.*\)</title>.*!\1!p}'
Update:
As explain in the comments, the last sed above uses the T command which is a GNU extension. If you do not have a compatible version, you can use:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext;b;:next;s!.*<title>\(.*\)</title>.*!\1!p}'
Update 2:
As above still not working on Mac, try:
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -e 'H;${x;s!.*<head[^>]*>\(.*\)</head>.*!\1!;tnext};b;:next;s!.*<title>\(.*\)</title>.*!\1!p'
and/or
cat << EOF > script
H
\$x
\$s!.*<head[^>]*>\(.*\)</head>.*!\1!
\$tnext
b
:next
s!.*<title>\(.*\)</title>.*!\1!p
EOF
wget --quiet -O - http://bit.ly/rQyhG5 \
| sed -n -f script
(Note the \ before the $ to avoid variable expansion.)
It seams that the :next does not like to be prefixed by a $, which could be a problem in some sed version.
The following will pull whatever lynx thinks the title of the page is, saving you from all of the regex nonsense. Assuming the page you are retrieving is standards compliant enough for lynx, this should not break.
lynx -dump example.com | sed '2q;d'

Command composition in bash

So I have the equivalent of a list of files being output by another command, and it looks something like this:
http://somewhere.com/foo1.xml.gz
http://somewhere.com/foo2.xml.gz
...
I need to run the XML in each file through xmlstarlet, so I'm doing ... | xargs gzip -d | xmlstarlet ..., except I want xmlstarlet to be called once for each line going into gzip, not on all of the xml documents appended to each other. Is it possible to compose 'gzip -d' 'xmlstarlet ...', so that xargs will supply one argument to each of their composite functions?
Why not read your file and process each line separately in the shell? i.e.
fileList=/path/to/my/xmlFileList.txt
cat ${fileList} \
| while read fName ; do
gzip -d ${fName} | xmlstartlet > ${fName}.new
done
I hope this helps.
Although the right answer is the one suggested by shelter (+1), here is a one-liner "divertimento" providing that the input is the proposed by Andrey (a command that generates the list of urls) :-)
~$ eval $(command | awk '{a=a "wget -O - "$0" | gzip -d | xmlstartlet > $(basename "$0" .gz ).new; " } END {print a}')
It just generates a multi command line that does wget http://foo.xml.gz | gzip -d | xmlstartlet > $(basenname foo.xml.gz .gz).new for each of the urls in the input; after the resulting command is evaluated
Use GNU Parallel:
cat filelist | parallel 'zcat {} | xmlstarlet >{.}.out'
or if you want to include the fetching of urls:
cat urls | parallel 'wget -O - {} | zcat | xmlstarlet >{.}.out'
It is easy to read and you get the added benefit of having on job per CPU run in parallel. Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ
If xmlstarlet can operate on stdin instead of having to pass it a filename, then:
some command | xargs -i -n1 sh -c 'zcat "{}" | xmlstarlet options ...'
The xargs option -i means you can use the "{}" placeholder to indicate where the filename should go. Use -n 1 to indicate xargs should only one line at a time from its input.

How to get names of the currently running hadoop jobs?

I need to get the list of job names that currently running, but hadoop -job list give me a list of jobIDs.
Is there a way to get names of the running jobs?
Is there a way to get the job names from jobIDs?
I've had to do this a number of times so I came up with the following command line that you can throw in a script somewhere and reuse. It prints the jobid followed by the job name.
hadoop job -list | egrep '^job' | awk '{print $1}' | xargs -n 1 -I {} sh -c "hadoop job -status {} | egrep '^tracking' | awk '{print \$3}'" | xargs -n 1 -I{} sh -c "echo -n {} | sed 's/.*jobid=//'; echo -n ' ';curl -s -XGET {} | grep 'Job Name' | sed 's/.* //' | sed 's/<br>//'"
If you use Hadoop YARN don't use mapred job -list (or its deprecated version hadoop job -list) just do
yarn application -appStates RUNNING -list
That also prints out the application/job name. For mapreduce applications you can get the corresponding JobId by replacing the application prefix of the Application-Id with job.
Modifying AnthonyF's script, you can use the following on Yarn:
mapred job -list 2> /dev/null | egrep '^\sjob' | awk '{print $1}' | xargs -n 1 -I {} sh -c "mapred job -status {} 2>/dev/null | egrep 'Job File' | awk '{print \$3}'" | xargs -n 1 -I{} sh -c "hadoop fs -cat {} 2>/dev/null | egrep 'mapreduce.job.name' | sed 's/.*<value>//' | sed 's/<\/value>.*//'"
If you do $HADOOP_HOME/bin/hadoop -job -status <jobid> you will get a tracking URL in the output. Going to that URL will give you the tracking page, which has the name
Job Name: <job name here>
The -status command also gives a file, which can also be seen from the tracking URL. In this file is a mapred.job.name which has the job name.
I didn't find a way to access the job name from the command line. Not to say there isn't... but not found by me. :)
The tracking URL and xml file are probably your best options for getting the job name.
You can find the information in JobTracker UI
You can see
Jobid
Priority
User
Name of the job
State of the job whether it succeed or failed
Start Time
Finish Time
Map % Complete
Reduce % Complete etc
INFO
Just In case any one interested in latest query to get the Job Name :-). Modified Pirooz Command -
mapred job -list 2> /dev/null | egrep '^job' | awk '{print $1}' | xargs -n 1 -I {} sh -c "mapred job -status {} 2>/dev/null | egrep 'Job File'" | awk '{print $3}' | xargs -n 1 -I{} sh -c "hadoop fs -cat {} 2>/dev/null" | egrep 'mapreduce.job.name' | awk -F"" '{print $2}' | awk -F "" '{print $1}'
I needed to look through history, so I changed mapred job -list to mapred job -list all....
I ended up adding a -L to the curl command, so the block there was:
curl -s -L -XGET {}
This allows for redirection, such as if the job is retired and in the job history. I also found that it's JobName in the history HTML, so I changed the grep:
grep 'Job.*Name'
Plus of course changing hadoop to mapred. Here's the full command:
mapred job -list all | egrep '^job' | awk '{print $1}' | xargs -n 1 -I {} sh -c "mapred job -status {} | egrep '^tracking' | awk '{print \$3}'" | xargs -n 1 -I{} sh -c "echo -n {} | sed 's/.*jobid=//'; echo -n ' ';curl -s -L -XGET {} | grep 'Job.*Name' | sed 's/.* //' | sed 's/<br>//'"
(I also changed around the first grep so that I was only looking at a certain username....YMMV)
by typing "jps" in your terminal .

Resources