Command composition in bash - bash

So I have the equivalent of a list of files being output by another command, and it looks something like this:
http://somewhere.com/foo1.xml.gz
http://somewhere.com/foo2.xml.gz
...
I need to run the XML in each file through xmlstarlet, so I'm doing ... | xargs gzip -d | xmlstarlet ..., except I want xmlstarlet to be called once for each line going into gzip, not on all of the xml documents appended to each other. Is it possible to compose 'gzip -d' 'xmlstarlet ...', so that xargs will supply one argument to each of their composite functions?

Why not read your file and process each line separately in the shell? i.e.
fileList=/path/to/my/xmlFileList.txt
cat ${fileList} \
| while read fName ; do
gzip -d ${fName} | xmlstartlet > ${fName}.new
done
I hope this helps.

Although the right answer is the one suggested by shelter (+1), here is a one-liner "divertimento" providing that the input is the proposed by Andrey (a command that generates the list of urls) :-)
~$ eval $(command | awk '{a=a "wget -O - "$0" | gzip -d | xmlstartlet > $(basename "$0" .gz ).new; " } END {print a}')
It just generates a multi command line that does wget http://foo.xml.gz | gzip -d | xmlstartlet > $(basenname foo.xml.gz .gz).new for each of the urls in the input; after the resulting command is evaluated

Use GNU Parallel:
cat filelist | parallel 'zcat {} | xmlstarlet >{.}.out'
or if you want to include the fetching of urls:
cat urls | parallel 'wget -O - {} | zcat | xmlstarlet >{.}.out'
It is easy to read and you get the added benefit of having on job per CPU run in parallel. Watch the intro video to learn more: http://www.youtube.com/watch?v=OpaiGYxkSuQ

If xmlstarlet can operate on stdin instead of having to pass it a filename, then:
some command | xargs -i -n1 sh -c 'zcat "{}" | xmlstarlet options ...'
The xargs option -i means you can use the "{}" placeholder to indicate where the filename should go. Use -n 1 to indicate xargs should only one line at a time from its input.

Related

how to pipe multi commands to bash?

I want to check some file on the remote website.
Here is bash command to generate commands that calculate the file md5
[root]# head -n 3 zrcpathAll | awk '{print $3}' | xargs -I {} echo wget -q -O - -i {}e \| md5sum\;
wget -q -O - -i https://example.com/zrc/3d2f0e76e04444f4ec456ef9f11289ec.zrce | md5sum;
wget -q -O - -i https://example.com/zrc/e1bd7171263adb95fb6f732864ceb556.zrce | md5sum;
wget -q -O - -i https://example.com/zrc/5300b80d194f677226c4dc6e17ba3b85.zrce | md5sum;
Then I pipe the outputed commands to bash, but only the first command was executed.
[root]# head -n 3 zrcpathAll | awk '{print $3}' | xargs -I {} echo wget -q -O - -i {}e \| md5sum\; | bash -v
wget -q -O - -i https://example.com/zrc/3d2f0e76e04444f4ec456ef9f11289ec.zrce | md5sum;
3d2f0e76e04444f4ec456ef9f11289ec -
[root]#
Would you please try the following instead:
while read -r _ _ url _; do
wget -q -O - "$url"e | md5sum
done < <(head -n 3 zrcpathAll)
we should not put -i in front of "$url" here.
[Explanation about -i option]
Manpage of wget says:
-i file
--input-file=file
Read URLs from a local or external file. [snip]
If this function is used, no URLs need be present on the command line. [snip]
If the file is an external one, the document will be automatically treated as html if the Content-Type matches text/html.
Furthermore, the file's location will be implicitly used as base
href if none was specified.
where the file will contain line(s) of url such as:
https://example.com/zrc/3d2f0e76e04444f4ec456ef9f11289ec.zrce
https://example.com/zrc/e1bd7171263adb95fb6f732864ceb556.zrce
https://example.com/zrc/5300b80d194f677226c4dc6e17ba3b85.zrce
Whereas if we use the option as -i url, wget first
downloads the url as a file which contains the lines of urls
as above. In our case, the url is the target to download itself,
not the list of urls, wget causes an error: No URLs found in url.
Even if the wget fails, why the command outputs just one line, not
three lines as the result of md5sum?
This seems to be because the head command immediately flushes the remaining
lines when the piped subprocess fails.

Pipe grep response to a second command?

Here's the command I'm currently running:
curl 'http://test.com/?id=12345' | grep -o -P '(?<=content="2;url=).*?(?=")'
The response from this command is a URL, like this:
$ curl 'http://test.com/?id=12345' | grep -o -P '(?<=content="2;url=).*?(?=")'
http://google.com
I want to use whatever that URL is to essentially do this:
curl 'http://test.com/?id=12345' | grep -o -P '(?<=content="2;url=).*?(?=")' | curl 'http://google.com'
Is there any simple way to do this all in one line?
Use xargs with a place holder for the output from stdin with the -I{} flag as below. The -r flag is to ensure the curl command is not invoked on a empty output from previous grep output.
curl 'http://test.com/?id=12345' | grep -o -P '(?<=content="2;url=).*?(?=")' | xargs -r -I{} curl {}
A small description about the flags, -I and -r from the GNU xargs man page,
-I replace-str
Replace occurrences of replace-str in the initial-arguments with
names read from standard input.
-r, --no-run-if-empty
If the standard input does not contain any nonblanks, do not run
the command. Normally, the command is run once even if there is
no input. This option is a GNU extension
(or) if you are looking for a bash approach without other tools,
curl 'http://test.com/?id=12345' | grep -o -P '(?<=content="2;url=).*?(?=")' | while read line; do [ ! -z "$line" ] && curl "$line"; done

curl complex usage with pattern

I'm trying to get 2 files using curl based on some pattern but that doesn't seem to work:
Files:
SystemOut_15.04.01_21.12.36.log
SystemOut_15.04.01_15.54.05.log
curl -f -k -u "login:password" https://myserver/cgi-bin/logviewer/index.cgi?getlogfile=SystemOut_15.04.01_21.12.36.log'&'server=qwerty123.com'&'numlines=100000000'&'appenv=MBL%20-%20PROD'&'directory=/app/WAS/was85/profiles/node/logs/mbl-server1
I know there is -A key but it doesn't work since my file is inside the link.
How can I extract those 2 files using a pattern?
Did that myself. One curl gets the list of logs on the webpage. Another downloads those files.
The code looks like:
for file in $(curl -f -k -u "user:pwd" https://selfservice.pwj.com/cgi-bin/logviewer/index.cgi?listdirectory=/app/smx_client_mob/data/log'&'appenv=MBL%20-%20PROD'&'server=xshembl04pap.she.pwj.com | grep href | sed 's/.*href="//' | sed 's/".*//' | sed 's/javascript:getLog//g' | sed "s/['();]//g" | grep -i 'service' | grep '^[a-zA-Z].*'); do
curl -o $file -f -k -u "user:pwd" https://selfservice.pwj.com/cgi-bin/logviewer/index.cgi?getlogfile="$file"'&'server=xshembl04pap.she.pwj.com'&'numlines=100000000'&'appenv=MBL%20-%20PROD'&'directory=/app/smx_client_mob/data/log; done

How to check broken links in a webpage?

I maintained a list of links to some resources in my blog.
If I find a link is broken, I add a class="broken" to it.
Sometimes the broken links go to alive again, so I remove the class="broken".
When the list goes very long, it's very hard to check them one bye one.
<ul>
<li>a</li>
<li>b</li>
<li>c</li>
<li>d</li>
</ul>
How to write a bash script to do the editing?
Maybe it's not the answer you're looking for, but why doing it from bash, and not writing the page to use javascript that can do it on request basis / on the fly? This should get you going http://www.egrappler.com/jquery-broken-link-checker-plugin-jslink/
but I think it would be also possible to create similar logic on your own with jQuery $.get / $.load methods
Not quite appropriate task for Bash.
Option 1: I'd use Java or Groovy, have a SAX handler simply dump all data to output, except for the <a> elements for which it would check the href value, and if broken, add the class="broken" part.
Option 2: Have a XSLT which would call a custom XSLT function on <a> elements. Again, I'd do this with Java, but any language with a good XSLT engine can do that.
Option 3: If you really really want to feel geeky ;-) here's a line to get quite unreliable link checker for Bash:
grep -R '(?:href="(http://[^"]+)")' -ohPI | grep -oP 'http://[^"]+' | sort | uniq | wget -nv -S -O /dev/null -i - 2>&1 | grep -P '(wget:| -> |HTTP/|Location:)'
It could probably get better but I was okay with this.
Option 4: You could employ curl -L ... (the -L follows the redirects) instead of wget.
grep -R '(?:"(http://[^"]+)")' -ohPI | grep -v search.maven.org | grep -oP 'http://[^"]+' | sort | uniq | xargs -I{} sh -c 'echo && echo "$1" && curl -i -I -L -m 5 -s -S "$1"' -- {} 2>&1 | grep -P '(^$|curl:|HTTP/|http://|https://|Location:)'
Pro tip: curl seems to have more scripting friendly output, so you can make it parallel to speed things up: ... | xargs -n 1 -P 8 curl -L ... This will run 8 processes of curl, and pass one argument (URL) at a time. Sorting out the output is up to you, I'd probably create one file for each curl invocation and then concatenated.

bash: comment a long pipeline

I've found that it's quite powerful to create long pipelines in bash scripts, but the main drawback that I see is that there doesn't seem to be a way to insert comments.
As an example, is there a good way to add comments to this script?
#find all my VNC sessions
ls -t $HOME/.vnc/*.pid \
| xargs -n1 \
| sed 's|\.pid$||; s|^.*\.vnc/||g' \
| xargs -P50 --replace vncconfig -display {} -get desktop \
| grep "($USER)" \
| awk '{print $1}' \
| xargs -n1 xdpyinfo -display \
| egrep "^name|dimensions|depths"
Let the pipe be the last character of each line and use # instead of \, like this:
ls -t $HOME/.vnc/*.pid | #comment here
xargs -n1 | #another comment
...
This works too:
# comment here
ls -t $HOME/.vnc/*.pid |
#comment here
xargs -n1 |
#another comment
...
based on https://stackoverflow.com/a/5100821/1019205.
it comes down to s/|//;s!\!|!.
Unless they're spectacularly long pipelines, you don't have to comment inline, just comment at the top:
# Find all my VNC sessions.
# xargs does something.
# sed does something else
# the second xargs destroys the universe.
# :
# and so on.
ls -t $HOME/.vnc/*.pid \
| xargs -n1 \
| sed 's|\.pid$||; s|^.*\.vnc/||g' \
| xargs -P50 --replace /opt/tools/bin/restrict_resources -T1 \
-- vncconfig -display {} -get desktop 2>/dev/null \
| grep "($USER)" \
| awk '{print $1}' \
| xargs -n1 xdpyinfo -display \
| egrep "^name|dimensions|depths"
As long as comments are relatively localised, it's fine. So I wouldn't put them at the top of the file (unless your pipeline was the first thing in the file, of course) or scribbled down on toilet paper and locked in your desk at work.
But the first thing I do when looking at a block is to look for comments immediately preceding the block. Even in C code, I don't comment every line, since the intent of comments is to mostly show the why and a high-level how.
#!/bin/bash
for pid in $HOME/.vnc/*.pid; do
tmp=${pid##*/}
disp=${tmp%.*}
xdpyinfo -display "$disp" | # commment here
egrep "^name|dimensions|depths"
done
I don't understand the need for vncconfig if all it does is append '(user)' which you subsequently remove for the call to xdpyinfo. Also, all those pipes take quite a bit of overhead, if you time your script vs mine I think you'll find the performance comparable if not faster.

Resources