Extract field from xml file - bash

xml file:
<head>
<head2>
<dict type="abc" file="/path/to/file1"></dict>
<dict type="xyz" file="/path/to/file2"></dict>
</head2>
</head>
I need to extract the list of files from this. So the output would be
/path/to/file1
/path/to/file2
So far, I've managed to the following.
grep "<dict*file=" /path/to/xml.file | awk '{print $3}' | awk -F= '{print $NF}'

quick and dirty based on your sample, not xml possibilties
# sed a bit secure
sed -e '/<head>/,/<\/head>/!d' -e '/.*[[:blank:]]file="\([^"]*\)".*/!d' -e 's//\1/' YourFile
# sed in brute force
sed -n 's/.*[[:blank:]]file="\([^"]*\)".*/\1/p' -e 's//\1/' YourFile
# awk quick unsecure using your sample
awk -F 'file="|">' '/<head>/{h=1} /\/head>{h=0} h && /[[:blank:]]file/ { print $2 }' YourFile
now, i don't promote this kind of extraction on XML unless your really know how is your source in format and content (extra field, escaped quote, content of string like tag format, ...) are a big cause of failure and unexpected result and no more appropriate tools are available
now to use your own script
#grep "<dict*file=" /path/to/xml.file | awk '{print $3}' | awk -F= '{print $NF}'
awk '! /<dict.*file=/ {next} {$0=$3;FS="\"";$0=$0;print $2;FS=OFS}' YourFile
no need of a grep with awk, use starting pattern filter /<dict.*file/
second awk for using a different separator (FS) could be done inside the same script changing FS but because it only occur at next evaluation (next line by default), you could force a reevaluation of current content with $0=$0 in this case

Use an xmllint solution with -xpath as //head/head2/dict/#file
xmllint --xpath "//head/head2/dict/#file" input-xml | awk 'BEGIN{FS="file="}{printf "%s\n%s\n", gensub(/"/,"","g",$2), gensub(/"/,"","g",$3)}'
/path/to/file1
/path/to/file2
Unfortunately couldn't provide a pure xmllint logic, because thought applying,
xmllint --xpath "string(//head/head2/dict/#file)" input-xml
will return the file attributes from both the nodes, but it was returning only the first instance.
So added coupled my logic with GNU Awk, to extract the required values, doing
xmllint --xpath "//head/head2/dict/#file" input-xml
returns values as
file="/path/to/file1" file="/path/to/file2"
On the above output, setting a string de-limiter as file= and removing the double-quotes using gensub() function solved the requirement.

Also PE [perl everywhere :) ] solution:
perl -MXML::LibXML -E 'say $_->to_literal for XML::LibXML->load_xml(location=>q{file.xml})->findnodes(q{/head/head2/dict/#file})'
it prints
/path/to/file1
/path/to/file2
For the above you need to have installed the XML::LibXML module.

With xmlstarlet it would be:
xmlstarlet sel -t -v "//head/head2/dict/#file" -nl input.xml

This command:
awk -F'[=" ">]' '{print $12}' file
Will produces:
/path/to/file1
/path/to/file2

Related

Need string extraction between tags

I have a string named as <tr><td>-Xms36g</td></tr>
I need to extract Xms36g from it and I have tried and ended successfully with
grep -oE '[Xms0-9g]' | xargs | sed 's| ||g'
But I would like to know is there any other way I can achieve this.
Thank you.
Using grep with PCRE (-P)
grep -Po -- '-\K[^<]+'
- matches - literally and \K discards the match
[^<]+ gets the portion upto next < i.e. our desired portion
With sed:
sed -E 's/^[^-]*-([^<]+)<.*/\1/'
^[^-]*- matches substring upto the -
The only captured group, ([^<]+) gets the portion upto next <
<.* matches the rest
In the replacement we have used the captured group only
Example:
% grep -Po -- '-\K[^<]+' <<<'<tr><td>-Xms36g</td></tr>'
Xms36g
% sed -E 's/^[^-]*-([^<]+)<.*/\1/' <<<'<tr><td>-Xms36g</td></tr>'
Xms36g
Parsing HTML with regular expressions is frowned upon. If you have xmllint which is shipped with libxml2-util you can use this:
xmllint --html --xpath '//text()' file
You can also pipe to standard input. In this case you need to use - for the filename:
foo | xmllint --html --xpath '//text()' -
There are seemingly endless ways you could do this. Here's an awk example:
awk -F'-|<' '{print $4}'
Another variation:
awk -F'[-<]' '$0=$4 {print}'
Using sed:
sed -E 's/.*-([^/<>]*).*/\1/'
Using cut:
cut -b 10-15
Using echo:
echo "${str:9:6}"

Using grep to pull a series of random numbers from a known line

I have a simple scalar file producing strings like...
bpred_2lev.ras_rate.PP 0.9413 # RAS prediction rate (i.e., RAS hits/used RAS)
Once I use grep to find this line in the output.txt, is there a way I can directly grab the "0.9413" portion? I am attempting to make a cvs file and just need whatever value is generated.
Thanks in advance.
There are several ways to combine finding and extracting into a single command:
awk (POSIX-compliant)
awk '$1 == "bpred_2lev.ras_rate.PP" { print $2 }' file
sed (GNU sed or BSD/OSX sed)
sed -En 's/^bpred_2lev\.ras_rate\.PP +([^ ]+).*$/\1/p' file
GNU grep
grep -Po '^bpred_2lev\.ras_rate\.PP +\K[^ ]+' file
You can use awk like this:
grep <your_search_criteria> output.txt | awk '{ print $2 }'

How to grep with a pattern that includes a " quotation mark?

I want to grep a line that includes a quotation mark, more specifically I want to grep lines that include a " mark.
more specifically I want to grep lines like:
#include "something.h"
then pipe into sed to just return something.h
A single grep will do this job.
grep -oP '(?<=")[^"]*(?=")' file
Example:
$ echo '#include "something.h"' | grep -oP '(?<=")[^"]*(?=")'
something.h
sed '#n
/"/ s/.*"\([^"]*\)" *$/\1/p' YourFile
No need of grep (unless performance on huge file is wanted) with a sed. Sed could filter and adapt directly the content
In your case, /"/ is certainly modified by /#include *"/
in case of several string between quote
sed '#n
/"/ {s/"[^"]*$/"/;s/[^"]*"\([^"]*\)" */\1/gp;}' YourFile
You can use awk to get included filename:
awk -F'"' '{print $2}' file.c
something.h

Reading numbers from a text line in bash shell

I'm trying to write a bash shell script, that opens a certain file CATALOG.dat, containing the following lines, made of both characters and numbers:
event_0133_pk.gz
event_0291_pk.gz
event_0298_pk.gz
event_0356_pk.gz
event_0501_pk.gz
What I wanna do is print the numbers (only the numbers) inside a new file NUMBERS.dat, using something like > ./NUMBERS.dat, to get:
0133
0291
0298
0356
0501
My problem is: how do I extract the numbers from the text lines? Is there something to make the script read just the number as a variable, like event_0%d_pk.gz in C/C++?
A grep solution:
grep -oP '[0-9]+' CATALOG.dat >NUMBERS.dat
A sed solution:
sed 's/[^0-9]//g' CATALOG.dat >NUMBERS.dat
And an awk solution:
awk -F"[^0-9]+" '{print $2}' CATALOG.dat >NUMBERS.dat
There are many ways that you can achieve your result. One way would be to use awk:
awk -F_ '{print $2}' CATALOG.dat > NUMBERS.dat
This sets the field separator to an underscore, then prints the second field which contains the numbers.
Awk
awk 'gsub(/[^[:digit:]]/,"")' infile
Bash
while read line; do echo ${line//[!0-9]}; done < infile
tr
tr -cd '[[:digit:]\n]' <infile
You can use grep command to extract the number part.
grep -oP '(?<=_)\d+(?=_)' CATALOG.dat
gives output as
0133
0291
0298
0356
0501
Or
much simply
grep -oP '\d+' CATALOG.dat
You don't need perl mode in grep for this. BREs can do this.
grep -o '[[:digit:]]\+' CATALOG.dat > NUMBERS.dat

shell command to truncate/cut a part of string

I have a file with the below contents. I got the command to print version number out of it. But I need to truncate the last part in the version file
file.spec:
Version: 3.12.0.2
Command used:
VERSION=($(grep -r "Version:" /path/file.spec | awk '{print ($2)}'))
echo $VERSION
Current output : 3.12.0.2
Desired output : 3.12.0
There is absolutey no need for external tools like awk, sed etc. for this simple task if your shell is POSIX-compliant (which it should be) and supports parameter expansion:
$ cat file.spec
Version: 3.12.0.2
$ version=$(<file.spec)
$ version="${version#* }"
$ version="${version%.*}"
$ echo "${version}"
3.12.0
Try this:
VERSION=($(grep -r "Version:" /path/file.spec| awk '{print ($2)}' | cut -d. -f1-3))
Cut split string with field delimiter (-d) , then you select desired field with -f param.
You could use this single awk script awk -F'[ .]' '{print $2"."$3"."$4}':
$ VERSION=$(awk -F'[ .]' '{print $2"."$3"."$4}' /path/file.spec)
$ echo $VERSION
3.12.0
Or this single grep
$ VERSION=$(grep -Po 'Version: \K\d+[.]\d+[.]\d' /path/file.spec)
$ echo $VERSION
3.12.0
But you never need grep and awk together.
if you only grep single file, -r makes no sense.
also based on the output of your command line, this grep should work:
grep -Po '(?<=Version: )(\d+\.){2}\d+' /path/file.spec
gives you:
3.12.0
the \K is also nice. worked for fixed/non-fixed length look-behind. (since PCRE 7.2). There is another answer about it. but I feel look-behind is easier to read, if fixed length.

Resources