Fetch value from xml element which is enclosed with CDATA through awk

Fetch value from xml element which is enclosed with CDATA through awk - bash

I am using awk command to fetch values from xml elements. Using below command
$(awk -F "[><]" '/'$tag_name'/{print $3}' $FILE_NAME | sort | uniq)
Here
File_name: XML File.
tag_name: name of xml element whose value we
need.
Sample XML
<item>
<tag1>test</tag1>
<tag2><![CDATA[cdata_test]]></tag2>
</item>
One of the tag in xml contains CDATA. For that script is not working as expected.
When I tried to print it is printing blank.

Instead of using a specific tool as AWK, not aware of the XML specificities, I suggest you to use xmlstarlet for selecting the nodes you want. For instance:
xmlstarlet select -t -v '//tag1' -n input.xml
will give as result:
test
Issuing:
xmlstarlet select -t -v '//tag2' -n input.xml
gives as output:
cdata_test
If you don't need the newline at the end of the returned string, just remove the -n from the options of the xmlstarlet command.
Keep it simple.

As xmlstarlet is not installed on my machine.
I used sed prior to my awk command as follows and that works for me.
$(sed -e 's/<![CDATA[//g; s/]]>//g' ${FILE_NAME} | awk -F "[><]" '/'$tag_name'/{print $3}' | sort | uniq)
Also, If anybody has any other solution. That too is also welcome.

Related

replace XML tags value in shell script

Hi I am a xml file like this.
<parameter name="END_DATE">20181031</parameter>
I want to replace this tags value to some other value I tried like this.
dt=$(awk -F '[<>]' '/_DATE/{print $3}' test.xml)
I extracted the tags value.
I have another variable value like this.
newdt=20181108
Now I need to replace this value to the extracted value.
Any help would be appreciated.

Though Chepner is right that awk or sed are not exact tools for xml in case you are NOT having xmlstarlet in your system then try following.
echo $newdt
20181108
awk -v dat="$newdt" 'match($0,/>[0-9]+</){$0=substr($0,1,RSTART) dat substr($0,RSTART+RLENGTH-1)} 1' Input_file

If sed works for you -
sed -Ei 's/( name="END_DATE")>20181031</\1>20181108</' test.xml
And xml parser is probably a better idea, though.
If you need to embed the variable -
sed -Ei "s/( name=\"END_DATE\")>20181031</\1>$newdt</" test.xml

Extract field from xml file

xml file:
<head>
<head2>
<dict type="abc" file="/path/to/file1"></dict>
<dict type="xyz" file="/path/to/file2"></dict>
</head2>
</head>
I need to extract the list of files from this. So the output would be
/path/to/file1
/path/to/file2
So far, I've managed to the following.
grep "<dict*file=" /path/to/xml.file | awk '{print $3}' | awk -F= '{print $NF}'

quick and dirty based on your sample, not xml possibilties
# sed a bit secure
sed -e '/<head>/,/<\/head>/!d' -e '/.*[[:blank:]]file="\([^"]*\)".*/!d' -e 's//\1/' YourFile
# sed in brute force
sed -n 's/.*[[:blank:]]file="\([^"]*\)".*/\1/p' -e 's//\1/' YourFile
# awk quick unsecure using your sample
awk -F 'file="|">' '/<head>/{h=1} /\/head>{h=0} h && /[[:blank:]]file/ { print $2 }' YourFile
now, i don't promote this kind of extraction on XML unless your really know how is your source in format and content (extra field, escaped quote, content of string like tag format, ...) are a big cause of failure and unexpected result and no more appropriate tools are available
now to use your own script
#grep "<dict*file=" /path/to/xml.file | awk '{print $3}' | awk -F= '{print $NF}'
awk '! /<dict.*file=/ {next} {$0=$3;FS="\"";$0=$0;print $2;FS=OFS}' YourFile
no need of a grep with awk, use starting pattern filter /<dict.*file/
second awk for using a different separator (FS) could be done inside the same script changing FS but because it only occur at next evaluation (next line by default), you could force a reevaluation of current content with $0=$0 in this case

Use an xmllint solution with -xpath as //head/head2/dict/#file
xmllint --xpath "//head/head2/dict/#file" input-xml | awk 'BEGIN{FS="file="}{printf "%s\n%s\n", gensub(/"/,"","g",$2), gensub(/"/,"","g",$3)}'
/path/to/file1
/path/to/file2
Unfortunately couldn't provide a pure xmllint logic, because thought applying,
xmllint --xpath "string(//head/head2/dict/#file)" input-xml
will return the file attributes from both the nodes, but it was returning only the first instance.
So added coupled my logic with GNU Awk, to extract the required values, doing
xmllint --xpath "//head/head2/dict/#file" input-xml
returns values as
file="/path/to/file1" file="/path/to/file2"
On the above output, setting a string de-limiter as file= and removing the double-quotes using gensub() function solved the requirement.

Also PE [perl everywhere :) ] solution:
perl -MXML::LibXML -E 'say $_->to_literal for XML::LibXML->load_xml(location=>q{file.xml})->findnodes(q{/head/head2/dict/#file})'
it prints
/path/to/file1
/path/to/file2
For the above you need to have installed the XML::LibXML module.

With xmlstarlet it would be:
xmlstarlet sel -t -v "//head/head2/dict/#file" -nl input.xml

This command:
awk -F'[=" ">]' '{print $12}' file
Will produces:
/path/to/file1
/path/to/file2

Need string extraction between tags

I have a string named as <tr><td>-Xms36g</td></tr>
I need to extract Xms36g from it and I have tried and ended successfully with
grep -oE '[Xms0-9g]' | xargs | sed 's| ||g'
But I would like to know is there any other way I can achieve this.
Thank you.

Using grep with PCRE (-P)
grep -Po -- '-\K[^<]+'
- matches - literally and \K discards the match
[^<]+ gets the portion upto next < i.e. our desired portion
With sed:
sed -E 's/^[^-]*-([^<]+)<.*/\1/'
^[^-]*- matches substring upto the -
The only captured group, ([^<]+) gets the portion upto next <
<.* matches the rest
In the replacement we have used the captured group only
Example:
% grep -Po -- '-\K[^<]+' <<<'<tr><td>-Xms36g</td></tr>'
Xms36g
% sed -E 's/^[^-]*-([^<]+)<.*/\1/' <<<'<tr><td>-Xms36g</td></tr>'
Xms36g

Parsing HTML with regular expressions is frowned upon. If you have xmllint which is shipped with libxml2-util you can use this:
xmllint --html --xpath '//text()' file
You can also pipe to standard input. In this case you need to use - for the filename:
foo | xmllint --html --xpath '//text()' -

There are seemingly endless ways you could do this. Here's an awk example:
awk -F'-|<' '{print $4}'
Another variation:
awk -F'[-<]' '$0=$4 {print}'
Using sed:
sed -E 's/.*-([^/<>]*).*/\1/'
Using cut:
cut -b 10-15
Using echo:
echo "${str:9:6}"

How to retrieve single value with grep from Json?

How to extract a single value from a given json?
{
"Vpc": {
"InstanceTenancy": "default",
"State": "pending",
"VpcId": "vpc-123",
"CidrBlock": "10.0.0.0/16",
"DhcpOptionsId": "dopt-123"
}
}
Tried this but with no luck:
grep -e '(?<="VpcId": ")[^"]*'

You probably wanted -Po, which works with your regex:
$ grep -oP '(?<="VpcId": ")[^"]*' infile
vpc-123
If GNU grep with its -P option isn't available, we can't use look-arounds and have to resort to for example using grep twice:
$ grep -o '"VpcId": "[^"]*' infile | grep -o '[^"]*$'
vpc-123
The first one extracts up to and excluding the closing quotes, the second one searches from the end of the line for non-quotes.
But, as mentioned, you'd be better off properly parsing your JSON. Apart from jq mentioned in another answer, I know of
Jshon
JSON.sh
A jq solution would be as simple as this:
$ jq '.Vpc.VpcId' infile
"vpc-123"
Or, to get raw output instead of JSON:
$ jq -r '.Vpc.VpcId' infile
vpc-123

Something like
grep '^ *"VpcId":' json.file \
| awk '{ print $2 }' \
| sed -e 's/,$//' -e 's/^"//' -e 's/"$//'

you can do:
sed -r -n -e '/^[[:space:]]*"VpcId":/s/^[^:]*: *"(.*)", *$/\1/p'
but really, using any shell tools to run regexes over JSON content is a bad idea. you should consider a much saner language like python.
python -c 'import json, sys; print(json.loads(sys.stdin.read())["Vpc"]["VpcId"]);'

Try this regex pattern:
\"VpcId\":\s?(\"\S+\")

If you can install a tool I would suggest using jq jq. It allows very simple grep, with great support for piping too.

The OP asks for solutions using grep. In case he means using terminal, the node cli is an alternative, since support for JSON is total. One alternative could be the command node --eval "script"
echo '{"key": 42}' \
| node -e 'console.log(JSON.parse(require("fs").readFileSync(0).toString()).key)' //prints 42

Concatenate lines of file

I am having these two lines and I am trying to get all in one line as a variable in bash.
initial values
DEST
none
and I would like such of result:
DEST="none"
Many thanks in advance for any suggestion,
Al.

You can use the paste command for that:
echo -e "DEST\nnone" | paste -s -d '='
or
cat <file> | paste -s -d '='

You can use the following awk command:
awk '!(NR%2){print s,$1}NR%2{s=$1}' OFS== <file>
Depending of the contents of file, you might need to enclose the value (every second line) in quotes:
awk '!(NR%2){print s,"\""$1"\""}NR%2{s=$1}' OFS== <file>
This would give you:
DEST="none"

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Fetch value from xml element which is enclosed with CDATA through awk - bash

As xmlstarlet is not installed on my machine. I used sed prior to my awk command as follows and that works for me. $(sed -e 's/<![CDATA[//g; s/]]>//g' ${FILE_NAME} | awk -F "[><]" '/'$tag_name'/{print $3}' | sort | uniq) Also, If anybody has any other solution. That too is also welcome.

Related

replace XML tags value in shell script

Extract field from xml file

Need string extraction between tags

How to retrieve single value with grep from Json?

Concatenate lines of file

Categories

Resources