How to parse xml using xmllint and store in arrays

How to parse xml using xmllint and store in arrays - bash

In shell script, I have an xml file as p.xml, as follows and I want to parse it and get values in two arrays. I am trying to use xmllint, but could not get desired data.
<?xml version="1.0" encoding="UTF-8"?>
<Share_Collection>
<Share id="data/Backup" resource-id="data/Backup" resource-type="SimpleShare" share-name="Backup" protocols="cifs,afp"/>
<Share id="data/Documents" resource-id="data/Documents" resource-type="SimpleShare" share-name="Documents" protocols="cifs,afp"/>
<Share id="data/Music" resource-id="data/Music" resource-type="SimpleShare" share-name="Music" protocols="cifs,afp"/>
<Share id="data/OwnCloud" resource-id="data/OwnCloud" resource-type="SimpleShare" share-name="OwnCloud" protocols="cifs,afp"/>
<Share id="data/Pictures" resource-id="data/Pictures" resource-type="SimpleShare" share-name="Pictures" protocols="cifs,afp"/>
<Share id="data/Videos" resource-id="data/Videos" resource-type="SimpleShare" share-name="Videos" protocols="cifs,afp"/>
</Share_Collection>
I want to get an array all share ids and one array containing share-names. So two array would be like
share-ids-array = ["data/Backup", "data/Documents", "data/Music", "data/OwnCloud", "data/Pictures", "data/Videos"]
share-names-array = ["Backup", "Documents", "Music", "OwnCloud", "Pictures", "Videos"]
I started as follows:
xmllint --xpath '//Share/#id' p.xml
xmllint --xpath '//Share/#share-name' p.xml
that gives me
id="data/Backup"
id="data/Documents" id="data/Music" id="data/OwnCloud" id="data/Pictures" id="data/Videos"
Any help to build those two arrays will be appreciated.

Here is one solution with grep (and tr)...sed or awk are other alternatives. By the way, you cannot use hyphens in variable names in bash.
share_ids=($( xmllint --xpath '//Share/#id' p.xml | grep -Po '".*?"' | tr -d \" ))
share_names=($( xmllint --xpath '//Share/#share-name' p.xml | grep -Po '".*?"' | tr -d \" ))
Example:
$ echo ${share_names[#]}
Backup Documents Music OwnCloud Pictures Videos
Using xmlstarlet is probably better, though:
share_names=($( xmlstarlet sel -T -t -m '//Share/#share-name' -v '.' -n p.xml ))

Related

Is it possible to use sed instead of Grep -oP to extract a word? [duplicate]

Sometimes I need to quickly extract some arbitrary data from XML files to put into a CSV format. What's your best practices for doing this in the Unix terminal? I would love some code examples, so for instance how can I get the following problem solved?
Example XML input:
<root>
<myel name="Foo" />
<myel name="Bar" />
</root>
My desired CSV output:
Foo,
Bar,

Peter's answer is correct, but it outputs a trailing line feed.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:template match="root">
<xsl:for-each select="myel">
<xsl:value-of select="#name"/>
<xsl:text>,</xsl:text>
<xsl:if test="not(position() = last())">
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Just run e.g.
xsltproc stylesheet.xsl source.xml
to generate the CSV results into standard output.

Use a command-line XSLT processor such as xsltproc, saxon or xalan to parse the XML and generate CSV. Here's an example, which for your case is the stylesheet:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="root">
<xsl:apply-templates select="myel"/>
</xsl:template>
<xsl:template match="myel">
<xsl:for-each select="#*">
<xsl:value-of select="."/>
<xsl:value-of select="','"/>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>

If you just want the name attributes of any element, here is a quick but incomplete solution.
(Your example text is in the file example)
grep "name" example | cut -d"\"" -f2,2
| xargs -I{} echo "{},"

XMLStarlet is a command line toolkit to query/edit/check/transform
XML documents (for more information, see XMLStarlet Command Line XML Toolkit)
No files to write, just pipe your file to xmlstarlet and apply an xpath filter.
cat file.xml | xml sel -t -m 'xpathExpression' -v 'elemName' 'literal' -v 'elname' -n
-m expression
-v value
'' included literal
-n newline
So for your xpath the xpath expression would be //myel/#name
which would provide the two attribute values.
Very handy tool.

Here's a little ruby script that does exactly what your question asks (pull an attribute called 'name' out of elements called 'myel'). Should be easy to generalize
#!/usr/bin/ruby -w
require 'rexml/document'
xml = REXML::Document.new(File.open(ARGV[0].to_s))
xml.elements.each("//myel") { |el| puts "#{el.attributes['name']}," if el.attributes['name'] }

Using xidel:
xidel -s input.xml -e '//myel/concat(#name,",")'

Answering the original question, assuming xml file is "test.xml" that contains:
<root>
<myel name="Foo" />
<myel name="Bar" />
</root>
tr -s "\"" " " < text.xml | awk '{printf "%s,\n", $3}'

Your test file is in test.xml.
sed -n 's/^\s*<myel\s*name="\([^"]*\)".*$/\1,/p' test.xml
It has its pitfalls; for example if it is not strictly given that each myel is on one line you have to "normalize" the XML file first (so each myel is on a separate line).

yq can be used for XML parsing.
It is a lightweight and portable command-line YAML processor and can also deal with XML.
The syntax is similar to jq
Input
<root>
<myel name="Foo" />
<myel name="Bar">
<mysubel>stairway to heaven</mysubel>
</myel>
</root>
usage example 1
yq e '.root.myel.0.+name' $INPUT (version >= 4.30: yq e '.root.myel.0.+#name' $INPUT)
Foo
usage example 2
yq has a nice builtin feature to make XML easily grep-able
yq --input-format xml --output-format props $INPUT
root.myel.0.+name = Foo
root.myel.1.+name = Bar
root.myel.1.mysubel = stairway to heaven
usage example 3
yq can also convert an XML input into JSON or YAML
yq --input-format xml --output-format json $INPUT
{
"root": {
"myel": [
{
"+name": "Foo"
},
{
"+name": "Bar",
"mysubel": "stairway to heaven"
}
]
}
}
yq --input-format xml $FILE (YAML is the default format)
root:
myel:
- +name: Foo
- +name: Bar
mysubel: stairway to heaven

How to get attribute values of multiple nodes in xpath with just xmllint?

I want to query the names of all the persons in the test.xml below.
<body>
<person name="abc"></person>
<person name="def"></person>
<person name="ghi"></person>
</body>
basic query
This has the problem of including "name", which I don't want.
$ xmllint --xpath '//body/person/#name' test.xml`
name="abc"
name="def"
name="ghi"
string function
Using the string function, I only get one result.
$ xmllint --xpath 'string(//body/person/#name)' test.xml
abc
sed and grep
This works but looks needlessly complicated to me.
xmllint --xpath '//body/person/#name' test.xml | grep -o '"\([^"]*\)"' | sed 's|"||g'
abc
def
ghi
Question
Is it possible to get multiple values without the attribute name and without using another tool like grep?

I don't know about xmllint, but xmlstarlet can do it:
xmlstarlet sel -t -v 'body/person/#name' test.xml
Output:
abc
def
ghi

extract xml tag and its value

I want to read xml file and set its value into a variable.
for example ,
qhr2400.xml
<XML>
<OPERATION type="1">
<TABLENAME>TABLE</TABLENAME>
<ROWSET>
<ROW>
<CLLI>518</CLLI>
<COLLECTION_DATE>06/04/20 00:45:00</COLLECTION_DATE>
<SS7RT>99</SS7RT>
<AQPRT_1>84</AQPRT_1>
<L7RMSUOCT_01>80</L7RMSUOCT_01>
<L7RMSUOCT_02>80</L7RMSUOCT_02>
</ROW>
</ROWSET>
</OPERATION>
</XML>
I want its value in a variable like $CLLI =518, $COLLECTION_DATE = 06/04/20 00:45:00, SS7RT = 99..
so that I can use these values further to write an insert query.
Basically I want to load this .xml data into a database table.
this is what I tried.
read_xml.sh
awk 'NF==1 && (/ +<[a-zA-Z]+>/ || /^<[a-zA-Z]+>/ || / +<\/[a-zA-Z]+>/){
next
}
{
sub(/^ +/,"")
gsub(/\"|<|>/,"",$0);
sub(/\/.*/,"");
if($0){
print
}
}
' qhr2400.xml
Output
OPERATION type=1
CLLI5018
COLLECTION_DATE06
SS7RT99
AQPRT_184
L7RMSUOCT_0180
L7RMSUOCT_0280
Any help is appreciated.
Thanks!

Don't parse XML/HTML with regex, use a proper XML/HTML parser and a powerful xpath query.
theory :
According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
Check this thread too, why-its-not-possible-to-use-regex-to-parse-html-xml
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint often installed by default with libxml2, xpath1
xmlstarlet can edit, select, transform... Not installed by default, xpath1
xpath installed via perl's module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over #Michael Kay's Saxon-HE Java library, xpath3
or you can use high level languages and proper libs, I think of :
python's lxml (from lxml import etree)
perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath
ruby nokogiri, check this example
php DOMXpath, check this example
Check: Using regular expressions with HTML tags

Fot this, you need an XML parser and xpath query in your shell, see:
$ xidel -se '//CLLI/text()' file.xml
When fixed your XML (opening/closing tag missmatch: TABLENANE/TABLENAME):
xmllint --xpath '//CLLI/text()' file
This command is installed with libxml2 and is far than exotic because it's installed by default on many Linux distros
Output
518
So now, you can retrieve all wanted values in shell variables, one example:
$ collectiondate=$(xidel -se '//COLLECTION_DATE/text()' file)
$ echo "$collectiondate"
But, please, don't use awk nor regex to parse XML.
There's others tools, check:
How to execute XPath one-liners from shell?
Check too: Using regular expressions with HTML tags (same thing for XML)
 Going further
declare -A arr
for i in CLLI COLLECTION_DATE SS7RT; do
read arr[$i] < <(xmllint --xpath "//$i/text()" file.xml)
done
Now you have an associative array with CLLI COLLECTION_DATE SS7RT keys:
Keys:
printf '%s\n' "${!arr[#]}"
CLLI
SS7RT
COLLECTION_DATE
Values:
$ printf '%s\n' "${arr[#]}"
518
99
06/04/20 00:45:00
for COLLECTION_DATE:
$ echo "${arr[COLLECTION_DATE]}"
06/04/20 00:45:00
It's possible to feed a numeric array in one line too:
readarray a < <(xidel -se '//*[self::CLLI or self::COLLECTION_DATE or self::SS7RT]/text()' file.xml)

I want its value in a variable like $CLLI =518, $COLLECTION_DATE = 06/04/20 00:45:00, SS7RT = 99.. so that I can use these values further to write an insert query.
I'm going to interpret this as; you want every child-node, and its value, in the "ROW"-node exported as a variable.
As "Gilles Quenot" already mentioned, please don't parse xml with regex. I'd suggest you give xidel a try.
You could do it manually and call xidel for each and every node...
CLLI=$(xidel -s qhr2400.xml -e '//CLLI')
COLLECTION_DATE=$(xidel -s qhr2400.xml -e '//COLLECTION_DATE')
[...]
...but xidel itself can also export variables, multiple at once even:
#multiple queries, multiple declarations:
xidel -s qhr2400.xml -e 'CLLI:=//CLLI' -e 'COLLECTION_DATE:=//COLLECTION_DATE' -e '[...]' --output-format=bash
#or one query, multiple declarations:
xidel -s qhr2400.xml -e 'CLLI:=//CLLI,COLLECTION_DATE:=//COLLECTION_DATE,[...]' --output-format=bash
CLLI='518'
COLLECTION_DATE='06/04/20 00:45:00'
[...]
The output are just strings. To actually set/export these variables you have to use Bash's eval built-in command:
eval "$(xidel -s qhr2400.xml -e 'CLLI:=//CLLI,COLLECTION_DATE:=//COLLECTION_DATE,[...]' --output-format=bash)"
And finally, to do it fully automatic for every child-node in the "ROW"-node:
xidel -s qhr2400.xml -e '//ROW/*/name()'
CLLI
COLLECTION_DATE
SS7RT
AQPRT_1
L7RMSUOCT_01
L7RMSUOCT_02
xidel -s qhr2400.xml -e 'for $x in //ROW/*/name() return eval(x"//ROW/{$x}")'
518
06/04/20 00:45:00
99
84
80
80
xidel -s qhr2400.xml -e 'for $x in //ROW/*/name() return eval(x"{$x}:=//ROW{$x}")[0]' --output-format=bash
CLLI='518'
COLLECTION_DATE='06/04/20 00:45:00'
SS7RT='99'
AQPRT_1='84'
L7RMSUOCT_01='80'
L7RMSUOCT_02='80'
result=
eval "$(xidel -s qhr2400.xml -e 'for $x in //ROW/*/name() return eval(x"{$x}:=//ROW{$x}")[0]' --output-format=bash)"

Another approach is to use XSLT (XSL Transformation)
Here is a fixed and indented version of the OP's XML file:
$ cat demo.xml
<XML>
<OPERATION type="1">
<TABLENAME>TABLE</TABLENAME>
<ROWSET>
<ROW>
<CLLI>518</CLLI>
<COLLECTION_DATE>06/04/20 00:45:00</COLLECTION_DATE>
<SS7RT>99</SS7RT>
<AQPRT_1>84</AQPRT_1>
<L7RMSUOCT_01>80</L7RMSUOCT_01>
<L7RMSUOCT_02>80</L7RMSUOCT_02>
</ROW>
</ROWSET>
</OPERATION>
</XML>
This is the stylesheet I will use:
$ cat demo.xsl
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" encoding="utf-8" />
<xsl:strip-space elements="*"/>
<xsl:template match="ROW">
<xsl:text>CLLI="</xsl:text><xsl:value-of select="CLLI"/><xsl:text>" </xsl:text>
<xsl:text>COLLECTION_DATE="</xsl:text><xsl:value-of select="COLLECTION_DATE"/><xsl:text>" </xsl:text>
<xsl:text>SS7RT="</xsl:text><xsl:value-of select="SS7RT"/><xsl:text>" </xsl:text>
<xsl:text>AQPRT_1="</xsl:text><xsl:value-of select="AQPRT_1"/><xsl:text>" </xsl:text>
<xsl:text>L7RMSUOCT_01="</xsl:text><xsl:value-of select="L7RMSUOCT_01"/><xsl:text>" </xsl:text>
<xsl:text>L7RMSUOCT_02="</xsl:text><xsl:value-of select="L7RMSUOCT_02"/><xsl:text>" </xsl:text>
</xsl:template>
<xsl:template match="text()"/>
</xsl:stylesheet>
Here is a simple shell script which uses xsktproc to transform demo.xml into suitable text for input to eval in order to create shell variables for required element values.
$ cat demo.sh
#!/bin/bash
eval $(xsltproc demo.xsl demo.xml)
echo "CLLI: $CLLI"
echo "COLLECTION_DATE: $COLLECTION_DATE"
echo "SS7RT: $SS7RT"
echo "AQPRT_1: $AQPRT_1"
echo "L7RMSUOCT_01: $L7RMSUOCT_01"
echo "L7RMSUOCT_02: $L7RMSUOCT_02"
Run the script:
$ ./demo.sh
CLLI: 518
COLLECTION_DATE: 06/04/20 00:45:00
SS7RT: 99
AQPRT_1: 84
L7RMSUOCT_01: 80
L7RMSUOCT_02: 80
$

read_xml.sh
gawk '
BEGIN {
FS="<|>"
}
// {
{
if($3 ~ /[0-9]/) { vars[$2] = $3; next }
}
}
END {
print vars["CLLI"]
print vars["SS7RT"]
print vars["COLLECTION_DATE"]
# etc...
}
' qhr2400.xml
result:
518
99
06/04/20 00:45:00
of course, instead of printing in END, you can use these variables from the vars array for something.
Rejecting AWK as an XML or HTML pareser is unreasonable. AWK is great as a parser for any files, including damaged xml files. Using AWK requires more thought, instead you don't need to install any exotic software. You can save the xml file so that AWK reads some lines incorrectly but the same can be said about xml analysis tools.
EDIT:
We fix the XML file error - splitting the field into several lines.
file qhr2400.xml contains:
<CLLI>
518
</CLLI>
instead of
<CLLI>518</CLLI>
call:
cat qhr2400.xml |tr -d '\n' |sed 's/ *//g' |sed 's/</\n</g' |awk -f readxml.awk
readxml.awk is now:
BEGIN {
FS="<|>"
}
// {
{
if($3 ~ /[0-9]/) { vars[$2] = $3; next }
}
}
END {
print vars["CLLI"]
print vars["SS7RT"]
print vars["COLLECTION_DATE"]
# etc...
}
the result is correct
EDIT2
For some time, there has been a worrying fashion for adding complexity instead of simplifying the environment. The use of a ready-made additional tool is usually a quick solution and may tempt you with its simplicity of use. Unfortunately, it is not always possible to install a huge Perl or Python or Ruby environment, e.g. on a built-in system with 32MB Flash, it is not always possible to compile any smaller tool for your processor architecture or company policy can rightly prohibit adding anything to the standard set, there is also sense for one-time processing of the file. AWK, sed, tr are usually equipped and it is the only rescue then. Also, not always parsing an XML file means wanting to extract key-value pairs, it can be something completely different, e.g.
"ROW> <CLLI> 518 </CLLI> <COLLECTION" which makes useless ready analytical tools based on xpath. AWK is a programming language written specifically for parsing text files in a practicaly unlimited way if we add standard unix tools.
However, if you have little experience, better rely on ready-made solutions if possible.

Look for more then one value using xmllint

I need to retrieve more then one value from several XML-blocks inside a XML-file. How can I use xmllint to do this?
I noticed this solution (xml_grep get attribute from element) and tried to extend it. Unfortunately without any luck so far.
xmllint --xpath 'string(//identity/#name #placeofbirth #photo)' file.xml
Example XML file:
<eid>
<identity>
<name>Menten</name>
<firstname>Kasper</firstname>
<middlenames>Marie J</middlenames>
<nationality>Belg</nationality>
<placeofbirth>Sint-Truiden</placeofbirth>
<photo>base64-string</photo>
</identity>
<identity>
<name>Herbal</name>
<firstname>Jane</firstname>
<middlenames>Helena</middlenames>
<nationality>Frans</nationality>
<placeofbirth>Paris</placeofbirth>
<photo>notavailable</photo>
</identity>
</eid>
Output wanted
Kasper, Sint-Truiden, base64-string
Jane, Paris, notavailable

One way to do that is
# Read xml into variable
xmlStr=$(cat test.xml)
# Count identity nodes
nodeCount=$(echo "$xmlStr" | xmllint --xpath "count(//identity)" -)
# Iterate the nodeset by index
for i in $(seq 1 $nodeCount);do
echo "$xmlStr" | xmllint --xpath "concat((//identity)[$i]/name,', ',(//identity)[$i]/placeofbirth, ', ', (//identity)[$i]/photo)" - ; echo
done
Result:
Menten, Sint-Truiden, base64-string
Herbal, Paris, notavailable

bash XHTML parsing using xpath

I'm writing a small script to learn how to parse an XHTML web page. The following command:
cat q?s=goog.xhtml | xpath '//span[#id="yfs_l10_goog"]'
returns:
Found 2 nodes:
-- NODE --
<span id="yfs_l10_goog">624.50</span>-- NODE --
<span id="yfs_l10_goog">624.50</span>
How do I:
need to write my command in order to only extract the value 624.50 ?
what do I need to do to extract it only once ?
source page I'm parsing: http://finance.yahoo.com/q?s=goog

Edit 2:
Give this a try:
xpath -q -e '//span[#id="yfs_l10_goog"][1]/text()'
Edit:
Pipe your output through:
sed -n '/span/{s/<span[^<]*>\([^<]*\)<.*/\1/;p;q}'
Original answer:
Using xmlstarlet:
echo -e '<foo><span id="yfs_l10_goog">624.50</span>\n<bar>xyz</bar><span id="yfs_l10_goog">555.50</span>\n<span id="yfs_l10_goog">123.50</span></foo>' |
xmlstarlet sel -t -v "//span[#id='yfs_l10_goog']"
Result of query:
624.50
Result of echo:
<foo><span id="yfs_l10_goog">624.50</span>
<bar>xyz</bar><span id="yfs_l10_goog">555.50</span>
<span id="yfs_l10_goog">123.50</span></foo>
Result of xml fo:
<?xml version="1.0"?>
<foo>
<span id="yfs_l10_goog">624.50</span>
<bar>xyz</bar>
<span id="yfs_l10_goog">555.50</span>
<span id="yfs_l10_goog">123.50</span>
</foo>
Other queries:
$ echo -e '...' | xmlstarlet sel -t -v "//span[#id='yfs_l10_goog'][1]"
624.50
$ echo -e '...' | xmlstarlet sel -t -v "//span[#id='yfs_l10_goog'][3]"
123.50
$ echo -e '...' | xmlstarlet sel -t -v "//span[#id='yfs_l10_goog'][last()]"
123.50

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to parse xml using xmllint and store in arrays - bash

Related

Is it possible to use sed instead of Grep -oP to extract a word? [duplicate]

How to get attribute values of multiple nodes in xpath with just xmllint?

extract xml tag and its value

Look for more then one value using xmllint

bash XHTML parsing using xpath

Categories

Resources