extract xml tag and its value

extract xml tag and its value - shell

I want to read xml file and set its value into a variable.
for example ,
qhr2400.xml
<XML>
<OPERATION type="1">
<TABLENAME>TABLE</TABLENAME>
<ROWSET>
<ROW>
<CLLI>518</CLLI>
<COLLECTION_DATE>06/04/20 00:45:00</COLLECTION_DATE>
<SS7RT>99</SS7RT>
<AQPRT_1>84</AQPRT_1>
<L7RMSUOCT_01>80</L7RMSUOCT_01>
<L7RMSUOCT_02>80</L7RMSUOCT_02>
</ROW>
</ROWSET>
</OPERATION>
</XML>
I want its value in a variable like $CLLI =518, $COLLECTION_DATE = 06/04/20 00:45:00, SS7RT = 99..
so that I can use these values further to write an insert query.
Basically I want to load this .xml data into a database table.
this is what I tried.
read_xml.sh
awk 'NF==1 && (/ +<[a-zA-Z]+>/ || /^<[a-zA-Z]+>/ || / +<\/[a-zA-Z]+>/){
next
}
{
sub(/^ +/,"")
gsub(/\"|<|>/,"",$0);
sub(/\/.*/,"");
if($0){
print
}
}
' qhr2400.xml
Output
OPERATION type=1
CLLI5018
COLLECTION_DATE06
SS7RT99
AQPRT_184
L7RMSUOCT_0180
L7RMSUOCT_0280
Any help is appreciated.
Thanks!

Don't parse XML/HTML with regex, use a proper XML/HTML parser and a powerful xpath query.
theory :
According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
Check this thread too, why-its-not-possible-to-use-regex-to-parse-html-xml
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint often installed by default with libxml2, xpath1
xmlstarlet can edit, select, transform... Not installed by default, xpath1
xpath installed via perl's module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over #Michael Kay's Saxon-HE Java library, xpath3
or you can use high level languages and proper libs, I think of :
python's lxml (from lxml import etree)
perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath
ruby nokogiri, check this example
php DOMXpath, check this example
Check: Using regular expressions with HTML tags

Fot this, you need an XML parser and xpath query in your shell, see:
$ xidel -se '//CLLI/text()' file.xml
When fixed your XML (opening/closing tag missmatch: TABLENANE/TABLENAME):
xmllint --xpath '//CLLI/text()' file
This command is installed with libxml2 and is far than exotic because it's installed by default on many Linux distros
Output
518
So now, you can retrieve all wanted values in shell variables, one example:
$ collectiondate=$(xidel -se '//COLLECTION_DATE/text()' file)
$ echo "$collectiondate"
But, please, don't use awk nor regex to parse XML.
There's others tools, check:
How to execute XPath one-liners from shell?
Check too: Using regular expressions with HTML tags (same thing for XML)
 Going further
declare -A arr
for i in CLLI COLLECTION_DATE SS7RT; do
read arr[$i] < <(xmllint --xpath "//$i/text()" file.xml)
done
Now you have an associative array with CLLI COLLECTION_DATE SS7RT keys:
Keys:
printf '%s\n' "${!arr[#]}"
CLLI
SS7RT
COLLECTION_DATE
Values:
$ printf '%s\n' "${arr[#]}"
518
99
06/04/20 00:45:00
for COLLECTION_DATE:
$ echo "${arr[COLLECTION_DATE]}"
06/04/20 00:45:00
It's possible to feed a numeric array in one line too:
readarray a < <(xidel -se '//*[self::CLLI or self::COLLECTION_DATE or self::SS7RT]/text()' file.xml)

I want its value in a variable like $CLLI =518, $COLLECTION_DATE = 06/04/20 00:45:00, SS7RT = 99.. so that I can use these values further to write an insert query.
I'm going to interpret this as; you want every child-node, and its value, in the "ROW"-node exported as a variable.
As "Gilles Quenot" already mentioned, please don't parse xml with regex. I'd suggest you give xidel a try.
You could do it manually and call xidel for each and every node...
CLLI=$(xidel -s qhr2400.xml -e '//CLLI')
COLLECTION_DATE=$(xidel -s qhr2400.xml -e '//COLLECTION_DATE')
[...]
...but xidel itself can also export variables, multiple at once even:
#multiple queries, multiple declarations:
xidel -s qhr2400.xml -e 'CLLI:=//CLLI' -e 'COLLECTION_DATE:=//COLLECTION_DATE' -e '[...]' --output-format=bash
#or one query, multiple declarations:
xidel -s qhr2400.xml -e 'CLLI:=//CLLI,COLLECTION_DATE:=//COLLECTION_DATE,[...]' --output-format=bash
CLLI='518'
COLLECTION_DATE='06/04/20 00:45:00'
[...]
The output are just strings. To actually set/export these variables you have to use Bash's eval built-in command:
eval "$(xidel -s qhr2400.xml -e 'CLLI:=//CLLI,COLLECTION_DATE:=//COLLECTION_DATE,[...]' --output-format=bash)"
And finally, to do it fully automatic for every child-node in the "ROW"-node:
xidel -s qhr2400.xml -e '//ROW/*/name()'
CLLI
COLLECTION_DATE
SS7RT
AQPRT_1
L7RMSUOCT_01
L7RMSUOCT_02
xidel -s qhr2400.xml -e 'for $x in //ROW/*/name() return eval(x"//ROW/{$x}")'
518
06/04/20 00:45:00
99
84
80
80
xidel -s qhr2400.xml -e 'for $x in //ROW/*/name() return eval(x"{$x}:=//ROW{$x}")[0]' --output-format=bash
CLLI='518'
COLLECTION_DATE='06/04/20 00:45:00'
SS7RT='99'
AQPRT_1='84'
L7RMSUOCT_01='80'
L7RMSUOCT_02='80'
result=
eval "$(xidel -s qhr2400.xml -e 'for $x in //ROW/*/name() return eval(x"{$x}:=//ROW{$x}")[0]' --output-format=bash)"

Another approach is to use XSLT (XSL Transformation)
Here is a fixed and indented version of the OP's XML file:
$ cat demo.xml
<XML>
<OPERATION type="1">
<TABLENAME>TABLE</TABLENAME>
<ROWSET>
<ROW>
<CLLI>518</CLLI>
<COLLECTION_DATE>06/04/20 00:45:00</COLLECTION_DATE>
<SS7RT>99</SS7RT>
<AQPRT_1>84</AQPRT_1>
<L7RMSUOCT_01>80</L7RMSUOCT_01>
<L7RMSUOCT_02>80</L7RMSUOCT_02>
</ROW>
</ROWSET>
</OPERATION>
</XML>
This is the stylesheet I will use:
$ cat demo.xsl
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" encoding="utf-8" />
<xsl:strip-space elements="*"/>
<xsl:template match="ROW">
<xsl:text>CLLI="</xsl:text><xsl:value-of select="CLLI"/><xsl:text>" </xsl:text>
<xsl:text>COLLECTION_DATE="</xsl:text><xsl:value-of select="COLLECTION_DATE"/><xsl:text>" </xsl:text>
<xsl:text>SS7RT="</xsl:text><xsl:value-of select="SS7RT"/><xsl:text>" </xsl:text>
<xsl:text>AQPRT_1="</xsl:text><xsl:value-of select="AQPRT_1"/><xsl:text>" </xsl:text>
<xsl:text>L7RMSUOCT_01="</xsl:text><xsl:value-of select="L7RMSUOCT_01"/><xsl:text>" </xsl:text>
<xsl:text>L7RMSUOCT_02="</xsl:text><xsl:value-of select="L7RMSUOCT_02"/><xsl:text>" </xsl:text>
</xsl:template>
<xsl:template match="text()"/>
</xsl:stylesheet>
Here is a simple shell script which uses xsktproc to transform demo.xml into suitable text for input to eval in order to create shell variables for required element values.
$ cat demo.sh
#!/bin/bash
eval $(xsltproc demo.xsl demo.xml)
echo "CLLI: $CLLI"
echo "COLLECTION_DATE: $COLLECTION_DATE"
echo "SS7RT: $SS7RT"
echo "AQPRT_1: $AQPRT_1"
echo "L7RMSUOCT_01: $L7RMSUOCT_01"
echo "L7RMSUOCT_02: $L7RMSUOCT_02"
Run the script:
$ ./demo.sh
CLLI: 518
COLLECTION_DATE: 06/04/20 00:45:00
SS7RT: 99
AQPRT_1: 84
L7RMSUOCT_01: 80
L7RMSUOCT_02: 80
$

read_xml.sh
gawk '
BEGIN {
FS="<|>"
}
// {
{
if($3 ~ /[0-9]/) { vars[$2] = $3; next }
}
}
END {
print vars["CLLI"]
print vars["SS7RT"]
print vars["COLLECTION_DATE"]
# etc...
}
' qhr2400.xml
result:
518
99
06/04/20 00:45:00
of course, instead of printing in END, you can use these variables from the vars array for something.
Rejecting AWK as an XML or HTML pareser is unreasonable. AWK is great as a parser for any files, including damaged xml files. Using AWK requires more thought, instead you don't need to install any exotic software. You can save the xml file so that AWK reads some lines incorrectly but the same can be said about xml analysis tools.
EDIT:
We fix the XML file error - splitting the field into several lines.
file qhr2400.xml contains:
<CLLI>
518
</CLLI>
instead of
<CLLI>518</CLLI>
call:
cat qhr2400.xml |tr -d '\n' |sed 's/ *//g' |sed 's/</\n</g' |awk -f readxml.awk
readxml.awk is now:
BEGIN {
FS="<|>"
}
// {
{
if($3 ~ /[0-9]/) { vars[$2] = $3; next }
}
}
END {
print vars["CLLI"]
print vars["SS7RT"]
print vars["COLLECTION_DATE"]
# etc...
}
the result is correct
EDIT2
For some time, there has been a worrying fashion for adding complexity instead of simplifying the environment. The use of a ready-made additional tool is usually a quick solution and may tempt you with its simplicity of use. Unfortunately, it is not always possible to install a huge Perl or Python or Ruby environment, e.g. on a built-in system with 32MB Flash, it is not always possible to compile any smaller tool for your processor architecture or company policy can rightly prohibit adding anything to the standard set, there is also sense for one-time processing of the file. AWK, sed, tr are usually equipped and it is the only rescue then. Also, not always parsing an XML file means wanting to extract key-value pairs, it can be something completely different, e.g.
"ROW> <CLLI> 518 </CLLI> <COLLECTION" which makes useless ready analytical tools based on xpath. AWK is a programming language written specifically for parsing text files in a practicaly unlimited way if we add standard unix tools.
However, if you have little experience, better rely on ready-made solutions if possible.

Related

Bash: decode string with url escaped hex codes [duplicate]

I'm looking for a way to turn this:
hello < world
to this:
hello < world
I could use sed, but how can this be accomplished without using cryptic regex?

Try recode (archived page; GitHub mirror; Debian page):
$ echo '<' |recode html..ascii
<
Install on Linux and similar Unix-y systems:
$ sudo apt-get install recode
Install on Mac OS using:
$ brew install recode

With perl:
cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'
With php from the command line:
cat foo.html | php -r 'while(($line=fgets(STDIN)) !== FALSE) echo html_entity_decode($line, ENT_QUOTES|ENT_HTML401);'

An alternative is to pipe through a web browser -- such as:
echo '!' | w3m -dump -T text/html
This worked great for me in cygwin, where downloading and installing distributions are difficult.
This answer was found here

Using xmlstarlet:
echo 'hello < world' | xmlstarlet unesc

A python 3.2+ version:
cat foo.html | python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'

This answer is based on: Short way to escape HTML in Bash? which works fine for grabbing answers (using wget) on Stack Exchange and converting HTML to regular ASCII characters:
sed 's/ / /g; s/&/\&/g; s/</\</g; s/>/\>/g; s/"/\"/g; s/#'/\'"'"'/g; s/“/\"/g; s/”/\"/g;'
Edit 1: April 7, 2017 - Added left double quote and right double quote conversion. This is part of bash script that web-scrapes SE answers and compares them to local code files here: Ask Ubuntu -
Code Version Control between local files and Ask Ubuntu answers
Edit June 26, 2017
Using sed was taking ~3 seconds to convert HTML to ASCII on a 1K line file from Ask Ubuntu / Stack Exchange. As such I was forced to use Bash built-in search and replace for ~1 second response time.
Here's the function:
LineOut="" # Make global
HTMLtoText () {
LineOut=$1 # Parm 1= Input line
# Replace external command: Line=$(sed 's/&/\&/g; s/</\</g;
# s/>/\>/g; s/"/\"/g; s/'/\'"'"'/g; s/“/\"/g;
# s/”/\"/g;' <<< "$Line") -- With faster builtin commands.
LineOut="${LineOut// / }"
LineOut="${LineOut//&/&}"
LineOut="${LineOut//</<}"
LineOut="${LineOut//>/>}"
LineOut="${LineOut//"/'"'}"
LineOut="${LineOut//'/"'"}"
LineOut="${LineOut//“/'"'}" # TODO: ASCII/ISO for opening quote
LineOut="${LineOut//”/'"'}" # TODO: ASCII/ISO for closing quote
} # HTMLtoText ()

On macOS, you can use the built-in command textutil (which is a handy utility in general):
echo '👋 hello < world 🌐' | textutil -convert txt -format html -stdin -stdout
outputs:
👋 hello < world 🌐

To support the unescaping of all HTML entities only with sed substitutions would require too long a list of commands to be practical, because every Unicode code point has at least two corresponding HTML entities.
But it can be done using only sed, grep, the Bourne shell and basic UNIX utilities (the GNU coreutils or equivalent):
#!/bin/sh
htmlEscDec2Hex() {
file=$1
[ ! -r "$file" ] && file=$(mktemp) && cat >"$file"
printf -- \
"$(sed 's/\\/\\\\/g;s/%/%%/g;s/&#[0-9]\{1,10\};/\&#x%x;/g' "$file")\n" \
$(grep -o '&#[0-9]\{1,10\};' "$file" | tr -d '&#;')
[ x"$1" != x"$file" ] && rm -f -- "$file"
}
htmlHexUnescape() {
printf -- "$(
sed 's/\\/\\\\/g;s/%/%%/g
;s/&#x\([0-9a-fA-F]\{1,8\}\);/\&#x0000000\1;/g
;s/&#x0*\([0-9a-fA-F]\{4\}\);/\\u\1/g
;s/&#x0*\([0-9a-fA-F]\{8\}\);/\\U\1/g' )\n"
}
htmlEscDec2Hex "$1" | htmlHexUnescape \
| sed -f named_entities.sed
Note, however, that a printf implementation supporting \uHHHH and \UHHHHHHHH sequences is required, such as the GNU utility’s. To test, check for example that printf "\u00A7\n" prints §. To call the utility instead of the shell built-in, replace the occurrences of printf with env printf.
This script uses an additional file, named_entities.sed, in order to support the named entities. It can be generated from the specification using the following HTML page:
<!DOCTYPE html>
<head><meta charset="utf-8" /></head>
<body>
<p id="sed-script"></p>
<script type="text/javascript">
const referenceURL = 'https://html.spec.whatwg.org/entities.json';
function writeln(element, text) {
element.appendChild( document.createTextNode(text) );
element.appendChild( document.createElement("br") );
}
(async function(container) {
const json = await (await fetch(referenceURL)).json();
container.innerHTML = "";
writeln(container, "#!/usr/bin/sed -f");
const addLast = [];
for (const name in json) {
const characters = json[name].characters
.replace("\\", "\\\\")
.replace("/", "\\/");
const command = "s/" + name + "/" + characters + "/g";
if ( name.endsWith(";") ) {
writeln(container, command);
} else {
addLast.push(command);
}
}
for (const command of addLast) { writeln(container, command); }
})( document.getElementById("sed-script") );
</script>
</body></html>
Simply open it in a modern browser, and save the resulting page as text as named_entities.sed. This sed script can also be used alone if only named entities are required; in this case it is convenient to give it executable permission so that it can be called directly.
Now the above shell script can be used as ./html_unescape.sh foo.html, or inside a pipeline reading from standard input.
For example, if for some reason it is needed to process the data by chunks (it might be the case if printf is not a shell built-in and the data to process is large), one could use it as:
nLines=20
seq 1 $nLines $(grep -c $ "$inputFile") | while read n
do sed -n "$n,$((n+nLines-1))p" "$inputFile" | ./html_unescape.sh
done
Explanation of the script follows.
There are three types of escape sequences that need to be supported:
&#D; where D is the decimal value of the escaped character’s Unicode code point;
&#xH; where H is the hexadecimal value of the escaped character’s Unicode code point;
&N; where N is the name of one of the named entities for the escaped character.
The &N; escapes are supported by the generated named_entities.sed script which simply performs the list of substitutions.
The central piece of this method for supporting the code point escapes is the printf utility, which is able to:
print numbers in hexadecimal format, and
print characters from their code point’s hexadecimal value (using the escapes \uHHHH or \UHHHHHHHH).
The first feature, with some help from sed and grep, is used to reduce the &#D; escapes into &#xH; escapes. The shell function htmlEscDec2Hex does that.
The function htmlHexUnescape uses sed to transform the &#xH; escapes into printf’s \u/\U escapes, then uses the second feature to print the unescaped characters.

I like the Perl answer given in https://stackoverflow.com/a/13161719/1506477.
cat foo.html | perl -MHTML::Entities -pe 'decode_entities($_);'
But, it produced an unequal number of lines on plain text files. (and I dont know perl enough to debug it.)
I like the python answer given in https://stackoverflow.com/a/42672936/1506477 --
python3 -c 'import html, sys; [print(html.unescape(l), end="") for l in sys.stdin]'
but it creates a list [ ... for l in sys.stdin] in memory, that is forbidden for large files.
Here is another easy pythonic way without buffering in memory: using awkg.
$ echo 'hello < : " world' | \
awkg -b 'from html import unescape' 'print(unescape(R0))'
hello < : " world
awkg is a python based awk-like line processor. You may install it using pip https://pypi.org/project/awkg/:
pip install awkg
-b is awk's BEGIN{} block that runs once in the beginning.
Here we just did from html import unescape.
Each line record is in R0 variable, for which we did
print(unescape(R0))
Disclaimer:
I am the maintainer of awkg

I have created a sed script based on the list of entities so it must handle most of the entities.
sed -f htmlentities.sed < file.html

My original answer got some comments, that recode does not work for UTF-8 encoded HTML files. This is correct. recode supports only HTML 4. The encoding HTML is an alias for HTML_4.0:
$ recode -l | grep -iw html
HTML-i18n 2070 RFC2070
HTML_4.0 h h4 HTML
The default encoding for HTML 4 is Latin-1. This has changed in HTML 5. The default encoding for HTML 5 is UTF-8. This is the reason, why recode does not work for HTML 5 files.
HTML 5 defines the list of entities here:
https://html.spec.whatwg.org/multipage/named-characters.html
The definition includes a machine readable specification in JSON format:
https://html.spec.whatwg.org/entities.json
The JSON file can be used to perform a simple text replacement. The following example is a self modifying Perl script, which caches the JSON specification in its DATA chunk.
Note: For some obscure compatibility reasons, the specification allows entities without a terminating semicolon. Because of that the entities are sorted by length in reverse order to make sure, that the correct entities are replaced first so that they do not get destroyed by entities without the ending semicolon.
#! /usr/bin/perl
use utf8;
use strict;
use warnings;
use open qw(:std :utf8);
use LWP::Simple;
use JSON::Parse qw(parse_json);
my $entities;
INIT {
if (eof DATA) {
my $data = tell DATA;
open DATA, '+<', $0;
seek DATA, $data, 0;
my $entities_json = get 'https://html.spec.whatwg.org/entities.json';
print DATA $entities_json;
truncate DATA, tell DATA;
close DATA;
$entities = parse_json ($entities_json);
} else {
local $/ = undef;
$entities = parse_json (<DATA>);
}
}
local $/ = undef;
my $html = <>;
for my $entity (sort { length $b <=> length $a } keys %$entities) {
my $characters = $entities->{$entity}->{characters};
$html =~ s/$entity/$characters/g;
}
print $html;
__DATA__
Example usage:
$ echo '😊 & ٱلْعَرَبِيَّة' | ./html5-to-utf8.pl
😊 & ٱلْعَرَبِيَّة

With Xidel:
echo 'hello < : " world' | xidel -s - -e 'parse-html($raw)'
hello < : " world

Grep for pattern in first occurrence of XML element

I have a file with multiple occurrences of an XML element. I want to grep for a pattern only in the first element. I want to use grep because I need to use this as the condition of an if check in a bash script. NOTE that unfortunately, I am not guaranteed that the XML element(s) are contained in an enclosing tag (this file is generated by another program out of my control).
Example of a match for "mango"
<element>
apple
banana
orange
mango
</element>
<element>
apple
banana
orange
mango
</element>
Example of a non-match for "mango"
In the following XML snippet, I want my search to fail b/c mango doesn't exist in the first element.
<element>
apple
banana
orange
</element>
<element>
apple
banana
orange
mango
</element>

Here's how I solved this, but I had to use a pipe combining grep with sed. This solution only worked for me because the first <element> is on the first line of the file.
sed -n '0,/<\/element>/p' /path/to/file | grep -q mango
Uses sed to print the first line of the file up to the first closing tag for element.
Uses grep to exit true or false if it matches mango.

For the handling of XML data I would always recommend XML tools. Only that tools can handle the specifics of XML in a save way. For the commandline is a tool called xsltproc available. This is a simple to use XSLT processor and it can do the job better than sed. The only drawback you need an additional xslt stylesheet.
Example stylesheet: test.xslt
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="text"/>
<xsl:template match="element[position()=1]">
<xsl:value-of select="."/>
</xsl:template>
<xsl:template match="*|#*|text()|comment()|processing-instruction()">
<xsl:apply-templates select="*|#*|text()|comment()|processing-instruction()"/>
</xsl:template>
</xsl:stylesheet>
With the stylesheet and xsltproc you can do a command like this
xsltproc test.xslt test.xml | grep mango

This may be quite a lengthy solution, however it works.
./check.sh mango
This calls a simple awk script for each file, referenced by the FILES variable
note:
The xml files I saved as: xml1, xml2
For the example above, it produces the following output:
mango found in xml1
mango not found in xml2
is-here.awk:
BEGIN {
tagOpened="not yet"
tagsPresent=0
}
/<[[:alnum:]]+>/ {
if (tagsPresent <= 1) # remove this condition to check ALL occurencies
{
tagOpened="true"
tagsPresent++
}
}
/<[/][[:alnum:]]+>/ {
tagOpened="false"
}
// {
if (match($1, value) && tagOpened=="true" && length($1)==length(value))
{
found++
}
}
END {
if (found == tagsPresent)
{
print "present"
}
else
{
print "not"
}
}
check.sh
#! /bin/bash
function check()
{
local file=$1
local pattern=$2
local result=$(cat $file | gawk -f is-here.awk -v value=$pattern)
echo $result
}
FILES="xml1 xml2"
for file in $FILES
do
result=$(check $file $1)
if [ "$result" == "present" ]
then
echo "$1 found in $file"
else
echo "$1 not found in $file"
fi
done

Shell script to read flat file and replace xml values

I have a flat file like this:
File:
# Environment
Application.Env~DEV
# Identity
Application.ID~999
# Name
Application.Name~appname
An XML like this:
<name>Application/Env</name>
<value>XXX</value>
<name>Application/ID</name>
<value>000</value>
<name>Application/Name</name>
<value>AAA</value>
I'm looking for a script (awk, sed etc) to read the flat file and replace all of the data in the <value> tags in the xml with the data found after the ~ when the <name> tag matches the data before the ~. Ultimately the resulting XML will look like:
<name>Application/Env</name>
<value>DEV</value>
<name>Application/ID</name>
<value>999</value>
<name>Application/Name</name>
<value>appname</value>
Thanks for your help!

Using XMLStarlet, this would look something like the following:
#!/bin/bash
# usage: [script] [flatfile-name] <in.xml >out.xml
flatfile=$1
# store an array of variables, and an array of edit commands
xml_vars=( )
xml_cmd=( )
count=0
while read -r line; do
[[ $line = *"~"* ]] || continue
key=${line%%"~"*} # put everything before the ~ into key
key=${key//"."/"/"} # change "."s to "/"s in key
val=${line#*"~"} # put everything after the ~ into val
# assign key to an XMLStarlet variable to avoid practices that can lead to injection
xml_vars+=( --var "var$count" "'$key'" )
# update the first value following a matching name
xml_cmd+=( -u "//name[.=\$var${count}]/following-sibling::value[1]" \
-v "$val" )
# increment the counter used to assign variable names
(( ++count ))
done <"$flatfile"
if (( ${#xml_cmd[#]} )); then
xmlstarlet ed "${xml_vars[#]}" "${xml_cmd[#]}"
else
cat # no edits to do
fi
This will run a command like the following:
xmlstarlet ed \
--var var0 "Application/Env" \
--var var2 "Application/ID" \
--var var3 "Application/Name" \
-u '//name[.=$var0]/following-sibling::value[1]' -v 'DEV' \
-u '//name[.=$var1]/following-sibling::value[1]' -v '999' \
-u '//name[.=$var2]/following-sibling::value[1]' -v 'appname'
...which replaces the first value after the name Application/Env with DEV, the first value after the name Application/ID with 999, and the first value after the name Application/Name with appname.
A slightly less paranoid approach might instead generate queries like //name[.="Application/Name"]/following-sibling::value[1]; putting the variables out-of-band is being followed as a security practice. Consider what could happen otherwise if the input file contained:
Application.Foo"or 1=1 or .="~bar
...and the resulting XPath were
//name[.="Application/Foo" or 1=1 or .=""]/following-sibling::value[1]
Because 1=1 is always true, this would then match every name, and thus change every value in the file to bar.
Unfortunately, the implementation of XMLStarlet doesn't effectively guard against this; however, using bind variables makes it possible for an implementation to provide such precautions, so a future release could be safe in this context.

Using Perl and XML::XSH2, a wrapper around XML::LibXML:
#!/usr/bin/perl
use warnings;
use strict;
use XML::XSH2;
open my $IN, '<', 'flatfile' or die $!;
$XML::XSH2::Map::replace = { map { chomp; split /~/ } grep /~/, <$IN> };
xsh << 'end.';
open 1.xml ;
for //name {
set following-sibling::value[1]
xsh:lookup('replace', xsh:subst(., '/', '.')) ;
}
save :b ;
end.
I wrapped the XML into a <root> tag to make it well formed.

How to modify an integer using Bash sed

I have an XML node like this:
<point type="2D" x="61" y="273" />
I wish to multiply x by 2 using Bash. I had tried the following:
echo '<rect key="frame" x="61" y="273" width="199" height="21"/>' | sed "s/x=\"\([[:digit:]]*\)\"/x=\"$((\1 * 2))\"/"
But it failed with:
syntax error: operand expected (error token is "\\1 * 2")
Any idea how to make this work?

sed is not the the right tool for this. You can use this gnu awk command with a custom record separator:
awk -v RS='.*x="|".*' '!NF{ s=RT } NF{ print s $1*2 RT }' file
<point type="2D" x="122" y="273" />
However it is better to use a proper XML parser for thorough XML parsing.

This is quite easy to achieve with perl:
perl -p -e 's/x="([0-9]+)"/"x=\"".($1*2)."\""/e' input.xml
To replace directly, add -i like you would with sed.

Command line combine files at change in part of name and part of file

I am on AIX, with bash, and we cannot install additional software at this time so I am very limited to command line batch processing and maybe custom java scripts. So, I have a ton of XML files in different directories. Here is what a subset may look like.
root_dir
Pages
PAGES_1.XML
Queries
QUERIES_1.XML
QUERIES_2.XML
QUERIES_3.XML
I have put together a script that gets me almost everything I want, but I don't know how to do the last piece of the puzzle if possible in a batch script. I create a new directory under root, copy all of the XML files into the new directory, and then I rename them to remove any spaces if there are any in the name, and buffer the integer so they can be sorted in alphabetical / numerical order. The new output looks like this:
copy_dir
PAGES_001.XML
QUERIES_001.XML
QUERIES_002.XML
QUERIES_003.XML
I am almost there. The last piece is that these separate XML files need to be combined into one XML file for each type, so HISTORY_001.XML to HISTORY_099.XML need to be combined, then QUERIES_001.XML to QUERIES_099.XML need to be combined, but only after a specific point in the file. I have a regex for the files that will select the parts that I want, now I just need to figure out how to loop through each file subset. Maybe I jumped the gun and should do it before moving them, but assuming they are all in one directory, how can I go about this?
Here is an example of the data. All of the XML files carry these same types of information.
Pages
<?xml version="1.0"?>
<project name="">
<rundate></rundate>
<object_type code="false" firstitem="1" id="5" items="65" name="Pages">
<primary_key>Page Name</primary_key>
<secondary_key>Language Code</secondary_key>
<secondary_key>Page Field ID</secondary_key>
<secondary_key>Field Type</secondary_key>
<secondary_key>Record (Table) Name</secondary_key>
<secondary_key>Field Name</secondary_key>
<item id="ACCTG_TEMPLATE_AP">
...
</item>
<item id="ACCTG_TEMPLATE_AR">
...
</item>
</object_type>
</project>
Queries
<?xml version="1.0"?>
<project name="">
<rundate></rundate>
<object_type code="false" firstitem="1" id="10" items="46" name="Queries">
<primary_key>Query Name</primary_key>
<primary_key>User ID</primary_key>
<item id="1099G_ALL_SHORT. ">
...
</item>
<item id="1099G_ALL_VOUCHERS. ">
...
</item>
</object_type>
</project>
Regex to pull out header
(?:(?!(^\s*i<item)).)*
Regex to pull out detail
^(\s*<item id=).*(</item>)
Regex to pull out footer
^(\s*</object_type).*
So I am assuming that what I want to do it have a counter, loop through each object type XML subset, if I am the first loop then pull the header and detail and output to a new summary file, then continue for all other files to concat the detail, then if the last file or change to a new object type then output the footer as well. Do you think this is possible using bash script?

This will spit commands to do the sorting and classification, just provide functions/scripts/whatever that do the right thing for files that are first, middle, last, or only in a group. The first and middle commands have to handle empty argument lists, middle for two-element groups and first for groups without a 1-sequenced file.
Edit: I broke the seds out to one command per line to handle seds that don't like semicolons
Run this as e.g. sh this.sh *_*.*
#!/bin/sh
#
# spit commands to sort, group, and classify argument filenames
# sorting by the number between `_` and `.` in their names and
# grouping by the text before the _.
{
# Everything through the sort would just be `ls -v` on GNU/anything...
for f; do
pfx=${f%%_*}
tail=${f#*_}
sortable=`printf %s_%03d.%s $pfx ${tail%.*} ${tail##*.}`
[ $f != $sortable ] \
&& echo mv $f $sortable >&2
echo $sortable
done \
| sort \
| sed '
/_0*1\./! H
// {
x
1! {
y/\n/ /
p
}
}
$!d
x
y/\n/ /
' \
| sed '
s/\([^ ]*\)\(.*\) \(.*\)/first \1\nmiddle\2\nlast \3/
t
s/^/only /
'
} 2>&1
The first of the above seds accumulates groups of one-per-line Words that can be identified by their first line. The second classifies the groups and subs in the right commands. They're separate because the first sed involves a double-pump to handle a widow group, plus they're hairy enough as it is.

combine()
{
# pull the header from 1st file
while IFS= read && word=($REPLY) && [ "$word" != "<item" ]
do echo "$REPLY"
done <$1
# concat the detail from all files
for file
do cmd=:
while IFS= read && word=($REPLY)
do case $word in \<item) cmd=echo;; esac
$cmd "$REPLY"
case $word in \</item\>) cmd=:;; esac
done <$file
done
# output the footer
while IFS= read && word=($REPLY)
do case $word in \</object_type\>) cmd=echo;; esac
$cmd "$REPLY"
done <$file
}
combine PAGES_???.XML >PAGES.XML
combine QUERIES_???.XML >QUERIES.XML

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

extract xml tag and its value - shell

Related

Bash: decode string with url escaped hex codes [duplicate]

Grep for pattern in first occurrence of XML element

Shell script to read flat file and replace xml values

How to modify an integer using Bash sed

Command line combine files at change in part of name and part of file

Categories

Resources