Grep for pattern in first occurrence of XML element - bash

I have a file with multiple occurrences of an XML element. I want to grep for a pattern only in the first element. I want to use grep because I need to use this as the condition of an if check in a bash script. NOTE that unfortunately, I am not guaranteed that the XML element(s) are contained in an enclosing tag (this file is generated by another program out of my control).
Example of a match for "mango"
<element>
apple
banana
orange
mango
</element>
<element>
apple
banana
orange
mango
</element>
Example of a non-match for "mango"
In the following XML snippet, I want my search to fail b/c mango doesn't exist in the first element.
<element>
apple
banana
orange
</element>
<element>
apple
banana
orange
mango
</element>

Here's how I solved this, but I had to use a pipe combining grep with sed. This solution only worked for me because the first <element> is on the first line of the file.
sed -n '0,/<\/element>/p' /path/to/file | grep -q mango
Uses sed to print the first line of the file up to the first closing tag for element.
Uses grep to exit true or false if it matches mango.

For the handling of XML data I would always recommend XML tools. Only that tools can handle the specifics of XML in a save way. For the commandline is a tool called xsltproc available. This is a simple to use XSLT processor and it can do the job better than sed. The only drawback you need an additional xslt stylesheet.
Example stylesheet: test.xslt
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="text"/>
<xsl:template match="element[position()=1]">
<xsl:value-of select="."/>
</xsl:template>
<xsl:template match="*|#*|text()|comment()|processing-instruction()">
<xsl:apply-templates select="*|#*|text()|comment()|processing-instruction()"/>
</xsl:template>
</xsl:stylesheet>
With the stylesheet and xsltproc you can do a command like this
xsltproc test.xslt test.xml | grep mango

This may be quite a lengthy solution, however it works.
./check.sh mango
This calls a simple awk script for each file, referenced by the FILES variable
note:
The xml files I saved as: xml1, xml2
For the example above, it produces the following output:
mango found in xml1
mango not found in xml2
is-here.awk:
BEGIN {
tagOpened="not yet"
tagsPresent=0
}
/<[[:alnum:]]+>/ {
if (tagsPresent <= 1) # remove this condition to check ALL occurencies
{
tagOpened="true"
tagsPresent++
}
}
/<[/][[:alnum:]]+>/ {
tagOpened="false"
}
// {
if (match($1, value) && tagOpened=="true" && length($1)==length(value))
{
found++
}
}
END {
if (found == tagsPresent)
{
print "present"
}
else
{
print "not"
}
}
check.sh
#! /bin/bash
function check()
{
local file=$1
local pattern=$2
local result=$(cat $file | gawk -f is-here.awk -v value=$pattern)
echo $result
}
FILES="xml1 xml2"
for file in $FILES
do
result=$(check $file $1)
if [ "$result" == "present" ]
then
echo "$1 found in $file"
else
echo "$1 not found in $file"
fi
done

Related

Is it possible to use sed instead of Grep -oP to extract a word? [duplicate]

Sometimes I need to quickly extract some arbitrary data from XML files to put into a CSV format. What's your best practices for doing this in the Unix terminal? I would love some code examples, so for instance how can I get the following problem solved?
Example XML input:
<root>
<myel name="Foo" />
<myel name="Bar" />
</root>
My desired CSV output:
Foo,
Bar,
Peter's answer is correct, but it outputs a trailing line feed.
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text"/>
<xsl:template match="root">
<xsl:for-each select="myel">
<xsl:value-of select="#name"/>
<xsl:text>,</xsl:text>
<xsl:if test="not(position() = last())">
<xsl:text>
</xsl:text>
</xsl:if>
</xsl:for-each>
</xsl:template>
</xsl:stylesheet>
Just run e.g.
xsltproc stylesheet.xsl source.xml
to generate the CSV results into standard output.
Use a command-line XSLT processor such as xsltproc, saxon or xalan to parse the XML and generate CSV. Here's an example, which for your case is the stylesheet:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text"/>
<xsl:template match="root">
<xsl:apply-templates select="myel"/>
</xsl:template>
<xsl:template match="myel">
<xsl:for-each select="#*">
<xsl:value-of select="."/>
<xsl:value-of select="','"/>
</xsl:for-each>
<xsl:text>
</xsl:text>
</xsl:template>
</xsl:stylesheet>
If you just want the name attributes of any element, here is a quick but incomplete solution.
(Your example text is in the file example)
grep "name" example | cut -d"\"" -f2,2
| xargs -I{} echo "{},"
XMLStarlet is a command line toolkit to query/edit/check/transform
XML documents (for more information, see XMLStarlet Command Line XML Toolkit)
No files to write, just pipe your file to xmlstarlet and apply an xpath filter.
cat file.xml | xml sel -t -m 'xpathExpression' -v 'elemName' 'literal' -v 'elname' -n
-m expression
-v value
'' included literal
-n newline
So for your xpath the xpath expression would be //myel/#name
which would provide the two attribute values.
Very handy tool.
Here's a little ruby script that does exactly what your question asks (pull an attribute called 'name' out of elements called 'myel'). Should be easy to generalize
#!/usr/bin/ruby -w
require 'rexml/document'
xml = REXML::Document.new(File.open(ARGV[0].to_s))
xml.elements.each("//myel") { |el| puts "#{el.attributes['name']}," if el.attributes['name'] }
Using xidel:
xidel -s input.xml -e '//myel/concat(#name,",")'
Answering the original question, assuming xml file is "test.xml" that contains:
<root>
<myel name="Foo" />
<myel name="Bar" />
</root>
tr -s "\"" " " < text.xml | awk '{printf "%s,\n", $3}'
Your test file is in test.xml.
sed -n 's/^\s*<myel\s*name="\([^"]*\)".*$/\1,/p' test.xml
It has its pitfalls; for example if it is not strictly given that each myel is on one line you have to "normalize" the XML file first (so each myel is on a separate line).
yq can be used for XML parsing.
It is a lightweight and portable command-line YAML processor and can also deal with XML.
The syntax is similar to jq
Input
<root>
<myel name="Foo" />
<myel name="Bar">
<mysubel>stairway to heaven</mysubel>
</myel>
</root>
usage example 1
yq e '.root.myel.0.+name' $INPUT (version >= 4.30: yq e '.root.myel.0.+#name' $INPUT)
Foo
usage example 2
yq has a nice builtin feature to make XML easily grep-able
yq --input-format xml --output-format props $INPUT
root.myel.0.+name = Foo
root.myel.1.+name = Bar
root.myel.1.mysubel = stairway to heaven
usage example 3
yq can also convert an XML input into JSON or YAML
yq --input-format xml --output-format json $INPUT
{
"root": {
"myel": [
{
"+name": "Foo"
},
{
"+name": "Bar",
"mysubel": "stairway to heaven"
}
]
}
}
yq --input-format xml $FILE (YAML is the default format)
root:
myel:
- +name: Foo
- +name: Bar
mysubel: stairway to heaven

Unable to associate or grouping each set of xml attributes in bash script

I have following format xml which has multiple occurrences of same attributes ( name , code and format ).
<?xml version="1.0" encoding="UTF-8"?>
<config>
<input>
<pattern>
<name>ABC</name>
<code>1234</code>
<format>txt</format>
</pattern>
</input>
<input>
<pattern>
<name>XYZ</name>
<code>7799</code>
<format>csv</format>
</pattern>
</input>
</config>
I want to parse each of these patterns and construct string like : ABC-1234-txt , XYZ-7799-csv etc... and add this to an array. The idea here is to group each pattern by constructing the string which will further be used.
I have tried below command but unable to maintain the grouping :
awk -F'</?name>|</?code>|</?format>' ' { print $2 } ' sample.xml
It simply prints available values of these attributes in xml. As I am not an expert in bash so can anyone please suggest me how to group each pattern in above mentioned format in string.
With bash and xmlstarlet:
mapfile -t array < <(
xmlstarlet select \
--text --template --match '//config/input/pattern' \
--value-of "concat(name,'-',code,'-',format)" -n file.xml
)
declare -p array
Output:
declare -a array=([0]="ABC-1234-txt" [1]="XYZ-7799-csv")
See: help mapfile and xmlstarlet select
with xslt:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" omit-xml-declaration="yes" indent="no"/>
<xsl:strip-space elements="*"/>
<xsl:template match="pattern">
<xsl:value-of select="concat(name,'-',code,'-',format,'
')"/>
</xsl:template>
</xsl:stylesheet>
Apply the transform via xsltproc:
$ xsltproc example.xslt sample.xml
ABC-1234-txt
XYZ-7799-csv
Populate array with xslt output:
$ declare -a my_array
$ my_array=($(xsltproc example.xslt sample.xml))
$ echo "${my_array[#]}"
ABC-1234-txt XYZ-7799-csv
$ echo "${my_array[1]}"
XYZ-7799-csv

extract xml tag and its value

I want to read xml file and set its value into a variable.
for example ,
qhr2400.xml
<XML>
<OPERATION type="1">
<TABLENAME>TABLE</TABLENAME>
<ROWSET>
<ROW>
<CLLI>518</CLLI>
<COLLECTION_DATE>06/04/20 00:45:00</COLLECTION_DATE>
<SS7RT>99</SS7RT>
<AQPRT_1>84</AQPRT_1>
<L7RMSUOCT_01>80</L7RMSUOCT_01>
<L7RMSUOCT_02>80</L7RMSUOCT_02>
</ROW>
</ROWSET>
</OPERATION>
</XML>
I want its value in a variable like $CLLI =518, $COLLECTION_DATE = 06/04/20 00:45:00, SS7RT = 99..
so that I can use these values further to write an insert query.
Basically I want to load this .xml data into a database table.
this is what I tried.
read_xml.sh
awk 'NF==1 && (/ +<[a-zA-Z]+>/ || /^<[a-zA-Z]+>/ || / +<\/[a-zA-Z]+>/){
next
}
{
sub(/^ +/,"")
gsub(/\"|<|>/,"",$0);
sub(/\/.*/,"");
if($0){
print
}
}
' qhr2400.xml
Output
OPERATION type=1
CLLI5018
COLLECTION_DATE06
SS7RT99
AQPRT_184
L7RMSUOCT_0180
L7RMSUOCT_0280
Any help is appreciated.
Thanks!
Don't parse XML/HTML with regex, use a proper XML/HTML parser and a powerful xpath query.
theory :
According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.
Check this thread too, why-its-not-possible-to-use-regex-to-parse-html-xml
realLife©®™ everyday tool in a shell :
You can use one of the following :
xmllint often installed by default with libxml2, xpath1
xmlstarlet can edit, select, transform... Not installed by default, xpath1
xpath installed via perl's module XML::XPath, xpath1
xidel xpath3
saxon-lint my own project, wrapper over #Michael Kay's Saxon-HE Java library, xpath3
or you can use high level languages and proper libs, I think of :
python's lxml (from lxml import etree)
perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath
ruby nokogiri, check this example
php DOMXpath, check this example
Check: Using regular expressions with HTML tags
Fot this, you need an XML parser and xpath query in your shell, see:
$ xidel -se '//CLLI/text()' file.xml
When fixed your XML (opening/closing tag missmatch: TABLENANE/TABLENAME):
xmllint --xpath '//CLLI/text()' file
This command is installed with libxml2 and is far than exotic because it's installed by default on many Linux distros
Output
518
So now, you can retrieve all wanted values in shell variables, one example:
$ collectiondate=$(xidel -se '//COLLECTION_DATE/text()' file)
$ echo "$collectiondate"
But, please, don't use awk nor regex to parse XML.
There's others tools, check:
How to execute XPath one-liners from shell?
Check too: Using regular expressions with HTML tags (same thing for XML)
 Going further
declare -A arr
for i in CLLI COLLECTION_DATE SS7RT; do
read arr[$i] < <(xmllint --xpath "//$i/text()" file.xml)
done
Now you have an associative array with CLLI COLLECTION_DATE SS7RT keys:
Keys:
printf '%s\n' "${!arr[#]}"
CLLI
SS7RT
COLLECTION_DATE
Values:
$ printf '%s\n' "${arr[#]}"
518
99
06/04/20 00:45:00
for COLLECTION_DATE:
$ echo "${arr[COLLECTION_DATE]}"
06/04/20 00:45:00
It's possible to feed a numeric array in one line too:
readarray a < <(xidel -se '//*[self::CLLI or self::COLLECTION_DATE or self::SS7RT]/text()' file.xml)
I want its value in a variable like $CLLI =518, $COLLECTION_DATE = 06/04/20 00:45:00, SS7RT = 99.. so that I can use these values further to write an insert query.
I'm going to interpret this as; you want every child-node, and its value, in the "ROW"-node exported as a variable.
As "Gilles Quenot" already mentioned, please don't parse xml with regex. I'd suggest you give xidel a try.
You could do it manually and call xidel for each and every node...
CLLI=$(xidel -s qhr2400.xml -e '//CLLI')
COLLECTION_DATE=$(xidel -s qhr2400.xml -e '//COLLECTION_DATE')
[...]
...but xidel itself can also export variables, multiple at once even:
#multiple queries, multiple declarations:
xidel -s qhr2400.xml -e 'CLLI:=//CLLI' -e 'COLLECTION_DATE:=//COLLECTION_DATE' -e '[...]' --output-format=bash
#or one query, multiple declarations:
xidel -s qhr2400.xml -e 'CLLI:=//CLLI,COLLECTION_DATE:=//COLLECTION_DATE,[...]' --output-format=bash
CLLI='518'
COLLECTION_DATE='06/04/20 00:45:00'
[...]
The output are just strings. To actually set/export these variables you have to use Bash's eval built-in command:
eval "$(xidel -s qhr2400.xml -e 'CLLI:=//CLLI,COLLECTION_DATE:=//COLLECTION_DATE,[...]' --output-format=bash)"
And finally, to do it fully automatic for every child-node in the "ROW"-node:
xidel -s qhr2400.xml -e '//ROW/*/name()'
CLLI
COLLECTION_DATE
SS7RT
AQPRT_1
L7RMSUOCT_01
L7RMSUOCT_02
xidel -s qhr2400.xml -e 'for $x in //ROW/*/name() return eval(x"//ROW/{$x}")'
518
06/04/20 00:45:00
99
84
80
80
xidel -s qhr2400.xml -e 'for $x in //ROW/*/name() return eval(x"{$x}:=//ROW{$x}")[0]' --output-format=bash
CLLI='518'
COLLECTION_DATE='06/04/20 00:45:00'
SS7RT='99'
AQPRT_1='84'
L7RMSUOCT_01='80'
L7RMSUOCT_02='80'
result=
eval "$(xidel -s qhr2400.xml -e 'for $x in //ROW/*/name() return eval(x"{$x}:=//ROW{$x}")[0]' --output-format=bash)"
Another approach is to use XSLT (XSL Transformation)
Here is a fixed and indented version of the OP's XML file:
$ cat demo.xml
<XML>
<OPERATION type="1">
<TABLENAME>TABLE</TABLENAME>
<ROWSET>
<ROW>
<CLLI>518</CLLI>
<COLLECTION_DATE>06/04/20 00:45:00</COLLECTION_DATE>
<SS7RT>99</SS7RT>
<AQPRT_1>84</AQPRT_1>
<L7RMSUOCT_01>80</L7RMSUOCT_01>
<L7RMSUOCT_02>80</L7RMSUOCT_02>
</ROW>
</ROWSET>
</OPERATION>
</XML>
This is the stylesheet I will use:
$ cat demo.xsl
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" encoding="utf-8" />
<xsl:strip-space elements="*"/>
<xsl:template match="ROW">
<xsl:text>CLLI="</xsl:text><xsl:value-of select="CLLI"/><xsl:text>" </xsl:text>
<xsl:text>COLLECTION_DATE="</xsl:text><xsl:value-of select="COLLECTION_DATE"/><xsl:text>" </xsl:text>
<xsl:text>SS7RT="</xsl:text><xsl:value-of select="SS7RT"/><xsl:text>" </xsl:text>
<xsl:text>AQPRT_1="</xsl:text><xsl:value-of select="AQPRT_1"/><xsl:text>" </xsl:text>
<xsl:text>L7RMSUOCT_01="</xsl:text><xsl:value-of select="L7RMSUOCT_01"/><xsl:text>" </xsl:text>
<xsl:text>L7RMSUOCT_02="</xsl:text><xsl:value-of select="L7RMSUOCT_02"/><xsl:text>" </xsl:text>
</xsl:template>
<xsl:template match="text()"/>
</xsl:stylesheet>
Here is a simple shell script which uses xsktproc to transform demo.xml into suitable text for input to eval in order to create shell variables for required element values.
$ cat demo.sh
#!/bin/bash
eval $(xsltproc demo.xsl demo.xml)
echo "CLLI: $CLLI"
echo "COLLECTION_DATE: $COLLECTION_DATE"
echo "SS7RT: $SS7RT"
echo "AQPRT_1: $AQPRT_1"
echo "L7RMSUOCT_01: $L7RMSUOCT_01"
echo "L7RMSUOCT_02: $L7RMSUOCT_02"
Run the script:
$ ./demo.sh
CLLI: 518
COLLECTION_DATE: 06/04/20 00:45:00
SS7RT: 99
AQPRT_1: 84
L7RMSUOCT_01: 80
L7RMSUOCT_02: 80
$
read_xml.sh
gawk '
BEGIN {
FS="<|>"
}
// {
{
if($3 ~ /[0-9]/) { vars[$2] = $3; next }
}
}
END {
print vars["CLLI"]
print vars["SS7RT"]
print vars["COLLECTION_DATE"]
# etc...
}
' qhr2400.xml
result:
518
99
06/04/20 00:45:00
of course, instead of printing in END, you can use these variables from the vars array for something.
Rejecting AWK as an XML or HTML pareser is unreasonable. AWK is great as a parser for any files, including damaged xml files. Using AWK requires more thought, instead you don't need to install any exotic software. You can save the xml file so that AWK reads some lines incorrectly but the same can be said about xml analysis tools.
EDIT:
We fix the XML file error - splitting the field into several lines.
file qhr2400.xml contains:
<CLLI>
518
</CLLI>
instead of
<CLLI>518</CLLI>
call:
cat qhr2400.xml |tr -d '\n' |sed 's/ *//g' |sed 's/</\n</g' |awk -f readxml.awk
readxml.awk is now:
BEGIN {
FS="<|>"
}
// {
{
if($3 ~ /[0-9]/) { vars[$2] = $3; next }
}
}
END {
print vars["CLLI"]
print vars["SS7RT"]
print vars["COLLECTION_DATE"]
# etc...
}
the result is correct
EDIT2
For some time, there has been a worrying fashion for adding complexity instead of simplifying the environment. The use of a ready-made additional tool is usually a quick solution and may tempt you with its simplicity of use. Unfortunately, it is not always possible to install a huge Perl or Python or Ruby environment, e.g. on a built-in system with 32MB Flash, it is not always possible to compile any smaller tool for your processor architecture or company policy can rightly prohibit adding anything to the standard set, there is also sense for one-time processing of the file. AWK, sed, tr are usually equipped and it is the only rescue then. Also, not always parsing an XML file means wanting to extract key-value pairs, it can be something completely different, e.g.
"ROW> <CLLI> 518 </CLLI> <COLLECTION" which makes useless ready analytical tools based on xpath. AWK is a programming language written specifically for parsing text files in a practicaly unlimited way if we add standard unix tools.
However, if you have little experience, better rely on ready-made solutions if possible.

Search and delete matches of patterns array

I made an array of filenames of files in which match an pattern:
lista=($(grep -El "<LastVisitedURL>.+</LastVisitedURL>.*<FavoriteTopic>0</FavoriteTopic>" *))
Now I would delete in a file index.xml all tags enclosure which contains the filenames in the array.
for e in ${lista[*]}
do
sed '/\<TopicKey FileName=\"$e\"\>.*\<\/TopicKey\>/d' index.xml
done
The complete script is:
#! /bin/bash
#search xml files watched and no favorites.
lista=($(grep -El "<LastVisitedURL>.+</LastVisitedURL>.*<FavoriteTopic>0</FavoriteTopic>" *))
#declare -p lista
for e in ${lista[*]}
do
sed '/<TopicKey FileName=\"$e\">.*<\/TopicKey>/d' index.xml
done
Even though the regex pattern doesn't work, -i option in sed for edit in place index.xml, reload index file many times how filenames have the array, and this is bad.
Any suggestions?
Here an example using xmlstarlet in a shell :
% cat file.xml
<?xml version="1.0"?>
<root>
<foobar>aaa</foobar>
<LastVisitedURL>http://foo.bar/?a=1</LastVisitedURL>
<LastVisitedURL>http://foo.bar/?a=2</LastVisitedURL>
<LastVisitedURL>http://foo.bar/?a=3</LastVisitedURL>
</root>
Then, the command line :
% xmlstarlet edit --delete '//LastVisitedURL' file.xml
<?xml version="1.0"?>
<root>
<foobar>aaa</foobar>
</root>

Command line combine files at change in part of name and part of file

I am on AIX, with bash, and we cannot install additional software at this time so I am very limited to command line batch processing and maybe custom java scripts. So, I have a ton of XML files in different directories. Here is what a subset may look like.
root_dir
Pages
PAGES_1.XML
Queries
QUERIES_1.XML
QUERIES_2.XML
QUERIES_3.XML
I have put together a script that gets me almost everything I want, but I don't know how to do the last piece of the puzzle if possible in a batch script. I create a new directory under root, copy all of the XML files into the new directory, and then I rename them to remove any spaces if there are any in the name, and buffer the integer so they can be sorted in alphabetical / numerical order. The new output looks like this:
copy_dir
PAGES_001.XML
QUERIES_001.XML
QUERIES_002.XML
QUERIES_003.XML
I am almost there. The last piece is that these separate XML files need to be combined into one XML file for each type, so HISTORY_001.XML to HISTORY_099.XML need to be combined, then QUERIES_001.XML to QUERIES_099.XML need to be combined, but only after a specific point in the file. I have a regex for the files that will select the parts that I want, now I just need to figure out how to loop through each file subset. Maybe I jumped the gun and should do it before moving them, but assuming they are all in one directory, how can I go about this?
Here is an example of the data. All of the XML files carry these same types of information.
Pages
<?xml version="1.0"?>
<project name="">
<rundate></rundate>
<object_type code="false" firstitem="1" id="5" items="65" name="Pages">
<primary_key>Page Name</primary_key>
<secondary_key>Language Code</secondary_key>
<secondary_key>Page Field ID</secondary_key>
<secondary_key>Field Type</secondary_key>
<secondary_key>Record (Table) Name</secondary_key>
<secondary_key>Field Name</secondary_key>
<item id="ACCTG_TEMPLATE_AP">
...
</item>
<item id="ACCTG_TEMPLATE_AR">
...
</item>
</object_type>
</project>
Queries
<?xml version="1.0"?>
<project name="">
<rundate></rundate>
<object_type code="false" firstitem="1" id="10" items="46" name="Queries">
<primary_key>Query Name</primary_key>
<primary_key>User ID</primary_key>
<item id="1099G_ALL_SHORT. ">
...
</item>
<item id="1099G_ALL_VOUCHERS. ">
...
</item>
</object_type>
</project>
Regex to pull out header
(?:(?!(^\s*i<item)).)*
Regex to pull out detail
^(\s*<item id=).*(</item>)
Regex to pull out footer
^(\s*</object_type).*
So I am assuming that what I want to do it have a counter, loop through each object type XML subset, if I am the first loop then pull the header and detail and output to a new summary file, then continue for all other files to concat the detail, then if the last file or change to a new object type then output the footer as well. Do you think this is possible using bash script?
This will spit commands to do the sorting and classification, just provide functions/scripts/whatever that do the right thing for files that are first, middle, last, or only in a group. The first and middle commands have to handle empty argument lists, middle for two-element groups and first for groups without a 1-sequenced file.
Edit: I broke the seds out to one command per line to handle seds that don't like semicolons
Run this as e.g. sh this.sh *_*.*
#!/bin/sh
#
# spit commands to sort, group, and classify argument filenames
# sorting by the number between `_` and `.` in their names and
# grouping by the text before the _.
{
# Everything through the sort would just be `ls -v` on GNU/anything...
for f; do
pfx=${f%%_*}
tail=${f#*_}
sortable=`printf %s_%03d.%s $pfx ${tail%.*} ${tail##*.}`
[ $f != $sortable ] \
&& echo mv $f $sortable >&2
echo $sortable
done \
| sort \
| sed '
/_0*1\./! H
// {
x
1! {
y/\n/ /
p
}
}
$!d
x
y/\n/ /
' \
| sed '
s/\([^ ]*\)\(.*\) \(.*\)/first \1\nmiddle\2\nlast \3/
t
s/^/only /
'
} 2>&1
The first of the above seds accumulates groups of one-per-line Words that can be identified by their first line. The second classifies the groups and subs in the right commands. They're separate because the first sed involves a double-pump to handle a widow group, plus they're hairy enough as it is.
combine()
{
# pull the header from 1st file
while IFS= read && word=($REPLY) && [ "$word" != "<item" ]
do echo "$REPLY"
done <$1
# concat the detail from all files
for file
do cmd=:
while IFS= read && word=($REPLY)
do case $word in \<item) cmd=echo;; esac
$cmd "$REPLY"
case $word in \</item\>) cmd=:;; esac
done <$file
done
# output the footer
while IFS= read && word=($REPLY)
do case $word in \</object_type\>) cmd=echo;; esac
$cmd "$REPLY"
done <$file
}
combine PAGES_???.XML >PAGES.XML
combine QUERIES_???.XML >QUERIES.XML

Resources