xmllint: Formating without adding header - bash

Is there a way to use $xmllint --format file without the header section?
<?xml version="1.0"?>
<Tag>
<Sub>A</Sub>
</Tag>
I know you can use --c14n but that does not seem to mix well with --format.
As $xmllint --format --c14n file will just produce:
<Tag><Sub>A</Sub></Tag>
Desired Result
<Tag>
<Sub>A</Sub>
</Tag>

You can use sed to remove the first line. Not saying it's the best but it would get you going:
xmllint --format <file> | sed 1d
You would preferrable try to avoid one million calls to xmllint. And sed or tail.
I'm not sure if xmllint supports inplace edit. But if it does then something like this might be possible:
xargs < list_of_files_to_change.txt xmllint --inplace --format
xargs < list_of_files_to_change.txt sed -i 1d

Related

Fetch value from xml element which is enclosed with CDATA through awk

I am using awk command to fetch values from xml elements. Using below command
$(awk -F "[><]" '/'$tag_name'/{print $3}' $FILE_NAME | sort | uniq)
Here
File_name: XML File.
tag_name: name of xml element whose value we
need.
Sample XML
<item>
<tag1>test</tag1>
<tag2><![CDATA[cdata_test]]></tag2>
</item>
One of the tag in xml contains CDATA. For that script is not working as expected.
When I tried to print it is printing blank.
Instead of using a specific tool as AWK, not aware of the XML specificities, I suggest you to use xmlstarlet for selecting the nodes you want. For instance:
xmlstarlet select -t -v '//tag1' -n input.xml
will give as result:
test
Issuing:
xmlstarlet select -t -v '//tag2' -n input.xml
gives as output:
cdata_test
If you don't need the newline at the end of the returned string, just remove the -n from the options of the xmlstarlet command.
Keep it simple.
As xmlstarlet is not installed on my machine.
I used sed prior to my awk command as follows and that works for me.
$(sed -e 's/<![CDATA[//g; s/]]>//g' ${FILE_NAME} | awk -F "[><]" '/'$tag_name'/{print $3}' | sort | uniq)
Also, If anybody has any other solution. That too is also welcome.

replace XML tags value in shell script

Hi I am a xml file like this.
<parameter name="END_DATE">20181031</parameter>
I want to replace this tags value to some other value I tried like this.
dt=$(awk -F '[<>]' '/_DATE/{print $3}' test.xml)
I extracted the tags value.
I have another variable value like this.
newdt=20181108
Now I need to replace this value to the extracted value.
Any help would be appreciated.
Though Chepner is right that awk or sed are not exact tools for xml in case you are NOT having xmlstarlet in your system then try following.
echo $newdt
20181108
awk -v dat="$newdt" 'match($0,/>[0-9]+</){$0=substr($0,1,RSTART) dat substr($0,RSTART+RLENGTH-1)} 1' Input_file
If sed works for you -
sed -Ei 's/( name="END_DATE")>20181031</\1>20181108</' test.xml
And xml parser is probably a better idea, though.
If you need to embed the variable -
sed -Ei "s/( name=\"END_DATE\")>20181031</\1>$newdt</" test.xml

Substitute all version="n.n" occurrences except xml version with bash

I have a bunch of files in the directory that has this pattern: version="0". It can be any number inside. But I don't want to affect the <?xml version="1.0" ?> parts. This line can be not on the first line, so we can't just skip the first.
The main problem is that sed and awk's gsub don't work with lookbehind. I decided that it acceptable to do double work. replace all versions and then fix all xml versions. But sed with -r missunderstands the capturing groups.
What I have so far:
sed -r 's#(\<\?xml .*)version="[^"]*"(.*\?\>)#\1version="1.0"\2#g' fixing xmls
To change all version attributes within an XML document, the following XMLStarlet command will suffice:
xmlstarlet ed --inplace \
-u '//*[#version="0"]/#version' -v 1 \
/your/directory/*.xml
I think I kind of made it:
find test -exec sed -i 's/version="[^"]*"/version="800"/g' {} \; -print | xargs -I FILE sed -i 's#\(<?xml \)version="[^"]*"\(.*\)#\1version="1.0"\2#g' FILE
where the 800 is desirable value. but still double work.
Don't escape the< and > if you don't use them for word boundary. Try this:
sed -r 's#(<\?xml .*)version="[^"]*"(.*\?>)#\1version="1.0"\2#g' file
That said you should avoid the any character .* pattern that is greedy. A safer command would be:
sed -r 's#(<\?xml[^>]*)version="[^"]*"([^>]*)\?>#\1version="1.0"\2?>#g' file

Getting text between first occurance of two strings [Shell]

I have a feed.xml file that looks something like this. What I want to do is to grab the test.html from this feed.(Basically, the top most item's url.) Any thoughts on how to do this?
<rss>
<item>
<title>ABC</title>
<url>
test.html
</url>
</item>
<item>
<title>CDE</title>
<url>
test1.html
</url>
</item>
</rss>
Thanks!
If the structure is fixed and you know that the URL has the postfix .html, you can simply do:
cat <yourfile> | grep ".html" | head -n1
If you don't know the postfix (or the string "html" can exist before), you can do:
cat <yourfile> | grep -A1 "<url>" | head -n2 | tail -n1
EDIT
In case, the structure is not fixed (i.e., no newlines), there this
cat <yourfile> | grep -o "<url>[^<]*</url>" | head -n1 | cut -d'>' -f2 | cut -d'<' -f1
or that
cat <yourfile> | grep -o "<url>[^<]*</url>" | head -n1 | sed -E -e"s#<url>(.*)</url>#\1#"
may work.
This might work for you:
sed '/<url>/,/<\/url>/{//d;s/ *//;q};d' file.xml
This awk script should work:
awk '/<url>/ && url==0 {url=1;next;} {if(url==1) {print;url=2;}}' file
EDIT:
Following grep command might also work:
grep -m 1 "^ *<url>" -A1 file | grep -v "<url>"
Instead of using line-based tools, I'd suggest using an xsl transform to get the data you want out of the document without making assumptions about the way it's formatted.
If you save this to get-url.xsl:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema">
<xsl:output method="text"/>
<xsl:template match="/">
<xsl:value-of select="normalize-space(rss/item/url)"/>
</xsl:template>
</xsl:stylesheet>
Then you can get the value of url from feed.xml like this:
$ xsltproc get-url.xsl feed.xml; echo
test.html
$
The extra echo is just there to give you a newline after the end of the output, to make it friendly for an interactive shell. Just remove it if you're assigning the result to a shell variable with $().

pipe one long line as multiple lines

Say I have a bunch of XML files which contain no newlines, but basically contain a long list of records, delimited by </record><record>
If the delimiter were </record>\n<record> I would be able to do something like cat *.xml | grep xyz | wc -l to count instances of records of interest, because cat would emit the records one per line.
Is there a way to write SOMETHING *.xml | grep xyz | wc -l where SOMETHING can stream out the records one per line? I tried using awk for this but couldn't find a way to avoid streaming the whole file into memory.
Hopefully the question is clear enough :)
This is a little ugly, but it works:
sed 's|</record>|</record>\
|g' *.xml | grep xyz | wc -l
(Yes, I know I could make it a little bit shorter, but only at the cost of clarity.)
If your record body has no character like < or / or >, then you may try this:
grep -E -o 'SEARCH_STRING[^<]*</record>' *.xml| wc -l
or
grep -E -o 'SEARCH_STRING[^/]*/record>' *.xml| wc -l
or
grep -E -o 'SEARCH_STRING[^>]*>' *.xml| wc -l
Here is a different approach using xsltproc, grep, and wc. Warning: I am new to XSL so I can be dangerous :-). Here is my count_records.xsl file:
<?xml version="1.0"?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output method="text" /> <!-- Output text, not XML -->
<xsl:template match="record"> <!-- Search for "record" node -->
<xsl:value-of select="text()"/> <!-- Output: contents of node record -->
<xsl:text> <!-- Output: a new line -->
</xsl:text>
</xsl:template>
</xsl:stylesheet>
On my Mac, I found a command line tool called xsltproc, which read instructions from an XSL file, process XML files. So the command would be:
xsltproc count_records.xsl *.xml | grep SEARCH_STRING | wc -l
The xsltproc command displays the text in each node, one line at a time
The grep command filters out the text you are interested in
Finally, the wc command produces the count
You may also try xmlstarlet for gig-sized files:
# cf. http://niftybits.wordpress.com/2008/03/27/working-with-huge-xml-files-tools-of-the-trade/
xmlstarlet sel -T -t -v "count(//record[contains(normalize-space(text()),'xyz')])" -n *.xml |
awk '{n+=$1} END {print n}'
xmlstarlet sel -T -t -v "count(//record[contains(normalize-space(text()),'xyz')])" -n *.xml |
paste -s -d '+' /dev/stdin | bc

Resources