Using Xpath return text that is positioned after the last comma - xpath

Using xpath, I want to return the value 000078 & 000077 from the below xml. The text for "Entity" tag can be 2 comma separated values or 3 or more. I always want the last value.
<Parent ID="123">
<SubParent ID="1">
<Name>Modem</Name>
<Entity>000006,000069,000078</Entity>
</SubParent>
<SubParent ID="2">
<Name>Modem</Name>
<Entity>000006,000077</Entity>
</SubParent>
</Parent>

XPath is a selection language, not a string processing (or general purpose programming) language, and you can only select from the distinct nodes in your document.
The nodes that contain the values you are looking for are two text nodes, '000006,000069,000078' and '000006,000077', so //Entity/text() (or //Entity) is the closest you can get with XPath alone.
Any further string processing, like pulling out the substring after the last comma, must be done in the host language.
This is one of the examples that show that storing opaque strings that contain multiple data points (like comma-separated values) in XML is a bad idea.
This is how your XML should look like.
<Parent ID="123">
<SubParent ID="1">
<Name>Modem</Name>
<Entity>000006</Entity>
<Entity>000069</Entity>
<Entity>000078</Entity>
</SubParent>
<SubParent ID="2">
<Name>Modem</Name>
<Entity>000006</Entity>
<Entity>000077</Entity>
</SubParent>
</Parent>
because now you would easily be able to select //Entity[last()]/text() and get exactly two nodes.

Related

How to remove multi-line blocks of text of varying sizes from a file given the first and last lines and a substring?

I have an xml file listing several games and their metadata, like so:
<?xml version="1.0"?>
<gameList>
<game>
<path>./Besiege.desktop</path>
<name>Besiege</name>
<desc>Long description of game</desc>
<releasedate>20150128T000000</releasedate>
<developer>Spiderling Studios</developer>
<publisher>Spiderling Studios</publisher>
<genre>Strategy</genre>
<players>1</players>
</game>
<A bunch of other entries>
<game>
<path>./67000.The Polynomial.txt</path>
<name>The Polynomial - Space of the music</name>
<desc>Long description of game</desc>
<releasedate>20101015T000000</releasedate>
<developer>Dmytry Lavrov</developer>
<publisher>Dmitriy Uvarov</publisher>
<genre>Shooter, Music</genre>
<players>1</players>
<favorite>true</favorite>
</game>
<Another bunch of entries>
</gameList>
I want to remove every entry that contains the substring ".desktop" and leave all the rest. But just removing the line which contains this string isn't enough, I want to remove the whole block from <game> to </game>.
I know that in Linux, with bash, there are several ways to remove a fixed number of lines before or after a given string. But by comparing the two entries above, you can see that they don't always have the same number of fields. The descriptions inside the "<desc>" tags also vary from one to four paragraphs separated by empty lines. I have not found any solutions that deal with a variable number of lines around a target substring.
I thought there would be an easy way to split the text into blocks from the opening <game> tag to the closing </game> tag so that I could operate on them in a similar way to how one normally does with lines, in which case a simple while loop that tested for the presence of the substring and deleted the block if true, or something similar, would solve my problem. Well, I've been banging my head against grep, sed and awk and I've tried to set a convenient value for IFS so that it would only end lines at "</game>" and I am growing increasingly frustrated because I'm almost at the point where it would have been faster to do this manually. But then I'd remain ignorant.
I'm only just beginning to learn Bash so there is so much that I don't know, and I feel like this is the sort of thing that someone more knowledgeable could do with a single-liner but I'm completely stumped. So thank you for your time and please point me in the right direction.
Do not use line tools to edit XML files. Do not use Bash to edit XML files. Use XML tools to edit XML files. Write a program in python or Perl or other capable programming language with an XML library to edit XML.
The following with xmlstarlet is quite simple:
$ xmlstarlet ed -d '/gameList/game[ contains(path, ".desktop") ]' input.xml
<?xml version="1.0"?>
<gameList>
<game>
<path>./67000.The Polynomial.txt</path>
<name>The Polynomial - Space of the music</name>
<desc>Long description of game</desc>
<releasedate>20101015T000000</releasedate>
<developer>Dmytry Lavrov</developer>
<publisher>Dmitriy Uvarov</publisher>
<genre>Shooter, Music</genre>
<players>1</players>
<favorite>true</favorite>
</game>
</gameList>

xmllint / Xpath extract parent node where child contains text from google shopping feed

I am trying to extract all "item" nodes containing a g:custom_label_0 with the text value "2020-2021"
So far, I manage to find all nodes containing the child g:custom_label_0, but I don't manage to filter by the text value of the field.
Here is the example XML:
<item>
<description>[...]</description>
<g:availability>in stock</g:availability>
<g:brand>Barts</g:brand>
<g:condition>new</g:condition>
<g:custom_label_0>2020-2021</g:custom_label_0>
<g:id>108873/10-3</g:id>
<g:image_link>[...]</g:image_link>
<g:price>26.99 EUR</g:price>
<g:sale_price>26.99 EUR</g:sale_price>
<g:shipping>
<g:country>NL</g:country>
<g:price>4.50 EUR</g:price>
</g:shipping>
<g:shipping_weight>7.95</g:shipping_weight>
<link>[....]</link>
</item>
...
There is nodes containing other values than 2020-2021, but I want to extract all complete item nodes containing this text.
Here's what I made in order to extract all nodes having the field available.
xmllint --xpath '//item["g:custom_label_0"]' myfile.xml
i tried adding a text filter via square brackets etc. but I have the feeling the quotation around the custom_label_0 might cause trouble. Adding more filters within the quotes gets accepted (no error), but I won't be able to add more quotations inside to filter the string.
Does work, throws no error:
xmllint --xpath '//item["g:custom_label_0[text()]"]' myfile.xml
If I wanted to filter the text now, I need to use quotations again. Escaping them breaks the code. How can i further filter down the text "2020-2021" when both types of quotation marks are already used?
You're right; the quotes around g:custom_label_0 is causing trouble. That makes it a string and that is always true so it will return all item elements.
The g: is a namespace prefix. To bind a namespace to a prefix in xmllint, you have to use it in shell mode (see https://stackoverflow.com/a/8266075/317052 for an example).
An alternative is to test the element name to select the g:custom_label_0 element and then test the value of that element to see if it's 2020-2021.
Example...
xmllint --xpath '//item[*[name()="g:custom_label_0"][.="2020-2021"]]' myfile.xml

xpath how to extract the element itself and one of its child?

I'm fetching data with python requests & xpath.
<div class="test">
<p>pppp</p>
aaa
<em>bbb</em>
ccc
<span>span</span>
</div>
I want to get aaabbbccc.
I tried //div/*[not(self::p) and not(self::span)]//text() to exclude the p and span element, but it only returns bbb.
What is the correct path?
If the element structure is totally predictable and only the content of text nodes varies, then you can use //div/node()[not(self::p|self::span)]/descendant-or-self::text(). Note that this returns a sequence of text nodes, not a single string. This may also return some whitespace text nodes which you may want to filter out with the predicate [normalize-space(.)].
Another possibility would be //text()[not(parent::p|parent::span)].

XPath - How to get image source from xml

Hello i have this xml
<item>
<title> Something for titleĀ»</title>
<link>some url</link>
<description><![CDATA[<div class="feed-description"><div class="feed-image"><img src="pictureUrl.jpg" /></div>text for desc</div>]]></description>
<pubDate>Thu, 11 Jun 2015 16:50:16 +0300</pubDate>
</item>
I try to get the img src with path: //description//div[#class='feed-description']//div[#class='feed-image']//img/#src but it doesn't work
is there any solution?
A CDATA section escapes its contents. In other words, CDATA prevents its contents from being parsed as markup when the rest of the document is parsed. So the <div>s in there are not seen as XML elements, only as flat text. The <description> element has no element children ... only a single text child. As such, XPath can't select any <div> descendant of <description> because none exists in the parsed XML tree.
What to do?
If your XPath environment supports XPath 3.0, you could use parse-xml() to turn the flat text into a tree, then use XPath to select //div[#class='feed-description']//div[#class='feed-image']//img/#src from the resulting tree.
Otherwise, your best workaround may be to use primitive string-processing functions like substring-before(), substring-after(), or match(). (The latter uses regular expressions and requires XPath 2.0.) Of course, many people will tell you not to use regular expressions to analyze markup like XML and HTML. For good reason: in the general case, it's very difficult to do it right (with regexes or with plain string searches). But for very restricted cases where the input is highly predictable, and in absence of better tools, it can be the best tool for a less-than-ideal job.
For example, for the data shown in your question, you could use
substring-before(substring-after(//description, 'img src="'), '"')
In this case, the inner call substring-after(//description, 'img src="') returns pictureUrl.jpg" /></div>text for desc</div>, of which the substring before " is pictureUrl.jpg.
This isn't really robust, for example it'll fail if there's a space between src and =. But if the exact formatting is predictable, you'll be OK.

How do construct an xpath to select items that do not contain a string

How do I use something similar to the example below, but with the opposite result, i.e items that do not contain (default) in the text.
<test>
<item>Some text (default)</item>
<item>Some more text</item>
<item>Even more text</item>
</test>
Given this
//test/item[contains(text(), '(default)')]
would return the first item. Is there a not operator that I can use with contains?
Yes, there is:
//test/item[not(contains(text(), '(default)'))]
Hint: not() is a function in XPath instead of an operator.
An alternative, possibly better way to express this is:
//test/item[not(text()[contains(., '(default)')])]
There is a subtle but important difference between the two expressions (let's call them A and B, respectively).
Simple case: If all <item> only have a single text node child, both A and B behave the same.
Complex case: If <item> can have multiple text node children, expression A only matches when '(default)' occurs in the first of them.
This is because text() matches all text node children and produces a node-set. So far no surprise. Now, contains() accepts a node-set as its first argument, but it needs to convert it to string to do its job. And conversion from node-set to string only produces the string value of the first node in the set, all other nodes are disregarded (try string(//item) to see what I mean). In the simple case this exactly what happens as well, but the result is not as surprising.
Expression B deals with this by explicitly checking every text node individually instead of only checking the string value of the whole <item> element. It's therefore the more robust of the two.

Resources