I'm thinking this is a very simple xpath question .. I'm just not sure why my xpath isn't working.
Here's what my XML looks like
<A>
<B>foo</B>
</A>
<C>
<A>
<B>foo</B>
</A>
</C>
Now .. I want to grab all "A" elements which contain a "B" with contained text "foo".
//A[B[text()='foo']]
//A matches all As
//A[B] that have a B as a child
//A[B[text()='foo']] which contains foo as text.
I suggest to read the XPath tutorial at w3chools.com
Related
I have the following HTML structure (there are many blocks using the same architecture):
<span id="mySpan">
<i>
Price
<b>
3 900
<small>€</small>
</b>
</i>
</span>
Now, I want to get the content of <b> using Xpath which I tried like so:
//span[#id="mySpan"]/i/node()[1][contains(text(),"Price")]
which does match anything. How can I match this using the node()[1] text as anchor?
Regarding the Xpath you tried, instead of text() which return text node child, simply use . :
//span[#id="mySpan"]/i/node()[1][contains(.,"Price")]
For the ultimate goal, I'd suggest this XPath :
//span[#id="mySpan"]/i[contains(.,"Price")]/b
or if you want specifically to match against the first node within <i> :
//span[#id="mySpan"]/i[contains(node(),"Price")]/b
It's a basic question, but I couldn't find the answer anywhere.
<a>
<b1>
<d1>D1</d1>
<e1>E1</e1>
</b1>
<b2>
<c2>
<d2>D2</d2>
<e2>E2</e2>
</c2>
</b2>
</a>
From the above I'd like to return:
<a>
<d1>D1</d1>
<e1>E1</e1>
<d2>D2</d2>
<e2>E2</e2>
</a>
And not:
<a>
<b1>
<d1>D1</d1>
<e1>E1</e1>
</b1>
<b2>
<d2>D2</d2>
<e2>E2</e2>
</b2>
</a>
If that makes any sense. I tried "/a", but that gave me:
<a>
<b1>
<d1>D1</d1>
<e1>E1</e1>
</b1>
<b2>
<c2>D2E2</c2>
</b2>
</a>
If you meant to select all leave nodes (nodes without child node(s)), you can try this XPath :
//*[not(*)]
Or using XPath union (|) to get child nodes of <b1> and <c2> :
(//b1/* | //c2/*)
Given sample XML you posted, both XPath above will return :
<d1>D1</d1>
<e1>E1</e1>
<d2>D2</d2>
<e2>E2</e2>
But if you really need the result to be wrapped in <a>, then I agree with #minopret comment, that isn't what XPath meant to do. XSLT is more proper way to transform an XML to different format.
UPDATE :
In respond to your last comment, there is no such grouping in XPath. Should be done in the host language if you need that data structure. Your best bet is to select parent of those desired nodes in XPath so you get them grouped by their parent. Then you can do further processing in the host language, for example :
//*[not(*)]/parent::*
//*[*[not(*)]]
Any of above two XPath queries can return :
<b1>
<d1>D1</d1>
<e1>E1</e1>
</b1>
<c2>
<d2>D2</d2>
<e2>E2</e2>
</c2>
XPath can only return nodes that are already present in your source tree. To construct new nodes, or reorganise the tree, you need XSLT or XQuery.
Here is the code:
<a id='Letter1'>
<p>Dear Sir, </p>
<p>This is with.........</p>
<p>I would be.......</p>
<p>Hoping to hear from you soon</p>
<p>Regards.</p>
</a>
Using Xpath I want to extract the inner text of all the Paragraph tags which are contained inside the anchor tag as a single text entity.
The final result i want is
string letterBody= document.DocumentNode.SelectSingleNode("//XPATH QUERY").innerText;
where letterBody="Dear Sir, This is with...................Regards."
You need to just get the <a> element and you will get all the text nodes which are under <a> as its innertext.
So your xpath would be /a[#id='Letter1'] or just /a.
I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[#name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[#name='hw2'] and /a[#name='hw3'].
Your xpath is selecting the text of the a nodes, not the text of the td nodes:
$item//a[#name='hw']/text()
Change it to this:
$item[a/#name='hw']/text()
Update (following comments and update to question):
This xpath selects the second text node from $item that have an a tag containing a name attribute set to hw:
$item[a/#name='hw']//text()[2]
I would not want to use text()[3] but
is there some way I could extract the
text out between /a[#name='hw2'] and
/a[#name='hw3'].
If there is just one text node between the two <a> elements, then the following would be quite simple:
/a[#name='hw3']/preceding::text()[1]
If there are more than one text nodes between the two elements, then you need to express the intersection of all text nodes following the first element with all text nodes preceding the second element. The formula for intersection of two nodesets (aka Kaysian method of intersection) is:
$ns1[count(.|$ns2) = count($ns2)]
So, just replace in the above expression $ns1 with:
/a[#name='hw2']/following-sibling::text()
and $ns2 with:
/a[#name='hw3']/preceding-sibling::text()
Lastly, if you really have XQuery (or XPath 2), then this is simply:
/a[#name='hw2']/following-sibling::text()
intersect
/a[#name='hw3']/preceding-sibling::text()
This handles your expanded case, while letting you select by attribute value rather than position:
let $item :=
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
return $item//node()[./preceding-sibling::a/#name = "hw2"][1]
This gets the first node that has a preceding-sibling "a" element with a name attribute of "hw2".
XML: /A/B or /A
I want to get all A nodes that do not have any B children.
I've tried
/A[not(B)]
/A[not(exists(B))]
without success
I prefer a solution with the syntax /*[local-name()="A" and .... ], if possible. Any ideas that works?
Clarification. The xml looks like:
<WhatEver>
<A>
<B></B>
</A>
</WhatEver>
or
<WhatEver>
<A></A>
</WhatEver>
Maybe
*[local-name() = 'A' and not(descendant::*[local-name() = 'B'])]?
Also, there should be only one root element, so for /A[...] you're either getting all your XML back or none. Maybe //A[not(B)] or /*/A[not(B)]?
I don't really understand why /A[not(B)] doesn't work for you.
~/xml% xmllint ab.xml
<?xml version="1.0"?>
<root>
<A id="1">
<B/>
</A>
<A id="2">
</A>
<A id="3">
<B/>
<B/>
</A>
<A id="4"/>
</root>
~/xml% xpath ab.xml '/root/A[not(B)]'
Found 2 nodes:
-- NODE --
<A id="2">
</A>
-- NODE --
<A id="4" />
Try this "/A[not(.//B)]" or this "/A[not(./B)]".
The first / causes XPath to start at the root of the document, I doubt that is what you intended.
Perhaps you meant //A[not(B)] which would find all A nodes in the document at any level that do not have a direct B child.
Or perhaps you are already at a node that contains A nodes in which case you just want A[not(B)] as the XPath.
If you are trying to get A anywhere in the hierarchy from the root, this works (for xslt 1.0 as well as 2.0 in case its used in xslt)
//descendant-or-self::node()[local-name(.) = 'a' and not(count(b))]
OR you can also do
//descendant-or-self::node()[local-name(.) = 'a' and not(b)]
OR also
//descendant-or-self::node()[local-name(.) = 'a' and not(child::b)]
There are n no of ways in xslt to achieve the same thing.
Note: XPaths are case-sensitive, so if your node names are different (which I am sure, no one is gonna use A, B), then please make sure the case matches.
Use this:
/*[local-name()='A' and not(descendant::*[local-name()='B'])]