XPath: Question concerning the p-Element from HTML - xpath

I have a question concerning XPath and the p-element from HTML. Let's say I'm confronted with an HTML-structure that looks like this:
<div id="this-is-a-text">
This is text segment 1.
<p>This is text segment 2.</p>
this is text segment 3.
<div id="this-is-not-part-of-the-text">This doesn't belong to the text.</div>
This is text segment 4.
</div>
I'm wondering what's the correct way to parse all all text segments no matter if they're inside a p-element or not? (NB: The the sequence of the elements is random.)
What I don't understand is why //div[#id="this-is-a-text"]/p seems to do the job (instead of just returning text segment 3), whereas //div[#id="this-is-a-text"]/text() doesn't return any results at all.
Can anyone help me understand this?
Thanks!
Bob

As Martin Honnen mentioned, query //div[#id="this-is-a-text"]/text() should return set of three text segments:
"\nThis is text segment 1.\n",
"\nthis is text segment 3.\n",
"\nThis is text segment 4.\n"
If I understand your question right, you need query like
//div[#id="this-is-a-text"]//text()
And this should return set:
"\nThis is text segment 1.\n",
"This is text segment 2.",
"\nthis is text segment 3.\n",
"This doesn't belong to the text.",
"\nThis is text segment 4.\n"

Related

Find nodes containing string split over 2 tags

is there a way to get nodes containing a specific string which is split over 2 tags. I tried this but it doesn't work. I can't manage to ignore foreign tag.
$crawler->filterXPath('//p/text()[contains(., "caractère a priori")]');
<p>leur caractère <foreign xml:lang="lat">a priori</foreign>, soit..</p>
Thanks a lot !
The below XPath should work for you, it will return only <p> nodes which contain the text specified in the contains statement. I've expanded the example a bit, for me to test, and included a fiddle here.
XPath:
div/p[contains(., 'caractère a priori')]
Input
<div>
<p>leur caractère <foreign xml:lang="lat">a priori</foreign>, soit..</p>
<p>leur poisson <foreign xml:lang="lat">a priori</foreign>, soit..</p>
</div>
Output
<p>leur caractère <foreign xml:lang="lat">a priori</foreign>, soit..</p>
Hopefully that give you enough to go on!

Extracting text in between nodes through XPath

I'm trying to read specific parts of a webpage through XPath. The page is not very well-formed but I can't change that...
<root>
<div class="textfield">
<div class="header">First item</div>
Here is the text of the <strong>first</strong> item.
<div class="header">Second item</div>
<span>Here is the text of the second item.</span>
<div class="header">Third item</div>
Here is the text of the third item.
</div>
<div class="textfield">
Footer text
</div>
</root>
I want to extract the text of the various items, i.e. the text in between the header divs (e.g. 'Here is the text of the first item.'). I've used this XPath expression so far:
//text()[preceding::*[#class='header' and contains(text(),'First item')] and following::*[#class='header' and contains(text(),'Second item')]]
However, I cannot hardcode the ending item name because in the pages I want to scrape the order of the items differ (e.g. 'First item' may be followed by 'Third item').
Any help on how to adapt my XPath query would be greatly appreciated.
Found it!
//text()[preceding::*[#class='header' and contains(text(),'First item')]][following::*[preceding::*[#class='header'][1][contains(text(),'First item')]]]
Indeed your solution, Aleh, won't work for tags inside the text.
Now, the one remaining case is the last item, which is not followed by an element with class=header; so it will include all text found 'till the end of the document. Ideas?
//*[#class='header' and contains(text(),'First item')]/following::text()[1] will select first text node after <div class="header">First item</div>.
//*[#class='header' and contains(text(),'Second item')]/following::text()[1] will select first text node after <div class="header">Second item</div> and so on
EDIT: Sorry, this will not work for <strong> cases. Will update my answer
EDIT2: Used #Michiel part. Looks like omg but works: //div[#class='textfield'][1]//text()[preceding::*[#class='header' and contains(text(),'First item')]][following::*[preceding::*[not(self::strong) and not(self::span)][1][contains(text(),'First item')]] or not(//*[preceding::*[#class='header' and contains(text(),'First item')]])]
Seems that this should be solved with a better solution :)
For the sake of completeness, the final query, composed of various suggestions throughout the thread:
//*[
#class='textfield' and position() = 1
]
//text() [
preceding::*[
#class='header' and contains(text(),'First item')
]
][
following::*[
preceding::*[
#class='header'
][1][
contains(text(),'First item')
]
]
]

XPath / XQuery: find text in a node, but ignoring content of specific descendant elements

I am trying to find a way to search for a string within nodes, but excluding ythe content of some subelements of those nodes. Plain and simple, I want to search for a string in paragraphs of a text, excluding the footnotes which are children elements of the paragraphs.
For example,
My document being:
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there<footnote>It's not a very long text!</footnote></p>
</document>
When I'm searching for "text", I would like the Xpath / XQuery to retrieve the first p element, but not the second one (where "text" is contained only in the footnote subelement).
I have tried the contains() function, but it retrieves both p elements.
Any help would be much appreciated :)
I want to search for a string in
paragraphs of a text, excluding the
footnotes which are children elements
of the paragraphs
An XPath 1.0 - only solution:
Use:
//p//text()[not(ancestor::footnote) and contains(.,'text')]
Against the following XML document (obtained from yours but added p s within a footnote to make this more interesting):
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there
<footnote>It's not a very long text!
<p>text</p>
</footnote>
</p>
</document>
this XPath expression selects exactly the wanted text node:
My text starts here/
//p[(.//text() except .//footnote//text())[contains(., 'text')]]
/document/p[text()[contains(., 'text')]] should do.
For the record, as a complement to the other answers, I've found this workaround that also seems to do the job:
//p[contains(child::text()|not(descendant::footnote), "text")]

XPath to find text node that is a sibling of other nodes

Given the following fragment of html:
<fieldset>
<legend>My Legend</legend>
<p>Some text</p>
Text to capture
</fieldset>
Is there an xpath expression that will return only the 'Text to capture' text node?
Trying /fieldset/text() yields three nodes, not just the one I need.
Assuming what you want is the text node containing non whitespace text :
//fieldset/text()[normalize-space(.)]
If what you want is the last text node, then:
//fieldset/text()[last()]
I recommend you accept Steven D. Majewski's answer, but here is the explanation (text nodes highlighted with square brackets):
<fieldset>[
]<legend>My Legend</legend>[
]<p>Some text</p>[
Text to capture
]</fieldset>
so /fieldset/text() returns
"\n "
"\n "
"\n Text to capture\n"
And this is why you want /fieldset/text()[normalize-space()], and you want the result trimmed before use.
Also note that the above is short for /fieldset/text()[normalize-space(.) != '']. When normalize-space() returns a non-empty string, the predicate evaluates to true, while the empty string evaluates to false.

Xquery to extract text in html

I am working on extracting text out of html documents and storing in database. I am using webharvest tool for extracting the content. However I kind of stuck at a point. Inside webharvest I use XQuery expression inorder to extract the data. The html document that I am parsing is as follows:
<td><a name="hw">HELLOWORLD</a>Hello world</td>
I need to extract "Hello world" text from the above html script.
I have tried extracting the text in this fashion:
$hw :=data($item//a[#name='hw']/text())
However what I always get is "HELLOWORLD" instead of "Hello world".
Is there a way to extract "Hello World". Please help.
What if I want to do it this way:
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
I would like to extract the text Hello world 2 that is in betweeb hw2 and hw3. I would not want to use text()[3] but is there some way I could extract the text out between /a[#name='hw2'] and /a[#name='hw3'].
Your xpath is selecting the text of the a nodes, not the text of the td nodes:
$item//a[#name='hw']/text()
Change it to this:
$item[a/#name='hw']/text()
Update (following comments and update to question):
This xpath selects the second text node from $item that have an a tag containing a name attribute set to hw:
$item[a/#name='hw']//text()[2]
I would not want to use text()[3] but
is there some way I could extract the
text out between /a[#name='hw2'] and
/a[#name='hw3'].
If there is just one text node between the two <a> elements, then the following would be quite simple:
/a[#name='hw3']/preceding::text()[1]
If there are more than one text nodes between the two elements, then you need to express the intersection of all text nodes following the first element with all text nodes preceding the second element. The formula for intersection of two nodesets (aka Kaysian method of intersection) is:
$ns1[count(.|$ns2) = count($ns2)]
So, just replace in the above expression $ns1 with:
/a[#name='hw2']/following-sibling::text()
and $ns2 with:
/a[#name='hw3']/preceding-sibling::text()
Lastly, if you really have XQuery (or XPath 2), then this is simply:
/a[#name='hw2']/following-sibling::text()
intersect
/a[#name='hw3']/preceding-sibling::text()
This handles your expanded case, while letting you select by attribute value rather than position:
let $item :=
<td>
<a name="hw1">HELLOWORLD1</a>Hello world1
<a name="hw2">HELLOWORLD2</a>Hello world2
<a name="hw3">HELLOWORLD3</a>Hello world3
</td>
return $item//node()[./preceding-sibling::a/#name = "hw2"][1]
This gets the first node that has a preceding-sibling "a" element with a name attribute of "hw2".

Resources