Improve XPath-query to distinguish text-nodes correctly

Improve XPath-query to distinguish text-nodes correctly - xpath

I am using XPath extensively in the past. Currently I am facing a problem, which I am unable so solve.
Constraints
pure XPath 1.0
no aux-functions (e.g. no "concat()")
HTML-Markup
<span class="container">
Peter: Lorem Impsum
<i class="divider" role="img" aria-label="|"></i>
Paul Smith: Foo Bar BAZ
<i class="divider" role="img" aria-label="|"></i>
Mary: One Two Three
</span>
Challenge
I want to extract the three coherent strings:
Peter: Lorem Impsum
Paul Smith: Foo Bar BAZ
Mary: One Two Three
XPath
The following XPath-queries is the best I've come up with after HOURS of research:
XPath-query 1
//span[contains(#class, "container")]
=> Peter: Lorem ImpsumPaul Smith: Foo Bar BAZMary: One Two Three
XPath-query 2
//span[contains(#class, "container")]//text()
Peter: Lorem Impsum Paul Smith: Foo Bar BAZ Mary: One Two Three
Problem
Although it is possible to post-process the resulting string using (PHP) string functions afterwards, I am not able to split it into the correct three chunks: I need an XPath-query which enables me to distinguish the text-nodes correctly.
Is it possible to integrate some "artificial separators" between the text-nodes?

You're expecting too much from XPath 1.0. XPath 1.0, itself, can help you here to select
a string, or
a set of text nodes
Then, you'll have to complete your processing outside of XPath (as Mads suggests in the comments).
To understand the limits you're hitting against, your first XPath,
//span[contains(#class, "container")]
selects a nodeset of span elements. The environment in which XPath 1.0 is operating is showing you (some variation of) the string value of the single such node in your document:
Peter: Lorem ImpsumPaul Smith: Foo Bar BAZMary: One Two Three
But be clear: Your XPath is selecting a nodeset of span elements, not strings here.
Your second XPath,
//span[contains(#class, "container")]//text()
selects a nodeset of text() nodes. The environment in which XPath 1.0 is operating is showing the string value of each selected text() node.
If you could use XPath 2.0, you could directly, within XPath, select a sequence of strings,
//span[contains(#class, "container")]/text()/string()
or you could join them,
string-join(//span[contains(#class, "container")]/text(), "|")
and directly get
Peter: Lorem Impsum
|
Paul Smith: Foo Bar BAZ
|
Mary: One Two Three
or
string-join(//span[contains(#class, "container")]/text()/normalize-space(), "|")
to get
Peter: Lorem Impsum|Paul Smith: Foo Bar BAZ|Mary: One Two Three

Related

xpath handle double quotes with some other tags

I have this html sample
<html>
<body>
....
<p id="book-1" class="abc">
<b>
book-1
section
</b>
"I have a lot of "
<i>different</i>
"text, and I want "
<i>all</i>
" text and we may or may not have italic surrounded text."
</p>
....
the xpath I currently have is this:
#"/html[1]/body[1]/p[1]/text()"
this gives this result:
I have a lot of
but I want this result:
I have a lot of different text, and I want all text and we may or may not have italic surrounded text.
Thanks for your help.

In XPath 2 and higher you could use string-join(/html[1]/body[1]/p[1]/b/following-sibling::node(), '') I think. It is not quite clear which nodes you want but that would select all sibling nodes following the b child of the p and then concatenate their string values into one.

Ruby Nokogiri text search not working with br tags and others

I'm using the Nokogiri gem in Ruby and running into some problems.
I want to scrape addresses from webpages and there is no set format to the way the addresses will be displayed.
I've got a list of postcodes and I want my Ruby script to return the node including the postcode so that I can find the rest of the address.
This is what I've got in Ruby, with some example HTML content:
require 'nokogiri'
require 'open-uri'
content1 = '
<div>
<div>
<div>Our Address:</div>
1 North Street
North Town
North County
N21 4DD
</div>
</div>'
doc = Nokogiri::HTML(content1)
result = doc.search "[text()*='N21 4DD']"
puts result.inspect
This returns []
I understand the example above is a strange way for an address to appear in HTML but it's the simplest way I can show the problems I've had. Here's another content variable that returns nothing:
content1 = '
<div>
<div>Our Address:</div>
<div>
1 North Street<br>
North Town<br>
North County<br>
N21 4DD
</div>
</div>'
I know that Nokogiri might have trouble with the above because the <br> tags should be </br> but this is quite common on websites.
THIS EXAMPLE WORKS:
content1 = '
<div>
<div>Our Address:</div>
<div>
1 North Street
North Town
North County
N21 4DD
</div>
</div>'
Can someone explain why the node is not being found from the first two content examples above and how I can fix this?
I'm not looking for a custom solution that will find the postcode in the sample content examples above – these are just for demonstration purposes. The postcode (and address) could be anywhere in the html – body, p, div, td, span, li etc.
Thanks.

With Xpath:
doc.xpath('.//div[contains(.,"N21 4DD")]')
This still returns two nodes because there is a nested div. I'm not sure that there is a way to get the middle div without the 'Our Address' div because it is in the same node.

Let's look at the first one and how Nokogiri translates your "css" (that's not valid css btw):
Nokogiri::CSS.xpath_for "[text()*='N21 4DD']"
#=> ["//*[contains(child::text(), 'N21 4DD')]"]
Ok, so here the problem is the child::text() will actually only match the first text node, which is the empty text before the "Our Address" div.
doc.search("//*[contains(child::text(), 'N21 4DD')]").length
#=> 0
No matches = not good.
Now let's try it jquery-style using the :contains pseudo:
Nokogiri::CSS.xpath_for ":contains('N21 4DD')"
#=> ["//*[contains(., 'N21 4DD')]"]
doc.search("//*[contains(., 'N21 4DD')]").length
#=> 4
This is actually correct, but maybe not what you expected.
Let's try it one more way:
doc.search("//*[text()[contains(., 'N21 4DD')]]").length
#=> 1
It sounds like this is what you're looking for. Just the div that has the string in a child text node.

Extract all text in between two nodes using xpath for websrcaping?

<div class="jokeContent">
<h2 style="color:#369;">Can I be Frank</h2>
What did Ellen Degeneres say to Kathy Lee?
<p></p> <p>Can I be Frank with you? </p>
<p>Submitted by Calamjo</p>
<p>Edited by Curtis</p>
<div align="right" style="margin-top:10px;margin-bottom:10px;">#joke #short </div>
<div style="clear:both;"></div>
</div>
So I am trying to extract all text after the <\h2> and before the [div aign = "right" style=...] nodes.
What I have tried so far:
jokes = response.xpath('//div[#class="jokeContent"]')
for joke in jokes:
text = joke.xpath('text()[normalize-space()]').extract()]
if len(text) > 0:
yield text
This works to some extend, but the website is inconsistent in the html and sometimes the text is embedded in <.p> TEXT <\p> and sometimes in <.br> TEXT <\br> or just TEXT.
So I thought just extracting everything after the header and before the style node might make sense and then the filtering can be done afterwords.

If you are looking for a literal xpath of what you are describing, it could be something like:
In [1]: sel.xpath("//h2/following-sibling::*[not(self::div) and not(preceding-sibling::div)]//text()").extract()
Out[1]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']
But there's probably a more logical, cleaner conclusion:
In [2]: sel.xpath("//h2/following-sibling::p//text()").extract()
Out[2]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']
This is just selecting paragraph tags. You said the paragraph tags might be something else and you can match several different tags with self::tag specification:
In [3]: sel.xpath("//h2/following-sibling::*[self::p or self::br]//text()").extract()
Out[3]: [u'Can I be Frank with you? ', u'Submitted by Calamjo', u'Edited by Curtis']
Edit: apparently I missed the text under the div itself. This can be ammended with | - or selector:
In [3]: sel.xpath("//h2/../text()[normalize-space(.)] | //h2/../p//text()").extract()
Out[3]:
[u'\n What did Ellen Degeneres say to Kathy Lee? \n ',
u'Can I be Frank with you? ',
u'Submitted by Calamjo',
u'Edited by Curtis']
normalize-space(.) is there only to get rid of text values that contain no text (e.g. ' \n').
You can append the first part of this xpath to any of the above and you'd get similar results.

XPath / XQuery: find text in a node, but ignoring content of specific descendant elements

I am trying to find a way to search for a string within nodes, but excluding ythe content of some subelements of those nodes. Plain and simple, I want to search for a string in paragraphs of a text, excluding the footnotes which are children elements of the paragraphs.
For example,
My document being:
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there<footnote>It's not a very long text!</footnote></p>
</document>
When I'm searching for "text", I would like the Xpath / XQuery to retrieve the first p element, but not the second one (where "text" is contained only in the footnote subelement).
I have tried the contains() function, but it retrieves both p elements.
Any help would be much appreciated :)

I want to search for a string in
paragraphs of a text, excluding the
footnotes which are children elements
of the paragraphs
An XPath 1.0 - only solution:
Use:
//p//text()[not(ancestor::footnote) and contains(.,'text')]
Against the following XML document (obtained from yours but added p s within a footnote to make this more interesting):
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there
<footnote>It's not a very long text!
<p>text</p>
</footnote>
</p>
</document>
this XPath expression selects exactly the wanted text node:
My text starts here/

//p[(.//text() except .//footnote//text())[contains(., 'text')]]

/document/p[text()[contains(., 'text')]] should do.

For the record, as a complement to the other answers, I've found this workaround that also seems to do the job:
//p[contains(child::text()|not(descendant::footnote), "text")]

Xpath: how do you select the second text node (specific text node)

consider a html page
<html>
apple
orange
drugs
</html>
how can you select orange using xpath ?
/html/text()[2]
doesn't work.

You cant do it directly by selecting. You need to call an xpath string function to cut the text() to get the string you want
substring-after(/html/text()," ") // something like this,
here is a list of string functions

If the strings are separated with <br> it works
doc = Nokogiri::HTML("""<html>
apple
<br>
orange
<br>
drugs
</html>""")
p doc.xpath('//text()[2]') #=> orange

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Improve XPath-query to distinguish text-nodes correctly - xpath

Related

xpath handle double quotes with some other tags

Ruby Nokogiri text search not working with br tags and others

Extract all text in between two nodes using xpath for websrcaping?

XPath / XQuery: find text in a node, but ignoring content of specific descendant elements

Xpath: how do you select the second text node (specific text node)

Categories

Resources