How to get node text without children? - ruby

I use Nokogiri for parse the html page with same content:
<p class="parent">
Useful text
<br>
<span class="child">Useless text</span>
</p>
When I call the method page.css('p.parent').text Nokogiri returns 'Useful text Useless text'. But I need only 'Useful text'.
How to get node text without children?

XPath includes the text() node test for selecting text nodes, so you could do:
page.xpath('//p[#class="parent"]/text()')
Using XPath to select HTML classes can become quite tricky if the element in question could belong to more than one class, so this might not be ideal.
Fortunately Nokogiri adds the text() selector to CSS, so you can use:
page.css('p.parent > text()')
to get the text nodes that are direct children of p.parent. This will also return some nodes that are whtespace only, so you may have to filter them out.

You should be able to use page.css('p.parent').children.remove.
Then your page.css('p.parent').text will return the text without the children nodes.
Note: the page will be modified by the remove

Related

When adding text() to my XPath, the number of results are duplicated. Why?

The following Xpath executed in Chrome's web inspector returns the expected number, 13, of nodes
//*[#id="day1"]//span[contains(#class, 'day-time-clock')]
However, when I add text() to it:
//*[#id="day1"]//span[contains(#class, 'day-time-clock')]/text()
it returns 26 nodes. However, only every other hit actually points somewhere in the source code, the others are just "numb".
The end node looks like this
<span class="medium bold day-time-clock">
09:00
<div class="tooltip-box first-free-tip ">
<div class="tooltip-box-inner">
<span class="fa fa-clock-o"></span>
Some text
</div>
</div>
</span>
The code sample above doesn't show exactly how it looks in the web inspector, there are a couple of empty rows in the text of this node. Here is a small screenshot of how it really looks.
Why is this happening? And what can I do about it?
Your span elements have multiple text node children. Some of the text node children contain only whitespace. In your example, the outer span element has one text node child containing "....09:00...." where "...." represents whitespace, plus one text node child immediately following the child div element. (Incidentally, my HTML is rusty, but I didn't think that having a div inside a span was allowed.)
Your second (inner) span element contains no text nodes, so /text() on this should select nothing.
Generally, using /text() in XPath is a bad idea unless you have some very good reason and know exactly what you are doing.

Trouble accessing a text with XPath query

I have this html snippet
<div id="overview">
<strong>some text</strong>
<br/>
some other text
<strong>more text</strong>
TEXT I NEED IS HERE
<div id="sub">...</div>
</div>
How can I get the text I am looking for (shown in caps)?
I tried this, I get an error message saying not able to locate the element.
"//div[#id='overview']/strong[position()=2]/following-sibling"
I tried this, I get the div with id=sub, but not the text (correctly so)
"//div[#id='overview']/*[preceding-sibling::strong[position()=2]]"
Is there anyway to get the text, other than doing some string matching or regex with contents of overview div?
Thanks.
following-sibling is the axis, you still need to specify the actual node (in your example the XPath processor is searching for an element named following-sibling). You separate the axis from the node with ::.
Try this:
//div[#id='overview']/strong[position()=2]/following-sibling::text()[1]
This specifies the first text node after the second strong in the div.
If you always want the text immediately preceding the <div id="sub"> then you could try
//div[#id='sub']/preceding-sibling::text()[1]
That would give you everything between the </strong> and the opening <div ..., i.e. the upper case text plus its leading and trailing new lines and whitespace.

Selenium: Extracting only Text with out any sub elements from <p>

Below is the sample code
<p>
I want this Text
<sup> not this </sup>
.(Need this too).
<sup> and not this </sup>
</p>
Using Selenium RC, selenium.getText("//...") bring us the all the text including which are in < sup >.
Is there any way to get the text from <p> without <sup> tags ?
Please let me know. Thanks
Your only option is to get the text of the three elements and manipulate the parts you don't want away. That, or resort to using getEval() to run some JavaScript that get's the <P> element's innerHTML property, then remove the parts inside the <SUP> elements yourself.

Xpath getting node without node child contents

hey guys coudln't get around this. I have an html structured as follow:
<div class="review-text">
<div id="reviewerprofile">
<div id="revimg"></div>
<div id="reviewr">marc</div>
<div id="revdate">2011-07-06</div>
</div>
this is an awesome review
</div>
what i am trying to get is just the text "this is an awesome review" but everytyme i query the node i also get the other content in the childs. using something like this now ".//div[#class='review-text']" how to get just that text only? tank you very much
You're almost there! Just add /text() at the end of your XPath to get the text node.
An XPath expression such as //div returns a set of nodes, in this case div elements. These are in effect pointers to the original nodes in the original tree; the nodes are still connected to their parents, children, ancestors, and siblings. If you see the children of the div element and don't want them, that's not the fault of the XPath processor, it's the fault of whatever software is processing the results returned by the XPath expression.
You can get the text that's an immediate child of the div element by using /text() as suggested. However, that assumes that you know exactly what you are expecting to find in the HTML page - if "awesome" were in italics, it would give you something different.

Selecting specific using x-path while disregarding certain nodes

I have some html that looks pretty much like this.
<p>
<a img src="img src">
<strong>foo</strong>
<strong>bar</strong>
<strong>baz</strong>
<strong>eek</strong>
This is the text I want to select using xpath.
</p>
How can I select only this particular text node as indicated above using xpath?
How do I get at only this particular
text element in question using xpath?
Use:
/p/text()[last()]
"/p/text()" xpath expression will select the text from "p" node in above XML (Posted in question).
/p/text()[normalize-space()]
this will remove trailing spaces from string. This xpath produces exactly what you want.
There is very good tutorial at http://www.w3schools.com/xpath/

Resources