select parent node containing text inside children's node - ruby

basically i want to select a node (div) in which it's children node's(h1,b,h3) contain specified text.
<html>
<div id="contents">
<p>
<h1> Child text 1</h1>
<b> Child text 2 </b>
...
</p>
<h3> Child text 3 </h3>
</div>
i am expecting, /html/div/ not /html/div/h1
i have this below, but unfortunately returns the children, instead of the xpath to the div.
expression = "//div[contains(text(), 'Child text 1')]"
doc.xpath(expression)
i am expecting, /html/div/ not /html/div/h1
So is there a way to do this simply with xpath syntax?

The following expression gives a node (div) in which any children nodes (not just h1,b,h3) contain specified text (not the div itself):
doc.xpath('//div[.//*[contains(text(), "Child text 1")]]')
you can refine that and return the only the div with the id contents like in your example:
doc.xpath('//div[#id="contents" and .//*[contains(text(), "Child text 1")]]')
It does not match, if the text is a text node of the div (directly inside the div), which is my interpretation of the question.

You could append "/.." to anchor back to the parent. Not sure if there's a more robust method.
expression = "//div[contains(text(), 'Child text 1')]/.."

Related

XPATH - grab content of div after named element

There are a number of labels, I want to specify them in xpath and then grab the text after them, example:
<div class="info-row">
<div class="info-label"><span>Variant:</span></div>
<div class="info-content">
<p>750 ml</p>
</div>
</div>
So in this case, I want to say "after the span named 'Variant' grab the p tag:
Result: 750ml
I tried:
//span[text()='Variant:']/following-sibling::p
and variations of this but to no avail.
'following-sibling' function selects all siblings after the current node,
there no siblings for span with text 'Variant:', and correct to search siblings for span parent.
Here is an example which will work
//span[text()='Variant:']/ancestor::div[#class="info-label"]/following-sibling::div/p

How to find direct children which contain nodes with specified text with xpath?

I need to extract all children which have nodes with some text. Html structure might be the following:
<div>
<div>
A
</div>
<p>
<b>A</b>
</p>
<span>
B
</span>
</div>
I need to extract child nodes which have "A" text. It should return div and p nodes
I tried the following xpaths:
./*/*[contains(text(), 'A')]
./*/*[./*[contains(text(), 'A')]]
but the first one returns only div with "A" text and the second one returns only p with "A" text
Is it possible to construct xpath which will return both children?
Node containing "A" text might be at any level in the child node
If you need XPath that returns both child nodes, try to use
./*/*[contains(., "A")]
I suspect contains() is wrong here, unless you really want to select a node whose value is "HAT" as well as one whose value is "A".
Try
*/*[normalize-space(.)='A']

Selenium Webdriver / XPath: How to find element by attribute and the text of its children

Given html:
<div class="class1">
<div class="class2">1 2 3</div>
<div class="class3">a b c</div>
</div>
As i have several div elements in my html which uses the class "class1" and none has a id i want to find/fetch this parent element by the text of its children.
I tried different variants like
By.xpath("//div[contains(#class, 'class1') "
+ "and text()[contains(.,'1 2 3')] "
+ "and text()[contains(.,'a b c')]]"));
but nothing seems to work yet.
In the example above i guess the text of the class1 element is checked but not of its children.
Can anybody help?
So you're looking for a div with class class1 that has children with texts 1 2 3 and a b c. From your example of what you've tried, I'm assuming there are no further conditions (eg class) on the children:
//div[#class='class1' and div/text()='1 2 3' and div/text()='a b c']
You can make those children node names into * if you don't care whether they are divs or not. You can make the children node names prefixed by descendant:: if you don't require them to be direct children.
Try any of these below mentioned xpath.
Using class attribute of <div> tag.
//div[#class='class2']/..//div[#class='class3']/..//parent::div[#class='class1']
Explanation of xpath: First locate both child elements using the class attribute of <div> tag and then move ahead with parent keyword with <div> tag along with class attribute.
OR
Using text method along with <div> tag.
//div[text()= '1 2 3']/..//div[text()= 'a b c']/..//parent::div[#class='class1']
Explanation of xpath: First locate both child elements using the text method of <div> tag and then move ahead with parent keyword with <div> tag along with class attribute.
These above xpath will locate your parent element <div class="class1">

How can I select nodes that don't contain links but which do contain specific text using xpath

Given the following HTML:
$content =
'<html>
<body>
<div>
<p>During the interim there shall be nourishment supplied</p>
</div>
<div>
<p>During the interim there shall be interim nourishment supplied</p>
</div>
<div>
<ul><li>During the interim there shall be nourishment supplied</li></ul>
</div>
</body>
</html>';
I want all the nodes containing the word "interim" but not if the word "interim" is part of a link element.
The nodes I would expect back are the first P node and the LI node only.
I've tried the following:
'//*/text()[not(a) and contains(.,"interim")]'
... but this still returns the A and also returns part of it's parent P node (the part after the A), neither of which are desired. You can see my attempt here: https://glot.io/snippets/ehp7hmmglm
If you use the XPath expression //*[not(self::a) and not(a) and text()[contains(.,"interim")]] then you get all elements that do not contain an a element, are not a elements and contain a text node child containing that word.

Nokogiri grab text with formatting and link tags, <em>,<strong>, <a>, etc

How can I recursively capture all the text with formatting tags using Nokogiri?
<div id="1">
This is text in the TD with <strong> strong </strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
For example, I would like to capture:
"This is text in the TD with <strong> strong </strong> tags"
"This is a child node. with <b> bold </b> tags"
"another line of text to a link "
"This is text inside a div <em>inside<em> another div inside a paragraph tag"
I can't just use .text() because it strips the formatting tags and I'm not sure how to do it recursively.
ADDED DETAIL: Sanitize looks like an interesting gem, I'm reading it now. However, have some added info that might clarify what I need to do.
I need to traverse each node, get the text, process it and put it back. therefore I would grab the text from , "This is text in the TD with strong tags", modify it to something like, "This is the modified text in the TD with strong tags. Then goto the next tag from div 1 get the text. "This is a child node. with bold tags" modify it "This is a modified child node. with bold tags." and put it back. Goto the next div#2 and grab the text, "another line of text to a link ", modify it, "another line of modified text to a link ", and put it back and goto the next node, Div#2 and grab text from the paragraph tag. "This is modified text inside a div inside another div inside a paragraph tag"
so after everything is processed the new html should be look like this...
<div id="1">
This is modified text in the TD with <strong> strong </strong> tags
<p>This is a modified child node. with <b> bold </b> tags</p>
<div id=2>
"another line of modified text to a link "
<p> This is modified text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
My quasi-code,but I'm really stuck on the two parts, grabbing just the text with formatting (which sanitize helps with), but sanitize grabs all tags. I need to preserve formatting of just the text with formatting, including spaces, etc. However, not grab the unrelated tag children. And two, traversing down all the children related directly with full text tags.
#Quasi-code
doc = Nokogiri.HTML(html)
kids=doc.at('div#1')
text_kids=kids.descendant_elements
text.kids.each do |i|
#grab full text(full sentence and paragraphs) with formating tags
#currently, I have not way to grab just the text with formatting and not the other tags
modified_text=processing_code(i.full_text_w_formating())
i.full_text_w_formating=modified_text
end
def processing_code(string)
#code to process string (not relevant for this example)
return modified_string
end
# Recursive 1
class Nokogiri::XML::Node
def descendant_elements
#This is flawed because it grabs every child and even
#splits it based on any tag.
# I need to traverse down only the text related children.
element_children.map{ |kid|
[kid, kid.descendant_elements]
}.flatten
end
end
I'd use two tactics, Nokogiri to extract the content you want, then a blacklist/whitelist program to strip tags you don't want or keep the ones you want.
require 'nokogiri'
require 'sanitize'
html = '
<div id="1">
This is text in the TD with <strong> strong <strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id=2>
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</p>
</div>
</div>
'
doc = Nokogiri.HTML(html)
html_fragment = doc.at('div#1').to_html
will capture the contents of <div id="1"> as an HTML string:
This is text in the TD with <strong> strong <strong> tags
<p>This is a child node. with <b> bold </b> tags</p>
<div id="2">
"another line of text to a link "
<p> This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em></p>
</div>
</strong></strong>
The trailing </strong></strong> is the result of two opening <strong> tags. That might be deliberate, but with no closing tags Nokogiri will do some fixup to make the HTML correct.
Passing html_fragment to the Sanitize gem:
doc = Sanitize.clean(
html_fragment,
:elements => %w[ a b em strong ],
:attributes => {
'a' => %w[ href ],
},
)
The returned text looks like:
This is text in the TD with <strong> strong <strong> tags
This is a child node. with <b> bold </b> tags
"another line of text to a link "
This is text inside a div <em>inside<em> another div inside a paragraph tag</em></em>
</strong></strong>
Again, because the HTML was malformed with no closing </strong> tags, the two trailing closing tags are present.

Resources