How to get parent's text using Nokogiri - ruby

I have situation where I want to get a parent <p> tag's text. For example:
<p>
<a name="TOPIC"></a>
<b><font color="#800000" size="4" face="Arial">Exapmles</font></b>
</p>
This is working fine for this example:
test = Nokogiri::HTML("row['test']"])
raw_attributes = test.root.css("p a").inject({}) do |accumulator, element|
accumulator[element.attr("name")] = (element.parent.text).strip
accumulator
end
But it's not working for the following example:
<p>
<font>
<a name="TOPIC"></a>
<b><font color="#800000" size="4" face="Arial">Exapmles</font></b>
</font>
</p>
How can I do this using Nokogiri? I want the solution which works for both of the above two conditions.

puts doc.at_xpath("//p[//a[#name='TOPIC']]").inner_text.strip
#=> "Exapmles"
Decoded, this says:
//p — find a p element anywhere in the document
[…] — that matches this condition
//a — it has an a element as a descendant
[#name='TOPIC'] — with a name attribute whose value is TOPIC.

Related

xpath:how to find a node that not contains text?

I have a html like:
...
<div class="grid">
"abc"
<span class="searchMatch">def</span>
</div>
<div class="grid">
<span class="searchMatch">def</span>
</div>
...
I want to get the div which not contains text,but xpath
//div[#class='grid' and text()='']
seems doesn't work,and if I don't know the text that other divs have,how can I find the node?
Let's suppose I have inferred the requirement correctly as:
Find all <div> elements with #class='grid' that have no directly-contained non-whitespace text content, i.e. no non-whitespace text content unless it's within a child element like a <span>.
Then the answer to this is
//div[#class='grid' and not(text()[normalize-space(.)])]
You need a not() statement + normalize-space() :
//div[#class='grid' and not(normalize-space(text()))]
or
//div[#class='grid' and normalize-space(text())='']

Extract text and ignore next node

From this:
<span class="postbody">
<span style="color: #8e2fb6">
<span style="font-weight: bold">nickname</span>
</span>
<br>
Example text
<br>
Example text
<br>
<p class="signature">THIS IS WHAT I DO NOT WANT</p>
</span>
I want to extract:
<br>
Example text
<br>
Example text
<br>
I tried: span/text()[1] but it seems not to work. I always get unwanted p class. Is it even possible to do?
First you need to load your Html string into a HtmlDocument or HtmlNode (Using .load() function).
ChildNodes collection contains every children of your current node (Basically every nodes under span.postbody).
After that what you need to do is pretty obvious, just grab #text and br nodes (keep in mind that you will receive some #text nodes that have just whitespace characters. You may want to filter it out in the result.
//load html to HtmlNode
node.ChildNodes.Where(n => n.Name.Equals("#text") || n.Name.Equals("br")) //It will return collection of HtmlNode
You can use the jQuery selector for postbody, then the .text method which should ignore the HTML. This will also ignore the .
$('.postbody').text();
An alternative would be to iterate through the children of the $('.postbody').text();
'//text()[preceding-sibling::br and normalize-space()]'

Access two elements simultaneously in Nokogiri

I have some weirdly formatted HTML files which I have to parse.
This is my Ruby code:
File.open('2.html', 'r:utf-8') do |f|
#parsed = Nokogiri::HTML(f, nil, 'windows-1251')
puts #parsed.xpath('//span[#id="f5"]//div[#id="f5"]').inner_text
end
I want to parse a file containing:
<span style="position:absolute;top:156pt;left:24pt" id=f6>36.4.1.1. варенье, джемы, конфитюры, сиропы</span>
<div style="position:absolute;top:167.6pt;left:24.7pt;width:709.0;height:31.5;padding-top:23.8;font:0pt Arial;border-width:1.4; border-style:solid;border-color:#000000;"><table></table></div>
<span style="position:absolute;top:171pt;left:28pt" id=f5>003874</span>
<div style="position:absolute;top:171pt;left:99pt" id=f5>ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div style="position:absolute;top:180pt;left:99pt" id=f5>325гр. </div>
<div style="position:absolute;top:167.6pt;left:95.8pt;width:2.8;height:31.5;padding-top:23.8;font:0pt Arial;border-width:0 0 0 1.4; border-style:solid;border-color:#000000;"><table></table></div>
I need to select either <div> or <span> with id==5. With my current XPath selector it's not possible. If I remove //span[#id="f5"], for example, then the divs are selected correctly. I can output them one after another:
puts #parsed.xpath('//div[#id="f5"]').inner_text
puts #parsed.xpath('//span[#id="f5"]').inner_text
but then the order would be a complete mess. The parsed span have to be directly underneath the div from the original file.
Am I missing some basics? I haven't found anything on the web regarding parallel parsing of two elements. Most posts are concerned with parsing two classes of a div for example, but not two different elements at a time.
If I understand this correctly, you can use the following XPath :
//*[self::div or self::span][#id="f5"]
xpathtester demo
The XPath above will find element named either div or span that have id attribute value equals "f5"
output :
<span id="f5" style="position:absolute;top:171pt;left:28pt">003874</span>
<div id="f5" style="position:absolute;top:171pt;left:99pt">ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div id="f5" style="position:absolute;top:180pt;left:99pt">325гр.</div>

accessing a <p> in Watir, which doesn't have attribute

I have this html code.
<div class="main" data-reactid=".0.2.1.1">
<div contenteditable="true" data-reactid=".0.2.1.1.0" autocomplete="off">
<p>
<br>
</p>
</div>
</div>
I have to write in tag. For this I wrote as:
paragraph(:article_title) {div_element(:class=>'main').div(:index=>1).paragraph(:index=>1)}
but it is giving an error. I don't understand what is wrong in this.
There are a couple of problems:
Watir uses a 0-based index. As a result, div(:index=>1) actually means to find the 2nd div tag. As this does not exist, you will get an unable to locate element error.
div and paragraph are not methods defined in the page-object gem. You will get deprecation errors when you try to use them. It should be div_element and paragraph_element respectively.
Try doing:
paragraph(:article_title) {div_element(:class=>'main').div_element(:index=>0).paragraph_element(:index=>0)}
More simply, since :index => 0 is implied:
paragraph(:article_title){div_element(:class=>'main').div_element.paragraph_element}
As there is only one paragraph element, you could further simplify it to:
paragraph(:article_title) {div_element(:class=>'main').paragraph_element}

How do I write an xpath expression for an element following another with certain text?

How do I construct an xpath expression in which I want to get the <p> element following a <p> element with a child <strong> element with the text, "About this event:".
In other words, what path expression will give me the "<p>" element with the "Hello" text following the <P> with the <Strong> text below …
<p>
<strong>About this event:</strong>
…
</p>
<p>Hello</p>
? -
//p[strong[.='About this event:']]/following-sibling::p
or
//p[strong[.='About this event:']]/following-sibling::p[.='Hello']
Try this:
//p[normalize-space(strong)='About this event:']/following-sibling::p
You can also narrow this down to the first p by adding [1]:
//p[normalize-space(strong)='About this event:']/following-sibling::p[1]
I believe this should be do-able. Here's the output from Perl's xpath tool:
so zacharyyoung$ xpath so.xml '//p[strong = "About this event:"]/following-sibling::p[1]'
Found 1 nodes:
-- NODE --
<p>Hello</p>
The XPath is //p[strong = "About this event:"]/following-sibling::p[1]

Resources