Access two elements simultaneously in Nokogiri - ruby

I have some weirdly formatted HTML files which I have to parse.
This is my Ruby code:
File.open('2.html', 'r:utf-8') do |f|
#parsed = Nokogiri::HTML(f, nil, 'windows-1251')
puts #parsed.xpath('//span[#id="f5"]//div[#id="f5"]').inner_text
end
I want to parse a file containing:
<span style="position:absolute;top:156pt;left:24pt" id=f6>36.4.1.1. варенье, джемы, конфитюры, сиропы</span>
<div style="position:absolute;top:167.6pt;left:24.7pt;width:709.0;height:31.5;padding-top:23.8;font:0pt Arial;border-width:1.4; border-style:solid;border-color:#000000;"><table></table></div>
<span style="position:absolute;top:171pt;left:28pt" id=f5>003874</span>
<div style="position:absolute;top:171pt;left:99pt" id=f5>ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div style="position:absolute;top:180pt;left:99pt" id=f5>325гр. </div>
<div style="position:absolute;top:167.6pt;left:95.8pt;width:2.8;height:31.5;padding-top:23.8;font:0pt Arial;border-width:0 0 0 1.4; border-style:solid;border-color:#000000;"><table></table></div>
I need to select either <div> or <span> with id==5. With my current XPath selector it's not possible. If I remove //span[#id="f5"], for example, then the divs are selected correctly. I can output them one after another:
puts #parsed.xpath('//div[#id="f5"]').inner_text
puts #parsed.xpath('//span[#id="f5"]').inner_text
but then the order would be a complete mess. The parsed span have to be directly underneath the div from the original file.
Am I missing some basics? I haven't found anything on the web regarding parallel parsing of two elements. Most posts are concerned with parsing two classes of a div for example, but not two different elements at a time.

If I understand this correctly, you can use the following XPath :
//*[self::div or self::span][#id="f5"]
xpathtester demo
The XPath above will find element named either div or span that have id attribute value equals "f5"
output :
<span id="f5" style="position:absolute;top:171pt;left:28pt">003874</span>
<div id="f5" style="position:absolute;top:171pt;left:99pt">ВАРЕНЬЕ "ЭКОПРОДУКТ" ЧЕРНАЯ СМОРОДИНА</div>
<div id="f5" style="position:absolute;top:180pt;left:99pt">325гр.</div>

Related

xpath:how to find a node that not contains text?

I have a html like:
...
<div class="grid">
"abc"
<span class="searchMatch">def</span>
</div>
<div class="grid">
<span class="searchMatch">def</span>
</div>
...
I want to get the div which not contains text,but xpath
//div[#class='grid' and text()='']
seems doesn't work,and if I don't know the text that other divs have,how can I find the node?
Let's suppose I have inferred the requirement correctly as:
Find all <div> elements with #class='grid' that have no directly-contained non-whitespace text content, i.e. no non-whitespace text content unless it's within a child element like a <span>.
Then the answer to this is
//div[#class='grid' and not(text()[normalize-space(.)])]
You need a not() statement + normalize-space() :
//div[#class='grid' and not(normalize-space(text()))]
or
//div[#class='grid' and normalize-space(text())='']

Xpath: select div that contains class AND whose specific child element contains text

With the help of this SO question I have an almost working xpath:
//div[contains(#class, 'measure-tab') and contains(., 'someText')]
However this gets two divs: in one it's the child td that has someText, the other it's child span.
How do I narrow it down to the one with the span?
<div class="measure-tab">
<!-- table html omitted -->
<td> someText</td>
</div>
<div class="measure-tab"> <-- I want to select this div (and use contains #class)
<div>
<span> someText</span> <-- that contains a deeply nested span with this text
</div>
</div>
To find a div of a certain class that contains a span at any depth containing certain text, try:
//div[contains(#class, 'measure-tab') and contains(.//span, 'someText')]
That said, this solution looks extremely fragile. If the table happens to contain a span with the text you're looking for, the div containing the table will be matched, too. I'd suggest to find a more robust way of filtering the elements. For example by using IDs or top-level document structure.
You can use ancestor. I find that this is easier to read because the element you are actually selecting is at the end of the path.
//span[contains(text(),'someText')]/ancestor::div[contains(#class, 'measure-tab')]
You could use the xpath :
//div[#class="measure-tab" and .//span[contains(., "someText")]]
Input :
<root>
<div class="measure-tab">
<td> someText</td>
</div>
<div class="measure-tab">
<div>
<div2>
<span>someText2</span>
</div2>
</div>
</div>
</root>
Output :
Element='<div class="measure-tab">
<div>
<div2>
<span>someText2</span>
</div2>
</div>
</div>'
You can change your second condition to check only the span element:
...and contains(div/span, 'someText')]
If the span isn't always inside another div you can also use
...and contains(.//span, 'someText')]
This searches for the span anywhere inside the div.

Page Object Gem. Making accessors for elements without ids

Lets say I have a simple page that has less IDs than I'd like for testing
<div class="__panel_body">
<div class="__panel_header">Real Estate Rating</div>
<div class="__panel_body">
<div class="__panel_header">Property Rating Info</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
<div class="__panel_body">
<div class="__panel_header">General Risks</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
<div class="__panel_body">
<div class="__panel_header">Amenities</div>
<a class="icon.edit"></a>
<a class="icon.edit"></a>
</div>
</div>
I'm using Jeff Morgan's Page Object gem and I want to make accessors for the edit links in any given section.
The challenge is that the panel headers differentiate what body I want to choose. Then I need to access the parent and get all links with class "icon.edit". Assume I can't change the HTML to solve this.
Here's a start
module RealEstateRatingPageFields
div(:general_risks_section, ....)
def general_risks_edit_links
general_risks_section_element.links(class: "icon.edit")
end
end
How do I get the general_risks_section accessor to work, though?
I want that to represent the parent div to the panel header with text 'General Risks'...
There are a number of ways to get the general risk section.
Using a Block
The accessors can take a block where you can more programatically describe how to locate the element. This allows you to locate a distinguishing element and then traverse the DOM to the element you actually want. In this case, you can locate the header with the matching text and navigate to its parent.
div(:general_risks_section) { div_element(class: '__panel_header', text: 'General Risks').parent }
Using XPath
While harder to read and write, you could also use an XPath locator. The concept and thought process is the same as using the block. The only benefit is that it reduces the number of element calls, which slightly improves performance.
div(:general_risks_section, xpath: './/div[#class="__panel_body"][./div[#class="__panel_header" and text() = "General Risks"]]')
The XPath is saying:
.//div # Find a div element that
[#class="__panel_body"] # Has the class "__panel_body" and
[./div[ # Contains a div element that
#class="__panel_header" and # Has the class "__panel_header" and
text() = "General Risks" # Has the text "General Risks"
]]
Using the Body Text
Given the HTML, you could also just locate the section directly based on its text.
div(:general_risks_section, class: '__panel_body', text: 'General Risks')
Note that this assumes that the HTML given was not simplified. If there are actually other text nodes, this probably would not be the best option.

Extracting contents from a list split across different divs

Consider the following html
<div id="relevantID">
<div class="column left">
<h1> Section-Header-1 </h1>
<ul>
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
</ul>
</div>
<div class="column">
<ul> <!-- Pay attention here -->
<li>item1e</li>
<li>item1f</li>
</ul>
<h1> Section-Header-2 </h1>
<ul>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
</ul>
</div>
<div class="column right">
<h1> Section-Header-3 </h1>
<ul>
<li>item3a</li>
<li>item3b</li>
<li>item3c</li>
<li>item3d</li>
</ul>
</div>
</div>
My objective is to extract the items for each Section headers. However, inconveniently the designer of the webpage decided to break up the data into three columns, adding an additional div (with classes column right etc).
My current method of extraction was using the xpath
for section headers, I use the xpath (get all h1 elements withing a div with given id)
//div[#id="relevantID"]//h1
above returns a list of h1 elements, looping over each element I apply the additional selector, for each matched h1 element, look up the next ul node and retreive all its li nodes.
following-sibling::ul//li
But thanks to the designer's aesthetics, I am failing in the one particular case I've marked in the HTML file. Where the items are split across two different column divs.
I can probably bypass this problem by stripping out the column divs entirely, but I don't think modifying the html to make a selector match is considered good (I haven't seen it needed anywhere in the examples I've browsed so far).
What would be a good way to extract data that has been formatted like this? Full solutions are not neccessary, hints/tips will do. Thanks!
The columns do frustrate use of following-sibling:: and preceding-sibling::, but you could instead use the following:: and preceding:: axis if the columns at least keep the list items in proper document order. (That is indeed the case in your example.)
The following XPath will select all li items, regardless of column, occurring after the "Section-Header-1" h1 and before the "Section-Header-2" h1 header in document order:
//div[#id='relevantID']//li[normalize-space(preceding::h1) = 'Section-Header-1'
and normalize-space(following::h1) = 'Section-Header-2']
Specifically, it selects the following items from your example HTML:
<li>item1a</li>
<li>item1b</li>
<li>item1c</li>
<li>item1d</li>
<li>item1e</li>
<li>item1f</li>
You can combine following-sibling and preceding-sibling to get possible li elements in a div before the h2 and use the union operator |. As example for the second h2:
((//div[#id="relevantID"]//h1)[2]/preceding-sibling::ul//li) |
((//div[#id="relevantID"]//h1)[2]/following-sibling::ul//li)
Result:
<li>item1e</li>
<li>item1f</li>
<li>item2a</li>
<li>item2b</li>
<li>item2c</li>
<li>item2d</li>
As you're already selecting all h1 using //div[#id="relevantID"]//h1 and retrieving all li items for each h1 using as a second step following-sibling::ul//li, you could combine this to following-sibling::ul//li | preceding-sibling::ul//li.

Xpath: Find an element whose descendant has got a certain attribute value

I have the following elements in a web page. I would like to fetch the element with id "pmt-id-234" which has got a descendant with classname as type2.
<div id="cards">
<div id="pmt-id-123" class="payments">
<div>
<div class="type1">Text1</div>
<div>
</div>
<div id="pmt-id-234" class="payments">
<div>
<div class="type2">Text1</div>
<div>
</div>
</div>
Notes:
I don't know the highlighted part in "pmt-id-123", hence direct query with ID is not possible.
The div with class="typeX" can be nested multiple levels down.
What is tried? The below gives me two div elements.
'//*[#id="cards"]//*[starts-with(#id,"pmt-id-")]'
Now, how to fetch the div which has a descendant div with class="type2"
The following din't yield any results.
'//*[#id="cards"]//*[starts-with(#id,"pmt-id-")//*[contains(#class, "type2")]]'
'//*[#id="cards"]//*[starts-with(#id,"pmt-id-")][contains(#class, "type2")]'
Please let me know how to do this?
I'd test against div rather than * if there are only divs there.
This XPath will select the div under one with an id of cards that has an id that starts with pmt-id- and also has a descendant div of class type2:
'//div[#id="cards"]//div[starts-with(#id,"pmt-id-") and .//div[contains(#class, "type2"]]'
Note that you may have to take extra care with the matching against the #class to avoid matching type22 or abctype2 if such types are possible.

Resources