Exclude nodes between elements with an extra twist - xpath

Have a tricky XPath issue that I can't quite seem to get. Let's say I have the following:
<content>
<body>
<block id="123">
<html>
<p align="left">Some text</p>
</html>
</block>
<block id="abc8383">
<html>
<p></p>
</html>
</block>
<block id="456">
<html>
<p><span>Some more text</span></p>
</html>
</block>
<block id="789">
<html>
<p></p>
</html>
</block>
<block id="012356">
<html>
<p class="finalBlock"><h3>content</h3><span>xyz</span></p>
</html>
</block>
</body>
</content>
I want to select all nodes above the element which has a p tag inside the xhtml with a "finalBlock" class, except for the ones that do not have context (node text - e.g. block id 789). However, this rule should only apply until the first node with content is encountered again - afterwards the empty elements should all be included. This means that the input above should produce the following output:
<content>
<body>
<block id="123">
<html>
<p align="left">Some text</p>
</html>
</block>
<block id="abc8383">
<html>
<p></p>
</html>
</block>
<block id="456">
<html>
<p><span>Some more text</span></p>
</html>
</block>
<block id="012356">
<html>
<p class="finalBlock"><h3>content</h3><span>xyz</span></p>
</html>
</block>
</body>
</content>
Where the element with an id of 789 was removed, but all others were kept. I've managed to craft the XPath query that excludes the block elements I want (empty ones), but am struggling with implementing the "between" rule. Any thoughts would be greatly appreciated!
Here's the expression excluding the empty block elements
//block[html/p]/html/p[normalize-space(.) != '']

This expression selects "the element which has a p tag inside the html, with a finalBlock class", which is <block id="012356">:
//*[html/p[#class='finalBlock']]
This one selects all the block nodes that precede it ("all nodes above" - which does not include the ancestor nodes):
//*[html/p[#class='finalBlock']]/preceding-sibling::*
You can add a predicate to restrict that to only the ones that have a non-empty p descendant:
//*[html/p[#class='finalBlock']]/preceding-sibling::*[descendant::p[string()]]
And the ones that have an empty p descendant, except the most recent one:
//*[html/p[#class='finalBlock']]/preceding-sibling::*[descendant::p[not(string())]][not(position() = 1)]
If you perform a union of the previous two expressions, you will obtain all the block nodes that satisfy the requirements you stated:
//*[html/p[#class='finalBlock']]/preceding-sibling::*[descendant::p[string()]]
| //*[html/p[#class='finalBlock']]/preceding-sibling::*[descendant::p[not(string())]][not(position() = 1)]

Related

xpath: how to locate a node that contains more than 1 of another specific node

Using xpath, I want to return all section tags that contain more than one title tag. I've tried
count(concept/conbody/section:child:title>1)
and that didn't return the results. I want to run this xpath accross many files to locate those < concepts that have section containing more than one title.
<concept>
<title>Topic Title</title>
<shortdesc>Short description text.</shortdesc>
<conbody>
<section>
<title>Section Title</title>
<p>paragraph text.</p>
</section>
<section>
<title>Section Title</title>
<p>paragraph text.</p>
<title>Section Title</title>
<p>paragraph text.</p>
</section>
</conbody>
</concept>
Depending oo how "fix" the ancestors of section arr you may use_
concept/conbody/section[count(title) >1]
or:
//section[count(title) >1]
Query for section with have a second title element, that saves you from retrieving all which is required for counting them:
concept/conbody/section[title[2]]
this should work
concept/section[count(title)>1]

xpath: check if element is within other element

I have quite a large XML structure that in its simplest form looks kinda like this:
<document>
<body>
<section>
<p>Some text</p>
</section>
</body>
<backm>
<section>
<p>Some text</p>
<figure><title>This</title></figure>
</section>
</backm>
</document>
The section levels can be almost limitless (both within the body and backm elements) so I can have a section in section in section in section, etc. and the figure element can be within a numlist, an itenmlist, a p, and a lot more elements.
What I want to do is to check if the title in figure element is somewhere within the backm element. Is this possible?
A document could have multiple <backm> elements and it could have multiple <figure><title>Title</title></figure> elements in it. How you build your query depends on the situations you're trying to distinguish between.
//backm/descendant::figure/title
Will return the <title> elements that are the child of a <figure> element and the descendant of a <backm> element.
So:
count(//backm/descendant::figure/title) > 0
Will return True if there are 1 or more such title elements.
You can also express this using Double Negation
not(//backm[not(descendant::figure/title)])
I'm under the impression that this should have better performance.
//title[parent::figure][ancestor::backm]
Lists all <title> elements with a parent of <figure> and an <backm> ancestor.

Magento Block Tag

Please explain all the attributes of Magento block tag
<block type="catalog/product_featured" name="product_featured"
as="product_featured"
template="catalog/product/featured.phtml"></block>
<block type="catalog/product_featured" name="product_featured" template="catalog/product/featured.phtml">
<action method="setLimit"><limit>2</limit></action>
</block>
also why do we need two times the block tag
type = PHP file the template will look for the methods.. Here it is Mage_Catalog_Block_Product_Featured.php
name = Name of the block. It should be unique in the page.
as = Alias. Smaller form of name. It should be unique in it's parent block.
template = The template file (View) this block is attached to. You can call methods from block type inside this by using $this.. e.g. $this->getName()
name vs. as example:
<reference name="left">
<block type="block/type1" name="first_block" template="template1.phtml">
<block type="abc/abc" name="abc" as="common" template="abc.phtml"/>
</block>
<block type="block/type2" name="second_block" template="template2.phtml">
<block type="xyz/xyz" name="xyz" as="common" template="xyz.phtml"/>
</block>
</reference>
So, you can now call block name abc from first_block AND xyz from second_block as $this->getChildHtml('common');, but see both the blocks called will be different as per their calling parent.

watir-webdriver: how to retrieve entire line from HTML for which I found substring in it?

I've got something like that in HTML coming from server:
<html ...>
<head ...>
....
<link href="http://mydomain.com/Digital_Cameras--~all" rel="canonical" />
<link href="http://mydomain.com/Digital_Cameras--~all/sec_~product_list/sb_~1/pp_~2" rel="next" />
...
</head>
<body>
...
</body>
</html>
If b holds the browser object navigated to the page I need to look through, I'm able to find rel="canonical" with b.html.include? statement, but how could I retrieve the entire line where this substring was found? And I also need the next (not empty) one.
You can use a css-locator (or xpath) to get link elements.
The following would return the html (which would be the line) for the link element that has the rel attribute value of "canonical":
b.element(:css => 'link[rel="canonical"]').html
#=> <link href="http://mydomain.com/Digital_Cameras--~all" rel="canonical" />
I am not sure what you mean by "I also need the next (not empty) one.". If you mean that you want the one with rel attribute value of "next", you can similarly do:
b.element(:css => 'link[rel="next"]').html
#=> <link href="http://mydomain.com/Digital_Cameras--~all/sec_~product_list/sb_~1/pp_~2" rel="next" />
You could use String#each_line to iterate through each line in b.html and check for rel=:
b.goto('http://www.iana.org/domains/special')
b.html.each_line {|line| puts line if line.include? "rel="}
That should return all strings including rel= (although it could return lines that you don't want, such as <a> tags with rel attributes).
Alternately, you could use nokogiri to parse the HTML:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.iana.org/domains/special"))
nodes = doc.css('link')
nodes.each { |node| puts node}

Select the xth element on a page that is a yth child of its parent

There are lots of similar questions, however I wasn't able to find an answer to this.
Imagine you have a HTML page like this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Page title</title>
</head>
<body>
<div id="content">
<table>
<tr>
<td>A</td>
<td>B</td>
<td>C</td>
</tr>
<tr>
<td>D</td>
<td>E</td>
<td>F</td>
</tr>
</table>
</div>
</body>
</html>
and you want to select the second <td> element on the page that is a first child of its parent. In this case, it's the element <td>D</td>.
Note that this wording should be kept intact, for example it's not the same as selecting the second <tr> and then its first child (results in the same element), because the original page I'm working with is far more complex than this minimal testcase and this approach wouldn't work there.
What I have done so far:
A CSS selector #content td:first-child finds me A and D, now I am able to select the second element either via JS (document.querySelectorAll("query")[1]) or in Java (where I'm working with those elements in the end). However, it's quite inconsistent to use additional code for what could be done via a selector.
Similarly, I can use an XPath expression: id('content')//td[1]. It's the equivalent to the CSS selector above. It returns a node-set, so I thought that id('content')//td[1][2] will work the way I wanted, but no luck.
After some time, I discovered ( id('content')//td[1] )[2] to be working the way I want so I went for that and am quite happy with it.
Still, it's a letdown for me to see that I couldn't do a single query to get my element, and therefore an academic question is in place: Is there any other solution, either with a CSS selector, or an XPath expression to do my query? What did I miss? Can it be done?
CSS selectors currently don't provide any way to select the nth element in a set of globally-matched elements or the nth occurrence of some element in the entire DOM. The structural :nth-*() functional pseudo-classes that are provided by both Selectors 3 and Selectors 4 all count by the nth child of its parent matching the criteria, rather than by the nth element in the entire DOM.
The current Selectors syntax doesn't provide an intuitive way to say "this is the nth of a set of matched elements in the DOM"; even :nth-match() and :nth-last-match() in Selectors 4 have a pretty awkward syntax as they currently stand. So that is indeed a letdown.
As for XPath, the expression to use is (id('content')//td[1])[2], as you have already found. The outer () simply means "this entire subexpression should be evaluated before the [2] predicate" or "the [2] predicate should operate on the result of this entire subexpression, not just //td[1]." Without them, the expression td[1][2] would be treated collectively, with two conflicting predicates that would never work together (you can't have the same element be both first and second!).
Having parentheses around a subexpression doesn't make it an extra query per se; if it were, then you could consider each of id('content'), //td, [1] and [2] a "query" as well in its own right, with implied (or optional) parentheses. And that's a lot of queries :)
Use this simple XPath expression:
(//td[1])[2]
XSLT - based verification:
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output omit-xml-declaration="yes" indent="yes"/>
<xsl:template match="node()|#*">
<xsl:copy-of select="(//td[1])[2]"/>
</xsl:template>
</xsl:stylesheet>
When this transformation is applied on the provided XML document:
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Page title</title>
</head>
<body>
<div id="content">
<table>
<tr>
<td>A</td>
<td>B</td>
<td>C</td>
</tr>
<tr>
<td>D</td>
<td>E</td>
<td>F</td>
</tr>
</table>
</div>
</body>
</html>
the XPath expression is evaluated and the result of this evaluation is copied to the output:
<td>D</td>

Resources