Here's my xml,
<w:tc>
<w:p>
<w:pPr></w:pPr>
<w:r></w:r>
</w:p>
</w:tc>
<w:tc>
<w:p>
<w:pPr></w:pPr>
</w:p>
</w:tc>
I want to match w:p which is preceded by w:tc and has no following sibling w:r, Precisely i want second w:tc. Code what i have tried,
<xsl:template match="w:pPr[ancestor::w:p[ancestor::w:tc] and not(following-sibling::w:r)]">
I need xpath for w:pPr having no following-sibling
The problem is when w:pPr is followed by w:hyperlink. Now i have ignored w:hyperlink too.
If you want to match a w:pPr that has no following sibling elements at all (regardless of name), then just use a match pattern of
w:pPr[ancestor::w:p[ancestor::w:tc] and not(following-sibling::*)]
or equivalently (and slightly shorter)
w:tc//w:p//w:pPr[not(following-sibling::*)]
Using the XPath is simple and straightforward, you have to filter elements olny. Your filtring could be based on the content of the element (using [] and path inside the brackets). With the filtered elements you can work as same as with the XML tree (start filtering again or select the final elements).
In your case, first you have to choose the correct tc element (filter the element as you need):
Based on the count of elements: //tc[count(./p/*) = 1], or
Based on non existing r element: //tc[not(./p/r)], or
Based on non existing r and hyperlink element: //tc[not(./p/r) and not(./p/hyperlink)]
Based on existing pPr and non existing r (it is not a necessary because the pPr is filtred in second step): //tc[./p/r and not(./p/r)]
It returns the following XML.
<tc>
<p>
<pPr>pPr</pPr>
</p>
</tc>
Then just simply say what do you want from the new XML:
Do you want the pPr element? Use: /p/pPr
All together:
//tc[count(./p/*) = 1]/p/pPr
or
//tc[not(./p/r)]/p/pPr
Note: // means find the element anywhere in the document.
Update 1: Hyperlink condition added.
Related
I'm trying to perform html scrapping of a webpage. I like to fetch the three alternate text (alt - highlighted) from the three "img" elements.
I'm using the following code extract the whole "img" element of slide-1.
from lxml import html
import requests
page = requests.get('sample.html')
tree = html.fromstring(page.content)
text_val = tree.xpath('//a[class="cover-wrapper"][id = "slide-1"]/text()')
print text_val
I'm not getting the alternate text values displayed. But it is an empty list.
HTML Script used:
This is one possible XPath :
//div[#id='slide-1']/a[#class='cover-wrapper']/img/#alt
Explanation :
//div[#id='slide-1'] : This part find the target <div> element by comparing the id attribute value. Notice the use #attribute_name syntax to reference attribute in XPath. Missing the # symbol would change the XPath selector meaning to be referencing a -child- element with the same name, instead of an attribute.
/a[#class='cover-wrapper'] : from each <div> element found by the previous bit of the XPath, find child element <a> that has class attribute value equals 'cover-wrapper'
/img/#alt : then from each of such <a> elements, find child element <img> and return its alt attribute
You might want to change the id filter to be starts-with(#id,'slide-') if you meant to return the all 3 alt attributes in the screenshot.
Try this:
//a[#class="cover-wrapper"]/img/#alt
So, I am first selecting the node having a tag and class as cover-wrapper and then I select the node img and then the attribute alt of img.
To find the whole image element :
//a[#class="cover-wrapper"]
I think you want:
//div[#class="showcase-wrapper"][#id="slide-1"]/a/img/#alt
In case below two elements do not show in same time
<a title='a' />
<b title='b' />
I want to check if one of them can show
does xpath support the 'or' function? I just want to write in one line:
//a[#title='a'] or .. #title='b' ??
XPath Operators
Select either matching nodes (your case here):
//a[#title='a'] | //b[#title='b']
Select one element with either matching attributes
//a[#title='a' or #title='b']
If you want to match either <a/> elements with #title='a' attribute or <b/> elements with #title='b' attribute, you can also match all elements and perform a test on their name:
//*[local-name(.) = 'a' and #title='a' or local-name(.) = 'b' and #title='b']
I am trying to find a way to search for a string within nodes, but excluding ythe content of some subelements of those nodes. Plain and simple, I want to search for a string in paragraphs of a text, excluding the footnotes which are children elements of the paragraphs.
For example,
My document being:
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there<footnote>It's not a very long text!</footnote></p>
</document>
When I'm searching for "text", I would like the Xpath / XQuery to retrieve the first p element, but not the second one (where "text" is contained only in the footnote subelement).
I have tried the contains() function, but it retrieves both p elements.
Any help would be much appreciated :)
I want to search for a string in
paragraphs of a text, excluding the
footnotes which are children elements
of the paragraphs
An XPath 1.0 - only solution:
Use:
//p//text()[not(ancestor::footnote) and contains(.,'text')]
Against the following XML document (obtained from yours but added p s within a footnote to make this more interesting):
<document>
<p n="1">My text starts here/</p>
<p n="2">Then it goes on there
<footnote>It's not a very long text!
<p>text</p>
</footnote>
</p>
</document>
this XPath expression selects exactly the wanted text node:
My text starts here/
//p[(.//text() except .//footnote//text())[contains(., 'text')]]
/document/p[text()[contains(., 'text')]] should do.
For the record, as a complement to the other answers, I've found this workaround that also seems to do the job:
//p[contains(child::text()|not(descendant::footnote), "text")]
Is there any way to specify that I want to select only tag-less child elements (in the following example - "text")?
<div>
<p>...</p>
"text"
</div>
The text() function matches text nodes. Example: //div/text() — matches all text children within all div elements.
Use:
/*/text()[normalize-space()]
This selects all text nodes that are children of the top element of the document and that do not consist only of white-space characters.
In the concrete example this will select only the text node with string value:
'
"text"
'
The XPath expressions:
/*/text()
or
/div/text()
both select two text nodes, the first of which contains only white-space and the second is the same text node as above:
'
"text"
'
select only tag-less child elements
To me this sounds like selecting all elements that don't have other elements as children. But then again, "text" in your example is not an element, but a text node, so I'm not really sure what do you want to select...
Anyway, here is a solution for selecting such elements.
//*[not(*)]
Selects all elements that don't have an element as a child. Replace the first * with an element name if you only want to select certain elements that don't have child elements. Also note that using // is generally slow since it runs through the whole document. Consider using more specific path when possible (like /div/*[not(*)] in this case).
I'd like to use Nokogiri to extract all nodes in an element that contain a specific attribute name.
e.g., I'd like to find the 2 nodes that contain the attribute "blah" in the document below.
#doc = Nokogiri::HTML::DocumentFragment.parse <<-EOHTML
<body>
<h1 blah="afadf">Three's Company</h1>
<div>A love triangle.</div>
<b blah="adfadf">test test test</b>
</body>
EOHTML
I found this suggestion (below) at this website: http://snippets.dzone.com/posts/show/7994, but it doesn't return the 2 nodes in the example above. It returns an empty array.
# get elements with attribute:
elements = #doc.xpath("//*[#*[blah]]")
Thoughts on how to do this?
Thanks!
I found this here
elements = #doc.xpath("//*[#*[blah]]")
This is not a useful XPath expression. It says to give you all elements that have attributes that have child elements named 'blah'. And since attributes can't have child elements, this XPath will never return anything.
The DZone snippet is confusing in that when they say
elements = #doc.xpath("//*[#*[attribute_name]]")
the inner square brackets are not literal... they're there to indicate that you put in the attribute name. Whereas the outer square brackets are literal. :-p
They also have an extra * in there, after the #.
What you want is
elements = #doc.xpath("//*[#blah]")
This will give you all the elements that have an attribute named 'blah'.
You can use CSS selectors:
elements = #doc.css "[blah]"