Is it possible to select the properties of a node a XPATH? - xpath

I have an XML of the form:
<articleslist>
<articles>
<originalId>507948</originalId>
<title>Hogan Lovells Training Contract</title>
<slug>hogan-lovells-training-contract</slug>
<metaTitle>Hogan Lovells Training Contract</metaTitle>
<metaDescription>Find out about the Hogan Lovells Training Contract and Application Process</metaDescription>
<language>en</language>
<disableAds>false</disableAds>
<shortUrl>false</shortUrl>
<category_slug>law</category_slug>
<subcategory_slug>industry</subcategory_slug>
<updatedAt>2021-03-15T18:38:51.058+00:00</updatedAt>
<createdAt>2018-11-29T06:42:51.665+00:00</createdAt>
</articles>
</articlelist>
I'm able to select the row values with the XPATH //articles.
How can I select the child properties of articles (i.e. the column headings), so I get back a list of the form:
originalId
title
slug
etc...

Depends on your XPath version.
In XPath 2.0 it's simply //articles/*/name()
In 1.0 it's not possible because there's no such data type as a "sequence of strings". You would have to return the set of elements as //articles/*, and then extract their names in the calling program.

Related

The <title> of the page is changing. How to get it with XPath?

Got the page with dynamic <title> tag depending on language selected by user, e.g.
<title>English</title> or <title>Italiano</title>
I'm trying to select that page among many others with XPath selector:
//*[contains(#title, 'English') or contains(#title, 'Italiano')]
but it doesn't work at all.
Also tried
(//*[contains(#title, 'English')] | //*[contains(#title, 'Italiano')])[1] - no positive result
title is not an attribute, so no need to add #:
//*[contains(title, 'English') or contains(title, 'Italiano')]
This will return parent node. If you want to select title node then try
//title[.='English' or .='Italiano']

Xpath expression (nokogiri) to get tag's child element?

From my xml, I can get this :
<home>
<creditors>
<count>2</count>
</creditors>
</home>
OR even this :
<home>
<creditors>
<moreThan>2</moreThan>
</creditors>
</home>
Which xpath expression can I use to get "<count>2</count>" instead of getting only "2" OR to get "<moreThan>2</moreThan>" instead of getting "2" ?
This XPath,
//creditors/count
will select all count child elements of all creditors elements in the XML document.
Update per OP's request in comments for a single XPath that selects both count and moreThan elements:
This XPath,
//creditors/*[self::count or self::moreThan]
will select all count or moreThan child elements of all creditors elements in the XML document.
Assuming that your xpath expression is OK, you just need to convert the element to string:
doc.xpath("home/creditors/*").to_s
=> "<count>2</count>"
Please check with queries returning more than one element, to make sure that it's desired behaviour.

XPath to only select the text contained within an element

I am new to xpath so I apologize in advance for how basic this question is.
How do I extract just the text from a specific element? For example, how would I extract just "text"
<h1>text</h1>
I tried the following but it seems to select everything including the tags instead of just the text.
//h1/text()
Thanks for your help
`
DocumentBuilderFactory docFactory = DocumentBuilderFactory
.newInstance();
DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
Document doc = docBuilder.parse(new File("src/myFile.xml"));
XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
String sessionId = (String) xpath
.evaluate(
"/Envelope/Body/LoginProcessResponse/loginResponse/sessionId",
doc, XPathConstants.STRING);
`
here Envelope is my parent element and i just traversed to the required path(in my case it is sessionid).
Hope it helps
This answer is rather an XSLT answer than an XPath answer, but many of the concepts are nevertheless applicable.
The XPath expression
//h1/text()
seems to be correct. It does select all text() nodes that are direct children of <h1> elements.
But one problem may be, that the XSL default template still copies all the othertext() nodes like described here in the W3C specification:
In the absence of a select attribute, the xsl:apply-templates instruction processes all of the children of the current node, including text nodes.
So to solve your problem, you have to define an explicit template that
ignores all other text() nodes like this:
<xsl:template match="text()" />
If you add this line to your XSL processing, the result will most likely be more pleasant to you.

XPath Wildcard -- Any Node Name, Must have Specific Attribute Value

I am having difficulty figuring out an XPath query that would allow me to return nodes based on the value of the Program attribute in the example below. For example, I would like to be able to search all nodes for a value of the Program attribute = "011.pas". I tried /Items/*[Program="012.pas"] and also /Items/Item*[Program="01.pas"] but neither works. What is the correct expression?
<Items>
<Item0 Program="01.pas"></Item0>
<Item1 Program="011.pas"></Item1>
</Items>
The attribute is selected with #Program, the child elements of the Items element with /Items/*, so you want /Items/*[#Program = '011.pas'].
Try this :
/items/*[#Program='011.pas']

using xpath to select an element after another

I've seen similar questions, but the solutions I've seen won't work on the following. I'm far from an XPath expert. I just need to parse some HTML. How can I select the table that follows Header 2. I thought my solution below should work, but apparently not. Can anyone help me out here?
content = """<div>
<p><b>Header 1</b></p>
<p><b>Header 2</b><br></p>
<table>
<tr>
<td>Something</td>
</tr>
</table>
</div>
"""
from lxml import etree
tree = etree.HTML(content)
tree.xpath("//table/following::p/b[text()='Header 2']")
Some alternatives to #Arup's answer:
tree.xpath("//p[b='Header 2']/following-sibling::table[1]")
select the first table sibling following the p containing the b header containing "Header 2"
tree.xpath("//b[.='Header 2']/following::table[1]")
select the first table in document order after the b containing "Header 2"
See XPath 1.0 specifications for details on the different axes:
the following axis contains all nodes in the same document as the context node that are after the context node in document order, excluding any descendants and excluding attribute nodes and namespace nodes
the following-sibling axis contains all the following siblings of the context node; if the context node is an attribute node or namespace node, the following-sibling axis is empty
You need to use the below XPATH 1.0 using the Axes preceding.
//table[preceding::p[1]/b[.='Header 2']]

Resources