HTMLAgilityPack xpath - xpath

I have hit a wall with this one, please could someone help me out?
From the URL below I am looking to get to the inner text of
A2A
The XPath syntax I am using doesn't return any data:
.//table[#class='table_dati']//tbody[#class='constituents']//tr//td[#class='name']//a
The URL is
http://www.borsaitaliana.it/borsa/azioni/ftse-mib/lista.html?lang=en&page=1
Thanks in advance,
Grant

How about //tbody[#class='constituents']//td[#class='name']/a? This should work pretty well, actually.

Your XPath starts with ., so it is relative to the context node. But you haven't told us anything about the context. Maybe you want to omit the initial . and make it "absolute":
//table[#class='table_dati']/tbody[#class='constituents']/tr/td[#class='name']/a
I would also change the // to / wherever you're looking for a direct child (not descendant in general) relationship.

From my experience, HTMLAgilityPack doesn't play nice with the tbody tag. I just follow up the table with the tr td to find the right cell, skipping tbody altogether.

Related

Block for xPath request

I need help with an xPath request, in importXML. I am absolutely not a pro in the field.
I had a type request:
//*[#id="search"]/div[1]/a/#href
That i had recovered in the field, research on the societe.com page.
The page having changed i tried a lot of thing, the ID would be i think now : input_search, but despite that i tried a lot of things, I can't get the right code.
Could you guide me on this problem?
Thank you.
EDIT : Here is the way in which i recuperate the info. CompagnieName is just a example, can be change with any compagnie. I think that the XPath line is not correct, but i cannot find what to change, problem with div or other...
The Xpath you showed works if you search for a company that actually exists.
However, if you want the complete result list you may want to try that URL instead:
https://www.societe.com/cgi-bin/liste?nom=XX
and this XPath:
//*[#id="liste"]/a/#href

Understanding X-Path Expression

I'm trying to get an understanding of XPath in order to parse a diffxml file. I skimmed over the w3schools site. Am I understanding these correctly?
Statement 1: /node()[1]/node()[3]
Selects the third child of the root node
Statement 2: /node()[1]/node()[1]/node()[1]
Selects the child of the first node of the root node
Statement 3: /node()[1]/node()[3]/node()[2]
Selects the second child of the third node under the root node.
Yes, you understand them correctly, but this is not how you'd use XPath. First node() can be anything, not just elements. Then the pure index is arguably the wort way of selecting things, you should really use names, and possibly predicates for filtering the node-sets.
You'll find a lot of criticism of w3schools on this site. Personally I find it a useful resource, but only when I'm trying to remind myself of something I once knew. It's not really designed for teaching yourself things from scratch, and I suggest you need a different learning strategy. Call me old-fashioned, but when I'm learning a new technology I find there's nothing better than a good book.
You've understood your examples correctly as far as I can tell. But have you understood what a "node" is? For example, do you know under what circumstances whitespace text counts as a node? The key to understanding XPath is to understand the data model, and the way in which the data model relates to the lexical (angle-bracket) form of the XML.

facing issue to find xpath expression

My XPath '//div[#id='sharetools-container-div']/iframe[#id='sharetools-iframe']' is working fine, but after this tag there is '#document' text present and after this '#document' there is html tag, so when I extend the XPath expression as '//div[#id='sharetools-container-div']/iframe[#id='sharetools-iframe']/#document/html', it is throwing exception as follows:
Caused by: class org.jaxen.saxpath.XPathSyntaxException:
//div[#id='sharetools-container-div']/iframe[#id='sharetools-iframe']/#document:
70: Expected one of '.', '..', '#', '*', QName.
So please guide me how to write XPath for this.
Thanks,
Dhananjay
From what I can gather, XPath does not descend into iframes.You see, XPath expressions are tied to a particular XML document, such as an HTML document,1 that they can be evaluated against. In the browser, an iframe counts as a separate document. The <iframe> node itself is a part of the parent document; but it is merely a pointer to another document (the iframe's contents) which is completely separate.
That seems to be the gist of this email chain, and seems to fall naturally out of the fact that XPath expressions are evaluated by calling document.evaluate (that is, a member of a particular document object), as implemented in Firefox. This suggests that the overlap between the various specs defining iframes and XPath excludes traversing that document boundary in a single XPath expression — or at least that seems to be Mozilla's interpretation.
But take note that all of this is an guesswork based on Firefox's particular implementation of the XPath specification. This limitation may or may not apply to other browsers, but I would suspect that it does.
It also seems to explain why Selenium requires you to switch context from one document (the parent HTML page) to another (the iframe itself) in order to execute XPath expressions against it, as hinted at by the solution posted by #singaravelan, and others.
1But only if the HTML document is magical enough! (Not all HTML documents are well-formed XML: browsers are much more lenient than XML parsers can be; Cf. #MathiasMüller's comment.)
You haven't shown your source XML, but one thing we know for sure is that it doesn't contain an element called "#document", because that isn't a legal element name. For the same reason, you can't request an element called "#document" in your XPath expression.
You can use with different XPath to bypass the word: #document with the word: descendant
For example:
//div[#id='sharetools-container-div']/iframe[#id='sharetools-iframe']/descendant::*[1]
or something like that. It is depend on what do you want in the inner html.
First thanks to raise this question. I am also face the same problem.
with help of following line I got solved for my case.
driver.SwitchTo().Frame(driver.FindElement(By.Name("fraToc")));
Thanks.

What am I doing wrong with this Xpath Query?

I've been having a play with some Xpath queries but I just can't get this one.
Here's the current string: "/html/body/div/div[8]/table/tr/td[2]/a"
It's showing the information below, but I need to grab "Australia" or node 5. I've tried last() and selecting a node on the a but no luck.
Anyone able to help?
The following seem to work
/html/body/div/div[8]/table/tr[3]/td[2]/a
You seemed to be on the wrong row. But will the structure always be this static? Maybe you should try to look for something "better" in the page, such as the href containing "country" so be somewhat more resilient to structure changes.

Problems with Xalan using XPATH (unclosed tags)

Greetings,
I'm facing a problem with the following tech-stack: JWebUnit -> HtmlUnit -> Xalan.
I'm trying to find an element by XPATH, but the HTML document is pretty malformed.
Xalan stops finding elements when I reach the /body element on XPATH. I believe it's because the document contains two <body> tags and one being unclosed.
Everything works for /html/head or /html. But when I try /html/body (or /html/body[1], //body[1], or anything inside those tags) I get only null from Xalan.
Is there any way to get around with that? I just can't change the html document istself. Thank you kindly for your attention.
Best regards,
Thiago
HtmlUnit must be using something to convert HTML to XML. Perhaps you can tell it to use jsoup or tagsoup, which are very tolerant of messy HTML?
You might as well also write code to just dump the XML tree to a file so you can see what's in it.

Resources