xpath function to serialize node? - xpath

I am using a stack at the moment where I don't have direct access to the XML and can only pull through xpath selectors. By default it will return 'string()' if the response is not text.
'string()' will concatenate all text nodes. I am looking for a way to return the serialization of a node so text + tags + attributes.
I can't see anything that looks like this but it seems like an obvious thing to want so I recon I am not looking in the right place .

Unless you work in an XPath or XQuery 3.0 environment with http://www.w3.org/TR/xpath-functions-30/#func-serialize being supported (or a similar extension function) you won't have access to such a function.

Related

XPath - Extract spectific file name from string

I'm trying to extract just the filename from a javascript link in import.io, eg googlebolver.htm from href="javascript:finpopup('googlebolver.htm',920,620,0)"
I've managed to get to the 'link' (javascript:finpopup('googlebolver.htm',920,620,0)) with the following XPath
//*[text()='GOOGLE.MAPS']/#href
but I would like to get to the actual address on its own.
As I am running the import.io Extracto on multiple urls, I want it to find something like *.htm
I believe this maybe possible by using the substring function, but I don't know how to do it.
The following questions of this site looked promising, but one only works for fixed length stings and the other I don't completely understand and works for only a specific 'word'
Extract value from javascript object in site using xpath and import.io
How to use substring() with Import.io?
Thanks in advance for your help
EDIT: Here is the URL
You can use the XPath functions substring-after and substring-before, to select the text after, say, (' and before ',
in your example, it would be
substring-before(substring-after(//*[text()='GOOGLE.MAPS']/#href,"('"),"',")
Note: I don't know if import.io supports these standard XPath function

Get the inner XML using XPath?

This is my XML
<my_xml>
<record>
<p>hello <b>world</b> this is some html</p>
</record>
</my_xml>
Can I use XPath to return the following?
<p>hello <b>world</b> this is some html</p>
my_xml/record/child::*
child::* selects all element children of the context node
see details
The quick answer is, no. You can't accomplish this with XPath, but, once you select the parent node (i.e. "record" in your example), you should be able to manipulate it in whichever language you are using to parse the XML. Unfortunately, it may not be "easy".
It sounds like you would want something like the innerHTML property, but for XML DOM instead of the HTML DOM. Unfortunately, nothing like this exists for the XML DOM. If you don't care about the nodes themselves, you could use the textContent property; in the case of your example, you would get "hello world this is some html", which doesn't seem to be what you want.
Check out this similar question, which includes a parsing algorithm in Java. It seems that you will need to write a similar algorithm in whichever language you're using to parse the XML.
For anyone looking for this in the future, this IS very much possible to do using a DOT, that will return the entire node content as text (at least in MSSQL xpath it does).
'(/my_xml/record/.)[1]'

facing issue to find xpath expression

My XPath '//div[#id='sharetools-container-div']/iframe[#id='sharetools-iframe']' is working fine, but after this tag there is '#document' text present and after this '#document' there is html tag, so when I extend the XPath expression as '//div[#id='sharetools-container-div']/iframe[#id='sharetools-iframe']/#document/html', it is throwing exception as follows:
Caused by: class org.jaxen.saxpath.XPathSyntaxException:
//div[#id='sharetools-container-div']/iframe[#id='sharetools-iframe']/#document:
70: Expected one of '.', '..', '#', '*', QName.
So please guide me how to write XPath for this.
Thanks,
Dhananjay
From what I can gather, XPath does not descend into iframes.You see, XPath expressions are tied to a particular XML document, such as an HTML document,1 that they can be evaluated against. In the browser, an iframe counts as a separate document. The <iframe> node itself is a part of the parent document; but it is merely a pointer to another document (the iframe's contents) which is completely separate.
That seems to be the gist of this email chain, and seems to fall naturally out of the fact that XPath expressions are evaluated by calling document.evaluate (that is, a member of a particular document object), as implemented in Firefox. This suggests that the overlap between the various specs defining iframes and XPath excludes traversing that document boundary in a single XPath expression — or at least that seems to be Mozilla's interpretation.
But take note that all of this is an guesswork based on Firefox's particular implementation of the XPath specification. This limitation may or may not apply to other browsers, but I would suspect that it does.
It also seems to explain why Selenium requires you to switch context from one document (the parent HTML page) to another (the iframe itself) in order to execute XPath expressions against it, as hinted at by the solution posted by #singaravelan, and others.
1But only if the HTML document is magical enough! (Not all HTML documents are well-formed XML: browsers are much more lenient than XML parsers can be; Cf. #MathiasMüller's comment.)
You haven't shown your source XML, but one thing we know for sure is that it doesn't contain an element called "#document", because that isn't a legal element name. For the same reason, you can't request an element called "#document" in your XPath expression.
You can use with different XPath to bypass the word: #document with the word: descendant
For example:
//div[#id='sharetools-container-div']/iframe[#id='sharetools-iframe']/descendant::*[1]
or something like that. It is depend on what do you want in the inner html.
First thanks to raise this question. I am also face the same problem.
with help of following line I got solved for my case.
driver.SwitchTo().Frame(driver.FindElement(By.Name("fraToc")));
Thanks.

Letting Nokogiri decide whether to use #fragment or #parse

I have a piece of HTML that I would like to parse with Nokogiri, but I do not know whether it is a full HTML document (with DOCTYPE, etc) or a fragment (e.g. just a div with some elements in it).
This makes a difference for Nokogiri, because it should use #fragment for parsing fragments but #parse for parsing full documents.
Is there a way to determine whether a given piece of text is a fragment or a full HTML document?
Denis
Depends on how trashed your page is, but
/^(?:\s*<!DOCTYPE)|(?:\s*<html)/
should work in most cases.
The simplest way would be to look for the mandatory <html> tag, using for instance a regular expression /<html[\s>])/ (allowing attributes).
Is this sufficient to solve your problem?

CSS equivalent to XPath parenthetical grouping and indexing?

This question is geared towards testing via Selenium / Web Driver, though applies to general web application/development.
XPath has a very nice feature of grouping a given XPath and combining with indexing to say "give me element N for all/multiple elements returned from given XPath, specified as "(//someXpath)[n]" w/o the quotes.
I was wondering if there is a translatable equivalent in CSS. If not via standard CSS, then how about Sizzle/jQuery? If none exist, would be nice if that kind of thing be added as a CSS standard in the future. Something like a "(someCssSelector):nth-of-type(n)"
Other than that, the alternative for XPath and CSS is to be more specific in describing the DOM tree, going up the tree to get uniqueness in identifying elements (as opposed to (someShorterSimplerXpath)[n]).
You can access jquery sets like arrays: $('selector')[n]
For the relative / xpath, you can use children(), so for an xpath like //selector/foo you'd do $('selector').children('foo'). For the relative // xpath, you can use find(): for //selector//foo use $('selector').find('foo'). For .. you can use parent(): for //selector/.. use $('selector').parent()
With CSS, while there are no parent selectors, there is an nth-of-type pseudo-class (specification here). So you can do selector:nth-of-type(n).

Resources